Deep Convolutional and Recurrent Networks for Polyphonic Instrument Classification from Monophonic Raw Audio Waveforms
Kleanthis Avramidis, Agelos Kratimenos, Christos Garoufis, Athanasia Zlatintsi, Petros Maragos
DDEEP CONVOLUTIONAL AND RECURRENT NETWORKS FOR POLYPHONICINSTRUMENT CLASSIFICATION FROM MONOPHONIC RAW AUDIO WAVEFORMS
Kleanthis Avramidis ∗ , Agelos Kratimenos ∗ , Christos Garoufis ,Athanasia Zlatintsi and Petros Maragos School of ECE, National Technical University of Athens, 15773 Athens, Greece Robot Perception and Interaction Unit, Athena Research Center, 15125 Maroussi, Greece [email protected], [email protected], cgaroufi[email protected], { nzlat, maragos } @cs.ntua.gr ABSTRACT
Sound Event Detection and Audio Classification tasks are tradition-ally addressed through time-frequency representations of audio sig-nals such as spectrograms. However, the emergence of deep neuralnetworks as efficient feature extractors has enabled the direct use ofaudio signals for classification purposes. In this paper, we attempt torecognize musical instruments in polyphonic audio by only feedingtheir raw waveforms into deep learning models. Various recurrentand convolutional architectures incorporating residual connectionsare examined and parameterized in order to build end-to-end classi-fiers with low computational cost and only minimal preprocessing.We obtain competitive classification scores and useful instrument-wise insight through the IRMAS test set, utilizing a parallel CNN-BiGRU model with multiple residual connections, while maintaininga significantly reduced number of trainable parameters.
Index Terms — Raw Waveforms, End-to-End Learning, Poly-phonic Music, Instrument Classification
1. INTRODUCTION
Waveforms are abstract representations of sound waves and, whenrecorded, they constitute convoluted signals that incorporate noisefrom the complexity of the recorded sound event, the acoustic scene,as well as the recording equipment. Complex sound events suchas spoken dialogues or simultaneously playing musical instruments(i.e. polyphonic music) can be challenging in extracting meaning-ful information. Thus, audio classification tasks traditionally discarddirect waveform modeling in favor of richer time-frequency featurerepresentations [1]. In fields like Speech Recognition [2] and MusicInformation Retrieval [3], such methods take advantage of the dis-criminative information of the signals’ spectra, which is aligned tothe human auditory system.In Instrument Classification particularly there is strong intu-ition into utilizing frequency-related representations, since musicalnotes and instruments are densely associated with specific fre-quency events. Thus, most research works in the field incorporatespectrograms in their analysis. It is however challenging and com-putationally expensive to design specialized feature representationsfor each different recognition task, especially when contemporarydeep learning models emerge as strong feature extractors for end-to-end classification. In this paper we address this challenge byparameterizing deep recurrent and convolutional networks to modelraw audio waveforms efficiently. Our analysis is concentrated on ∗ The first two authors contributed equally.
Fig. 1 . Intermediate activation of the Residual FCN Model (Sec. 3.2)for the above 1-sec piano sample.handling the high input dimensionality of the waveforms, miningtemporal features and preserving their low-level spatial locality,while reducing the computational cost of the process. We proposea lightweight end-to-end classifier for Instrument Classificationthat shows comparable performance to state-of-the-art spectrogram-based architectures, including our previous work on the task [4].The rest of the paper is organized as follows: Sec. 2 providesa review of related research in Audio Signal Processing using rawwaveforms, as well as Instrument Classification. The architecturesthat are used throughout our experiments are analyzed in Sec. 3.Sec. 4 describes the experimental setup, the dataset and the evalua-tion methods to be followed, whereas in Sec. 5 we discuss the resultsof our experiments. Finally, in Sec. 6 we present our conclusions aswell as propose further directions for future work.
2. RELATED WORK
Deep neural networks, which have achieved state-of-the-art per-formance in audio recognition [5] by operating directly in thetime domain, have blurred the line between representation learn-ing and predictive modeling. Convolutional networks particularly a r X i v : . [ c s . S D ] F e b ig. 2 . The DCNN, FCN and RFCN architectures used in the experimental evaluation.have shown competitive performance, in some cases matching thatof classical spectrogram-based models [6]. In speech analysisand synthesis, WaveNet [7] is a benchmark model fed with audiowaveforms, whereas [8] achieves robust representation learning byutilizing WaveNet Autoencoders that use waveforms directly asinput. In Music Information Retrieval, a number of works haveattempted to acquire high-level features like melody and pitch [9],while waveform-based architectures have also recorded competitiveresults in music [10] and speech [11] source separation.As far as Instrument Classification is concerned, until recentlythe majority of works utilized time-frequency representations anddatasets of solo recordings or excerpt-level annotations (e.g. IRMAS[12], MedleyDB [13]). While traditional research, partially due tothe challenge of labeling multi-instrumental music, focused uponmonophonic audio [14], recent studies address polyphonic tasks, re-lying on the efficiency of deep learning models. Specific points offocus include investigation of the optimal input temporal resolution[15, 16] and the design of the convolutional filters involved [17],while we have also experimented with sophisticated augmentationmethods, attempting to isolate timbre-like characteristics [4].
3. ARCHITECTURES3.1. Recurrent Networks
Recurrent neural networks (RNN) have been widely used in wave-form modeling and classification, thanks to their ability to modellong-range temporal dependencies. In our baseline network we em-ploy the Bidirectional Gated Recurrent Unit (BiGRU). GRU archi-tectures have shown comparable performance to Long Short-TermMemory (LSTM) units in processing audio sequences [18], althoughthey inherit a less complex structure and provide lower computa-tional costs. Moreover, Bidirectional units consider both past andfuture audio context, which intuitively assists our task.
Number of Layers Number of Units
Table 1 . BiGRU Architecture Configurations.Regarding this recurrent module, we experiment with the opti-mal number of layers and utilized units. Specifically, we trained 3
Fig. 3 . The utilized CNN cell structure.different networks: one with 1 BiGRU layer of 128 units, one with 1BiGRU layer of 256 units and one with 2 BiGRU layers of 128 and64 units respectively, as shown in Table 1. We used a fully connectedlayer to produce the models’ output as well as a Dropout layer rightbefore, to enhance the models’ ability to generalize. In order to ef-fectively reduce the high dimensionality of the waveforms we applymax pooling at the input level, with a pool size of 3 samples.
Convolutional Neural Networks (CNN) traditionally operate on im-ages [19] or, in Audio Classification, on time-frequency represen-tations like spectrograms [15], being an extremely coherent featureextraction method for Deep Learning. However, they can provideuseful results when applied to 1D signals as well [20]. We base ourCNN on the architecture used in [4] that yielded strong results on theIRMAS Dataset. In order to adjust to the different input form, wesubstitute 2D with 1D Convolutional layers and fine-tune the pool-ing parameters to handle the high input dimensionality. The CNNcell structure is composed of 2 stacked convolutional layers with thesame number of filters, a Batch Normalization layer that enhancesthe training efficiency [21], a Leaky ReLU activation and a maxpooling layer with kernels of varied length (Fig. 3).We place 5 CNN cells in a row with 16, 32, 64, 64 and 32 fil-ters respectively, using kernel and pooling sizes equal to 3 for thefirst three layers and 5 for the rest. This module is followed by twofully connected layers (denoted as DCNN), something that increasessubstantially the number of its trainable parameters. We thus experi-ment by removing all the fully connected layers to form a Fully Con-volutional Network (FCN). The 11D output vector is then estimatedthrough an additional convolutional layer of unit kernel, followed by iGRU F1-micro % F1-macro % LRAP % ± ± ± ± ± ± ± ± ± Table 2 . Results for the Recurrent Networks discussed in Sec. 3.1,subject to the number of GRU layers and the number of units.
Models F1-micro % F1-macro % LRAP %
DCNN 55.32 ± ± ± ± ± ± RFCN ± ± ± Table 3 . Results for the Dense Connected Neural Network (DCNN),the Fully Connected Network (FCN) and the Residual FCN (RFCN)discussed in Sec. 3.2.Global Average Pooling. The adoption of FCNs in modeling rawaudio waveforms sharply reduces the number of parameters that weneed to train, while it can force the network to learn meaningful fea-tures in its hidden layers, that keep their temporal locality through-out the architecture [20]. The final configuration is a Residual FCN(RFCN), where we simply embed skip connections to the previousmodel, as shown in Fig. 2. Through residual connections the modelis able to propagate low-level features throughout the network.
The utilized feature extraction and prediction methods are capableof learning different types of features. It has been demonstrated [5]that convolutional nets concentrate on spatial features and, in thecontext of waveforms, on temporally local correlations, while recur-rent ones are useful in modeling longer-term temporal structure. Wecan therefore expect that a combined Convolutional-Recurrent Neu-ral Network (CRNN) would further enhance the performance of thehighlighted architectures, so we attach the best performing BiGRUmodel into our RFCN model. In order to keep the temporal resolu-tion of the RNN to a feasible magnitude, we experiment by embed-ding it after the 2nd, the 3rd and the 4th CNN cell, as well as byfeeding the CNN output to the recurrent units. Specifically, the em-bedded model takes the output of the corresponding CNN cell and itsoutput is reduced to the number of classes through additional convo-lution with a unit kernel, followed by Global Average Pooling. Thetwo 11D vectors are then averaged before the Sigmoid activation. Inthis way we empirically search the optimal way of integrating therecurrent model information into a robust classifier.
4. EXPERIMENTAL SETUP4.1. Dataset & Training
The IRMAS dataset [12] is used to train and test our models, as it hasbeen extensively researched for the task of Instrument Classification.IRMAS is divided into a training set containing 6705 audio segmentsof 3 seconds each and a testing set containing polyphonic tracks ofvarious lengths. Each of the 3-sec training snippets is annotated withexactly 1 out of the 11 available predominant instruments, while thepolyphonic tracks in the testing set contain 2–4 instruments. Wechoose to cut each track into 1-sec segments, since this temporalresolution increases the data volume and helps the model generalizebetter, as indicated in [15, 16]. Each waveform is then downsam-pled to 22.05 kHz, downmixed to mono and normalized by its root-mean-square energy. Since we are interested in classifying only raw
Models F1-micro % F1-macro % LRAP %
CRNN ± ± ± CRNN ± ± ± ± ± ± ± ± ± Table 4 . Results for the combined CNN and RNN networks dis-cussed in Sec. 3.3. The subscript denotes the layer in which thelatter was connected to the former.waveforms, no further pre-processing is applied.The training data are then partitioned into 5 subsets to performcross-validation for each of the above-mentioned architectures. Allnetworks were trained using binary cross-entropy loss since the taskis modeled as multi-class (11 instruments) and multi-label (instru-ments can co-play). Adam optimizer [22] is used to optimize theloss function, with an initial learning rate of 0.001 and 10% decayrate per 4 epochs of non-decreasing validation loss. The batch sizeis set to 64 after fine-tuning. We also perform Early Stopping bymonitoring the validation loss with a patience of 7 epochs.
Each model is evaluated at the IRMAS test set, consisting of 2355polyphonic music tracks, ranging from 5 to 20 sec duration. Duringthe evaluation process, we partition each track into 1-sec segments,compute the frame-level predictions and then average them in orderto extract a single track prediction. This method will produce reli-able results because each labeled instrument is always active for thewhole duration of the track. For the particular polyphonic classifica-tion we utilize two metrics. The first one is the F1 Score, which iswidely used in many relative studies [12, 16] and provides a balancedview of multi-class performance. In order to calculate an overallscore, we compute the average of the per-instrument scores at bothmicro and macro scales. The second one is Label Ranking AveragePrecision (LRAP), a rank-based metric proposed in [23]. LRAP issuitable for multi-label evaluation as it is threshold-independent andmeasures the classifier’s ability to assign higher scores to the correctlabels associated to each sample.
5. RESULTS AND DISCUSSION5.1. Architecture Comparison
Table 2 shows the accuracy scores for the recurrent neural modelsproposed in Sec. 3.1. It is clear that a simple recurrent networkcannot sufficiently decode the information included in a waveform.Still, the best model emerges from a combination of two BiGRUlayers with 128 and 64 units, respectively, which by far outperformsthe single GRU layer. Further experiments show that adding a CNNCell before the recurrent units significantly improves the classifica-tion, indicating the efficiency of the models described in Sec. 3.3.On the other hand, as we see in Table 3, 1D convolutional mod-els are capable of extracting the most discriminative features fromraw waveforms, almost as well as 2D convolutional models thatwork on spectrogram inputs [16, 4]. Furthermore, removing thedense layers not only reduces the number of model trainable param-eters, and thus the training time, but also increases the accuracy sub-stantially. We argue that, in the absence of a dense layer, the networkgeneralizes better upon the information from the spatial processing.Additionally, connecting the output of the first CNN cell with thethird one, and the output of the second CNN cell with the fourth, ig. 4 . Instrument-wise performance of the proposed model and the monophonic [4] in terms of F1-score.
Models F1-micro F1-macro LRAP
Bosch et al. [12] 0.503 0.432 – –Pons et al. [17] 0.589 0.516 – –Han et al. [16] 0.602 0.503 – –Kratimenos et al. [4]
Table 5 . Comparison of our work with previous performances onthe IRMAS Datasetslightly increases the results with no additional cost in parametersor training time. Other types of residual connections though do notseem to yield consistently improved performance.To optimally combine the temporal information extracted fromthe RNN and the already utilized spatial characteristics drawn fromthe residual network, we additionally assess the performance of thecombined networks described in Sec. 3.3. Simply averaging theRNN and CNN model outputs lowers though the classification ac-curacy, something we attribute to the inadequate standalone perfor-mance of the BiGRU (see Table 2). We thus inserted the RNN modelin various locations in the RFCN architecture that yielded optimalaccuracy. From the scores we report at Table 4, we notice that thereis no observed improvement in the model performance as far as theLRAP metric is concerned. However, there is a steady increase at F1scores, about 2% and 4% at micro and macro scales, respectively. Itshould be mentioned at this point that the combined models consistof a significantly larger number of parameters than the fully convolu-tional ones, while the DCNN is the model with the most parameters,despite its lower performance.The optimal architecture is constructed by embedding the Bi-GRU module after the 3rd CNN cell of the residual fully con-volutional model and by averaging the 2 models’ output in theend. The proposed model yields an LRAP score close to 75% andF1 scores comparable to the literature [16, 4, 17] on the IRMASDataset. Specifically, F1 micro surpasses most studies on the task,while we observe dominant performance at the more competitiveF1 macro score, in which case the model even surpasses the perfor-mance achieved in our previous work with the use of complex dataaugmentation [4]. Those results are obtained with a significantly re-duced number of trainable parameters, low training and testing timeand minimal pre-processing. In order to emphasize this, we attemptto train the model utilized in [4] by incorporating a significantlyreduced number of parameters, without altering the architecture.Specifically, we cut the number of filters and the final dense layer ofthe model. The results (Table 6) show that this network falls behindthe proposed one by 6% and an average 8%, regarding LRAP andF1 scores respectively.
Models F1-micro F1-macro LRAP
Proposed [4] Reduced 0.520 0.458 0.689 1.20M
Table 6 . Comparison between the proposed architecture and thestate-of-the-art after parameter reduction.
To get a more thorough insight into the waveform characteristicsof different instruments and how much these can assist the task ofInstrument Classification, we examine the class-wise performancein terms of the F1 metric. The results are visualized in Fig. 4, alongwith the corresponding results obtained from Constant Q Trans-form (CQT) spectrogram modeling from our previous work [4].We clearly observe that brass instruments are recognized much bet-ter using raw waveforms, compared to CQT representations. Inspecific, clarinet, flute, saxophone and trumpet achieve 14%, 7%,13% and 9% increase respectively in F1-score when waveformsare considered. On the other hand, predominant instruments, i.e.electric/acoustic guitar, piano/organ and the human voice, are distin-guished better through processing their CQT representation, with thehighest difference observed on piano with 10% change in F1-score.Apart from that, the absolute performance of the instruments prettymuch resembles the findings of the CQT-based study.
6. CONCLUSIONS
In this paper we attempt to perform polyphonic instrument classi-fication from monophonic music data with the usage of their rawaudio waveforms. We experiment with various architectures that arefavourable towards waveforms, like Fully Convolutional and Resid-ual Nets and we also attempt to embed a recurrent modules to theoptimal Architecture so as to fuse their discriminating information.A residual FCN-BiGRU model with a total of 1 million parametersoutperforms the state-of-the-art model, utilizing CQT spectrogramsand holding 24 million parameters, in the F1-macro metric by 4%,while it is comparable in the F1-micro and LRAP metrics. A morethorough experiment on the performance of each instrument inde-pendently shows that brass instruments are being identified easierthrough waveforms, while predominant instruments, like piano orthe electric guitar, benefit more from time-frequency representations.Future work should therefore deal with alternate methods to ex-ploit a recurrent neural network when fed with raw waveforms, aswell as with methods to enhance the recognition performance of pre-dominant instruments. Finally, additional experiments should ad-dress how noise, that is highly incorporated within waveform sig-nals, affects the model capabilities. . REFERENCES [1] C. C. Chatterjee, M. Mulimani, and S. G. Koolagudi, “Poly-phonic sound event detection using transposed convolutionalrecurrent neural network,” in
Proc. IEEE Int’l Conf. on Acous-tics, Speech and Signal Processing (ICASSP) , 2020.[2] B. Puterka and J. Kacur, “Time window analysis for automaticspeech emotion recognition,” in
Proc. Int’l Symposium EL-MAR , 2018.[3] S. Gururani, M. Sharma, and A. Lerch, “An attention mecha-nism for musical instrument recognition,” in
Proc. Int’l Societyfor Music Information Retrieval Conference, (ISMIR) .[4] A. Kratimenos, K. Avramidis, C. Garoufis, A. Zlatintsi, andP. Maragos, “Augmentation Methods on Monophonic Audiofor Instrument Classification in Polyphonic Music,” in
Proc.European Signal Processing Conference (EUSIPCO) , 2020.[5] S. H. Bae, I. Choi, and N. S. Kim, “Acoustic scene classifica-tion using parallel combination of lstm and cnn,” in
Proc. ofthe Detection and Classification of Acoustic Scenes and Events2016 Workshop (DCASE) , 2016.[6] T. Sainath, Ron J. Weiss, A. Senior, K. Wilson, and OriolVinyals, “Learning the speech front-end with raw waveformcldnns,” in
Proc. INTERSPEECH , 2015.[7] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, andK. Kavukcuoglu, “Wavenet: A generative model for raw au-dio,” in
Proc. Speech Synthesis Workshop (ISCA) . Sept., ISCA.[8] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord,“Unsupervised speech representation learning using wavenetautoencoders,”
IEEE/ACM Trans. on Audio, Speech, and Lan-guage Processing , vol. 27, no. 12, pp. 2041–2053, 2019.[9] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “CREPE: Aconvolutional representation for pitch estimation,”
CoRR , vol.abs/1802.06182, 2018.[10] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,”in
Proc. Int’l Society for Music Information Retrieval Confer-ence (ISMIR) , 2018.[11] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation,”
IEEEACM Trans. Audio Speech Lang. Process. , vol. 27, no. 8, pp.1256–1266, 2019.[12] J. J. Bosch, J. Janer, F. Fuhrmann, and P. Herrera, “A com-parison of sound segregation techniques for predominant in-strument recognition in musical audio signals,” in
Proc. Int’lSociety for Music Information Retrieval Conference, (ISMIR) ,2012.[13] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam,and J. Bello, “Medleydb: A multitrack dataset for annotation-intensive mir research,” in
Proc. Int’l Society for Music Infor-mation Retrieval Conference, (ISMIR) , 2014.[14] P. Herrera-Boyer, G. Peeters, and S. Dubnov, “Automatic clas-sification of musical instrument sounds,”
Journal of New MusicResearch , vol. 32, no. 1, pp. 3–21, 2003.[15] S. Gururani, C. Summers, and A. Lerch, “Instrument activitydetection in polyphonic music using deep neural networks,” in
Proc. Int’l Society for Music Information Retrieval Conference,(ISMIR) , 2018. [16] Y. Han, J. Kim, and K. Lee, “Deep convolutional neural net-works for predominant instrument recognition in polyphonicmusic,”
IEEE/ACM Trans. on Audio, Speech, and LanguageProcessing , vol. 25, pp. 208–221, 2017.[17] J. Pons, O. Slizovskaia, R. Gong, E. G´omez, and X. Serra,“Timbre analysis of music audio signals with convolutionalneural networks,” , pp. 2744–2748, 2017.[18] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empiri-cal evaluation of gated recurrent neural networks on sequencemodeling,” in
Proc. NIPS 2014 Workshop on Deep Learning ,Dec. 2014.[19] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,”
CoRR , vol.abs/1409.1556, 2015.[20] W. Dai, C. Dai, S. Qu, J. Li, and S. Das, “Very deep convolu-tional neural networks for raw waveforms,” in
Proc. IEEE Int’lConf. on Acoustics, Speech and Signal Processing (ICASSP) ,2017.[21] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” vol.37, Jul. 2015.[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-timization,” in
Proc. Int’l Conf. on Learning Representations(ICLR) , May 2015.[23] R. E. Schapire and Y. Singer, “Boostexter: A boosting-basedsystem for text categorization,”