[PDF] Artificially Synthesising Data for Audio Classification and Segmentation to Improve Speech and Music Detection in Radio Broadcast

Abstract

Segmenting audio into homogeneous sections such as music and speech helps us understand the content of audio. It is useful as a pre-processing step to index, store, and modify audio recordings, radio broadcasts and TV programmes. Deep learning models for segmentation are generally trained on copyrighted material, which cannot be shared. Annotating these datasets is time-consuming and expensive and therefore, it significantly slows down research progress. In this study, we present a novel procedure that artificially synthesises data that resembles radio signals. We replicate the workflow of a radio DJ in mixing audio and investigate parameters like fade curves and audio ducking. We trained a Convolutional Recurrent Neural Network (CRNN) on this synthesised data and outperformed state-of-the-art algorithms for music-speech detection. This paper demonstrates the data synthesis procedure as a highly effective technique to generate large datasets to train deep neural networks for audio segmentation.

Full PDF

AARTIFICIALLY SYNTHESISING DATA FOR AUDIO CLASSIFICATION ANDSEGMENTATION TO IMPROVE SPEECH AND MUSIC DETECTION IN RADIOBROADCAST

Satvik Venkatesh (cid:63) , David Moffat (cid:63) , Alexis Kirke (cid:63) , G¨ozel Shakeri † , Stephen Brewster † ,J¨org Fachner ‡ , Helen Odell-Miller ‡ , Alex Street ‡ , Nicolas Farina †† ,Sube Banerjee ‡‡ , and Eduardo Reck Miranda (cid:63) (cid:63) Interdisciplinary Centre for Computer Music Research, University of Plymouth, UK † School of Computing Science, University of Glasgow, UK ‡ Cambridge Institute of Music Therapy Research, Anglia Ruskin University, UK †† Centre for Dementia Studies, Brighton and Sussex Medical School, UK ‡‡ Faculty of Health, University of Plymouth, UK

ABSTRACT

Segmenting audio into homogeneous sections such as music andspeech helps us understand the content of audio. It is useful as a pre-processing step to index, store, and modify audio recordings, radiobroadcasts and TV programmes. Deep learning models for segmen-tation are generally trained on copyrighted material, which cannot beshared. Annotating these datasets is time-consuming and expensiveand therefore, it signiﬁcantly slows down research progress. In thisstudy, we present a novel procedure that artiﬁcially synthesises datathat resembles radio signals. We replicate the workﬂow of a radio DJin mixing audio and investigate parameters like fade curves and au-dio ducking. We trained a Convolutional Recurrent Neural Network(CRNN) on this synthesised data and outperformed state-of-the-artalgorithms for music-speech detection. This paper demonstrates thedata synthesis procedure as a highly effective technique to generatelarge datasets to train deep neural networks for audio segmentation.

Index Terms — Audio Segmentation, Audio Classiﬁcation,Music-speech Detection, Training Set Synthesis, Deep Learning

1. INTRODUCTION

Automatically understanding the content of audio data is useful forindexing audio archives, target-based distribution of media, speechrecognition, and intelligent remixing. It includes the task of audiosegmentation, which divides an audio signal into homogeneous seg-ments. These segments contain audio classes like music, speech,environmental sounds, and noise, to name but a few. The speciﬁcityof audio classes depends on the application. For instance, in radiobroadcast, some relevant audio classes include music, speech, noise,and silence [1].Primarily, there are two approaches to audio segmentation — (1)distance-based segmentation and (2) segmentation-by-classiﬁcation[2]. In the former, boundaries of acoustic events are directly de-tected. This is done by calculating a distance metric, such as Eu-clidean distance, Bayesian information criterion (BIC) [3], or gener-alized likelihood ratio (GLR). For a given audio, a distance curve

This study was supported by Engineering and Physical Sci-ences Research Council (EPSRC) grants EP/S026991/1, EP/S027491/1,EP/S027203/1, and EP/S026959/1. is plotted. The peaks on this distance curve are associated withthe boundaries of audio events because they comprise high acous-tic changes. The advantage of this technique is that it is generallyunsupervised and does not require knowledge of the individual au-dio classes. However, the disadvantage is that it is more sensitive todissimilarities within each audio class.In segmentation-by-classiﬁcation, as the name suggests, the au-dio is divided into individual frames, typically in the range of 10to 25 ms. These frames are independently classiﬁed and eventuallythe boundaries of audio events are detected. Traditionally, this wasperformed through algorithms like Gaussian mixture model (GMM)[4], support vector machine (SVM), and factor analysis (FA) [5]. Inrecent years, due to the advances in deep learning, segmentation-by-classiﬁcation has gained more popularity through neural networkarchitectures like bidirectional long short-term memory (B-LSTM)[6], Convolutional Recurrent Neural Network (CRNN) [7], and Tem-poral Convolutional Neural Network (TCN) [8].Machine learning models are generally trained using proprietaryaudio such as television and radio broadcast. This imposes a serioushindrance in the repeatability of research because this audio cannotbe shared across different research groups. Annotating these datasetsis a time-consuming and expensive task. For example, the study bySchl¨uter et al. [9] annotated 42 hours of radio broadcast with thehelp of paid students. Moreover, a dataset called Open BroadcastMedia Audio from TV (OpenBMAT) [10] was cross-annotated bythree different annotators and each of them spent approximately 130hours to annotate 27.4 hours of audio. As the labels in these datasetsneed to be precise enough for the models to train, the annotationsin such datasets are generally veriﬁed by at least one other person.These factors impose many challenges for a researcher who wants tofreshly explore audio segmentation.The literature comprises many datasets that contain individualﬁles of music and speech. However, these ﬁles are different frombroadcast audio because broadcast audio is well-mixed. To ourknowledge, the only openly available annotated database for thistask is the MuSpeak dataset [11], which contains approx. 5 hoursof audio. Moreover, OpenBMAT [10] focused on estimating therelative loudness of music, but not speech and music detection.In this paper, we present a novel approach to artiﬁcially synthe-sise audio that resembles a radio broadcast. We replicate the pro-cess of a radio DJ in mixing audio content. This was done by in-© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current orfuture media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works,for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. a r X i v : . [ ee ss . A S ] F e b estigating fade curves, audio ducking, fade durations, and silences.The artiﬁcially mixed audio only uses openly available music-speechdatasets that contain individual ﬁles of music and speech. Using thisdata synthesis procedure, large amounts of training data can be gen-erated to train deep neural networks. The trained models are use-ful for real-world applications and achieve state-of-the-art perfor-mance on human-labelled datasets. The implementation, code, andpre-trained models associated with this study are openly available inthis GitHub repository .

2. DATA SYNTHESIS2.1. Datasets

In this study, we used datasets that contain audio ﬁles labelled aseither music or speech. We did not use data that contains mixedaudio. Instead, the radio content was synthesised through combin-ing and mixing the music and speech data together. We used theMUSAN corpus [12], GTZAN music and speech detection dataset[13], and the Scheirer & Slaney dataset [14]. These are the com-monly used datasets in music-speech detection studies [8]. Whenwe conducted initial tests with our neural network, we observedthat there were confusions between wind instruments like ﬂute andspeech. Additionally, some vocal sections without accompanimentwere confused with speech. Therefore, we extended our data repos-itory to using the Instrument Recognition in Musical Audio Signals(IRMAS) dataset that includes many examples of wind instruments[15], GTZAN genre recognition [16] for additional music examples,Singing Voice Audio Dataset which contains unaccompanied vocals[17], and a section of the LibriSpeech corpus [18] for more speechexamples. We also considered noise examples from the MUSANcorpus to enable the neural network to detect task-irrelevant exam-ples. These are sounds that cannot be labelled as either music orspeech. For instance, environmental sounds, babble noise, unintelli-gible speech, footsteps, and so on. The total number of audio ﬁlesfor music, speech, and noise was 6876, 6885, and 665 respectively.

In radio programmes, shifts between music and speech and viceversa are generally smoothed through transitions. We broadly ob-served two types of transitions, which we have termed as normaltransition and cross-fade transition. In a normal transition, an audioevent is faded out, followed by a short period of silence, and thena new audio event is faded in. An example can be found in ﬁgure1a. In a cross-fade transition, as the name suggests, the two audiosignals are overlapping. While one is fading out, the other is fadingin, as shown in ﬁgure 1b. G a i n Music Silence Speech F a d e o u t F a d e i n (a) Normal fade transition G a i n Music Speech F a d e o u t F a d e i n (b) Cross-fade transition Fig. 1 . Two types of audio transitions https://github.com/satvik-venkatesh/audio-seg-data-synth/ During audio mixing, engineers use different fade curves depend-ing on the context. We have considered four popular curves that arecommonly used in mixing [19] — linear, exponential convex, expo-nential concave, and s-curve. Figure 2 illustrates the types of fadecurves. G a i n S-curve G a i n Exponential Convex G a i n Exponential Concave G a i n Linear

Fig. 2 . Four types of fade curves

Each audio example that was synthesised in this study was 8 s long.We felt that 8 s was a long enough duration to capture the entire fadecurve. The three audio classes — speech, music, and noise werestored in different directories. Each time an audio class was chosenfor the data synthesis, a random ﬁle was selected from the entire listof ﬁles and then a random segment was extracted.For an audio transition, the two audio events and a time stamp ofthe transition are randomly chosen. For example, music to speech,speech to noise, speech to speech, etc., transitioning at a speciﬁctime. Note that we also allowed repetition of the same audio class,such as speech to speech because this would suggest cases like inter-views.To cover a wide range of possibilities that can occur while mix-ing radio programmes, we randomised the various parameters of au-dio transitions. Each time an audio example was synthesised, a ran-dom fade curve was chosen. Subsequently, a random fade durationwas chosen from a uniform distribution ranging from 0 s to the max-imum possible duration. For a normal fade transition, the gap ofsilence between audio events was randomised.

In radio programmes, it is very common to have background musicplaying alongside foreground speech. Audio ducking is the processof reducing the volume of background music. It is generally per-formed to make speech intelligible. Many radio broadcasters havetheir own guidelines to audio ducking [20]. Therefore, in order toartiﬁcially synthesise audio examples with background music, it isimportant for us to consider these guidelines.We adopted the integrated loudness metric by ITU BS.1770-4[21] to calculate the loudness of audio. This is measured in loud-2ess units (LU). There is no ideal loudness difference (LD) betweenspeech and background music because it is highly subjective. Com-monly, the literature recommends a minimum LD of 7 to 10 LU [20].Moreover, in cases of very quiet background music, the LD can beas high as 23 LU.In order to implement our data synthesis procedure, we requirea minimum and maximum LD to choose random values from a uni-form distribution. We empirically observed the average performanceof the network over multiple training cycles on different LDs andalso manually listened to synthesised audio examples. We set theLD range to be between 7 and 18 LU.In radio programmes, audio ducking can be performed througheither volume automation or side-chain compression. We chose theformer technique because it was relatively straightforward to achieveaccurate LDs during data synthesis.

There are different combinations of audio classes that occur in thesynthesised examples — music, speech, noise/silence, and speechover background music. Each example can either have no transitions(that is purely a single audio class) or one transition (that is two audioclasses connected through fade curves) with a probability of 0.5. Anoverview of the data synthesis procedure can be found in ﬁgure 3.

No. of audiotransitions andtime stamp Pick audioclasses Choose fadecurves andtransition type Pick randomsound ﬁles andsegments Synthesiseexample

Fig. 3 . An overview of the data synthesis procedure.

3. EXPERIMENTS3.1. Pre-processing and Feature Extraction

All audio ﬁles used in this study were resampled to 22.05 kHz monosignals. Silences in the audio ﬁles were removed/shortened by usingSound eXchange (SoX). For our data synthesis to work smoothly,we required the audio ﬁles to have a minimum duration of 8 s. Somedatasets such as the IRMAS have only 3 s audio ﬁles. To addressthis, we looped the audio to obtain the required duration.Mel spectrograms have been commonly adopted by audio seg-mentation studies to extract features [6, 8]. We set the hop size to220 (10 ms) and FFT size to 1024 (46 ms). The selected audio seg-ments were peak-normalised before synthesising an example. Aftersynthesis, the whole example was peak-normalised again. We ex-tracted 80 log-scale-Mel bands from 64 Hz to 8 kHz.

We are using artiﬁcially synthesised data for training. As these arenot real-world examples, we cannot use synthesised data for vali-dation and testing. We incorporated the MuSpeak dataset, whichcontains 5 h 14 m 14 s of audio. We also collected 9 h of broadcastaudio from BBC Radio Devon, which was manually annotated bythe authors. The audio in the BBC dataset were split into ﬁles of 1h. Three hours of our annotations were veriﬁed by an external audiomixing engineer who was not involved in the research and paid forhis time. Additionally, a random section of 15 minutes was blind-annotated by him independently. We found an agreement of 99.49%with our annotations by using 10 ms segment veriﬁcations. This was done to ensure that audio events were similarly perceived by differ-ent people. In order to explore the robustness of data synthesis, wedid not use any of this data for training. The data from MuSpeak andBBC dataset was shufﬂed and used as validation and test sets as a50-50% split.In order to compare our model with other state-of-the-art algo-rithms, we also evaluated it on dataset number 1 of the Music In-formation Retrieval Evaluation eXchange (MIREX) 2018 music andspeech detection competition . This dataset contains 27 hours of au-dio from various TV programmes. Although our data synthesis wasdesigned for radio programmes, this dataset would provide us with agood evaluation of our model. For this study, we adopted a CRNN, which is a state-of-the-art archi-tecture for audio classiﬁcation and segmentation tasks [7, 22]. Theinput shape of the network was × × , equivalent to 802 timesteps and 80 Mel bins. The output of the network comprised × neurons with sigmoid activations, where two neurons perform a bi-nary classiﬁcation for music and speech at every time step. The net-work performs multi-output detection, independently detecting theregions of music and speech. This is important for models workingwith radio data because music and speech can occur simultaneously.Binary cross-entropy was used as the loss function.We used the Adam optimizer with a constant learning rate of0.001 and batch size of 128. The ﬁrst two layers of the networkwere 2D convolutional layers with a kernel size of 7 and a stride of 1.The input was padded with zeros such that ‘same’ convolutions wereperformed to ensure that the time resolution remains the same. Thenext two layers were bidirectional gated recurrent units (B-GRU)with 80 units each.In this study, we evaluated the model using different trainingsets, as explained in section 3.4. Hence, a model architecture wasﬁnalised by optimising the performance across different trainingdatasets. For regularisation, we implemented early-stopping andused batch normalisation after all the layers. Max pooling alongthe dimension of Mel bins was performed after the convolutionallayers. A dropout of 0.2 was added only after the convolutionallayers because we observed that it was not effective for the B-GRUlayers. In order to evaluate the effectiveness of our data synthesis algorithm,we constructed 4 training datasets. All datasets contain 40960 exam-ples of 8 s audio (which is approximately 91 h of audio). Initial testsconveyed this was an adequate number of examples to train the net-work.1. Dataset-only ﬁles (d-OF): This dataset contains audio seg-ments of only speech, music, or noise. There was no mixingof audio events within each example. 40960 examples wererandomly sampled from our data repository. We did not in-clude the whole corpus because of computational limitationsand to manage redundancy.2. Dataset-only ﬁles and background music (d-OFB): In addi-tion to d-OF, this dataset contains examples of speech overbackground music. The volume of background music wasnormalised according to the method explained in section 2.5.However, this dataset did not contain any audio transitions. https://music-ir.org/mirex/wiki/2018:Music and/or Speech Detection

3. Dataset-no normalisation (d-NN): In this dataset, the datasynthesis was performed as explained in section 2, except forthe loudness normalisation of background music accordingto loudness of foreground speech. However, all examplesof speech, music, and noise were peak-normalised beforesynthesis.4. Dataset-data synthesis (d-DS): In this dataset, the data syn-thesis was performed exactly as explained in section 2.

A threshold of 0.5 was used to make binary classiﬁcations on the out-put layer. The length of each ﬁle in the test set was approximately 1h. We traversed the audio ﬁle with a window size of 8 s and hop sizeof 6 s. We discarded the predictions made on the ﬁrst and last sec-ond of each audio example because they might be unreliable. Thistechnique was adopted from the study by Gimeno et al. [6].In the audio segmentation pipeline, predictions made by themodel are generally sent though a post-processing phase to removespurious transitions and events. This is done through either medianﬁltering [6, 9] or setting thresholds for minimum durations of audioevents [8]. We adopted the latter approach and set thresholds forminimum speech duration, minimum music duration, maximum si-lence between speech, and maximum silence between music. Thesevalues were obtained from the study by Lemaire et al. [8] and set to1.3 s, 3.4 s, 0.4 s, and 0.6 s respectively.

4. RESULTS

To evaluate the models, we adopted metrics implemented in thesed eval toolbox [23], which has been widely adopted by audioevent detection studies [8, 22, 24]. The segment-level evaluationwas performed with a segment size of 10 ms. Table 1 presentsthe model’s performance on different datasets. The highest overallF-measure was obtained by d-DS, which implemented the entiredata synthesis procedure. The F-measures of d-OF and d-OFBwere at least 3% lower than d-DS because their datasets did notcontain audio transitions. This demonstrates that modelling radioDJ-like transitions is an effective technique. Additionally, there is amarginal difference between d-OF and d-OFB, which explains thatadding background music to speech in the training examples is notsufﬁcient, but there needs to be audio transitions.The dataset d-NN contained background music that was peak-normalised, but not normalised with respect to loudness of fore-ground speech. Therefore, music F-measure of d-DS surpasses thevalue of d-NN by more than 2%. This proves that randomising theloudness of background music with respect to foreground speechwithin a LD of 7 to 18 was an effective method. Speech F-measurefor d-NN was slightly greater than d-DS. However, this might bebecause the background music in d-NN was at a relatively constantvolume, which improves speech detection but compromises musicdetection.

Dataset F overall F s F m d-OF 93.54 94.58 92.99d-OFB 93.68 94.95 92.99d-NN 95.33 . The F-measure of our CRNN model trained on differentdatasets. Table 2 shows the segment-level evaluation of our d-DS modelon the MIREX speech and music detection dataset. The evaluationsof other submissions were obtained from the MIREX website. Ourmodel signiﬁcantly outperforms the other models for F-measuresof music. This is attributed to the presence of audio transitionsand loudness normalisation of background music in the synthesiseddataset. Our model also obtains the highest F-measure for speechdetection.All the other submissions in the competition used real-worlddata [25, 26]. Therefore, these results demonstrate that our data syn-thesis is a highly effective approach for audio segmentation. More-over, there was another task in MIREX 2018 that was solely for mu-sic detection. Our model places second in this task, preceded bythe submission by Mel´endez-Catal´an et al. [27]. Their model wastrained on 30 hours of TV programmes, which comes from the samedata distribution. It is important to note that the MIREX evaluationdataset can contain background music over foreground speech, audi-ence noises, sound effects, everyday-life sounds, sounds of the city,and so on. As our data synthesis procedure only considered fore-ground speech, it explains the poor precision for music in table 2.Our model predicted many of the sound effects as music. The per-formance of our model over TV programmes can be improved byconsidering these factors in the data synthesis. Algo. F m P m R m F s P s R s [25] 49.36 62.4 40.82 77.18 d-DS Table 2 . F-measure, precision, and recall of our CRNN modeltrained on ‘d-DS’ and other algorithms evaluated on dataset num-ber 1 of MIREX 2018 speech and music detection competition.

5. CONCLUDING DISCUSSIONS

In this study, only artiﬁcially synthesised data was used to train amodel for audio segmentation and classiﬁcation. We adopted a train-ing dataset belonging to a different distribution from the validationand test sets. Despite this, we obtained a high F-measure on our lo-cal test set. Furthermore, we obtained state-of-the-art performancefor speech and music detection on the MIREX 2018 competitiondataset.There were noticeable differences between the BBC Radio De-von recordings and the data repository we have used for data syn-thesis. The BBC recordings have greater dynamic range compres-sion, cleaner speech, and generally use side-chain compression foraudio ducking. Therefore, including a small number of radio record-ings in the training dataset might improve the model’s performance.Additionally, incorporating audio effects like dynamic range com-pression in the data synthesis pipeline might improve the model’sperformance.Many studies have suggested end-to-end deep learning to be apotential pathway for future audio classiﬁcation and segmentationresearch [8, 28]. However, it requires much more data than usingMel spectrograms as features. As labelling large amounts of data isan expensive and time-consuming task, our data synthesis procedureserves as a potential solution to generate large amounts of trainingdata and advance the state-of-the-art in audio segmentation and clas-siﬁcation systems.4 . REFERENCES [1] Theodoros Theodorou, Iosif Mporas, and Nikos Fakotakis,“An overview of automatic audio segmentation,”

Interna-tional Journal of Information Technology and Computer Sci-ence (IJITCS) , vol. 6, no. 11, pp. 1, 2014.[2] Taras Butko and Climent Nadeu, “Audio segmentation ofbroadcast news in the albayzin-2010 evaluation: overview, re-sults, and discussion,”

EURASIP Journal on Audio, Speech,and Music Processing , vol. 2011, no. 1, pp. 1, 2011.[3] Hao Xue, HaiFeng Li, Chang Gao, and ZiQiang Shi, “Com-putationally efﬁcient audio segmentation through a multi-stageBIC approach,” in . IEEE, 2010, vol. 8, pp. 3774–3777.[4] Marko Kos, Matej Grasic, Damjan Vlaj, and Zdravko Kacic,“On-line speech/music segmentation for broadcast news do-main,” in . IEEE, 2009, pp. 1–4.[5] Diego Cast´an, Alfonso Ortega, Antonio Miguel, and EduardoLleida, “Audio segmentation-by-classiﬁcation approach basedon factor analysis in broadcast news domain,”

EURASIP Jour-nal on Audio, Speech, and Music Processing , vol. 2014, no. 1,pp. 34, 2014.[6] Pablo Gimeno, Ignacio Vi˜nals, Alfonso Ortega, AntonioMiguel, and Eduardo Lleida, “Multiclass audio segmentationbased on recurrent neural networks for broadcast domain data,”

EURASIP Journal on Audio, Speech, and Music Processing ,vol. 2020, no. 1, pp. 1–19, 2020.[7] Keunwoo Choi, Gy¨orgy Fazekas, Mark Sandler, andKyunghyun Cho, “Convolutional recurrent neural networksfor music classiﬁcation,” in

IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , 2017,pp. 2392–2396.[8] Quentin Lemaire and Andre Holzapfel, “Temporal convolu-tional networks for speech and music detection in radio broad-cast,” in , 2019.[9] Jan Schl¨uter and Reinhard Sonnleitner, “Unsupervised featurelearning for speech and music detection in radio broadcasts,”in

Proceedings of the 15th International Conference on DigitalAudio Effects (DAFx) , 2012.[10] Blai Mel´endez-Catal´an, Emilio Molina, and Emilia G´omez,“Open broadcast media audio from TV: A dataset of tv broad-cast audio with relative music loudness annotations,”

Trans-actions of the International Society for Music Information Re-trieval , vol. 2, no. 1, 2019.[11] MuSpeak Team, “MIREX muspeak sample dataset,” 2015, http://mirg.city.ac.uk/datasets/muspeak/ [Last accessed on 13-10-2020].[12] David Snyder, Guoguo Chen, and Daniel Povey, “Mu-san: A music, speech, and noise corpus,” arXiv preprintarXiv:1510.08484 , 2015.[13] George Tzanetakis and Perry Cook, “Marsyas: A frameworkfor audio analysis,”

Organised sound , vol. 4, no. 3, pp. 169–175, 2000.[14] Eric Scheirer and Malcolm Slaney, “Construction and evalu-ation of a robust multifeature speech/music discriminator,” in

IEEE International Conference on Acoustics, Speech, and Sig-nal processing (ICASSP) , 1997, vol. 2, pp. 1331–1334. [15] Juan J Bosch, Jordi Janer, Ferdinand Fuhrmann, and PerfectoHerrera, “A comparison of sound segregation techniques forpredominant instrument recognition in musical audio signals.,”in , 2012, pp. 559–564.[16] George Tzanetakis and Perry Cook, “Musical genre classiﬁca-tion of audio signals,”

IEEE Transactions on Speech and Audioprocessing , vol. 10, no. 5, pp. 293–302, 2002.[17] Dawn A. A. Black, Ma Li, and Mi Tian, “Automatic identi-ﬁcation of emotional cues in chinese opera singing,” in , 2014.[18] Vassil Panayotov, Guoguo Chen, Daniel Povey, and SanjeevKhudanpur, “LibriSpeech: an ASR corpus based on publicdomain audio books,” in

IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , 2015, pp.5206–5210.[19] Eric Tarr,

Hack Audio: An Introduction to Computer Program-ming and Digital Signal Processing in MATLAB , Routledge,2018.[20] Matteo Torcoli, Alex Freke-Morin, Jouni Paulus, Christian Si-mon, and Ben Shirley, “Background ducking to produce es-thetically pleasing audio for TV with clear speech,” in , 2019.[21] ITU-R, “ITU-R Rec. BS.1770-4: Algorithms to measure audioprogramme loudness and true-peak audio level,” 2017.[22] Emre Cakır, Giambattista Parascandolo, Toni Heittola, HeikkiHuttunen, and Tuomas Virtanen, “Convolutional recur-rent neural networks for polyphonic sound event detection,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 25, no. 6, pp. 1291–1303, 2017.[23] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen,“Metrics for polyphonic sound event detection,”

Applied Sci-ences , vol. 6, no. 6, pp. 162, 2016.[24] Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Ben-jamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj,and Tuomas Virtanen, “DCASE 2017 challenge setup: Tasks,datasets and baseline system,” in

Workshop on Detection andClassiﬁcation of Acoustic Scenes and Events , 2017.[25] Minsuk Choi, Jongpil Lee, and Juhan Nam, “Hybrid featuresfor music and speech detection,”

Music Information RetrievalEvaluation eXchange (MIREX) , 2018.[26] Matija Marolt, “Music/speech classiﬁcation and detection sub-mission for MIREX 2018,”

Music Information Retrieval Eval-uation eXchange (MIREX) , 2018.[27] Blai Mel´endez-Catal´an, E Molina, and E Gomez, “Musicand/or speech detection MIREX 2018 submission,”

Music In-formation Retrieval Evaluation eXchange (MIREX) , 2018.[28] Jongpil Lee, Taejun Kim, Jiyoung Park, and Juhan Nam, “Rawwaveform-based audio classiﬁcation using sample-level CNNarchitectures,”31st Conference on Neural Information Pro-cessing Systems (NIPS)