DESED-FL and URBAN-FL: Federated Learning Datasets for Sound Event Detection
David S. Johnson, Wolfgang Lorenz, Michael Taenzer, Stylianos Mimilakis, Sascha Grollmisch, Jakob Abe?er, Hanna Lukashevich
DDESED-FL and URBAN-FL: Federated LearningDatasets for Sound Event Detection
David S. Johnson ∗ , Wolfgang Lorenz, Michael Taenzer, Stylianos Mimilakis,Sascha Grollmisch, Jakob Abeßer and Hanna Lukashevich Fraunhofer Institute for Digital Media Technology (IDMT), Ilmenau, Germany ∗ [email protected] Abstract —Research on sound event detection (SED) in envi-ronmental settings has seen increased attention in recent years.The large amounts of (private) domestic or urban audio dataneeded raise significant logistical and privacy concerns. Theinherently distributed nature of these tasks, make federatedlearning (FL) a promising approach to take advantage of large-scale data while mitigating privacy issues. While FL has alsoseen increased attention recently, to the best of our knowledgethere is no research towards FL for SED. To address this gapand foster further research in this field, we create and publishnovel FL datasets for SED in domestic and urban environments.Furthermore, we provide baseline results on the datasets in aFL context for three deep neural network architectures. Theresults indicate that FL is a promising approach for SED, butfaces challenges with divergent data distributions inherent todistributed client edge devices.
Index Terms —federated learning, sound event detection, deeplearning, distributed learning
I. I
NTRODUCTION
The aim of sound event detection (SED) is to automaticallyidentify the occurrence of target sound events, such as glassbreaking or dog barking, within an audio signal capturing anacoustic scene. Identifying these sound events within complexscenes is a challenging and open research problem that hasattracted much attention in recent years, observed with theincrease in the literature and particularly with the growingresearch interest in the DCASE Community . Two commonSED uses-cases are acoustic monitoring in domestic [1] andurban [2] environments. In both scenarios, the use of audio totrain detection models in a centralized training context raisesconsiderable privacy concerns. In these environments, there arespeech and other sounds that are confidential which should notbe shared or stored insecurely.State-of-the-art approaches to SED are most commonlybased on deep learning [3] which requires large centralizeddatasets for model training, posing significant security andlogistical challenges. Federated learning (FL) offers an attrac-tive approach to mitigate some of these concerns. Instead ofsending private data to a centralized data store, FL performsmodel training directly on many client edge devices (from hereon referred to as clients) using locally stored data. The clientsthen share only their updated parameters with a coordinationserver, which aggregates the shared parameters to update aglobal model. The new global model is then transferred back http://dcase.community to the clients. This process continues until convergence, orindefinitely if new data is continuously acquired [4].Current research in FL has focused on image or text-basedtasks. To our knowledge, the only known research or practicalapplications of FL in the audio domain are related to keywordspotting [5]–[7]. Due to the limited research on FL for SED,there remain questions about the effectiveness of the approachdue to varying acoustic conditions inherent to distributedclients. For example, data may be captured from clients inmultiple locations with different background noise character-istics or in locations with only a subset of the sound eventclasses. This leads to differences in data distributions amongstthe clients involved in the training process. For centralizedtraining, data from multiple devices is combined into singletraining dataset that is typically assumed to be independentand identically distributed (IID), but with FL distributed datacollection leads to models being trained using data fromdivergent distributions, i.e., data that is non-IID. Existing SEDdatasets do not capture the non-IID characteristics seen withFL. To address this gap and foster research on FL for SED,we contribute novel SED datasets specifically designed for FLtraining. Additionally, we provide baseline results for threeneural network architectures to evaluate the effects of FLhyperparameters and non-IID data on SED performance.II. R ELATED W ORK
A. Federated Learning
McMahan et al. first proposed the idea of FL as a methodto allow data to remain on distributed devices while traininga shared model by aggregating locally trained updates [8].For a comprehensive overview of FL, refer to the technicalreport by Kairouz et al. [4]. There are two main challengesfor FL methods. First is the need for communicating overunreliable networks to transmit data. A second issue resultsfrom data captured in varying contexts, leading to datasetsthat are statistically dissimilar, non-IID, between clients.To address these challenges, Sattler et al. [9] proposed acompression framework sparse ternary compression (STC).Similarly, Lin et al. [10] and Bernstein et al. [11] proposedmethods to remove the redundancies of gradient informa-tion in node-distributed learning frameworks. Hsieh et al.evaluated the challenges of non-IID data in an FL scenario[12]. They identified problems with the batch normalizationlayer, a common layer in many deep neural network (DNN) a r X i v : . [ c s . S D ] F e b ABLE I: Sound event and background classes for each of the datasets.
Dataset Sound Events Background Noise Types
DESED-FL e : Dishes; e : Cat; e : Frying; e : Dog; e : Blender; e : Speech; e : Vacuum cleaner; e : Electric shaver/toothbrush; e : Alarm bell; e : Running water apartment room, computer interior, computerlab, emergency staircase, and libraryURBAN-FL e : Children playing; e : Siren; e : Drilling; e : Street music; e : Car horn; e : Gunshot; e : Jackhammer; e : Dog bark; e : Air conditioner; e : Engine idling birds, crowd, fountain, rain, and traffic architectures, and proposed to use group normalization [13]instead. Similarly, to address the problems of non-IID data,Sattler et al. [14] proposed a clustering operation to groupclients whose data distributions have similar characteristics.While the previous research proposed methods to overcomechallenges in FL, there are no known datasets to evaluate themfor SED. We address this gap, by presenting new SED datasetsspecifically designed for FL with non-IID data. B. Sound Event Detection
State-of-the-art SED algorithms build upon deep neural net-works, the most common being convolutional neural networks(CNNs) and convolutional recurrent neural networks (CRNNs)based architectures. Both architectures include convolutionalfront-ends, where multiple convolutional layers are trainedto learn sound-specific features. As input to the network,either fixed two-dimensional signal transformations such asmel spectrograms [15] or raw one-dimensional audio samplesare used (end-to-end learning) [16]. As a back-end, CNNsuse fully-connected layers for classification whereas CRNNsemploy recurrent layers such as gated recurrent unit (GRU) orlong short-term memory (LSTM) layers to model the temporalprogression of the extracted features. We focus our work onCNN architectures for a lightweight approach, to enable modeltraining on low resource devices required by FL.Training SED models requires strongly labeled datasets inwhich onset and offset times are labeled for each sound event.Because of the laborious effort required to annotate real-worldsamples, researchers often use synthetically generated datasets.Creating synthetic datasets requires mixing events from acurated sound bank with a background signal to generatesoundscapes with multiple, possibly overlapping, events. Forexample, the URBAN-SED dataset [17] is composed of soundevents from the UrbanSound-8K (URBAN-8K) [2] datasetmixed with Brownian noise. A recent trend is to use acombination of synthetic and real recordings for training andevaluation as in the Domestic Environment Sound Event De-tection (DESED) dataset [1]. For both datasets, however, soundevents are distributed uniformly during soundscape generation.For the FL context, sound events should be distributed in astructured fashion to simulate real world distributed learningconditions. To enable research in FL for SED, our proposeddatasets distribute soundscapes to simulated clients with dif-ferent background characteristics and class distributions.III. D
ATASETS
In this section, we present DESED-FL and URBAN-FL,datasets for acoustic monitoring of domestic and urban envi- ronments with FL. Each use-case contains two independenttraining sets: an IID dataset, in which sound event classes aredistributed evenly amongst devices, and a non-IID dataset, inwhich only subset of overall classes is assigned to each client.To imitate different acoustic recording conditions, we mix thesound events with one of five background noise classes. Eachtraining dataset is pre-partitioned into 100 simulated clientswith 20 clients per each background noise class. It is possibleto simulate more than 100 clients by partitioning further, orless than 100 by combining or removing clients. For repro-ducible evaluation, each use case also includes an evaluationdataset in which sound events are uniformily distributed toeach background class.To generate the soundscapes for DESED-FL, sound eventsand background noises are sourced from DESED [1].URBAN-FL soundscapes are generated using sound eventsfrom URBAN-8K [2], and background noises from IsolatedUrban Sound Database (IUSD) [18]. The sound event andnoise classes for each dataset are listed in Table I.
A. IID and Non-IID Training Datasets
To generate the IID and non-IID dataset variations, twosound event distribution schemes are implemented for assign-ing events to clients. The IID scheme implements a uniformdistribution of sound event classes across all devices. Thissimulates the best possible case for training FL models sinceall clients have access to all classes. For the non-IID scheme,sound event classes are distributed to the clients using oneof five class distributions each containing a subset of fiveclasses. To minimize background noise bias, each of the fivedistributions is assigned to four clients per background classfor a total of 100 clients. This results in each set of 20 clientsper background comprising five different class distributions.A detailed view of each data distribution is presented inAppendix A. In the non-IID scheme, each of the five classdistributions contains five event classes for a total of 25 eventclasses, which is not divisible by ten (the number of totalclasses). This means that five event classes were used threetimes and five event classes were used only twice.One goal in designing the distributions is to have eachcollection of classes be as different as possible from any othercollection. Hence, the algorithm to select class distributionsminimizes the penalty value p computed by p = N coll − (cid:88) k =1 N coll (cid:88) l = k +1 N eq ( k, l ) , here N coll equals the number of collections and N eq ( k, l ) represents the number of equal classes in collections k and l .The distributions that minimize the penalty p consists of the sets { e , e , e , e , e } , { e , e , e , e , e } , { e , e , e , e , e } , { e , e , e , e , e } ,and { e , e , e , e , e } , where each event e i is assigned oneof the available event classes. The mapping of the eventclasses to the positions (i.e. e ) have been randomized, andcan be found in Table I. B. Data Generation
The datasets consist of ten-second soundscapes syntheticallygenerated using Scaper [17] by mixing between one and fivepossibly overlapping source events with one background noisetype. Each event is mixed with an signal-to-noise ratio (SNR)chosen from N ( µ, σ ) with µ = 10 dB and σ = 3 dB . Thesound events are selected by sampling from the correspondingclass distributions, discussed in Section III-A. Additionally,source events are augmented by pitch shifting the audio byan amount uniformly sampled from the range [ − , , andby time stretching by a value uniformly sampled from therange [0 . , . . These augmentations are only applied to thetraining data.Before generating the soundscapes, the source data is splitinto training and evaluation sets to ensure that there is no dataleakage. For the DESED-FL events, the data is partitionedinto training and evaluation data according to split used forthe DESED dataset. For Urban-FL, we take the approachemployed for the URBAN-SED dataset, by using the existingstratified folds from the URBAN-8K dataset for the split:folds 1-6 are used for training and 9-10 for evaluation. Thebackground noise for each dataset is split into training andevaluation by splitting each source file into separate trainingand evaluation segments.The final training datasets each contain 300 ten-secondsoundscapes per edge device, totaling
30 000 soundscapes. Theevaluation datasets contain soundscapes per backgroundclass for a total of soundscapes. To enable reproducibil-ity, the dataset creation scripts are available for download IV. E
XPERIMENTAL S ETUP
A. Architectures
We propose three baseline architectures to evaluate differentmodel complexities and their effects on FL training perfor-mance. One of the goals driving this research is developingsmall models that are able to be trained on low resourcedevices, such as neuromorphic hardware. This limits the typesof architectures to those without recurrent layers. Therefore,we evaluate two standard CNN architectures of differentsizes, and a Residual Network (ResNet) architecture [19].The baseline CNN,
CNN-Base , is a medium-sized architecturebased on the feature extraction front-end of the Detection andClassification of Acoustic Scenes and Events (DCASE) 2019baseline architecture. It is composed of seven convolutional blocks and linear classification layer, for a total of
542 442 parameters. The second CNN architecture, a small CNN called
CNN-Sm , was designed using neural architecture search withBayes optimization [20] to limit the model to nearly
100 000 parameters while optimizing the F-score on the URBAN-SEDdataset. The found model has four convolutional blocks, asingle feed forward layer, and a classification layer, resultingin
115 434 parameters. Lastly, we propose a medium-sizedResNet architecture,
ResNet , with five independent component(IC) ResNet blocks [21] and a classification layer for a total of
422 090 parameters. Detailed descriptions for each architecturemay be found in Appendix B.
B. Preprocessing1) Input Representation:
The input for each model is aperceptually weighted mel spectrogram [19]: The input signalis first downsampled to
22 050 Hz . The short-time Fouriertransform (STFT) is applied with an fast Fourier transform(FFT) of size and a hop size of , and is followedby perceptual weighting. A mel-filter bank of 256 mel bandsis then applied. Finally, 43 windows are stacked togetherresulting in a feature representation (43x256x1) of one second.
2) Data Augmentation:
Similar to Salamon and Bello [22],we apply pitch shifting to the raw audio data before extractingthe mel spectrograms. However, instead of applying all shiftsof ± and ± semitones for a total increase factor of 4, werandomly select one semitone value between ± for each inputfile and for a total augmentation factor of 1 (i.˜e., doubling thesize of the dataset). This helps significantly reduce the finalsize of the dataset and has only a minor impact on modelperformance. C. Experimental Design
In our first experiment, the three proposed network ar-chitectures are evaluated in a centralized training scenario.First, we train the models using the original URBAN-SEDdataset [2] to validate that the architectures work as expectedon a well-known dataset. Then, we train each of the threemodels using the FL training sets. We evaluate the models,first, with batch normalization, and then, replace all batchnormalization layers with group normalization to ensure thatthis substitution for mitigating non-IID issues in FL [12] doesnot significantly affect the baseline results. All models aretrained for 50 epochs with early stopping using a patience of25 epochs, which monitors the validation loss. We use theAdam optimizer [23] with a cosine learning rate schedule inthe range [1 e − , e − .The next set of experiments evaluate the influence of FL hy-perparameters on SED performance, namely the total numberof clients, N , the participation rate during each communicationround, r p , and number of local epochs performed duringeach round, E L . For N we evaluate 1, 25, and 100 clients,with N = 1 providing a FL baseline. For r p , values of { . , . , . } are used in order to simulate the unreliabil-ity of client participation. At each communication round afraction, r p , of the N clients are uniformly sampled (witheplacement after each communication round) to participatein training. As network communication the amount shouldbe limited in FL, we evaluate the effects of E L = { , , } to reduce training time. For all FL experiments, the localclient models are optimized using Adam, with a learningrate lr = 1 e − , proposed by Leroy et al. [5]. Thelocal weight updates are aggregated by the coordinator usingthe standard Federating Averaging algorithm using stochasticgradient descent (SGD) with a learning rate of . [8].Due to the number of experiments being run, the size of thedatasets, and the training times required for each experiment,we limit the number of communication rounds during trainingto 60 rounds per experiment. Thus, reducing the time andresources required for experimental evaluation . While in somecases the models may not have completely converged, theresults provide valuable insights to improve our understandingof the different effects of FL hyperparameters to better focusfuture research. V. R ESULTS
A. Centralized Training
Table II lists the segment based F-Scores for each of theproposed datasets and architectures under centralized training.Additionally, we include the results of the architectures trainedusing the URBAN-SED [17] dataset as a baseline to validatethe network architectures’ performance on a well-establisheddataset.In the case of URBAN-SED, the models perform compara-ble or better than the original baseline of F = 0 . fromSalamon et al. [17]. In general, the ResNet architecture istypically the best performing model. Furthermore, replacingbatch normalization with group normalization has only minoreffects on the performance of all architectures, and in manyof the cases it improves performance. The results also indicatethat data distribution, either IID or non-IID, has minimaleffect on model performance in a centralized context. Thisis expected since all data is used during each training epoch,and the model does not fit to a particular data distribution. B. Federated Learning
Figure 7 shows the segment-based F-scores on evaluationdata for models trained with each training datasets. DESED-FL results are shown in Figure 1a, and URBAN-FL in Figure1b. Each subfigure includes the the results from IID (toprow) and non-IID (bottom row) datasets. The results areshown for hyperparameters N , E L , and r p as a functionof communication rounds. Here we present the results for r p = 0 . with the remaining results found in Appendix C. Thetraining curves for the IID data look similar to what is expectedin a centralized training scenario with the training times (i. e.,number of communication rounds) being influenced by thearchitecture type, participation rate, and the number of localepochs. Typically in the IID setting, increasing the model In real-world FL, this would not be an issue due to the inherent paral-lelization of FL, as opposed to simulating FL on a single server.
TABLE II: Centralized Training F-scores.
Dataset IID Norm CNN-Sm CNN-Base ResNet
URBAN-SED - batch 0.566 0.567
URBAN-SED - group 0.532 0.587 0.589URBAN-FL (cid:51) batch 0.600 0.625
URBAN-FL (cid:51) group 0.574 0.638
URBAN-FL (cid:55) batch 0.593 0.609 0.634URBAN-FL (cid:55) group 0.564 0.618
DESED-FL (cid:51) batch 0.627 0.632 0.630DESED-FL (cid:51) group 0.628 0.632
DESED-FL (cid:55) batch 0.618 0.625 0.625DESED-FL (cid:55) group 0.621 0.634 complexity and E L reduces the amount of communicationneeded between client and server by improving the trainingtime; whereas, increasing N slows down training. This couldbe attributed to the fact that each local client has less dataresulting in smaller gradient deltas between the global andnewly trained local model at each round. These observations,however, do not necessarily hold in a non-IID context, whereperformance degrades significantly for all models (except forthe baseline N = 1 ). In this case, the results indicate that largermodels are more prone to overfit to the data seen during anindividual training round. This is especially true with a smallnumber of clients, such as N = 25 , since there is inherentlyless variation in the randomly selected client distributions.However, even in cases when all data is seen during eachcommunication round, i. e., r p = 1 . , the ResNet architecturecontinues to overfit. Reducing E L helps to alleviate this issueby limiting the gradient values, but slowing down training.Training with a larger number of overall clients helps tomitigate this issue as well. In general, the models in a non-IID scenario tend to overfit to local distributions, and thefederating averaging process does not correct for this on itsown. Damping the gradients before performing aggregation,by reducing the server learning rate or normalization gradientsfor example, may help to reduce the effects of large localgradients. VI. C ONCLUSION
In this work, we introduce DESED-FL and URBAN-FL,two novel datasets to foster research in FL for SED. Tobetter understand the effects of previously identified challengesassociated with non-IID data in FL, we include both IIDand non-IID training sets for each use case. Additionally, wecontribute the first known research on FL for SED through theevaluation of three baseline neural network architecture. Theresults show that while FL is a promising approach for SEDit is prone to challenges with non-IID data similar to previousFL research [9], [12]. By contributing non-IID datasets, wehope enable further research to identify potential solutions tomitigate these issues. , , ' ) 6 F R U H 1 3 5 1 3 5 1 3 5 &