[PDF] DESED-FL and URBAN-FL: Federated Learning Datasets for Sound Event Detection

Abstract

Research on sound event detection (SED) in environmental settings has seen increased attention in recent years. The large amounts of (private) domestic or urban audio data needed raise significant logistical and privacy concerns. The inherently distributed nature of these tasks, make federated learning (FL) a promising approach to take advantage of largescale data while mitigating privacy issues. While FL has also seen increased attention recently, to the best of our knowledge there is no research towards FL for SED. To address this gap and foster further research in this field, we create and publish novel FL datasets for SED in domestic and urban environments. Furthermore, we provide baseline results on the datasets in a FL context for three deep neural network architectures. The results indicate that FL is a promising approach for SED, but faces challenges with divergent data distributions inherent to distributed client edge devices.

Full PDF

DDESED-FL and URBAN-FL: Federated LearningDatasets for Sound Event Detection

David S. Johnson ∗ , Wolfgang Lorenz, Michael Taenzer, Stylianos Mimilakis,Sascha Grollmisch, Jakob Abeßer and Hanna Lukashevich Fraunhofer Institute for Digital Media Technology (IDMT), Ilmenau, Germany ∗ [email protected] Abstract —Research on sound event detection (SED) in envi-ronmental settings has seen increased attention in recent years.The large amounts of (private) domestic or urban audio dataneeded raise signiﬁcant logistical and privacy concerns. Theinherently distributed nature of these tasks, make federatedlearning (FL) a promising approach to take advantage of large-scale data while mitigating privacy issues. While FL has alsoseen increased attention recently, to the best of our knowledgethere is no research towards FL for SED. To address this gapand foster further research in this ﬁeld, we create and publishnovel FL datasets for SED in domestic and urban environments.Furthermore, we provide baseline results on the datasets in aFL context for three deep neural network architectures. Theresults indicate that FL is a promising approach for SED, butfaces challenges with divergent data distributions inherent todistributed client edge devices.

Index Terms —federated learning, sound event detection, deeplearning, distributed learning

I. I

NTRODUCTION

The aim of sound event detection (SED) is to automaticallyidentify the occurrence of target sound events, such as glassbreaking or dog barking, within an audio signal capturing anacoustic scene. Identifying these sound events within complexscenes is a challenging and open research problem that hasattracted much attention in recent years, observed with theincrease in the literature and particularly with the growingresearch interest in the DCASE Community . Two commonSED uses-cases are acoustic monitoring in domestic [1] andurban [2] environments. In both scenarios, the use of audio totrain detection models in a centralized training context raisesconsiderable privacy concerns. In these environments, there arespeech and other sounds that are conﬁdential which should notbe shared or stored insecurely.State-of-the-art approaches to SED are most commonlybased on deep learning [3] which requires large centralizeddatasets for model training, posing signiﬁcant security andlogistical challenges. Federated learning (FL) offers an attrac-tive approach to mitigate some of these concerns. Instead ofsending private data to a centralized data store, FL performsmodel training directly on many client edge devices (from hereon referred to as clients) using locally stored data. The clientsthen share only their updated parameters with a coordinationserver, which aggregates the shared parameters to update aglobal model. The new global model is then transferred back http://dcase.community to the clients. This process continues until convergence, orindeﬁnitely if new data is continuously acquired [4].Current research in FL has focused on image or text-basedtasks. To our knowledge, the only known research or practicalapplications of FL in the audio domain are related to keywordspotting [5]–[7]. Due to the limited research on FL for SED,there remain questions about the effectiveness of the approachdue to varying acoustic conditions inherent to distributedclients. For example, data may be captured from clients inmultiple locations with different background noise character-istics or in locations with only a subset of the sound eventclasses. This leads to differences in data distributions amongstthe clients involved in the training process. For centralizedtraining, data from multiple devices is combined into singletraining dataset that is typically assumed to be independentand identically distributed (IID), but with FL distributed datacollection leads to models being trained using data fromdivergent distributions, i.e., data that is non-IID. Existing SEDdatasets do not capture the non-IID characteristics seen withFL. To address this gap and foster research on FL for SED,we contribute novel SED datasets speciﬁcally designed for FLtraining. Additionally, we provide baseline results for threeneural network architectures to evaluate the effects of FLhyperparameters and non-IID data on SED performance.II. R ELATED W ORK

A. Federated Learning

McMahan et al. ﬁrst proposed the idea of FL as a methodto allow data to remain on distributed devices while traininga shared model by aggregating locally trained updates [8].For a comprehensive overview of FL, refer to the technicalreport by Kairouz et al. [4]. There are two main challengesfor FL methods. First is the need for communicating overunreliable networks to transmit data. A second issue resultsfrom data captured in varying contexts, leading to datasetsthat are statistically dissimilar, non-IID, between clients.To address these challenges, Sattler et al. [9] proposed acompression framework sparse ternary compression (STC).Similarly, Lin et al. [10] and Bernstein et al. [11] proposedmethods to remove the redundancies of gradient informa-tion in node-distributed learning frameworks. Hsieh et al.evaluated the challenges of non-IID data in an FL scenario[12]. They identiﬁed problems with the batch normalizationlayer, a common layer in many deep neural network (DNN) a r X i v : . [ c s . S D ] F e b ABLE I: Sound event and background classes for each of the datasets.

Dataset Sound Events Background Noise Types

DESED-FL e : Dishes; e : Cat; e : Frying; e : Dog; e : Blender; e : Speech; e : Vacuum cleaner; e : Electric shaver/toothbrush; e : Alarm bell; e : Running water apartment room, computer interior, computerlab, emergency staircase, and libraryURBAN-FL e : Children playing; e : Siren; e : Drilling; e : Street music; e : Car horn; e : Gunshot; e : Jackhammer; e : Dog bark; e : Air conditioner; e : Engine idling birds, crowd, fountain, rain, and trafﬁc architectures, and proposed to use group normalization [13]instead. Similarly, to address the problems of non-IID data,Sattler et al. [14] proposed a clustering operation to groupclients whose data distributions have similar characteristics.While the previous research proposed methods to overcomechallenges in FL, there are no known datasets to evaluate themfor SED. We address this gap, by presenting new SED datasetsspeciﬁcally designed for FL with non-IID data. B. Sound Event Detection

State-of-the-art SED algorithms build upon deep neural net-works, the most common being convolutional neural networks(CNNs) and convolutional recurrent neural networks (CRNNs)based architectures. Both architectures include convolutionalfront-ends, where multiple convolutional layers are trainedto learn sound-speciﬁc features. As input to the network,either ﬁxed two-dimensional signal transformations such asmel spectrograms [15] or raw one-dimensional audio samplesare used (end-to-end learning) [16]. As a back-end, CNNsuse fully-connected layers for classiﬁcation whereas CRNNsemploy recurrent layers such as gated recurrent unit (GRU) orlong short-term memory (LSTM) layers to model the temporalprogression of the extracted features. We focus our work onCNN architectures for a lightweight approach, to enable modeltraining on low resource devices required by FL.Training SED models requires strongly labeled datasets inwhich onset and offset times are labeled for each sound event.Because of the laborious effort required to annotate real-worldsamples, researchers often use synthetically generated datasets.Creating synthetic datasets requires mixing events from acurated sound bank with a background signal to generatesoundscapes with multiple, possibly overlapping, events. Forexample, the URBAN-SED dataset [17] is composed of soundevents from the UrbanSound-8K (URBAN-8K) [2] datasetmixed with Brownian noise. A recent trend is to use acombination of synthetic and real recordings for training andevaluation as in the Domestic Environment Sound Event De-tection (DESED) dataset [1]. For both datasets, however, soundevents are distributed uniformly during soundscape generation.For the FL context, sound events should be distributed in astructured fashion to simulate real world distributed learningconditions. To enable research in FL for SED, our proposeddatasets distribute soundscapes to simulated clients with dif-ferent background characteristics and class distributions.III. D

ATASETS

In this section, we present DESED-FL and URBAN-FL,datasets for acoustic monitoring of domestic and urban envi- ronments with FL. Each use-case contains two independenttraining sets: an IID dataset, in which sound event classes aredistributed evenly amongst devices, and a non-IID dataset, inwhich only subset of overall classes is assigned to each client.To imitate different acoustic recording conditions, we mix thesound events with one of ﬁve background noise classes. Eachtraining dataset is pre-partitioned into 100 simulated clientswith 20 clients per each background noise class. It is possibleto simulate more than 100 clients by partitioning further, orless than 100 by combining or removing clients. For repro-ducible evaluation, each use case also includes an evaluationdataset in which sound events are uniformily distributed toeach background class.To generate the soundscapes for DESED-FL, sound eventsand background noises are sourced from DESED [1].URBAN-FL soundscapes are generated using sound eventsfrom URBAN-8K [2], and background noises from IsolatedUrban Sound Database (IUSD) [18]. The sound event andnoise classes for each dataset are listed in Table I.

A. IID and Non-IID Training Datasets

To generate the IID and non-IID dataset variations, twosound event distribution schemes are implemented for assign-ing events to clients. The IID scheme implements a uniformdistribution of sound event classes across all devices. Thissimulates the best possible case for training FL models sinceall clients have access to all classes. For the non-IID scheme,sound event classes are distributed to the clients using oneof ﬁve class distributions each containing a subset of ﬁveclasses. To minimize background noise bias, each of the ﬁvedistributions is assigned to four clients per background classfor a total of 100 clients. This results in each set of 20 clientsper background comprising ﬁve different class distributions.A detailed view of each data distribution is presented inAppendix A. In the non-IID scheme, each of the ﬁve classdistributions contains ﬁve event classes for a total of 25 eventclasses, which is not divisible by ten (the number of totalclasses). This means that ﬁve event classes were used threetimes and ﬁve event classes were used only twice.One goal in designing the distributions is to have eachcollection of classes be as different as possible from any othercollection. Hence, the algorithm to select class distributionsminimizes the penalty value p computed by p = N coll − (cid:88) k =1 N coll (cid:88) l = k +1 N eq ( k, l ) , here N coll equals the number of collections and N eq ( k, l ) represents the number of equal classes in collections k and l .The distributions that minimize the penalty p consists of the sets { e , e , e , e , e } , { e , e , e , e , e } , { e , e , e , e , e } , { e , e , e , e , e } ,and { e , e , e , e , e } , where each event e i is assigned oneof the available event classes. The mapping of the eventclasses to the positions (i.e. e ) have been randomized, andcan be found in Table I. B. Data Generation

The datasets consist of ten-second soundscapes syntheticallygenerated using Scaper [17] by mixing between one and ﬁvepossibly overlapping source events with one background noisetype. Each event is mixed with an signal-to-noise ratio (SNR)chosen from N ( µ, σ ) with µ = 10 dB and σ = 3 dB . Thesound events are selected by sampling from the correspondingclass distributions, discussed in Section III-A. Additionally,source events are augmented by pitch shifting the audio byan amount uniformly sampled from the range [ − , , andby time stretching by a value uniformly sampled from therange [0 . , . . These augmentations are only applied to thetraining data.Before generating the soundscapes, the source data is splitinto training and evaluation sets to ensure that there is no dataleakage. For the DESED-FL events, the data is partitionedinto training and evaluation data according to split used forthe DESED dataset. For Urban-FL, we take the approachemployed for the URBAN-SED dataset, by using the existingstratiﬁed folds from the URBAN-8K dataset for the split:folds 1-6 are used for training and 9-10 for evaluation. Thebackground noise for each dataset is split into training andevaluation by splitting each source ﬁle into separate trainingand evaluation segments.The ﬁnal training datasets each contain 300 ten-secondsoundscapes per edge device, totaling

30 000 soundscapes. Theevaluation datasets contain soundscapes per backgroundclass for a total of soundscapes. To enable reproducibil-ity, the dataset creation scripts are available for download IV. E

XPERIMENTAL S ETUP

A. Architectures

We propose three baseline architectures to evaluate differentmodel complexities and their effects on FL training perfor-mance. One of the goals driving this research is developingsmall models that are able to be trained on low resourcedevices, such as neuromorphic hardware. This limits the typesof architectures to those without recurrent layers. Therefore,we evaluate two standard CNN architectures of differentsizes, and a Residual Network (ResNet) architecture [19].The baseline CNN,

CNN-Base , is a medium-sized architecturebased on the feature extraction front-end of the Detection andClassiﬁcation of Acoustic Scenes and Events (DCASE) 2019baseline architecture. It is composed of seven convolutional blocks and linear classiﬁcation layer, for a total of

542 442 parameters. The second CNN architecture, a small CNN called

CNN-Sm , was designed using neural architecture search withBayes optimization [20] to limit the model to nearly

100 000 parameters while optimizing the F-score on the URBAN-SEDdataset. The found model has four convolutional blocks, asingle feed forward layer, and a classiﬁcation layer, resultingin

115 434 parameters. Lastly, we propose a medium-sizedResNet architecture,

ResNet , with ﬁve independent component(IC) ResNet blocks [21] and a classiﬁcation layer for a total of

422 090 parameters. Detailed descriptions for each architecturemay be found in Appendix B.

B. Preprocessing1) Input Representation:

The input for each model is aperceptually weighted mel spectrogram [19]: The input signalis ﬁrst downsampled to

22 050 Hz . The short-time Fouriertransform (STFT) is applied with an fast Fourier transform(FFT) of size and a hop size of , and is followedby perceptual weighting. A mel-ﬁlter bank of 256 mel bandsis then applied. Finally, 43 windows are stacked togetherresulting in a feature representation (43x256x1) of one second.

2) Data Augmentation:

Similar to Salamon and Bello [22],we apply pitch shifting to the raw audio data before extractingthe mel spectrograms. However, instead of applying all shiftsof ± and ± semitones for a total increase factor of 4, werandomly select one semitone value between ± for each inputﬁle and for a total augmentation factor of 1 (i.˜e., doubling thesize of the dataset). This helps signiﬁcantly reduce the ﬁnalsize of the dataset and has only a minor impact on modelperformance. C. Experimental Design

In our ﬁrst experiment, the three proposed network ar-chitectures are evaluated in a centralized training scenario.First, we train the models using the original URBAN-SEDdataset [2] to validate that the architectures work as expectedon a well-known dataset. Then, we train each of the threemodels using the FL training sets. We evaluate the models,ﬁrst, with batch normalization, and then, replace all batchnormalization layers with group normalization to ensure thatthis substitution for mitigating non-IID issues in FL [12] doesnot signiﬁcantly affect the baseline results. All models aretrained for 50 epochs with early stopping using a patience of25 epochs, which monitors the validation loss. We use theAdam optimizer [23] with a cosine learning rate schedule inthe range [1 e − , e − .The next set of experiments evaluate the inﬂuence of FL hy-perparameters on SED performance, namely the total numberof clients, N , the participation rate during each communicationround, r p , and number of local epochs performed duringeach round, E L . For N we evaluate 1, 25, and 100 clients,with N = 1 providing a FL baseline. For r p , values of { . , . , . } are used in order to simulate the unreliabil-ity of client participation. At each communication round afraction, r p , of the N clients are uniformly sampled (witheplacement after each communication round) to participatein training. As network communication the amount shouldbe limited in FL, we evaluate the effects of E L = { , , } to reduce training time. For all FL experiments, the localclient models are optimized using Adam, with a learningrate lr = 1 e − , proposed by Leroy et al. [5]. Thelocal weight updates are aggregated by the coordinator usingthe standard Federating Averaging algorithm using stochasticgradient descent (SGD) with a learning rate of . [8].Due to the number of experiments being run, the size of thedatasets, and the training times required for each experiment,we limit the number of communication rounds during trainingto 60 rounds per experiment. Thus, reducing the time andresources required for experimental evaluation . While in somecases the models may not have completely converged, theresults provide valuable insights to improve our understandingof the different effects of FL hyperparameters to better focusfuture research. V. R ESULTS

A. Centralized Training

Table II lists the segment based F-Scores for each of theproposed datasets and architectures under centralized training.Additionally, we include the results of the architectures trainedusing the URBAN-SED [17] dataset as a baseline to validatethe network architectures’ performance on a well-establisheddataset.In the case of URBAN-SED, the models perform compara-ble or better than the original baseline of F = 0 . fromSalamon et al. [17]. In general, the ResNet architecture istypically the best performing model. Furthermore, replacingbatch normalization with group normalization has only minoreffects on the performance of all architectures, and in manyof the cases it improves performance. The results also indicatethat data distribution, either IID or non-IID, has minimaleffect on model performance in a centralized context. Thisis expected since all data is used during each training epoch,and the model does not ﬁt to a particular data distribution. B. Federated Learning

Figure 7 shows the segment-based F-scores on evaluationdata for models trained with each training datasets. DESED-FL results are shown in Figure 1a, and URBAN-FL in Figure1b. Each subﬁgure includes the the results from IID (toprow) and non-IID (bottom row) datasets. The results areshown for hyperparameters N , E L , and r p as a functionof communication rounds. Here we present the results for r p = 0 . with the remaining results found in Appendix C. Thetraining curves for the IID data look similar to what is expectedin a centralized training scenario with the training times (i. e.,number of communication rounds) being inﬂuenced by thearchitecture type, participation rate, and the number of localepochs. Typically in the IID setting, increasing the model In real-world FL, this would not be an issue due to the inherent paral-lelization of FL, as opposed to simulating FL on a single server.

TABLE II: Centralized Training F-scores.

Dataset IID Norm CNN-Sm CNN-Base ResNet

URBAN-SED - batch 0.566 0.567

URBAN-SED - group 0.532 0.587 0.589URBAN-FL (cid:51) batch 0.600 0.625

URBAN-FL (cid:51) group 0.574 0.638

URBAN-FL (cid:55) batch 0.593 0.609 0.634URBAN-FL (cid:55) group 0.564 0.618

DESED-FL (cid:51) batch 0.627 0.632 0.630DESED-FL (cid:51) group 0.628 0.632

DESED-FL (cid:55) batch 0.618 0.625 0.625DESED-FL (cid:55) group 0.621 0.634 complexity and E L reduces the amount of communicationneeded between client and server by improving the trainingtime; whereas, increasing N slows down training. This couldbe attributed to the fact that each local client has less dataresulting in smaller gradient deltas between the global andnewly trained local model at each round. These observations,however, do not necessarily hold in a non-IID context, whereperformance degrades signiﬁcantly for all models (except forthe baseline N = 1 ). In this case, the results indicate that largermodels are more prone to overﬁt to the data seen during anindividual training round. This is especially true with a smallnumber of clients, such as N = 25 , since there is inherentlyless variation in the randomly selected client distributions.However, even in cases when all data is seen during eachcommunication round, i. e., r p = 1 . , the ResNet architecturecontinues to overﬁt. Reducing E L helps to alleviate this issueby limiting the gradient values, but slowing down training.Training with a larger number of overall clients helps tomitigate this issue as well. In general, the models in a non-IID scenario tend to overﬁt to local distributions, and thefederating averaging process does not correct for this on itsown. Damping the gradients before performing aggregation,by reducing the server learning rate or normalization gradientsfor example, may help to reduce the effects of large localgradients. VI. C ONCLUSION

In this work, we introduce DESED-FL and URBAN-FL,two novel datasets to foster research in FL for SED. Tobetter understand the effects of previously identiﬁed challengesassociated with non-IID data in FL, we include both IIDand non-IID training sets for each use case. Additionally, wecontribute the ﬁrst known research on FL for SED through theevaluation of three baseline neural network architecture. Theresults show that while FL is a promising approach for SEDit is prone to challenges with non-IID data similar to previousFL research [9], [12]. By contributing non-IID datasets, wehope enable further research to identify potential solutions tomitigate these issues. ,, ' ) 6 F R U H 1 35 1 35 1 35 &RPPXQLFDWLRQ5RXQGV 1 RQ ,, ' ) 6 F R U H &RPPXQLFDWLRQ5RXQGV &RPPXQLFDWLRQ5RXQGV 0RGHO &11( 0RGHO &11( 0RGHO &11( 0RGHO &116( 0RGHO &116( 0RGHO &116( 0RGHO 5HV1HW( 0RGHO 5HV1HW( 0RGHO 5HV1HW( (a) DESED-FL ,, ' ) 6 F R U H 1 35 1 35 1 35 &RPPXQLFDWLRQ5RXQGV 1 RQ ,, ' ) 6 F R U H &RPPXQLFDWLRQ5RXQGV &RPPXQLFDWLRQ5RXQGV (b) URBAN-FL Fig. 1: F-scores as a function of communication round for

P R = 50 % , and for N = { , , } . The top row shows resultsof the IID datasets and bottom row non-IID. Each column in the subﬁgures show the results for a speciﬁed N .Future research directions include addressing the issues ofnon-IID data through adaptive techniques such as the auto-matic adjustment of parameters [12], or clustering techniquesto identify groups of distributions and train individual modelsaccordingly [14]. Furthermore, an investigation on strategiesfor local data management is needed. Since clients may havelimited storage capacities, an evaluation of how much datashould be stored and for how long is needed. Additionally, itis important to be able to identify which data samples containthe most information and should be stored for later training. Tothat end, the selective sampling of data based on entropy [24]is an interesting research direction.R EFERENCES[1] R. Serizel, N. Turpault, A. Shah, and J. Salamon, “Sound event detec-tion in synthetic domestic environments,” in

Proceedings of the 45thInternational Conference on Acoustics, Speech, and Signal Processing(ICASSP2020) , Barcelona, Spain, May 2020.[2] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urbansound research,” in , Orlando, FL, USA, 11 2014, pp. 1041–1044.[3] A. Dang, T. H. Vu, and J. Wang, “A survey of deep learning forpolyphonic sound event detection,” in , 2017, pp. 75–78.[4] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, and et al.,“Advances and Open Problems in Federated Learning,” 2019.[5] D. Leroy, A. Coucke, T. Lavril, T. Gisselbrecht, and J. Dureau, “FederatedLearning for Keyword Spotting,” in

Proceedings of the 44th InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP2019) ,2019.[6] Apple. (2019) Designing for privacy (video and slide deck). [Online].Available: https://developer.apple.com/videos/play/wwdc2019/708[7] A. Bhowmick, J. Duchi, J. Freudiger, G. Kapoor, and R. Rogers, “Pro-tection against reconstruction and its applications in private federatedlearning,” arXiv preprint arXiv:1812.00984 , 2018.[8] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-Efﬁcient Learning of Deep Networks from Decentral-ized Data,” in

Proceedings of the 20th International Conference onArtiﬁcial Intelligence and Statistics , 2017.[9] F. Sattler, S. Wiedemann, K.-R. Muller, and W. Samek, “Robust andCommunication-Efﬁcient Federated Learning From Non-i.i.d. Data,”

IEEE Transactions on Neural Networks and Learning Systems , 2019. [10] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep GradientCompression: Reducing the Communication Bandwidth for DistributedTraining,” 2017.[11] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar,“signSGD: Compressed optimisation for non-convex problems,” 2018.[12] K. Hsieh, A. Phanishayee, O. Mutlu, and P. B. Gibbons, “The Non-IIDData Quagmire of Decentralized Machine Learning,” 2019. [Online].Available: http://arxiv.org/abs/1910.00189[13] Y. Wu, “Group Normalization,”

ECCV , 2018. [Online]. Available:https://research.fb.com/publications/group-normalization/[14] F. Sattler, K.-R. M¨uller, and W. Samek, “Clustered Federated Learn-ing: Model-Agnostic Distributed Multi-Task Optimization under PrivacyConstraints,” 2019.[15] Y. Hou, Q. Kong, S. Li, and M. D. Plumbley, “Sound Event Detec-tion with Sequentially Labelled Data Based on Connectionist TemporalClassiﬁcation and Unsupervised Clustering,” in

Proceedings of the 44thInternational Conference on Acoustics, Speech, and Signal Processing(ICASSP2019) , 2019.[16] E. Cakir and T. Virtanen, “End-to-End Polyphonic Sound Event De-tection Using Convolutional Recurrent Neural Networks with LearnedTime-Frequency Representation Input,”

Proceedings of the InternationalJoint Conference on Neural Networks , vol. 2018-July, 2018.[17] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello, “Sca-per: A library for soundscape synthesis and augmentation,” in

Proceedingsof the IEEE Workshop on Applications of Signal Processing to Audio andAcoustics (WASPAA) , 2017.[18] J.-R. Gloaguen, M. Lagrange, A. Can, and J.-F. Petiot, “Isolated urbansound database,” Apr. 2018. [Online]. Available: https://doi.org/10.5281/zenodo.1213793[19] K. Koutini, H. Eghbal-zadeh, M. Dorfer, and G. Widmer, “The ReceptiveField as a Regularizer in Deep Convolutional Neural Networks forAcoustic Scene Classiﬁcation,” in

Proceedings of the 27th EuropeanSignal Processing Conference (EUSIPCO) , 2019.[20] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian Optimiza-tion of Machine Learning Algorithms,” in

NeurIPS , Lake Tahoe, Nevada,2012, pp. 2951–2959.[21] G. Chen, P. Chen, Y. Shi, C.-Y. Hsieh, B. Liao, and S. Zhang,“Rethinking the usage of batch normalization and dropout in the trainingof deep neural networks,” arXiv preprint arXiv:1905.05928 , 2019.[22] J. Salamon and J. P. Bello, “Deep Convolutional Neural Networks andData Augmentation for Environmental Sound Classiﬁcation,”

IEEE SignalProcessing Letters , vol. 24, no. 3, pp. 279–283, 3 2017.[23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[24] F. Wiewel and B. Yang, “Entropy-based sample selection for online con-tinual learning,” in

Proceedings of the 28th European Signal ProcessingConference (EUSIPCO) , 2020.

PPENDIX AD ATASETS

TABLE III: DESED-FL and URBAN-FL Sound Event Statistics per Client Device

Dataset Version Max Poly Avg Poly Avg Num Events Avg Dur. (s) Min Dur. (s) Max Dur. (s)

URBAN-FL IID 5 . ± .

09 2 . ± .

14 2 . ± .

10 0 . ± .

03 4 . ± . URBAN-FL Non-IID 5 . ± .

08 3 . ± .

14 2 . ± .

15 0 . ± .

03 4 . ± . URBAN-FL Eval 5 . ± .

05 3 . ± .

08 2 . ± .

03 0 . ± .

00 4 . ± . DESED-FL IID 5 . ± .

09 2 . ± .

14 3 . ± .

20 0 . ± .

02 10 . ± . DESED-FL Non-IID 5 . ± .

28 2 . ± .

15 3 . ± .

18 0 . ± .

02 9 . ± . DESED-FL Eval 5 . ± .

04 3 . ± .

07 2 . ± .

04 0 . ± .

00 10 . ± . In this section, we present a detailed view of the device data and distributions for the proposed IID and non-IID datasets,DESED-FL and URBAN-FL. Table III lists detailed statistics regarding polyphony, i.e. the number of events occurringsimultaneously, number and duration of events per sound soundscape. The statistics indicate that the datasets all exhibit similarcharacteristics, with the only signiﬁcant different in the max durations between the DESED-FL and URBAN-FL datasets. Thisis due to the nature of the source datasets. The max duration of the sound events in the URBAN-8K datasets is four seconds,whereas in the DESED-FL source dataset, some events can be longer than four seconds so the max duration is limited to thelength of the soundscapes.Figures 2 and 4 show the total duration of each of the sound events as distributed to the background noise classes for each ofthe training and evaluation datasets of DESED-FL and URBAN-FL respectively. As expected, the distributions of data betweeneach background. The data was distributed evenly amongst background classes to mitigate any potential background biases inthe ﬁnal trained models. Figures 3 and 5, provide a detailed look at the individual client distributions. The event durations barcharts for the IID data, in Figures 3a and 5a, show that the sound events are identically distributed for all devices. Whereas,in the non-IID data, shown in Figures 3b and 5b, there are ﬁve different distributions for each background class, indicated bythe surrounding green boxes. This distribution scheme reduces the potential for background bias by ensuring all sound eventsare distributed among each background type while ensuring non-IID client devices. DSDUWPHQWBURRP FRPSXWHUBLQWHULRU FRPSXWHUBODE HPHUJHQF\BVWDLUFDVH OLEUDU\ (a) DESED-FL IID training data DSDUWPHQWBURRP FRPSXWHUBLQWHULRU FRPSXWHUBODE HPHUJHQF\BVWDLUFDVH OLEUDU\ (b) DESED-FL non-IID training data DSDUWPHQWBURRP FRPSXWHUBLQWHULRU FRPSXWHUBODE HPHUJHQF\BVWDLUFDVH OLEUDU\ (c) DESED-FL evaluation data

Fig. 2: DESED-FL: Total sound event durations for each background for the two versions of training data and the evaluationdataset. The sound event classes from left to right are: alarm bell ringing, blender, cat, dishes, dog, electric shaver toothbrush,frying, running water, speech, and vacuum cleaner . a) DESED-FL IID training data event duration per client(b) DESED-FL non-IID training data event duration per client Fig. 3: Distribution of data, by event duration, to client devices for the DESED-FL dataset. Each row shows the distributionfor a speciﬁc background type. The green boxes indicate the ﬁve different class distribution groupings for the non-IID dataset. ELUG FURZG IRXQWDLQ UDLQ WUDIILF (a) URBAN-FL IID training data ELUG FURZG IRXQWDLQ UDLQ WUDIILF (b) URBAN-FL non-IID training data ELUG FURZG IRXQWDLQ UDLQ WUDIILF (c) URBAN-FL evaluation data

Fig. 4: URBAN-FL: Total sound event durations for each background for the two versions of training data and the evaluationdataset. The sound event classes from left to right are: air conditioner, car horn, children playing, dog bark, drilling,engine idling, gun shot, jackhammer, siren, street music a) URBAN-FL IID training data event duration per client(b) URBAN-FL non-IID training data event duration per client

Fig. 5: Distribution of data, by event duration, to client devices for the URBAN-FL dataset. Each row shows the distributionfor a speciﬁc background type. The green boxes indicate the ﬁve different class distribution groupings for the non-IID dataset.

PPENDIX BD ETAILED A RCHITECTURES

In this section, we provide details on the three architectures used in our experiments on SED with FL to foster reproducibilityof baseline results on the proposed datasets. As baseline architectures we propose two standard CNN architectures, a smallone with close to

100 000 parameters and a medium-sized one. Additionally, we proprosed a medium-sized ResNet basedarchitecture. TABLE IV: CNN-Base Architecture with

542 442 parameters

Layer Output Kernel Size Droput

Conv Block 16 (3, 3) 0.20Avg Pooling - (2, 2) -Conv Block 32 (3, 3) 0.20Avg Pooling - (2, 2) -Conv Block 64 (3, 3) 0.20Avg Pooling - (1, 2) -Conv Block 128 (3, 3) 0.20Avg Pooling - (1, 2) -Conv Block 128 (3, 3) 0.20Avg Pooling - (1, 2) -Conv Block 128 (3, 3) 0.20Avg Pooling - (1, 2) -Conv Block 128 (3, 3) 0.20Global Avg Pooling - - -Dense 10 - -Sigmoid - - -

Both CNN-Base and CNN-Small are standard CNN architectures composed of convolution blocks each followed by a poolinglayer. Each convolutional block, shown in Figure 6b, contains a 2D convolution, a normalization layer, an activation layer,and 20% dropout. Global pooling is performed before the ﬁnal output layer. CNN-Base is based on the baseline architecturefrom the DCASE 2019 challenge for SED [App1] without the recurrent layers, and is composed of seven convolutionalblocks. The recurrent layers were removed in order to maintain compatibility with neuromorphic hardware. The architectureand hyperparameters are listed in Table IV. The resulting architecture contains

542 442 trainable parameters.TABLE V: small CNN (CNN-S) Architecture with . parameters Layer Output Kernel Size Dropout

Conv Block 16 (3, 3) 0.2Pooling 2D - (2, 2) -Conv Block 32 (3, 3) 0.2Pooling 2D - (2, 2) -Conv Block 64 (3, 3) 0.2Pooling 2D - (2, 2) -Conv Block 128 (3, 3) 0.2Global Pooling - - -Dense 128 - -ReLu - - -Dense 10 - -Sigmoid - - -

CNN-Small is a CNN that was designed using Bayesian Optimization neural architecture search [20] to keep the modelclose to

100 000 parameters. The ﬁnal architecture is composed of four convolutional blocks, seen in 6b and a classiﬁcationnetwork with one hidden layer and one output layer, for a total of

115 434 parameters. The architecture and hyperparametersare listed in Table IV. Implemented with http://github.com/fmfn/BayesianOptimization e L U I C C o n v D R e L U I C C o n v D R e L U (a) ResNet Block with IC Layers C o n v D N o r m a li z a t i o n R e L U D r o p o u t (b) Convolution Block Fig. 6: Blocks used for the ResNet and CNN architecturesTABLE VI: Resnet Architecture with

422 090 parameters

Layer Output Kernel Size Dropout

Conv 2D 64 (5, 5) -Relu - - -ResNet Block 64 (3, 1) 0.10Avg Pooling - (2, 2) -ResNet Block 64 (3, 3) 0.10Avg Pooling - (2, 2) -Res Block 64 (3, 3) 0.10Avg Pooling - (2, 2) -ResNet Block 128 (3, 1) 0.10Avg Pooling - (2, 2) -ResNet Block 128 (1, 1) 0.10Avg Global Pooling - - -Dense 10 - -Sigmoid - - -

The ResNet architecture is composed of ResNet blocks with IC layers as proposed by Chen et al. [21] shown in Fig. 6awith hyperparameters inspired by Koutini et al. [19] to minimize the receptive ﬁeld of the model. The IC blocks containa normalization layer followed by dropout. For additional regularization, l2 weight regularization is performed as well asdisabling of centering and scaling during the normalization process [App5]. The ﬁnal architecture contains

422 090 trainableparameters.

PPENDIX CE XTENDED R ESULTS

In this section, we present extended results from our FL experiments. In addition, to r p = 0 . , presented in the main article,results for r p = { . , . } are shown in Figures 7 and 8. The percentage of clients involved in a training round has minimaleffects on the training peformance for IID data, while in the case of non-IID training is more stabilized as the percentageincreases. This is most likely a result of the central server receiving weight gradients from all possible distributions at eachtraining round. ) 6 F R U H ,, ' 35 &RPPXQLFDWLRQ5RXQGV ) 6 F R U H 1 RQ ,, ' 0RGHO &11( 0RGHO &11( 0RGHO &11( 0RGHO &116( 0RGHO &116( 0RGHO &116( 0RGHO 5HV1HW( 0RGHO 5HV1HW( 0RGHO 5HV1HW( (a) N Clients = 1 ) 6 F R U H ,, ' 35 35 35 &RPPXQLFDWLRQ5RXQGV ) 6 F R U H 1 RQ ,, ' &RPPXQLFDWLRQ5RXQGV &RPPXQLFDWLRQ5RXQGV0RGHO &11( 0RGHO &11( 0RGHO &11( 0RGHO &116( 0RGHO &116( 0RGHO &116( 0RGHO 5HV1HW( 0RGHO 5HV1HW( 0RGHO 5HV1HW( (b) N Clients = 25 ) 6 F R U H ,, ' 35 35 35 &RPPXQLFDWLRQ5RXQGV ) 6 F R U H 1 RQ ,, ' &RPPXQLFDWLRQ5RXQGV &RPPXQLFDWLRQ5RXQGV0RGHO &11( 0RGHO &11( 0RGHO &11( 0RGHO &116( 0RGHO &116( 0RGHO &116( 0RGHO 5HV1HW( 0RGHO 5HV1HW( 0RGHO 5HV1HW( (c) N Client = 100 Fig. 7: DESED-FL: Federated Learning F1 (micro average) Score on Validation Data ) 6 F R U H ,, ' 35 &RPPXQLFDWLRQ5RXQGV ) 6 F R U H 1 RQ ,, ' 0RGHO &11( 0RGHO &11( 0RGHO &11( 0RGHO &116( 0RGHO &116( 0RGHO &116( 0RGHO 5HV1HW( 0RGHO 5HV1HW( 0RGHO 5HV1HW( (a) N Clients = 1 ) 6 F R U H ,, ' 35 35 35 &RPPXQLFDWLRQ5RXQGV ) 6 F R U H 1 RQ ,, ' &RPPXQLFDWLRQ5RXQGV &RPPXQLFDWLRQ5RXQGV 0RGHO &11( 0RGHO &11( 0RGHO &11( 0RGHO &116( 0RGHO &116( 0RGHO &116( 0RGHO 5HV1HW( 0RGHO 5HV1HW( 0RGHO 5HV1HW( (b) N Clients = 25 ) 6 F R U H ,, ' 35 35 35 &RPPXQLFDWLRQ5RXQGV ) 6 F R U H 1 RQ ,, ' &RPPXQLFDWLRQ5RXQGV &RPPXQLFDWLRQ5RXQGV 0RGHO &11( 0RGHO &11( 0RGHO &11( 0RGHO &116( 0RGHO &116( 0RGHO &116( 0RGHO 5HV1HW( 0RGHO 5HV1HW( 0RGHO 5HV1HW( (c) N Client = 100 Fig. 8: URBAN-FL: Federated Learning F1 (micro average) Score on Validation Data

HE OTHER LIST[App1] N. Turpault, R. Serizel, A. Parag Shah, and J. Salamon, “Sound event detection in domestic environments with weakly labeled data and soundscapesynthesis,” in

Workshop on Detection and Classiﬁcation of Acoustic Scenes and Events , New York City, United States, Oct. 2019. [Online].Available: https://hal.inria.fr/hal-02160855[App2] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian Optimization of Machine Learning Algorithms,” in

NeurIPS , Lake Tahoe, Nevada,2012, pp. 2951–2959.[App3] G. Chen, P. Chen, Y. Shi, C.-Y. Hsieh, B. Liao, and S. Zhang, “Rethinking the usage of batch normalization and dropout in the training of deepneural networks,” arXiv preprint arXiv:1905.05928 , 2019.[App4] K. Koutini, H. Eghbal-zadeh, M. Dorfer, and G. Widmer, “The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for AcousticScene Classiﬁcation,” in

Proceedings of the 27th European Signal Processing Conference (EUSIPCO) , 2019.[App5] M. D. McDonnell and W. Gao, “Acoustic Scene Classiﬁcation Using Deep Residual Networks with Late Fusion of Separated High and LowFrequency Paths,” in