Sound Event Detection in Urban Audio With Single and Multi-Rate PCEN
SSOUND EVENT DETECTION IN URBAN AUDIO WITH SINGLE AND MULTI-RATE PCEN
Christopher Ick
New York UniversityCenter for Data Science60 5th Ave., New York, NY 10011
Brian McFee
New York UniversityCenter for Data Science60 5th Ave., New York, NY 10011
ABSTRACT
Recent literature has demonstrated that the use of per-channelenergy normalization (PCEN), has significant performanceimprovements over traditional log-scaled mel-frequencyspectrograms in acoustic sound event detection (SED) ina multi-class setting with overlapping events. However, theconfiguration of PCEN’s parameters is sensitive to the record-ing environment, the characteristics of the class of events ofinterest, and the presence of multiple overlapping events[1]. This leads to improvements on a class-by-class basis,but poor cross-class performance. In this article, we experi-ment using PCEN spectrograms as an alternative method forSED in urban audio using the UrbanSED dataset, demon-strating per-class improvements based on parameter config-uration. Furthermore, we address cross-class performancewith PCEN using a novel method, Multi-Rate PCEN (MR-PCEN). We demonstrate cross-class SED performance withMRPCEN, demonstrating improvements to cross-class per-formance compared to traditional single-rate PCEN.
Index Terms — Acoustic noise, acoustic sensors, acousticsignal detection, signal classification, spectrogram.
1. INTRODUCTION
Noise suppression is a critical step in acoustic signal detec-tion, particularly so in the case of practical sound event detec-tion (SED) in field recordings. Popular approaches to this taskuse convolutional operators, mimicking the methods used incomputer vision [2]. When using audio as an input to thesemethods, the images are typically time-frequency representa-tions (spectrograms). Traditionally, log-scaling is the primaryapproach for noise reduction, and is the standard approachwhen spectrograms are the feature of interest. For single-source audio in clean acoustic environments, this is sufficientfor SED.However, real world environments typically don’t haveclean, separated events of a single class; this is particularlytrue in the case of urban audio. Field recordings often havemultiple sound sources of varying acoustic qualities, leadingto varying cross-class performance. Furthermore, the intro- duction of auditory deformations can lead to rapid perfor-mance degradation [3].Former research has proposed using per-channel energynormalization (PCEN) as a time-frequency representation tomitigate the effects of background noise, demonstrating itsuse as an input to convolutional methods in SED. PCEN hasproven beneficial in various single-source tasks, includingbioacoustic event detection [1], keyword spotting [3], andvocal detection in music [4]. The per-channel backgroundsuppression characteristics make PCEN an attractive choicein SED in urban environments, as background noise in urbanenvironments tends to be Brownian, rather than Gaussian, anassumption made when log-scaling spectrograms [5].One constraint of PCEN is its dependence on parameterconfigurations in relation to the specific acoustic propertiesof the sound event of interest [6]. This makes PCEN poorlysuited for multi-class classification, as a given PCEN param-eter configuration will likely suit only one particular class ofsound events while performing poorly across other classes.In this article, we demonstrate the effectiveness of PCEN forSED in urban environments. Furthermore, we propose a newapproach to acoustic event detection that combines the fore-ground separation characteristic of PCEN while preservingmulti-class performance: Multi-Rate PCEN (MRPCEN). Bycomputing a multi-layered PCEN spectrogram at different pa-rameter configurations, we gain the advantages of PCEN, par-ticularly robustness to varying acoustic conditions and au-ditory deformations without the performance loss associatedwith cross-class performance.
2. AUDIO REPRESENTATION
While log-scaling mel-spectrograms is a simple and computa-tionally efficient method of range compression, it has limitedeffects in certain conditions, particularly those in which back-ground noise is non-Gaussian noise. PCEN has been used asa pre-processing method for time-frequency representationsthat reduces the effects of noise on convolutional neural net-works by Gaussianizing the distribution of magnitudes acrossmel-frequency spectrogram coefficients [3]. a r X i v : . [ ee ss . A S ] F e b ig. 1 . A comparison of standard SED audio featurizationscompared to Multi-Rate PCEN. Note the unevenly distributednoise in the log-mel spectrogram, with more noise in thelower end of the spectrum (due to Brown noise found in urbanenvironments), which is removed in the PCEN spectrogram. PCEN is a sequence of audio processing steps on a spec-trogram E ( t, f ) using adaptive gain control scaled by theresponse an autogregressive filter φ T , followed by dynamicrange compression. It takes the form: PCEN ( t, f ) = (cid:18) E ( t, f )( (cid:15) + ( E ∗ φ T )( t, f )) α + δ (cid:19) r − δ r (1)The gain α scales the smoothing of the spectrogram. δ and r are the bias and power of the dynamic range compression thatPCEN provides. We can see that the filter response M ( t, f ) =( E ∗ φ T )( t, f ) scales our PCEN magnitude by: M ( t, f ) = ( E ∗ φ T )( t, f ) = s E ( t, f ) + (1 − s ) M ( t − τ, f ) (2)where < s < is the weight of 1 st order autogregressivefilter (AR(1)), and τ is the discretized step defined by the hopsize of the input spectrogram ( (cid:15) is an offset term). The re-sulting filter is a low-pass filter of 0dB gain and a cuttoff fre-quency of ω c = πτT = arccos (1 − s − s ) ) at 3 dB, andsidelobe falloff of 10 dB per decade near ω c [6].We consider this filter response M ( t, f ) an approxima-tion to the relative magnitude of stationary background noiseat each frequency band f . The effect of scaling our outputPCEN spectrogram by the reciprocal of this response is am-plifying/magnifying the response of foreground events andsuppressing background events on a per-channel basis. Inregimes where the background noise isn’t Gaussian, such asthe case with Brownian noise in urban environments, thisdecorrelates noise across different frequency bands.A critical tuning parameter of this smoothing of the au-toregressive filter is the rate parameter T , which defines the cutoff frequency by ω c = πτT . Setting T too low will re-duce the noise reduction effects of PCEN, but setting it toohigh will suppress the sound event of interest, especially ifthe sound event of interest is stationary. Prior practical rec-ommendations for the rate parameter are defined by the sta-tionarity of the sound, the frequency range, and the chirp rate.For single sources in a consistent acoustic environment, tun-ing the rate parameter T to these specifications proves suffi-cient for audio detection and classification tasks [6]. In a field setting, with several audio classes with varyingacoustic properties and variable recording conditions, a sin-gular ideal value for T is much harder to identify. For ex-ample, in the case of urban audio, the combination of short,fast decay sounds (such as gunshots or dog barks) comparedto longer ambient sounds (e.g. sirens, air conditioners) leadto widely varying preferred values of T . To capture infor-mation across varying regimes of our rate parameter, T , wetake inspiration from 3-channel RGB images in computervision. These three separate but correlated feature maps canbe passed to a convolutional neural network (CNN) whichcan use this multi-frequency image to make predictions thata black-and-white image could not.We replicate this multi-regime approach, but instead offrequency responses, we vary our rate parameter T in eachlayer, as shown in figure 1. Each i th layer of the image hasa differing level of gain control applied based on the cutoffthreshold from T i . By using multiple logarithmically-scaledvalues of T i , we can produce a multi-layered image that cap-tures information of sounds at varying decay lengths. Thisensures that sound events that may be suppressed using onevalue of T i will be preserved in another. The resulting multi-channel image can be used as an input to a CNN much likea multi-channel color image is used in machine vision. Byincorporating information at varying degrees of gain control,our model preserves the robustness of PCEN without degrad-ing multi-class performance.
3. EXPERIMENTAL DESIGN3.1. UrbanSED
We used the UrbanSED dataset to evaluate the performanceon this method for the task of sound event detection[5]. Ur-banSED contains 10,000 synthetic soundscapes of 10 distinctsound classes, with each class having approximately 1000instances per class, drawn from the UrbanSound8K dataset.Each soundscape contains 1-9 time/class labeled foregroundevents with additive background Brownian noise. The datasetis pre-split in a 6-2-2 training, validation, and evaluation sub-sets. Because UrbanSED is synthetic, we can ensure no spu-rious unlabeled audio is included in the soundscapes, makingit a standard benchmark for state-of-the-art SED models. .2. Data augmentation
Because one of the primary predicted benefits of PCEN isimproved robustness to audio deformations, we augmentedour dataset with several reverberant duplicates of UrbanSED.In this implementation, reverb is modeled as a convolu-tion of a source signal and an impulse response of a givenacoustic environment. Augmenting a given audio clip withreverb is done by computing the convolution of the impulseand the audio file. This effectively computes the responseof the audio clip in the acoustic environment associated withthe impulse. We used 6 distinct reverb responses, three im-pulse responses recorded in different acoustic environments,and three synthetic impulse responses of white noise with anexponential decay envelope e − t/τ c .The three real impulse responses were recorded in a bed-room , an alleyway , and a tunnel , each with increasing de-cay times. The three synthetic impulse responses had decaytime constants τ c of . s, . s, and . s. In addition to reverbaugmentation, we also duplicated our dataset by pitch shiftingeach sample by {± , ± } semitones. We applied pitch shiftsand convolutional to reverb to our dataset using using MuDA ,a library for musical data augmentation [7].
Training audio was processed using librosa 0.7.2 [8] whichgenerated a PCEN spectrogram for each audio sample andrate parameter T . Each spectrogram had the following pa-rameters: sampling rate . kHz, window size samples,hop length samples, and mel-frequency bands. Ourresulting frame rate was . kHz / ≈ Hz. At a 10 sec-ond length sample length, each spectrogram had a horizontallength of 862 samples per band.Our rate parameters were 10 logarithmically spaced val-ues, ranging from = 1 to = 512 , corresponding toaveraging over windows ranging from ms to s. The adap-tive gain control bias was set to (cid:15) = 10 − , the gain wasset to α = 0 . , the dynamic range compression bias wasset to δ = 2 , and the compression power r = 0 . for allrate configurations; these parameters were chosen based onprior work in audio classification [1]. Each audio sample had10 PCEN spectrograms, one per rate parameter. Individuallayers/subsets were selected in training and evaluation. Fol-lowing the channels-last conventions of Keras, the resultingPCEN spectrograms have the shape of (128,862,10), and theresulting log-mel spectrograms had a shape (128,862,1). The network architecture is inspired by the L audio sub-network [9] for discriminative audio-video correspondence “My Bedroom”, https://freesound.org/people/Uzbazur/sounds/382907/ “alley.wav”, https://freesound.org/people/NoiseCollector/sounds/126804/ “tunnel 2013”, https://freesound.org/people/recordinghopkins/sounds/175358/ embeddings. This architecture has demonstrated success in itsuse for classification and predictions at a fine time-resolutionwith mel-spectrogram inputs[2]. We follow the implemen-tation of this architecture found in [2], with the input layeradjusted to accept multi-layer images to ensure that all meth-ods were being evaluated on near-identical architecture. Ad-ditional parameters in the input layer may lead to lower per-formance due to additional depth requirements, but this is as-sumed to be negligible for this application.Models were built and trained using Keras 2.3.1 [10],with Tensorflow 2.2.0 [11]. The model was trained using theAdam optimizer [12], using UrbanSED’s pre-folded trainingand validation sets. The loss function was binary cross-entropy on a per-class per-frame bases. The validation metricwas accuracy. If no improvement in the accuracy was seen in10 epochs, the learning rate would be reduced. Early stop-ping was implemented if no improvement was seen after 30epochs. Evaluation was handled via the sed eval package[13], which computed segment-based classification metrics,including precision, recall, and error rate. We used the F1-score, the harmonic mean of precision and recall, as our mainmetric of effectiveness. These metrics were computer bothper-class and overall. For reproducability, the implementa-tion and experimental framework, including data preprocess-ing, model training, and evaluation, is publicly available ongithub . We trained and evaluated models across multiple datasets totest robustness and stability to audio deformation. We pri-marily used the dry dataset as our baseline dataset, whichwas UrbanSED with no reverb augmentation, to ensure wewere achieving state of the art performance seen in [2]. The realreverb set is built from the reverb-augmented audio us-ing the 3 real recorded impulse responses, and is the pri-mary set we evaluate, as this is closest to real-world con-ditions. The simreverb set, using the remaining 3 syntheticimpulse responses, demonstrated the strongest separation be-tween PCEN and log-mel models, but the results don’t gener-alize as well as when using real impulse responses. All mod-els were trained on data augmented by {± , ± } pitch-shifts,but were validated and evaluated on non-shifted data. We trained and evaluated models on fixed sets of rate param-eters. We experimented with single rate parameter modelsand varying sets of multi-layered models, each with rate con-stants in T k = 2 k for k ∈ { , , ..., } . We tested a total of44 unique rate parameter configurations for our PCEN mod-els. This included single-rate parameter models and n -layermodels including rate parameters from T i to T i + n − . As a https://github.com/ChrisIck/pcen-t-varying og ( T ) F M e a s u r e air_conditioner log ( T ) car_horn log ( T ) F M e a s u r e children_playing log ( T ) dog_bark log ( T ) F M e a s u r e drilling log ( T ) engine_idling log ( T ) F M e a s u r e gun_shot log ( T ) jackhammer log ( T ) F M e a s u r e siren log ( T ) street_music F1 Across Singular Rate PCEN (On
RealReverb
Dataset)
Single Rate PCENBest Multi-Layer PCENLog-Mel Spectrogram
Fig. 2 . F1-score of single-rate PCEN models at various T values (blue dots), compared to 8-layer MRPCEN (dashedgreen) and log-mel (dotted red) models evaluated on the 100bootstrapped samples of the realreverb dataset.baseline, we also computed a model using a traditional log-scaled mel-spectrogram, as seen in previous literature for thisapplication and dataset [2]. Each set of models was trained and evaluated on both datasetsto see how stable each model was to reverberation conditionsthat were both contained in the respective training dataset, aswell as in conditions distinct from those in the training set.Evaluation was done on bootstrapped subsamples of each eva-lution set, sampling 100 evaluation examples with replace-ment, 100 times per model and dataset. Evaluations werecompleted both in overall micro-averaged performance met-rics across classes, as well as on a per-class basis.
4. RESULTS4.1. Per-class performance
We demonstrate PCEN’s robustness to noise by comparingperformance of models trained and evaluated on the realre-
Model Overall F1( realreverb ) Overall F1( simreverb ) Overall F1(All Data)Logmel 0.334 0.175 0.274PCEN 0.345 0.167 0.268MRPCEN . Overall Micro-averaged F1 Scores on the realreverb and simreverb datasets verb dataset. We see in figure 2 that single rate PCEN demon-strates performance at or above log-scaled spectrograms, de-pending on its choice of rate parameter T and the class ofthe target event. Certain classes, like the jackhammer andgunshot classes, prefer higher T values in the neighborhoodof T = 2 where mid-low frequency is filtered out with theadaptive gain aspect of PCEN. Similarly, for higher and full-band frequency stationary events, such as air conditioners andsirens, lower values of T in the neighborhood of T ∈ [2 , ] perform better, as the low-frequency noise doesn’t interferewith the relevant frequency bands. However, it is crucial tonote that due to the diversity of sonic characteristics of thesedifferent classes, there does not exist a single ideal value of T that produces the strongest performance overall. MRPCEN successfully outperforms most single-rate modelsin cross-class performance, as seen in figure 2. The plotshows per-class performance of each model evaluated onthe realreverb dataset, compared to a log-mel based modeland an 8-layer MRPCEN. In comparison, MRPCEN modelpreserves information at multiple rates, it consistently per-forms at or above the majority of the single rate models.Furthermore, MRPCEN performs at or above the level ofthe standard log-mel spectrogram with the exception of the“car horn” and “jackhammer” classes, providing it with thehighest overall performance metric on this dataset (comparedto the strongest performing single-rate PCEN model T = 2 )in table 1.
5. CONCLUSION
The results here show PCEN as a viable alternative to log-melspectrograms, showing equivalent or improved performancedepending on rate parameter choice, which in turn dependson the acoustic characteristics of the target sound event andacoustic environment. We can also see that MRPCEN pro-vides cross-class performance improvements over single-ratePCEN models. MRPCEN permits simultaneous prediction ofdiffering classes with distinct acoustic characteristics in a sin-gle model. In a field setting, this will lead to less per-classknowledge and tuning expertise needed to effectively deploya model that can perform well in multi-class applications. . ACKNOWLEDGEMENTS
This work was supported in part through by NSF award1955357, and in part by the NYU IT High PerformanceComputing resources, services, and staff expertise.
7. REFERENCES [1] Vincent Lostanlen, Kaitlin Palmer, Elly Knight, Christo-pher Clark, Holger Klinck, Andrew Farnsworth, TinaWong, Jason Cramer, and Juan Bello, “Long-distancedetection of bioacoustic events with per-channel energynormalization,”
Proceedings of the Detection and Clas-sification of Acoustic Scenes and Events 2019 Workshop(DCASE2019) , 2019.[2] B. McFee, J. Salamon, and J. P. Bello, “Adaptive pool-ing operators for weakly labeled sound event detection,”
IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 26, no. 11, pp. 2180–2193, 2018.[3] Yuxuan Wang, Pascal Getreuer, Thad Hughes,Richard F. Lyon, and Rif A. Saurous, “Trainablefrontend for robust and far-field keyword spotting,”
CoRR , vol. abs/1607.05666, 2016.[4] Jan Schl¨uter and Bernhard Lehner, “Zero-Mean Convo-lutions for Level-Invariant Singing Voice Detection,” in
Proceedings of the 19th International Society for MusicInformation Retrieval Conference (ISMIR 2018) , Paris,France, 2018.[5] J. Salamon, D. MacConnell, M. Cartwright, P. Li, andJ. P. Bello, “Scaper: A library for soundscape synthe-sis and augmentation,” in , 2017, pp. 344–348.[6] V. Lostanlen, J. Salamon, M. Cartwright, B. McFee,A. Farnsworth, S. Kelling, and J. P. Bello, “Per-channelenergy normalization: Why and how,”
IEEE Signal Pro-cessing Letters , vol. 26, no. 1, pp. 39–43, 2019.[7] Brian McFee, Eric J. Humphrey, and Juan P. Bello, “Asoftware framework for musical data augmentation,” in , 2015, ISMIR.[8] Brian McFee, Colin Raffel, Dawen Liang, DanielP.W. Ellis, Matt McVicar, Eric Battenberg, and OriolNieto, “librosa: Audio and Music Signal Analysis inPython,” in
Proceedings of the 14th Python in ScienceConference , Kathryn Huff and James Bergstra, Eds.,2015, pp. 18 – 24.[9] Relja Arandjelovic and Andrew Zisserman, “Look, lis-ten and learn,”