[PDF] Acoustic event detection for multiple overlapping similar sources

Abstract

Many current paradigms for acoustic event detection (AED) are not adapted to the organic variability of natural sounds, and/or they assume a limit on the number of simultaneous sources: often only one source, or one source of each type, may be active. These aspects are highly undesirable for applications such as bird population monitoring. We introduce a simple method modelling the onsets, durations and offsets of acoustic events to avoid intrinsic limits on polyphony or on inter-event temporal patterns. We evaluate the method in a case study with over 3000 zebra finch calls. In comparison against a HMM-based method we find it more accurate at recovering acoustic events, and more robust for estimating calling rates.

Full PDF

AAcoustic event detection for multiple overlappingsimilar sources

Dan Stowell and David ClaytonQueen Mary University of LondonLondon, [email protected] 1, 2018

Abstract

Many current paradigms for acoustic event detection (AED) are notadapted to the organic variability of natural sounds, and/or they assumea limit on the number of simultaneous sources: often only one source, orone source of each type, may be active. These aspects are highly unde-sirable for applications such as bird population monitoring. We introducea simple method modelling the onsets, durations and oﬀsets of acousticevents to avoid intrinsic limits on polyphony or on inter-event temporalpatterns. We evaluate the method in a case study with over 3000 zebraﬁnch calls. In comparison against a HMM-based method we ﬁnd it moreaccurate at recovering acoustic events, and more robust for estimatingcalling rates.

Acoustic event detection (AED) is useful for various purposes, such as securitymonitoring, wildlife monitoring, and music transcription [1, 2, 3, 4]. Manyapproaches to AED assume that the acoustic scene is monophonic —having nosimultaneous or overlapping events—which is unrealistic but useful for someapplications. Approaches which allow for polyphonic scenes are more ﬂexible,but often assume that each stream is diﬀerent in kind, allowing one monophonicstream for each class of event considered [5, 6]. They may also assume a ﬁxednumber of simultaneous streams, for example when source separation is appliedas a ﬁrst step and then each separated channel is treated as a monophonic scene[7]. In this paper we explore approaches for AED in cases with an unknownnumber of similar sources. As an example, consider a sound recording in whicha ﬂock of birds can be heard calling, all of the same species. This is representa-tive of practical scenarios in which AED might assist ecologists or conservationorganisations wishing to estimate the total number of individuals detected in a1 a r X i v : . [ c s . S D ] J u l ecording, or alternatively the total number of calls in a recording. Both such“point counts” are used (in most cases manually detected) for monitoring trendsin breeding populations [4].Note the speciﬁc information need: for overall event counts, which is distinctfrom the “transcription” task in which the objective is to recover a list of the trueevents. An exact transcription would itself give us an exact value for the eventcount; but an imperfect transcription may be an ineﬃcient or biased route tocount estimation. In the abstract, event counting tasks are regression problems,and may not require an estimated event transcript at any point. Also importantis that we wish to avoid placing limits or inappropriate biases on the numberof events that can be simultaneously active. Methods that assume monophonicsequences per event class are likely to be inappropriate, and other methods maybias estimation due to assumptions implicit in their models. With this in mind,we brieﬂy consider some event detection paradigms in previous work.To decompose an audio scene, one strand of research uses non-negative ma-trix factorisation (NMF), in particular convolutive NMF which allows eventsto have spectro-temporal structure [8]. However, these models are inﬂexibleabout the temporal evolution within an event, depending on good matching ofspectro-temporal templates. This is particularly problematic for sounds withmuch inherent within-class variability such as animal calls.Hidden Markov models (HMMs) have been used in various systems for acous-tic event detection (e.g. [9, 5]). These can allow for variation in the temporalevolution of events. However a typical HMM corresponds to a monophonicmodel of events; extensions such as the factorial HMM extend this to a speciﬁcﬁxed number of parallel sources, and thus retain strong limits on the level ofpolyphony. In one example of polyphonic adapatation of HMM tracking, [5]train and apply a standard HMM for event detection, where in their case eachstate corresponds to a class of event. To achieve polyphonic detection they per-form multiple Viterbi decoding passes: after each Viterbi pass, the states usedare taken out of consideration for the subsequent passes. In this way a tran-scription is obtained which allows multiple event classes to occur in parallel.However it does not allow multiple simultaneous events of the same class, andretains the ﬁxed limit on polyphony.In Section 3 we will describe an alternative way to adapt a HMM to multipledetection scenario. However, that is primarily as a point of comparison againstthe main model we wish to explore here, which uses an onset-duration-oﬀset model of acoustic events to allow for unbounded polyphony. We describe thismethod in the next section. Then we will describe our alternative HMM method,before evaluating both methods using a dataset of bird calls. Physiological studies indicate that biological auditory processing involves early-stage “edge detectors” having separate auditory detection units for onsets andfor oﬀsets, both in humans [10] and in songbirds [11]. The information from2 nset detector O ﬀ set detector Prior beliefs about durationFusionAudioMultiple event sequence Figure 1: Schematic diagram of onset-duration-oﬀset event detection model.these detectors is then combined in later processing for cognition of “auditoryobjects” or events. Although there is no requirement for our computational sys-tems to mimic the organisation found in nature, this suggests that a processingstrategy starting with onset and oﬀset detectors and combining their outputsmay be fruitful. We can combine onsets and oﬀsets with other information toyield posterior beliefs about the events observed (Figure 1). If these componentsare based on the characteristics of individual events, and not the temporal re-lationships between events, we should be able to design a system that imposesfew constraints or biases on the observable event patterns.The scheme just presented assumes that the onset and oﬀset characteristicsare the reliable, relatively invariant characteristics of the events of interest.However, it makes no strong assumptions on event durations, nor even thesignal content in the middle of the event, thus allowing for organic variability.It also makes no strong assumptions on the temporal occurrence patterns, andin particular the level of polyphony is unbounded: at any particular time, if thesystem overall has detected k more onsets than oﬀsets, then the current numberof parallel active events is k .Our approach is probabilistic: we will use onset/oﬀset detectors that yielddetection probabilities at each time point, and our prior beliefs about event dura-tion will be expressed as a distribution over durations. To combine these prob-abilities together, we characterise acoustic events in a two-dimensional spaceindexed by onset time t and duration τ . We characterise the conditional prob-ability of an event at some point in that space as p evt ( t, τ | y ) ∝ p on ( t | y ) p oﬀ ( t + τ | y ) p dur ( τ )where y is the observation (the audio signal). The conditional probabilities p on nset detection probabilityO ﬀ set detection probability Prior on event durations

Posterior

Onset time E v e n t d u r a t i o n Figure 2: Schematic diagram of how probabilities are combined in the onset-duration-oﬀset event detection model. Each source of probabilistic informationis a marginal with respect to a diﬀerent direction in the [onset × duration] space.and p oﬀ come from the detectors, and p dur is our duration prior. When dealingwith discretised time (as we do here), this Bernoulli model imposes a mildconstraint: no two events can have exactly the same onset time and duration.This constraint is very mild, since events can co-occur in our scheme as long asthey have slight mutual diﬀerences in onset time or duration. (We will imposea slightly stronger constraint to recover an event transcript, described shortly.)Each of our probabilistic sources of information (onsets, oﬀsets, durations)gives us information that acts as a one-dimensional marginal, when consideredin our two-dimensional [onset × duration] space (Figure 2). Note particularlythat oﬀset detection probabilities are translated into [onset × duration] spacewith an oﬀ-axis inﬂuence, since oﬀset time is equivalent to onset time plus du-ration. In this space, we assume that the three types of probability information4re conditionally independent and multiply them together to produce a pos-terior “intensity” over possible events (Figure 2). This could be thresholdedto give a polyphonic event sequence, or marginalised to give posterior beliefsabout onsets and oﬀsets. Such posterior beliefs will be related to the raw on-set/oﬀset detections but reﬁned using the other information sources. Note thatthe posterior is not a probability distribution: it does not sum up to one overour 2D space. It represents a set of binomial probabilities; the sum over the 2Dposterior gives the expected number of events.In the implementation we use here, for onset and oﬀset detectors we userandom forest regression (cf. [12]). Our detectors take spectrogram patches asinput, the spectrograms having been treated with background noise reductionby median-thresholding, and then ﬁrst diﬀerencing in time. The trained randomforest outputs detection probabilities for each patch. Since the regression makesan independence assumption for adjacent (overlapping) patches, the outputs areliable to be correlated in time; to reduce this eﬀect, after training the randomforest we then train an ordinary least squares regression from a sliding windowof 11 outputs from the detector onto the ground truth, to recover “sharper”detections. To implement our prior on event durations, we will train a Gaussianmixture model (GMM) on the durations observed in training data.We note some resemblances between our approach and that of [12]. Thoseauthors also use random forest regression as a recognition component that con-tributes towards an eventual event segmentation. However, their method isfundamentally diﬀerent in that its elements for recognition are not the on-sets/oﬀsets, but the frames “within” an event, which have been augmentedwith pointers to their associated onset/oﬀset. For this and other reasons theirapproach is limited to monophonic event detection.In the following we will refer to our onset-duration-oﬀset model as ODO forshort.

To recover a deﬁnite set of events, our 2D posterior can be thresholded usinga threshold determined during the training phase. In practice, however, weobserve that this tends to yield a large number of duplicated events, since anyparticular ground-truth event of duration τ will often be detected with a rel-atively strong probability for duration τ + 1, τ −

1, etc., each of which is aseparate position in our 2D posterior. This eﬀect can be reduced by imposingfurther assumptions: perhaps about the maximum polyphony, or the temporalpattern. In the present work we wish to avoid imposing assumptions that mightstrongly bias event counts and the like. We choose to impose an assumptionwhich is partly implicit in the detectors themselves: the assumption that onlyone onset, and one oﬀset, may happen in any time frame. This assumptionmay bias detection in very dense audio recordings, but for many densities en-countered in practice this assumption holds almost always. Hence to recover anevent transcript, we keep only the events whose posterior probability is strongerthan all other events with matching onset time or oﬀset time. From these events5e then select a threshold for discarding low-probability events.Taking Figure 2 as an example, in the posterior we see that the density dueto the second detected onset overlaps in 2D space with two possible oﬀsets. Inthis case we would keep no more than one of these events, namely that whichyields the strongest probability.A straightforward way to count events is to count the number of items in-cluded in an event transcript. However, as just described, producing an eventtranscript requires making “hard” thresholding decisions, discarding some in-formation from the posterior. We can avoid the transcription step and simplyuse the sum of the posterior, over the time range of interest, as the expectedevent count in that time range. We will use this in our evaluation.

As an alternative to the method presented above, we also describe a hiddenMarkov model (HMM) method for event detection. As described in Section 1,HMM approaches to event detection impose limits on the possible polyphony,and also may bias the durations and timing patterns of detected events. Withthose caveats acknowledged, we wish to use the HMM as a point of comparisonsince it is in widespread use. So in order to detect events of a single type,but with potential polyphony, we apply a HMM but where the hidden statesare not simply ‘on’ and ‘oﬀ’, but the count of currently-active events, i.e. thecount of events that have started but not yet ended. The state space is thus { , , ...K } where K is the maximum number of simultaneous events observedin the training data.For modelling the observations, we train a separate Gaussian mixture model(GMM) with ten components for each cardinality. Again we use spectral patchesto train this model, but without diﬀerencing them in time since in this case weaim to model states rather than transitions.To recover an event transcription from this HMM, we perform Viterbi de-coding. From the decoded sequence of cardinalities, we deduce onset and oﬀsettimes, and we associate onsets and oﬀsets with each other in order of occurrence.This transcript is then also used for event counting. The ODO and HMM models we have described oﬀer two very diﬀerent ap-proaches to event detection. We note that it is possible to combine the two, asfollows. We can expand the HMM state space to include not only the currentevent cardinality, but also two binary indicators of whether the current frameincludes an onset and/or an oﬀset. This expands the set of HMM states by afactor of four. Not all state transitions are possible: e.g. a change in cardinalityfrom 3 to 4 can only occur when the onset state is 1. We do not impose suchlimits manually but allow the system to learn them from the transitions seen intraining data. 6ven with this expanded state space, HMM-based models inherit the limita-tions already mentioned: cardinalities higher than seen in the training data willnot be correctly detected, and the HMM may bias timing patterns. Howeverin the following evaluation we will compare the empirical characteristics of themethods we have described.

We recorded a set of four female zebra ﬁnches (

Taeniopygia guttata ) in anindoor aviary. The birds exchanged contact calls at a rate of approx 40 callsper minute. Birds were recorded for extended periods, and their calls weretranscribed as event sequences. Transcription was performed separately by twohuman annotators, whose annotations were combined automatically, and anydiscrepancies resolved by the ﬁrst author.This recording setup was designed for use in various studies; in the presentpaper we use it as a case study for event detection. We took two 30-minuterecording sessions, recorded with the same birds but on separate days, and usedthese two sessions for two-fold crossvalidation. The 30-minute sessions contained1663 and 1770 annotated calls in total. Here we use single-channel omni micrecordings of the sessions.The true polyphony in the original recordings ranged from zero to four. Inorder to investigate heavier densities, as well as to investigate the eﬀect of den-sity mismatches between training and test data, for each 30-minute session wealso created an artiﬁcial 10-minute mixture with the three 10-minute segmentssuperimposed. All experiments thus used the same set of calls, but in some casesthe training or test data was “folded” down to a denser 10-minute recording bysuperimposition.As in other event detection evaluations [1, 2], for evaluating event tran-scription we use the F-measure metric and we consider an event to be correctlyrecovered if the onset matches within a ﬁxed tolerance and the duration matcheswithin 50% of the true duration. The typical event duration in this data wasapprox 100 ms, so we chose ±

25 ms as our tolerance.Separately, for evaluating event counts we divide the data into ten-secondwindows and measure the RMS error between the true and estimated numberof events for each window. Note that both systems do exhibit some miscalibra-tion, in that their estimated counts even on the training data could exhibit amultiplicative deviation from the truth. For the ODO system we believe thisis largely due to the independence assumption already mentioned in the under-lying detectors, and thus more sophisticated edge detectors might remedy this.To account for this most basic aspect of miscalibration, during training of allsystems we used the training data to choose a multiplicative calibration factorto apply to all event counts. Calibration did not make use of test data. RMSerror statistics are reported from the calibrated outputs.We tested the following ﬁve conﬁgurations of event detector: • The full ODO system of Section 2.7

The ODO system with a ﬂat duration prior rather than the GMM. Thisconstrained events to lie within a reasonable duration but did not learna distribution on the duration, and allowed us to probe the extent of thebeneﬁt provided by the GMM duration prior. • The HMM system of Section 3. • The combined ODO+HMM system of Section 3.1. • The raw output from the onset-detector component. This can be evaluatedfor event counts only, not transcription, but indicates the extent to whichthe detector component contributes to ODO performance.Speciﬁc implementation details were as follows. Audio was recorded at 96kHz, and spectrograms calculated with frame length 2048, 50% overlap andHann windowing. Spectral information outside the range 0.5–20 kHz was dis-carded. Noise reduction was implemented using spectral median-subtraction,with the median calculated in a sliding ten-second window for each spectralband. Spectral patches for the onset/oﬀset detectors were taken from the time-diﬀerenced spectrogram, of size ﬁve frames before and ﬁve frames after theframe under consideration. The detectors were implemented using random for-est regression from the sklearn module, with 20 trees. Spectral patches for theHMM-GMM modelling were taken from the non-time-diﬀerenced spectrogram,of size ﬁve frames after the frame under consideration.Results for event transcription (Figure 3) show a number of tendencies.Firstly the full ODO model consistently outperforms the ODO model with ﬂatduration prior, indicating that the learned prior on event durations adds usefulinformation. Secondly, the HMM system generally performs much worse thanthe ODO system. This poor performance might be attributed to various pointsof diﬀerence between the two systems, and so it is interesting to observe thatthe combined ODO+HMM system improves on HMM but does not approachODO’s strong performance, despite making use of ODO’s onset/oﬀset outputas part of its input observations.The experiments with mismatched density in training and test give a generalperformance degradation for all systems, indicating that there is still some wayto go to perform detection robust to very wide variation in event density. TheHMM-based systems perform relatively well in the experiment with increasedtraining density—an exception to the general pattern. Conversely, the poor per-formance of the HMM-based systems in the experiment with increased testingdensity matches expectations since the training did not encompass all the eventcardinalities found in the testing data.In best conditions, our ODO system achieved an F-measure of around 37%.Although quite distant from the ideal of 100%, it is on a similar scale as theresults reported for state-of-the-art methods for related tasks [1, 2].Results for event counting (Figure 4) show a slightly diﬀerent picture. Again,the mismatch in training and testing conditions has a general negative impact on8 :3 1:1 3:1True event density ratio (training:test)0510152025303540 E v en t w i s e pea k f - m ea s u r e ( % ) Quality of recovering event transcriptionODOODO (flat)HMMODO+HMM

Figure 3: Event transcription results (F-measure), averaged over the two cross-validation folds. Error bars cover the range of results obtained within individualfolds.performance. In the main experiment, most of the systems perform at very sim-ilar quality levels. This is except for the raw onset detector output, included forcomparison, which performs notably worse than the full systems —illustratingthat the ODO method performs much better than its underlying detector. How-ever, although the HMM and ODO+HMM systems achieve similar performanceas ODO in the main experiment, this is not the case in the experiments withmismatched training and test densities, for which their performance degradesfurther.Taken together, these evaluations indicate that the ODO method is moreaccurate and more robust than the HMM method for detecting or countingevents in polyphonic bird recordings such as those we have studied. The methodcan be used for any data with events well-characterised by ‘landmarks’ suchas onsets and oﬀsets, including animal and human sounds. However we notethat all the systems evaluated here showed quite some decay in performancewhen evaluated with mismatched event densities. Improving these polyphonicdetection paradigms to be robust to these wide ranges of event density remainsas future work. Good detector components must be a key to strong performance:for the present work we used simple detection methods using spectral patchesas data; improvements such as feature learning [13] could improve detection9 :3 1:1 3:1True event density ratio (training:test)05101520 R M S e rr o r Quality of recovering event countsODOODO (flat)HMMODO+HMMRaw onsets

Figure 4: Count estimation results (RMS error of counts in ten-second timewindows). The y-axis is inverted so that upwards means better performance, tomatch Figure 3.performance. It also remains to evaluate systems on a wider range of event-annotated audio recordings.

Our thanks to Maeve McMahon for help with bird management, and to RobertJack and Alex Wilson for data annotation.10 eferences [1] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, andM. D. Plumbley, “Detection and classiﬁcation of acoustic scenes and events:an IEEE AASP challenge,” in

Proceedings of the Workshop on Applicationsof Signal Processing to Audio and Acoustics (WASPAA) , 2013.[2] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley,“Detection and classiﬁcation of acoustic scenes and events,”

IEEE Trans-actions on Multimedia , 2015.[3] R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa, andP. Soundararajan, “The CLEAR 2006 evaluation,”

Multimodal Technolo-gies for Perception of Humans , pp. 1–44, 2007.[4] T. A. Marques, L. Thomas, S. W. Martin, D. K. Mellinger, J. A. Ward,D. J. Moretti, D. Harris, and P. L. Tyack, “Estimating animal populationdensity using passive acoustics,”

Biological Reviews , 2012.[5] A. Diment, T. Heittola, and T. Virtanen, “Sound event de-tection for oﬃce live and oﬃce synthetic AASP challenge,” in

IEEE AASP Challenge on Detection and Classiﬁcation of Acous-tic Scenes and Events (WASPAA 2013 special session) , 2013,http://c4dm.eecs.qmul.ac.uk/sceneseventschallenge/abstracts/OL/DHV.pdf.[6] S. Ewert, M. D. Plumbley, and M. Sandler, “A dynamic programmingvariant of non-negative matrix deconvolution for the transcription of struckstring instruments,” in

Proc ICASSP 2015 , 2015.[7] T. Heittola, A. Mesaros, T. Virtanen, and A. Eronen, “Sound eventdetection in multisource environments using source separation,” in

Workshop on Machine Listening in Multisource Environments (CHiME2011) , 2011, pp. 36–40. [Online]. Available: http://spandh.dcs.shef.ac.uk/projects/chime/workshop/papers/pS32 heittola.pdf[8] P. D. O’Grady and B. A. Pearlmutter, “Convolutive non-negative matrixfactorisation with a sparseness constraint,” in

Machine Learning for Sig-nal Processing, 2006. Proceedings of the 2006 16th IEEE Signal ProcessingSociety Workshop on . IEEE, 2006, pp. 427–432.[9] E. Benetos, M. Lagrange, and S. Dixon, “Characterisation of acousticscenes using a temporally-constrained shift-invariant model,” in

Confer-ence on Digital Audio Eﬀects Conference (DAFx-12) , vol. 17, 2012, p. 21.[10] M. Chait, D. Poeppel, and J. Z. Simon, “Auditory temporal edge detectionin human auditory cortex,”

Brain research , vol. 1213, pp. 78–90, 2008.[11] S. M. Woolley, P. R. Gill, T. Fremouw, and F. E. Theunissen, “Functionalgroups in the avian auditory system,”

The Journal of Neuroscience , vol. 29,no. 9, pp. 2780–2793, 2009. 1112] H. Phan, M. Maasz, R. Mazur, and A. Mertins, “Random regression forestsfor acoustic event detection and classiﬁcation,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 23, no. 1, pp. 20–31, 2014.[13] D. Stowell and M. D. Plumbley, “Automatic large-scale classiﬁcation ofbird sounds is strongly improved by unsupervised feature learning,”