Acoustic event detection for multiple overlapping similar sources
AAcoustic event detection for multiple overlappingsimilar sources
Dan Stowell and David ClaytonQueen Mary University of LondonLondon, [email protected] 1, 2018
Abstract
Many current paradigms for acoustic event detection (AED) are notadapted to the organic variability of natural sounds, and/or they assumea limit on the number of simultaneous sources: often only one source, orone source of each type, may be active. These aspects are highly unde-sirable for applications such as bird population monitoring. We introducea simple method modelling the onsets, durations and offsets of acousticevents to avoid intrinsic limits on polyphony or on inter-event temporalpatterns. We evaluate the method in a case study with over 3000 zebrafinch calls. In comparison against a HMM-based method we find it moreaccurate at recovering acoustic events, and more robust for estimatingcalling rates.
Acoustic event detection (AED) is useful for various purposes, such as securitymonitoring, wildlife monitoring, and music transcription [1, 2, 3, 4]. Manyapproaches to AED assume that the acoustic scene is monophonic —having nosimultaneous or overlapping events—which is unrealistic but useful for someapplications. Approaches which allow for polyphonic scenes are more flexible,but often assume that each stream is different in kind, allowing one monophonicstream for each class of event considered [5, 6]. They may also assume a fixednumber of simultaneous streams, for example when source separation is appliedas a first step and then each separated channel is treated as a monophonic scene[7]. In this paper we explore approaches for AED in cases with an unknownnumber of similar sources. As an example, consider a sound recording in whicha flock of birds can be heard calling, all of the same species. This is representa-tive of practical scenarios in which AED might assist ecologists or conservationorganisations wishing to estimate the total number of individuals detected in a1 a r X i v : . [ c s . S D ] J u l ecording, or alternatively the total number of calls in a recording. Both such“point counts” are used (in most cases manually detected) for monitoring trendsin breeding populations [4].Note the specific information need: for overall event counts, which is distinctfrom the “transcription” task in which the objective is to recover a list of the trueevents. An exact transcription would itself give us an exact value for the eventcount; but an imperfect transcription may be an inefficient or biased route tocount estimation. In the abstract, event counting tasks are regression problems,and may not require an estimated event transcript at any point. Also importantis that we wish to avoid placing limits or inappropriate biases on the numberof events that can be simultaneously active. Methods that assume monophonicsequences per event class are likely to be inappropriate, and other methods maybias estimation due to assumptions implicit in their models. With this in mind,we briefly consider some event detection paradigms in previous work.To decompose an audio scene, one strand of research uses non-negative ma-trix factorisation (NMF), in particular convolutive NMF which allows eventsto have spectro-temporal structure [8]. However, these models are inflexibleabout the temporal evolution within an event, depending on good matching ofspectro-temporal templates. This is particularly problematic for sounds withmuch inherent within-class variability such as animal calls.Hidden Markov models (HMMs) have been used in various systems for acous-tic event detection (e.g. [9, 5]). These can allow for variation in the temporalevolution of events. However a typical HMM corresponds to a monophonicmodel of events; extensions such as the factorial HMM extend this to a specificfixed number of parallel sources, and thus retain strong limits on the level ofpolyphony. In one example of polyphonic adapatation of HMM tracking, [5]train and apply a standard HMM for event detection, where in their case eachstate corresponds to a class of event. To achieve polyphonic detection they per-form multiple Viterbi decoding passes: after each Viterbi pass, the states usedare taken out of consideration for the subsequent passes. In this way a tran-scription is obtained which allows multiple event classes to occur in parallel.However it does not allow multiple simultaneous events of the same class, andretains the fixed limit on polyphony.In Section 3 we will describe an alternative way to adapt a HMM to multipledetection scenario. However, that is primarily as a point of comparison againstthe main model we wish to explore here, which uses an onset-duration-offset model of acoustic events to allow for unbounded polyphony. We describe thismethod in the next section. Then we will describe our alternative HMM method,before evaluating both methods using a dataset of bird calls. Physiological studies indicate that biological auditory processing involves early-stage “edge detectors” having separate auditory detection units for onsets andfor offsets, both in humans [10] and in songbirds [11]. The information from2 nset detector O ff set detector Prior beliefs about durationFusionAudioMultiple event sequence Figure 1: Schematic diagram of onset-duration-offset event detection model.these detectors is then combined in later processing for cognition of “auditoryobjects” or events. Although there is no requirement for our computational sys-tems to mimic the organisation found in nature, this suggests that a processingstrategy starting with onset and offset detectors and combining their outputsmay be fruitful. We can combine onsets and offsets with other information toyield posterior beliefs about the events observed (Figure 1). If these componentsare based on the characteristics of individual events, and not the temporal re-lationships between events, we should be able to design a system that imposesfew constraints or biases on the observable event patterns.The scheme just presented assumes that the onset and offset characteristicsare the reliable, relatively invariant characteristics of the events of interest.However, it makes no strong assumptions on event durations, nor even thesignal content in the middle of the event, thus allowing for organic variability.It also makes no strong assumptions on the temporal occurrence patterns, andin particular the level of polyphony is unbounded: at any particular time, if thesystem overall has detected k more onsets than offsets, then the current numberof parallel active events is k .Our approach is probabilistic: we will use onset/offset detectors that yielddetection probabilities at each time point, and our prior beliefs about event dura-tion will be expressed as a distribution over durations. To combine these prob-abilities together, we characterise acoustic events in a two-dimensional spaceindexed by onset time t and duration τ . We characterise the conditional prob-ability of an event at some point in that space as p evt ( t, τ | y ) ∝ p on ( t | y ) p off ( t + τ | y ) p dur ( τ )where y is the observation (the audio signal). The conditional probabilities p on nset detection probabilityO ff set detection probability Prior on event durations
Posterior
Onset time E v e n t d u r a t i o n Figure 2: Schematic diagram of how probabilities are combined in the onset-duration-offset event detection model. Each source of probabilistic informationis a marginal with respect to a different direction in the [onset × duration] space.and p off come from the detectors, and p dur is our duration prior. When dealingwith discretised time (as we do here), this Bernoulli model imposes a mildconstraint: no two events can have exactly the same onset time and duration.This constraint is very mild, since events can co-occur in our scheme as long asthey have slight mutual differences in onset time or duration. (We will imposea slightly stronger constraint to recover an event transcript, described shortly.)Each of our probabilistic sources of information (onsets, offsets, durations)gives us information that acts as a one-dimensional marginal, when consideredin our two-dimensional [onset × duration] space (Figure 2). Note particularlythat offset detection probabilities are translated into [onset × duration] spacewith an off-axis influence, since offset time is equivalent to onset time plus du-ration. In this space, we assume that the three types of probability information4re conditionally independent and multiply them together to produce a pos-terior “intensity” over possible events (Figure 2). This could be thresholdedto give a polyphonic event sequence, or marginalised to give posterior beliefsabout onsets and offsets. Such posterior beliefs will be related to the raw on-set/offset detections but refined using the other information sources. Note thatthe posterior is not a probability distribution: it does not sum up to one overour 2D space. It represents a set of binomial probabilities; the sum over the 2Dposterior gives the expected number of events.In the implementation we use here, for onset and offset detectors we userandom forest regression (cf. [12]). Our detectors take spectrogram patches asinput, the spectrograms having been treated with background noise reductionby median-thresholding, and then first differencing in time. The trained randomforest outputs detection probabilities for each patch. Since the regression makesan independence assumption for adjacent (overlapping) patches, the outputs areliable to be correlated in time; to reduce this effect, after training the randomforest we then train an ordinary least squares regression from a sliding windowof 11 outputs from the detector onto the ground truth, to recover “sharper”detections. To implement our prior on event durations, we will train a Gaussianmixture model (GMM) on the durations observed in training data.We note some resemblances between our approach and that of [12]. Thoseauthors also use random forest regression as a recognition component that con-tributes towards an eventual event segmentation. However, their method isfundamentally different in that its elements for recognition are not the on-sets/offsets, but the frames “within” an event, which have been augmentedwith pointers to their associated onset/offset. For this and other reasons theirapproach is limited to monophonic event detection.In the following we will refer to our onset-duration-offset model as ODO forshort.
To recover a definite set of events, our 2D posterior can be thresholded usinga threshold determined during the training phase. In practice, however, weobserve that this tends to yield a large number of duplicated events, since anyparticular ground-truth event of duration τ will often be detected with a rel-atively strong probability for duration τ + 1, τ −
1, etc., each of which is aseparate position in our 2D posterior. This effect can be reduced by imposingfurther assumptions: perhaps about the maximum polyphony, or the temporalpattern. In the present work we wish to avoid imposing assumptions that mightstrongly bias event counts and the like. We choose to impose an assumptionwhich is partly implicit in the detectors themselves: the assumption that onlyone onset, and one offset, may happen in any time frame. This assumptionmay bias detection in very dense audio recordings, but for many densities en-countered in practice this assumption holds almost always. Hence to recover anevent transcript, we keep only the events whose posterior probability is strongerthan all other events with matching onset time or offset time. From these events5e then select a threshold for discarding low-probability events.Taking Figure 2 as an example, in the posterior we see that the density dueto the second detected onset overlaps in 2D space with two possible offsets. Inthis case we would keep no more than one of these events, namely that whichyields the strongest probability.A straightforward way to count events is to count the number of items in-cluded in an event transcript. However, as just described, producing an eventtranscript requires making “hard” thresholding decisions, discarding some in-formation from the posterior. We can avoid the transcription step and simplyuse the sum of the posterior, over the time range of interest, as the expectedevent count in that time range. We will use this in our evaluation.
As an alternative to the method presented above, we also describe a hiddenMarkov model (HMM) method for event detection. As described in Section 1,HMM approaches to event detection impose limits on the possible polyphony,and also may bias the durations and timing patterns of detected events. Withthose caveats acknowledged, we wish to use the HMM as a point of comparisonsince it is in widespread use. So in order to detect events of a single type,but with potential polyphony, we apply a HMM but where the hidden statesare not simply ‘on’ and ‘off’, but the count of currently-active events, i.e. thecount of events that have started but not yet ended. The state space is thus { , , ...K } where K is the maximum number of simultaneous events observedin the training data.For modelling the observations, we train a separate Gaussian mixture model(GMM) with ten components for each cardinality. Again we use spectral patchesto train this model, but without differencing them in time since in this case weaim to model states rather than transitions.To recover an event transcription from this HMM, we perform Viterbi de-coding. From the decoded sequence of cardinalities, we deduce onset and offsettimes, and we associate onsets and offsets with each other in order of occurrence.This transcript is then also used for event counting. The ODO and HMM models we have described offer two very different ap-proaches to event detection. We note that it is possible to combine the two, asfollows. We can expand the HMM state space to include not only the currentevent cardinality, but also two binary indicators of whether the current frameincludes an onset and/or an offset. This expands the set of HMM states by afactor of four. Not all state transitions are possible: e.g. a change in cardinalityfrom 3 to 4 can only occur when the onset state is 1. We do not impose suchlimits manually but allow the system to learn them from the transitions seen intraining data. 6ven with this expanded state space, HMM-based models inherit the limita-tions already mentioned: cardinalities higher than seen in the training data willnot be correctly detected, and the HMM may bias timing patterns. Howeverin the following evaluation we will compare the empirical characteristics of themethods we have described.
We recorded a set of four female zebra finches (
Taeniopygia guttata ) in anindoor aviary. The birds exchanged contact calls at a rate of approx 40 callsper minute. Birds were recorded for extended periods, and their calls weretranscribed as event sequences. Transcription was performed separately by twohuman annotators, whose annotations were combined automatically, and anydiscrepancies resolved by the first author.This recording setup was designed for use in various studies; in the presentpaper we use it as a case study for event detection. We took two 30-minuterecording sessions, recorded with the same birds but on separate days, and usedthese two sessions for two-fold crossvalidation. The 30-minute sessions contained1663 and 1770 annotated calls in total. Here we use single-channel omni micrecordings of the sessions.The true polyphony in the original recordings ranged from zero to four. Inorder to investigate heavier densities, as well as to investigate the effect of den-sity mismatches between training and test data, for each 30-minute session wealso created an artificial 10-minute mixture with the three 10-minute segmentssuperimposed. All experiments thus used the same set of calls, but in some casesthe training or test data was “folded” down to a denser 10-minute recording bysuperimposition.As in other event detection evaluations [1, 2], for evaluating event tran-scription we use the F-measure metric and we consider an event to be correctlyrecovered if the onset matches within a fixed tolerance and the duration matcheswithin 50% of the true duration. The typical event duration in this data wasapprox 100 ms, so we chose ±
25 ms as our tolerance.Separately, for evaluating event counts we divide the data into ten-secondwindows and measure the RMS error between the true and estimated numberof events for each window. Note that both systems do exhibit some miscalibra-tion, in that their estimated counts even on the training data could exhibit amultiplicative deviation from the truth. For the ODO system we believe thisis largely due to the independence assumption already mentioned in the under-lying detectors, and thus more sophisticated edge detectors might remedy this.To account for this most basic aspect of miscalibration, during training of allsystems we used the training data to choose a multiplicative calibration factorto apply to all event counts. Calibration did not make use of test data. RMSerror statistics are reported from the calibrated outputs.We tested the following five configurations of event detector: • The full ODO system of Section 2.7
The ODO system with a flat duration prior rather than the GMM. Thisconstrained events to lie within a reasonable duration but did not learna distribution on the duration, and allowed us to probe the extent of thebenefit provided by the GMM duration prior. • The HMM system of Section 3. • The combined ODO+HMM system of Section 3.1. • The raw output from the onset-detector component. This can be evaluatedfor event counts only, not transcription, but indicates the extent to whichthe detector component contributes to ODO performance.Specific implementation details were as follows. Audio was recorded at 96kHz, and spectrograms calculated with frame length 2048, 50% overlap andHann windowing. Spectral information outside the range 0.5–20 kHz was dis-carded. Noise reduction was implemented using spectral median-subtraction,with the median calculated in a sliding ten-second window for each spectralband. Spectral patches for the onset/offset detectors were taken from the time-differenced spectrogram, of size five frames before and five frames after theframe under consideration. The detectors were implemented using random for-est regression from the sklearn module, with 20 trees. Spectral patches for theHMM-GMM modelling were taken from the non-time-differenced spectrogram,of size five frames after the frame under consideration.Results for event transcription (Figure 3) show a number of tendencies.Firstly the full ODO model consistently outperforms the ODO model with flatduration prior, indicating that the learned prior on event durations adds usefulinformation. Secondly, the HMM system generally performs much worse thanthe ODO system. This poor performance might be attributed to various pointsof difference between the two systems, and so it is interesting to observe thatthe combined ODO+HMM system improves on HMM but does not approachODO’s strong performance, despite making use of ODO’s onset/offset outputas part of its input observations.The experiments with mismatched density in training and test give a generalperformance degradation for all systems, indicating that there is still some wayto go to perform detection robust to very wide variation in event density. TheHMM-based systems perform relatively well in the experiment with increasedtraining density—an exception to the general pattern. Conversely, the poor per-formance of the HMM-based systems in the experiment with increased testingdensity matches expectations since the training did not encompass all the eventcardinalities found in the testing data.In best conditions, our ODO system achieved an F-measure of around 37%.Although quite distant from the ideal of 100%, it is on a similar scale as theresults reported for state-of-the-art methods for related tasks [1, 2].Results for event counting (Figure 4) show a slightly different picture. Again,the mismatch in training and testing conditions has a general negative impact on8 :3 1:1 3:1True event density ratio (training:test)0510152025303540 E v en t w i s e pea k f - m ea s u r e ( % ) Quality of recovering event transcriptionODOODO (flat)HMMODO+HMM
Figure 3: Event transcription results (F-measure), averaged over the two cross-validation folds. Error bars cover the range of results obtained within individualfolds.performance. In the main experiment, most of the systems perform at very sim-ilar quality levels. This is except for the raw onset detector output, included forcomparison, which performs notably worse than the full systems —illustratingthat the ODO method performs much better than its underlying detector. How-ever, although the HMM and ODO+HMM systems achieve similar performanceas ODO in the main experiment, this is not the case in the experiments withmismatched training and test densities, for which their performance degradesfurther.Taken together, these evaluations indicate that the ODO method is moreaccurate and more robust than the HMM method for detecting or countingevents in polyphonic bird recordings such as those we have studied. The methodcan be used for any data with events well-characterised by ‘landmarks’ suchas onsets and offsets, including animal and human sounds. However we notethat all the systems evaluated here showed quite some decay in performancewhen evaluated with mismatched event densities. Improving these polyphonicdetection paradigms to be robust to these wide ranges of event density remainsas future work. Good detector components must be a key to strong performance:for the present work we used simple detection methods using spectral patchesas data; improvements such as feature learning [13] could improve detection9 :3 1:1 3:1True event density ratio (training:test)05101520 R M S e rr o r Quality of recovering event countsODOODO (flat)HMMODO+HMMRaw onsets
Figure 4: Count estimation results (RMS error of counts in ten-second timewindows). The y-axis is inverted so that upwards means better performance, tomatch Figure 3.performance. It also remains to evaluate systems on a wider range of event-annotated audio recordings.
Our thanks to Maeve McMahon for help with bird management, and to RobertJack and Alex Wilson for data annotation.10 eferences [1] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, andM. D. Plumbley, “Detection and classification of acoustic scenes and events:an IEEE AASP challenge,” in
Proceedings of the Workshop on Applicationsof Signal Processing to Audio and Acoustics (WASPAA) , 2013.[2] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley,“Detection and classification of acoustic scenes and events,”
IEEE Trans-actions on Multimedia , 2015.[3] R. Stiefelhagen, K. Bernardin, R. Bowers, J. Garofolo, D. Mostefa, andP. Soundararajan, “The CLEAR 2006 evaluation,”
Multimodal Technolo-gies for Perception of Humans , pp. 1–44, 2007.[4] T. A. Marques, L. Thomas, S. W. Martin, D. K. Mellinger, J. A. Ward,D. J. Moretti, D. Harris, and P. L. Tyack, “Estimating animal populationdensity using passive acoustics,”
Biological Reviews , 2012.[5] A. Diment, T. Heittola, and T. Virtanen, “Sound event de-tection for office live and office synthetic AASP challenge,” in
IEEE AASP Challenge on Detection and Classification of Acous-tic Scenes and Events (WASPAA 2013 special session) , 2013,http://c4dm.eecs.qmul.ac.uk/sceneseventschallenge/abstracts/OL/DHV.pdf.[6] S. Ewert, M. D. Plumbley, and M. Sandler, “A dynamic programmingvariant of non-negative matrix deconvolution for the transcription of struckstring instruments,” in
Proc ICASSP 2015 , 2015.[7] T. Heittola, A. Mesaros, T. Virtanen, and A. Eronen, “Sound eventdetection in multisource environments using source separation,” in
Workshop on Machine Listening in Multisource Environments (CHiME2011) , 2011, pp. 36–40. [Online]. Available: http://spandh.dcs.shef.ac.uk/projects/chime/workshop/papers/pS32 heittola.pdf[8] P. D. O’Grady and B. A. Pearlmutter, “Convolutive non-negative matrixfactorisation with a sparseness constraint,” in
Machine Learning for Sig-nal Processing, 2006. Proceedings of the 2006 16th IEEE Signal ProcessingSociety Workshop on . IEEE, 2006, pp. 427–432.[9] E. Benetos, M. Lagrange, and S. Dixon, “Characterisation of acousticscenes using a temporally-constrained shift-invariant model,” in
Confer-ence on Digital Audio Effects Conference (DAFx-12) , vol. 17, 2012, p. 21.[10] M. Chait, D. Poeppel, and J. Z. Simon, “Auditory temporal edge detectionin human auditory cortex,”
Brain research , vol. 1213, pp. 78–90, 2008.[11] S. M. Woolley, P. R. Gill, T. Fremouw, and F. E. Theunissen, “Functionalgroups in the avian auditory system,”
The Journal of Neuroscience , vol. 29,no. 9, pp. 2780–2793, 2009. 1112] H. Phan, M. Maasz, R. Mazur, and A. Mertins, “Random regression forestsfor acoustic event detection and classification,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 23, no. 1, pp. 20–31, 2014.[13] D. Stowell and M. D. Plumbley, “Automatic large-scale classification ofbird sounds is strongly improved by unsupervised feature learning,”