Large-scale analysis of frequency modulation in birdsong databases
LLarge-scale analysis of frequency modulation inbirdsong databases
Dan Stowell and Mark D. PlumbleyCentre for Digital Music, Queen Mary University of London [email protected]
Abstract
Birdsong often contains large amounts of rapid frequency modulation(FM). It is believed that the use or otherwise of FM is adaptive to theacoustic environment, and also that there are specific social uses of FMsuch as trills in aggressive territorial encounters. Yet temporal fine detailof FM is often absent or obscured in standard audio signal analysis meth-ods such as Fourier analysis or linear prediction. Hence it is important toconsider high resolution signal processing techniques for analysis of FMin bird vocalisations. If such methods can be applied at big data scales,this offers a further advantage as large datasets become available.We introduce methods from the signal processing literature which cango beyond spectrogram representations to analyse the fine modulationspresent in a signal at very short timescales. Focusing primarily on thegenus
Phylloscopus , we investigate which of a set of four analysis methodsmost strongly captures the species signal encoded in birdsong. In orderto find tools useful in practical analysis of large databases, we also studythe computational time taken by the methods, and their robustness toadditive noise and MP3 compression.We find three methods which can robustly represent species-correlatedFM attributes, and that the simplest method tested also appears to per-form the best. We find that features representing the extremes of FMencode species identity supplementary to that captured in frequency fea-tures, whereas bandwidth features do not encode additional information.Large-scale FM analysis can efficiently extract information useful forbioacoustic studies, in addition to measures more commonly used to char-acterise vocalisations.
Frequency modulation (FM) is an important component of much birdsong: var-ious species of bird can discriminate the fine detail of frequency-chirped signals(Dooling et al. , 2002; Lohr et al. , 2006), and use fine FM information as partof their social interactions (Trillo & Vehrencamp, 2005; de Kort et al. , 2009).Use of FM is also strongly species-dependent, in part due to adaptation of birds1 a r X i v : . [ c s . S D ] N ov o their acoustic environment (Brumm & Naguib, 2009; Ey & Fischer, 2009). Songbirds have specific musculature around the syrinx which endows themwith independent fine control over frequency (Goller & Riede, 2012). They cancontrol the two sides of their syrinx largely independently: a sequence of twotones might be produced by each side separately, or by one side alone, a differ-ence shown by the absence/presence of brief FM “slurs” between notes (Marler& Slabbekoorn, 2004, e.g. Figure 9.8). Therefore, if we can analyse bird vocal-isation recordings to characterise the use of FM across species and situations,this information could cast light upon acoustic adaptations and communicativeissues in bird vocalisations. As Slabbekoorn et al. (2002) concluded, “Measuringnote slopes [FM], as well as other more traditional acoustic measures, may beimportant for comparative studies addressing these evolutionary processes inthe future.”Frequency analysis of birdsong is typically carried out using the short-timeFourier transform (STFT) and displayed as a spectrogram. FM can be observedimplicitly in spectrograms, especially at slower modulation rates. However, FMdata are rarely explicitly quantified in bioacoustics analyses of birdsong (oneexception is Gall et al. (2012)), although the amount of FM is partly implicitin measurements such as the rate of syllables and the bandwidth (e.g. in Podos(1997); Vehrencamp et al. (2013)).The relative absence of fine FM analysis may be due to the difficulty in ex-tracting good estimates of FM rates from spectrograms, especially with largedata volumes. Some previous work has indicated that the FM data extractedfrom a chirplet representation can improve the accuracy of a bird species clas-sifier (Stowell & Plumbley, 2012). However, there exists a variety of signal pro-cessing techniques which can characterise frequency-modulated sounds, and noformal study has considered their relative merits for bird vocalisation analysis.In the present work we aim to facilitate the use of direct FM measurements inbird bioacoustics, by conducting a formal comparison of four methods for char-acterising FM. Each of these methods goes beyond the standard spectrogramanalysis to capture detail of local modulations in a signal on a fine time-scale. Toexplore the merits of these methods we will use the machine learning techniqueof feature selection (Witten & Frank, 2005) for a species classification task.In the present work our focus is on methods that can be used with large birdvocalisation databases. Many hypotheses about vocalisations could be exploredusing FM information, most fruitfully if data can be analysed at relatively largescale. For this reason, we will describe an analysis workflow for audio whichis simple enough to be fully automatic and to run across a large number offiles. We will consider the runtime of the analysis techniques as well as thecharacteristics of the statistics they extract.The genus Phylloscopus (leaf warblers) has been studied previously for ev-idence of adaptive song variation. For example Irwin et al. (2008) studied di-vergence of vocalisation in a ring species (
Phylloscopus trochiloides ), suggestingthat stochastic genetic drift may be a major factor in diversity of vocalisations.Mahler & Gil (2009) found correlations between aspects of frequency rangeand body size across the
Phylloscopus genus. They also considered character2isplacement effects, which one might expect to cause the song of sympatricspecies to diverge, but found no significant such effect on the song features theymeasured. Linhart et al. (2012) studied
Phylloscopus collybita , also finding aconnection between song frequency and body size. Such research context mo-tivated our choice to use
Phylloscopus as our primary focus in this study, inorder to develop signal analysis methods that might provide further data onsong structure. However, we also conducted a larger-scale FM analysis usinga database with samples representing species across the wider order of Passeri-formes. We first discuss the FM analysis methods to be considered.
For many purposes, the standard representation of audio signals is the spec-trogram, calculated from the magnitudes of the windowed short-time Fouriertransform (STFT). The STFT is applied to each windowed “frame” of the sig-nal (of duration typically 10 or 20 ms), resulting in a representation of variationsacross time and frequency. The spectrogram is widely used in bioacoustics, anda wide variety of measures are derived from this, manually or automatically: itis common to measure the minimum and maximum frequencies in each record-ing or each syllable, as well as durations, amplitudes and so forth (Marler &Slabbekoorn, 2004). Notable for the present work is the FM rate measure ofGall et al. (2012), derived from manual identification of frequency inflectionpoints (i.e. points at which the modulation changes from upward to downward,or downward to upward) on a spectrogram. Trillo & Vehrencamp (2005) charac-terise “trill vigour” in a related manner but applicable only to trilled syllables.For fully automatic analysis, in Section 2.2 we will describe a method relatedto that of Gall et al. (2012) but with no manual intervention.The spectrogram is a widespread tool, but it does come with some limita-tions. Analysing a 10 or 20 ms frame with the STFT implies the assumptionthat the signal is locally stationary (or pseudo-stationary ), meaning it is pro-duced by a process whose parameters (such as the fundamental frequency) donot change across the duration of the individual frame (Mallat, 1999, Section10.6.3). However, many songbirds sing with very dramatic and fast FM (as wellas AM), which may mean that the local stationarity assumption is violated andthat there is fine-resolution FM which cannot be represented with a standardspectrogram.Signal analysis is under-determined in general: many different processes canin principle produce the same audio signal. Hence the representations derivedby STFT and LPC analysis are but two families of possible “explanation” forthe observed signal. A large body of research in signal processing has consid-ered alternative representations, tailored to various classes of signal includingsignals with fast FM. One recent example which was specifically described inthe context of birdsong is that of Stowell & Plumbley (2012), which uses a kindof chirplet analysis to add an extra chirp-rate dimension to a spectrogram. A“chirplet” is a short-time packet of signal having a central frequency, ampli-tude, and a parametric chirp-rate which modulates the frequency over time.3ore generally, the field of sparse representations allows one to define a “dic-tionary” of a large number of elements from which a signal may be composed,and then to analyse the signal into a small number of components selected fromthe dictionary (Plumbley et al. , 2010). For the present purposes, notable is themethod of Gribonval (2001) which applies an accelerated version of a techniqueknown as
Matching Pursuit specifically adapted to analyse a signal as a sparsecombination of chirplets.Alternative paradigms are also candidates for performing high-resolution FManalysis. One paradigm is that of spectral reassignment , based on the idea thatafter performing an STFT analysis it is possible to “reassign” the resulting listof frequencies and magnitudes to shift them to positions which are in some sensea better fit to the evidence (Fulop & Fitz, 2006). The distribution derivative method (DDM) of Muˇseviˇc (2013, Chapter 10) is one such approach which is ableto reassign a spectrum to find the best-matching parameters on the assumptionthat the signal is composed of amplitude- and frequency-modulated sinusoids.Another approach is that of Badeau et al. (2006) which uses a subspacemodel to achieve high-resolution characterisation of signals with smooth mod-ulations. However, there may be limitations on the rate of FM that can bereflected faithfully: this method relies on a smoothness assumption in the frame-to-frame evolution of the sound which means that it is most suited to relativelymoderate rates of FM, such as the vibrato in human singing.In the following we will apply a selection of analysis techniques to birdsongrecordings, and study whether the FM information extracted is a reliable signalof species identity. This is not the only application for which FM informationis relevant: our aim is that this exploration will encourage other researchers toadd high-resolution FM analysis to their toolbox.
We first collected a set of recordings of birds in the genus
Phylloscopus from adataset made available by the Animal Sound Archive in Berlin. This consistedof 45 recordings over 5 species, in WAV format, with durations ranging from34 seconds to 19 minutes. In the following we will refer to this dataset as
PhyllASA .As a second dataset, we also considered a broader set of audio from the An-imal Sound Archive, not confined to
Phylloscopus but across the order
Passer-iformes (762 recordings over 84 species). We will refer to this as
PassaASA .Thirdly we collected a larger
Phylloscopus dataset from the online archiveXeno Canto. This consisted of 1390 recordings across 56 species, ranging widelyin duration from one second to seven minutes. Our criteria for selecting filesfrom the larger Xeno Canto archive were: genus
Phylloscopus ; quality level A or PhyllXC .Note that the “crowdsourced” Xeno Canto dataset is qualitatively differentfrom
PhyllASA . Firstly it was compiled from various contributors online, and sois not as tightly controlled. The noise conditions and recording quality can varywidely. Secondly, all audio content is compressed in MP3 format (with originaluncompressed audio typically unavailable). The MP3 format reduces filesizeby discarding information which is considered unnecessary for audio qualityas judged by human perception (International Standards Organisation, 1993).However, human and avian audition differ in important ways, including time andfrequency resolution, and we cannot assume that MP3 compression is “trans-parent” regarding the species-specific information that might be important inbird communication. Hence in our study we used this large crowdsourced MP3dataset only after testing experimentally the impact of compression and signaldegradation on the features we measured (using the
PhyllASA data).For each dataset considered here, we resampled audio files to 48 kHz monoWAV format before processing, and truncated long files to a maximum durationof 5 minutes. All of the datasets contain an uneven distribution, with somespecies represented in more recordings than others (Table 1). This is quitecommon but carries implications for the evaluation of automatic classification,as will be discussed below.
For all analysis methods we used a frame size of 512 samples (10.7 milliseconds,at 48 kHz), with Hann windowing for STFT, and the frequency range of in-terest was restricted to 2–10 kHz. For each recording in each dataset, we ap-plied a fully automatic analysis using each of four signal processing techniques.Our requirement of full automation excludes a preprocessing step of manuallysegmenting of birdsong syllables from the background. We chose to use thesimplest form of automatic segmentation, simply to select the 10% of highest-energy frames in each recording. More sophisticated procedures can be appliedin future; however, as well as simplicity this method has an advantage of speedwhen analysing large databases. We analysed each recording using each of thefollowing techniques (which we assign two-letter identifiers for reference): ss : a spectrographic method related to the method of Gall et al. (2012) butwith no manual intervention, as follows. Given a sample of birdsong, forevery temporal frame we identify the frequency having peak energy, withinthe frequency region of interest. We calculate the absolute value of thefirst difference, i.e. the magnitude of the frequency jump between succes-sive frames. We then summarise this by the median or other statistics, tocharacterise the distribution over the depth of FM present in each record-ing. This method relies on the peak-energy within each frame rather thanmanual identification of inflection points in the pitch trace, which meansthat it is potentially susceptible to noise and other corruptions, but it5able 1: Counts of species occurrence in our three datasets. Note that PhyllASA is a subset of
PassaASA , as reflected in the counts.
Species P h y ll A S A P a ss a A S A P h y ll X C Acrocephalus arundinaceus 9Acrocephalus palustris 12Acrocephalus schoenobaenus 3Acrocephalus scirpaceus 5Aegithalos caudatus 1Alauda arvensis 8Anthus pratensis 1Anthus trivialis 74Apalis chariessa 3Calcarius lapponicus 1Carduelis carduelis 1Carduelis chloris 3Carduelis spinus 4Certhia brachydactyla 3Certhia familiaris 1Corvus corax 1Corvus corone 3Cyanocitta cristata 2Delichon urbica 4Emberiza calandra 4Emberiza citrinella 34Emberiza hortulana 94Emberiza pusilla 1Emberiza rustica 3Emberiza schoeniclus 11Emberiza spodocephala 2Erithacus rubecula 14Ficedula albicollis 1Ficedula hypoleuca 4Ficedula parva 7Fringilla coelebs 87Fringilla montifringilla 9Garrulax subunicolor 1Garrulus glandarius 2Hippolais icterina 19Hirundo rustica 3Lanius collurio 4Locustella fluviatilis 5Locustella lanceolata 1Locustella luscinioides 3Locustella naevia 6Loxia curvirostra 1Lullula arborea 6Luscinia calliopeLuscinia luscinia 10Luscinia megarhynchos 26Luscinia svecica 3 Species P h y ll A S A P a ss a A S A P h y ll X C Motacilla alba 1Motacilla flava 3Muscicapa striata 1Nucifraga caryocatactes 20Panurus biarmicus 1Parus ater 5Parus caeruleus 8Parus major 9Parus montanus 4Parus palustris 3Perisoreus infaustus 1Phoenicurus ochruros 3Phoenicurus phoenicurus 22Phylloscopus affinis 7Phylloscopus amoenus 2Phylloscopus armandii 6Phylloscopus bonelli 3 3 71Phylloscopus borealis 25Phylloscopus borealoides 1Phylloscopus budongoensis 1Phylloscopus calciatilis 9Phylloscopus canariensis 11Phylloscopus cantator 6Phylloscopus cebuensis 4Phylloscopus chloronotus 10Phylloscopus claudiae 15Phylloscopus collybita 12 12 323Phylloscopus coronatus 6Phylloscopus davisoni 2Phylloscopus emeiensis 4Phylloscopus examinandus 3Phylloscopus forresti 14Phylloscopus fuligiventer 4Phylloscopus fuscatus 5 5 33Phylloscopus griseolus 6Phylloscopus hainanus 3Phylloscopus herberti 4Phylloscopus humei 51Phylloscopus ibericus 42Phylloscopus ijimae 2Phylloscopus inornatus 53Phylloscopus kansuensis 4Phylloscopus laetus 1Phylloscopus maculipennis 16Phylloscopus magnirostris 13Phylloscopus makirensis 7Phylloscopus neglectus 3 Species P h y ll A S A P a ss a A S A P h y ll X C Phylloscopus nigrorum 7Phylloscopus nitidus 9Phylloscopus occisinensis 5Phylloscopus ogilviegranti 15Phylloscopus olivaceus 2Phylloscopus orientalis 5Phylloscopus plumbeitarsus 10Phylloscopus poliocephalus 8Phylloscopus presbytes 15Phylloscopus proregulus 17Phylloscopus pulcher 6Phylloscopus reguloides 26Phylloscopus ricketti 1Phylloscopus ruficapilla 7Phylloscopus sarasinorum 11Phylloscopus schwarzi 16Phylloscopus sibilatrix 11 11 105Phylloscopus sindianus 7Phylloscopus subviridis 1Phylloscopus tenellipes 28Phylloscopus trivirgatus 16Phylloscopus trochiloides 61Phylloscopus trochilus 14 14 208Phylloscopus tytleri 1Phylloscopus umbrovirens 5Phylloscopus xanthoschistos 25Phylloscopus yunnanensis 11Prunella modularis 2Pyrrhula pyrrhula 1Regulus ignicapillus 3Regulus regulus 2Saxicola rubetra 2Sitta europaea 6Smithornis capensis 1Sturnus vulgaris 1Sylvia atricapilla 14Sylvia borin 10Sylvia communis 9Sylvia curruca 2Sylvia nisoria 2Troglodytes troglodytes 11Turdus iliacus 2Turdus merula 36Turdus philomelos 21Turdus pilaris 4Turdus viscivorus 7 rm : the heterodyne (ring modulation) chirplet analysis of Stowell & Plumbley(2012), taking information from the peak-energy detection in each frame. mp : the Matching Pursuit technique of Gribonval (2001), implemented using theopen-source Matching Pursuit ToolKit (MPTK) v0.7. For this techniquethe 10% highest-energy threshold is not applicable, since the method isiterative and could return many more results than there are signal frames:we automatically set a threshold at a number of results which recoversroughly the same amount of signal as the 10% threshold. dd : the distribution derivative method (DDM) of Muˇseviˇc (2013, Chapter 10),taking information from the peak-energy sinusoid detected in each frame. We also conducted a preliminary test with the subspace method of Badeau et al. (2006), but this proved to be inappropriate for the rapid FM modulations foundin birdsong because of an assumption of smooth FM variation inherent in themethod (Badeau, pers. comm.).Each of these methods resulted in a list of “frames” or “atoms” for a record-ing, each with an associated frequency and FM rate. In order to characteriseeach recording as a whole, we selected summary statistics over these frames ina recording to use as features. We summarised the frequency data by their me-dian, and by their 5- and 95-percentiles. The 5- and 95-percentiles are robustmeasures of minimum and maximum frequency; we also calculated the “band-width” as the difference between the 5- and 95-percentile. We summarised theFM data by their median, and also by their 75- and 95-percentiles. These per-centiles were chosen to explore whether information about the relative extremesof FM found in the recording provide useful extra information.So, for each recording and each analysis method we can extract a set offrequency and FM summary features. It remains to determine which of thesefeatures might be most useful in looking for signals of species identity in recordedbird vocalisations. We explored this through two interrelated approaches: fea-ture selection, and automatic classification experiments. Through these twoapproaches, we were able to compare the different features against each other,and also compare the features as extracted by each of the four signal-processingtechniques given above.One approach that has been used to explore the value of different featuresis principal components analysis (PCA) applied to the features, to determineaxes that represent the strongest dimensions of variance in the features (see e.g. Python source code for the method of Stowell & Plumbley (2012) is available at https://code.soundsoftware.ac.uk/projects/chirpletringmod . Available at http://mptk.irisa.fr/ . Matlab/Octave source code for the method of Muˇseviˇc (2013) is available at https://code.soundsoftware.ac.uk/projects/ddm . featureselection techniques to evaluate directly the predictive power that a feature(or a set of features) has with respect to some attribute (Witten & Frank,2005). We used an information-theoretic feature selection technique from thatfield. In information gain feature selection, each of our features is evaluated bymeasuring the information gain with respect to the species label, which is theamount by which the feature reduces our uncertainty in the label:IG(Species , Feature) = H (Species) − H (Species | Feature)where H ( · ) is the Shannon entropy. The value H (Species) represents the numberof binary bits of information that must typically be conveyed in order to identifythe species of an individual (from a fixed set of species). The information gainIG(Species , Feature) then tells us how many of those binary bits are alreadyencoded in a particular feature, i.e. the extent to which that feature reducesthe uncertainty of the species identity. If a feature is repeatedly ranked highly,this means that it contains a stronger signal of species identity than lower-ranked features and thus suggests it should be a useful measure. The approachjust described is reminiscent of the information-theoretic method introduced byBeecher (1989), except that his concern was with signals of individual identityrather than species identity.Having performed feature selection, we were then able to choose promisingsubsets of features which might concisely represent species information. To eval-uate these subsets concretely we conducted an experiment in automatic speciesclassification. For this we used a leading classification algorithm, the SupportVector Machine (SVM), implemented in the libsvm library version 3.1, choosingthe standard radial basis function SVM classifier. The evaluation statistic weused was the weighted “area under the receiver operating characteristics curve”(the weighted
AUC ), which summarises the rates of true-positive and false-positive detections made (Fawcett, 2006). This measure is more appropriatethan raw accuracy, when analysing datasets with wide variation in numbers perclass as in the present case (ibid.). The AUC yields the same information as theWilcoxon signed-rank statistic (Hanley & McNeil, 1982). The feature selectionand classification experiments were all performed using
Weka R version 2.13.1 (R Development Core Team,2010).An important issue when considering automatic feature extraction is the ro-bustness of the features to corruptions that may be found in audio databases,such as background noise or MP3 compression artifacts. This has particularpertinence for the crowdsourced PhyllXC dataset, as discussed above. For this8eason, we also studied our first dataset after putting the audio files through twocorruption processes: added white noise ( −
45 dB relative to full-scale, judgedby ear to be noticeable but not overwhelming), and MP3 compression (64 kbps,using the lame software library version 3.99.5). To quantify whether an audiofeature was badly impacted by such corruption, we measured the Pearson cor-relations of the features measured on the original dataset with their corruptedequivalent. This test does not depend on species identity as in our main exper-imental tests, but simply on the numerical stability of the summary statisticswe consider.In this study we focussed on frequency and FM characteristics of sounds,both of which can be extracted completely automatically from short time frames.We did not include macro-level features such as syllable lengths or syllable rates,because reliable automatic extraction of these is complex. Rather, we comparedthe fine-detail FM analyses against frequency measures, the latter being commonin the bioacoustics literature: our feature set included features correspondingto the lower, central and upper frequency, and frequency bandwidth.
We first illustrate the data which is produced by the analysis methods tested,using a recording of
Phylloscopus collybita (Chiffchaff) from
PhyllASA as anexample. Figure 1 shows a conventional spectrogram plot for our chosen excerpt.We can infer FM characteristics visually, but the underlying data (a grid ofintensity “pixels”) does not directly present FM for analysis. Figure 2 representsthe same excerpt analysed by each of the methods we consider. Each of the plotsappears similar to a conventional spectrogram, showing the presence of energyat particular time and frequency locations. However, instead of a uniform gridthe image is created from a set of line segments, each segment having a locationin time and frequency but also a slope. It is clear from Figure 2 that each of themethods can build up a portrait of the birdsong syllables, although some aremore readable than others. The plot from mp appears more fragmented thanthe others. This can be traced back to the details of the method used, but fornow we merely note that the apparent neatness of each representation does notnecessarily indicate which method most usefully captures species-specific FMcharacteristics.The relative speeds of the analysis methods described here are given in Table2. The simple spectrogram method is by far the fastest, as is to be expectedgiven its simplicity. All but one of the methods run much faster than real-time, though the difference in speed between the simple spectrogram and themore advanced methods is notable, and certainly pertinent when consideringthe analysis of large databases.Features extracted by methods ss rm and dd were highly robust to thenoise and MP3 degradations applied, in all cases having a correlation with theoriginal features better than 0.95 (Figure 3). Method rm showed particularlystrong robustness. The mp method, on the other hand, yielded features of very9 .5 1.0 1.5 2.0Time (s)23456789 F r e q ( k H z ) Figure 1: Standard spectrogram for a short excerpt of Chiffchaff (
Phylloscopuscollybita ). The FM can be seen by eye but is not explicit in the underlying data,being spread across many “pixels”.low robustness: correlation with the original features was never above 0.95, insome cases going as low as to be around zero. This indicates that features fromthe mp method may be generally unreliable when applied to the PhyllXC datasetconsidered next.Our feature selection experiments revealed notable trends in the informationgain (IG) values associated with certain features, with broad commonalitiesacross the three datasets tested (see Appendix for details). In particular, thebandwidth features achieve very low IG values in all cases. Conversely, themedian frequency feature performs strongly for all datasets and all methods.The FM features perform relatively strongly on
PhyllASA , appearing generallystronger than frequency features, but this pattern does not persist into the other(larger) datasets. However, the 75-percentile of FM did generally rank highlyin the feature selection results.Based on the results of feature selection, we chose to take the following fourfeature sets forward to the classification experiment: • Three FM features ( fm med , fm 75pc , fm 95pc ); • Three frequency-based features ( freq 05pc , freq med , freq 95pc );10 .5 1.0 1.5 2.0Time (s)23456789 F r e q ( k H z ) ss F r e q ( k H z ) rm F r e q ( k H z ) mp F r e q ( k H z ) dd Figure 2: Time-frequency plots of the “chirp” data recovered by each method,for the same excerpt as in Figure 1. 11able 2: Time taken to run each analysis method on our first dataset
PhyllASA ,expressed as a proportion of the total duration of the audio files (so that anynumber below 1 indicates faster than real-time processing). Times were mea-sured on a laptop with Intel i5 2.5 GHz processor.Method Time taken (relative to audio duration)ss 0.02rm 0.40mp 0.58dd 1.22 • The “Top-2” performing features ( freq med , fm 75pc ); • All six FM and frequency-based features together.We did not include the poorly-performing bandwidth features. This yielded anadvantage that the FM and frequency-based features had the same cardinality,ensuring the fairness of our experimental comparison of the two feature types.Results for the classification experiment with different extraction methodsand different feature subsets are shown in Figure 4 and Table 3. This is adifficult classification task (across 56 species), and the average AUC score inthis case peaks at around 70%. A repeated-measures factorial ANOVA con-firmed, for both datasets, a significant effect on accuracy for both feature set( p < × − ) and method ( p ≤ . × − ), with no significant interactionterm found ( p > . p < . ss vs. dd ( ss ≈ dd > rm > mp ). For the choice of feature set, meanswere found to be different ( p < . × − ) for all pairs of feature sets exceptTop-2 vs. Freq (FM+Freq > Freq ≈ Top-2 > FM).
The fine detail of frequency modulation (FM) is known to be used by varioussongbird species to carry information (Marler & Slabbekoorn (2004, Chapter 7);Brumm & Naguib (2009); Sprau et al. (2010); Vehrencamp et al. (2013)), butautomatic tools for analysis of such FM are not yet commonly used. Our experi-ments have demonstrated that FM information can be extracted efficiently fromlarge datasets, in a fashion which captures species-related information despitethe simplicity of method (we used no source-separation, syllable segmentationor pitch tracking). This was explicitly designed for application on large collec-tions: our experiments used up to 1390 individual recordings, larger numbersthan in many bioacoustic studies. 12 s rm mp ddMethod0.20.40.60.80.90.951.0 R s q u a r e d NoiseMP3
Figure 3: Squared Pearson correlation between audio features and their valuesafter applying audio degradation, across the
PhyllASA dataset. Each pointrepresents one feature; features are grouped by analysis method and degradationtype. We inspected the variation according to feature, and found no generaltendencies; therefore features are collapsed into a single column per analysismethod in order to visualise the differences in range. Note that the vertical axisis warped to enhance visibility at the top end of the scale.Our results show an effect of the choice of summary features, both for fre-quency and for FM data. The consistently strongest-performing summary fea-ture was the median frequency, which is similar to measures of central tendencyused elsewhere in the literature and can be held to represent a bird’s central“typical” frequency. On the contrary, we were surprised to find that bandwidthmeasurements as implemented in our study showed rather little predictive powerfor species identity, since bandwidth has often been discussed with respect to thevariation in vocal capacities across avian species (Podos, 1997; Trillo & Vehren-camp, 2005; Mahler & Gil, 2009). In our case the upper frequency extent alone(represented by the 95-percentile) appears more reliable, which may reflect theimportance of production limits in the highest frequencies in song.The FM features, taken alone, were not as predictive of species identity aswere the frequency features. However, they provided a significant boost in pre-dictive power when appended to the frequency features. This tells us not onlythat FM features encode aspects of species identity, but they encode comple-mentary information which is not captured in the frequency measurements.In light of our results we note that Trillo & Vehrencamp (2005) explored ameasure of “trill vigour”: “Because of the known production constraint trade-off between note rate and bandwidth of trilled songs (Podos 1997), we derivedan index of trill vigour by multiplying the standardized scores of these two13 ll l ss rm mp dd ss rm mp dd ss rm mp dd ss rm mp dd
FM features Frequency features Top−2 features FM+Frequency features W e i gh t ed − a v e r age A UC sc o r e ( % ) f o r s pe c i e s c l a ss i f i c a t i on l ll ss rm mp dd ss rm mp dd ss rm mp dd ss rm mp dd FM features Frequency features Top−2 features FM+Frequency features W e i gh t ed − a v e r age A UC sc o r e ( % ) f o r s pe c i e s c l a ss i f i c a t i on Figure 4: Performance of species classification across 56 species, evaluated us-ing datasets
PassaASA (upper) and
PhyllXC (lower). Results are shown foreach analysis method, and for four different subsets of the available features(see text for details). The horizontal dashed line indicates the baseline chanceperformance at 50%. 14able 3: Marginal mean of the weighted area under the curve (AUC) scores forthe results shown in Figure 4.Dataset Method AUC (%)
PassaASA ss dd rm mp PhyllXC ss dd rm mp PassaASA
FM+Freq
Top-2 65.8Freq 66.9FM 58.9
PhyllXC
FM+Freq
Top-2 64.4Freq 63.6FM 59.1parameters” (Trillo & Vehrencamp, 2005, p. 925). This index was not furtherpursued since in their study it yielded similar results as the raw bandwidthdata. However, if we assume for the moment that each note in the trills studiedby Trillo & Vehrencamp is one full sweep of the bandwidth of the trill (this isthe case for all except “hooked” trills), then multiplying the bandwidth (in Hz)by the note rate (in sec − ) yields exactly the mean value of the instantaneousabsolute FM rate (in Hz/sec). This “trill vigour” calculation is thus very closein spirit to our measurement of the median FM rate. Their comparison ofbandwidth features against trill vigour features served for them as a kind offeature selection, although in their case the focus was on trills in a single species.A further aspect of our study is the comparison of four different methods forextracting FM data. A clear result emerges from this, which is that the simplestmethod ( ss ) attains the strongest classification results (tied with method dd ),and is sufficiently robust to the degradations we tested. This should be takentogether with the observation that it runs at least 20 times faster than any ofthe other methods on the same audio data, to yield a strong recommendationfor the ss method.This outcome came as a surprise to us, especially considering the simpli-fying assumptions implicit in the ss method. It considers the peak-amplitudefrequencies found in adjacent STFT frames (i.e. in adjacent “slices” of a spec-trogram), which may in many cases relate to the fundamental frequency of thebird vocalisation, but can often happen to relate to a harmonic, or a chance15uctuation in background noise. It contains no larger-scale corrections for con-tinuity, as might be used in pitch-tracking-type methods (though note that aswe found with the method of Badeau et al. (2006), those methods can incurdifficulties tracking fast modulations).The statistical strength of simple methods has been studied elsewhere in theliterature. For example Kershenbaum et al. (2013) found that bottlenose dol-phin signature whistles could usefully be summarised by a strongly decimatedrepresentation of the pitch track: a so-called “Parsons code” based on whetherthe pitch is rising or falling at a particular timescale, and which completelyomits the magnitude of such rises or falls. The method is not analogous to ours,but has in common that it uses suprisingly simple statistics to summarise tem-poral variation. Audio “fingerprinting” systems such as Shazam (Wang, 2003)also rely on highly-reduced summary data, customised to the audio domain ofinterest.Our ss method relies on finding a temporal difference between adjacentframes, as does that of Kershenbaum et al. (2013). This is partly reminiscent ofthe “delta” features often added to MFCCs to reflect how they may be chang-ing. Such deltas are common in speech recognition and are also used in someautomatic species classification (for example Trifa et al. (2008)). However notethat MFCC “deltas” represent differences in magnitude, not in frequency.Separately from the classification experiment, we studied the effects of noiseand MP3 degradation on our summary features. Such issues are pertinent forcrowdsourced datasets such as PhyllXC .Measures such as minimum and maximum frequency carry some risk of de-pendence on recording conditions, particularly when derived from manual in-spection of spectrograms (Zollinger et al. , 2012; Cardoso & Atwell, 2012). Wehave demonstrated that our automatic FM measures using methods rm , dd or ss are robust against two common types of degradation (noise and compression),with rm particularly robust. They are therefore suitable tools to explore thevariation in songbirds’ use of FM in the laboratory and in the field.Future work: in this study we did not use any higher-level temporal mod-elling such as the temporal structure of trill syllables, nor did we use advancedmethods for segmenting song/call syllables from background. We have demon-strated the utility of fully automatic extraction of fine temporal structure in-formation, and in future work we aim to combine this with richer modelling ofother aspects of vocalisation. We also look forward to combining fine FM anal-ysis with physiological models of the songbird vocal production mechanism—ashas already been done with linear prediction for the source-filter model (Markel,1972)—but explicitly accounting for songbirds’ capacity for rapid nonstationarymodulation and their use of two separate sound sources in the syrinx. In much research involving acoustic analysis of birdsong, frequency modula-tion (FM) has been measured manually, described qualitatively or left implicit16n other measurements such as bandwidth. We have demonstrated that it ispossible to extract data about FM on a fine temporal scale, from large audiodatabases, in fully automatic fashion, and that this data encodes aspects ofecologically pertinent information such as species identity. Further, we havedemonstrated that a relatively simple technique based on spectrogram data issufficient to extract information pertinent to species, which one might expectcould only be extracted with more advanced signal-processing techniques. Ourstudy provides evidence that researchers can and should measure such FM char-acteristics when analysing the acoustic characteristics of bird vocalisations.
Acknowledgments
DS & MP are supported by an EPSRC Leadership Fellowship EP/G007144/1.Our thanks to: Alan McElligott for helpful advice while preparing the manuscript;Saˇso Muˇseviˇc for discussion and for making his DDM software available; andR´emi Gribonval and team at INRIA Rennes for discussion and software devel-opment during a research visit.
Data accessibility
The feature values for each sound file are available in online data tables. Theoriginal audio for the
PhyllXC dataset can be retrieved from the Xeno Cantowebsite, using the XC ID numbers given in the online data table. The originalaudio for the
PhyllASA and
PassaASA datasets can be requested from theAnimal Sound Archive, using the track filenames given in the online data table.
References
Badeau, R., David, B. & Richard, G. (2006) High-resolution spectral analysis ofmixtures of complex exponentials modulated by polynomials.
IEEE Transac-tions on Signal Processing , (4), 1341–1350, doi:10.1109/TSP.2006.870556.Beecher, M.D. (1989) Signalling systems for individual recognition: An infor-mation theory approach. Animal Behaviour , (2), 248–261, doi:10.1016/S0003-3472(89)80087-9.Brumm, H. & Naguib, M. (2009) Environmental acoustics and the evolutionof bird song. Vocal Communication in Birds and Mammals , Advances in theStudy of Behavior , vol. 40. Academic Press, Massachusetts, USA, pp. 1–33,doi:10.1016/S0065-3454(09)40001-9.Cardoso, G.C. & Atwell, J.W. (2012) On amplitude and frequency in birdsong:A reply to Zollinger et al.
Animal Behaviour , (4), e10–e15, doi:10.1016/j.anbehav.2012.08.012. http://dx.doi.org/10.6084/m9.figshare.795273
17e Kort, S.R., Eldermire, E.R.B., Valderrama, S., Botero, C.A. & Vehrencamp,S.L. (2009) Trill consistency is an age-related assessment signal in bandedwrens.
Proceedings of the Royal Society B: Biological Sciences , (1665),2315–2321, doi:10.1098/rspb.2009.0127.Dooling, R.J., Leek, M.R., Gleich, O. & Dent, M.L. (2002) Auditory temporalresolution in birds: discrimination of harmonic complexes. Journal of theAcoustical Society of America , , 748, doi:10.1121/1.1494447.Ey, E. & Fischer, J. (2009) The “acoustic adaptation hypothesis”–a review ofthe evidence from birds, anurans and mammals. Bioacoustics , (1-2), 21–48,doi:10.1080/09524622.2009.9753613.Fawcett, T. (2006) An introduction to ROC analysis. Pattern Recognition Let-ters , (8), 861–874, doi:10.1016/j.patrec.2005.10.010.Fulop, S.A. & Fitz, K. (2006) A spectrogram for the twenty-first century. Acous-tics Today , (3), 26–33, doi:10.1121/1.2961138.Gall, M.D., Brierley, L.E. & Lucas, J.R. (2012) The sender–receiver match-ing hypothesis: Support from the peripheral coding of acoustic featuresin songbirds. Journal of Experimental Biology , (21), 3742–3751, doi:10.1242/jeb.072959.Goller, F. & Riede, T. (2012) Integrative physiology of fundamental frequencycontrol in birds. Journal of Physiology–Paris , (3), 230–242, doi:10.1016/j.jphysparis.2012.11.001.Gribonval, R. (2001) Fast matching pursuit with a multiscale dictionary of Gaus-sian chirps. IEEE Transactions on Signal Processing , (5), 994–1001, doi:10.1109/78.917803.Handford, P. & Lougheed, S.C. (1991) Variation in duration and frequencycharacters in the song of the rufous-collared sparrow, Zonotrichia capensis,with respect to habitat, trill dialects and body size. Condor , 644–658, doi:10.2307/1368196.Hanley, J.A. & McNeil, B.J. (1982) The meaning and use of the area under areceiver operating (ROC) curve.
Radiology , (1), 29–36.International Standards Organisation (1993) Information technology – Codingof moving pictures and associated audio for digital storage media at up toabout 1,5 Mbit/s – Part 3: Audio. Tech. Rep. ISO/IEC 11172-3:1993 , Inter-national Standards Organisation.Irwin, D.E., Thimgan, M.P. & Irwin, J.H. (2008) Call divergence is corre-lated with geographic and genetic distance in greenish warblers (Phylloscopustrochiloides): A strong role for stochasticity in signal evolution?
Journal ofEvolutionary Biology , (2), 435–448, doi:10.1111/j.1420-9101.2007.01499.x.18ershenbaum, A., Sayigh, L.S. & Janik, V.M. (2013) The encoding of individualidentity in dolphin signature whistles: How much information is needed? PLoS ONE , (10), e77671, doi:10.1371/journal.pone.0077671.Linhart, P., Slabbekoorn, H. & Fuchs, R. (2012) The communicative significanceof song frequency and song length in territorial chiffchaffs. Behavioral Ecology , (6), 1338–1347, doi:10.1093/beheco/ars127.Lohr, B., Dooling, R.J., Bartone, S. et al. (2006) The discrimination of temporalfine structure in call-like harmonic sounds by birds. Journal of ComparativePsychology , (3), 239–251, doi:10.1037/0735-7036.120.3.239.Mahler, B. & Gil, D. (2009) The evolution of song in the Phylloscopus leafwarblers (aves: Sylviidae): A tale of sexual selection, habitat adaptation, andmorphological constraints. Vocal Communication in Birds and Mammals , Ad-vances in the Study of Behavior , vol. 40 (eds. M. Naguib, K. Zuberbuumlhler,N.S. Clayton & V.M. Janik). Academic Press, Massachusetts, USA, pp. 35–66, doi:10.1016/S0065-3454(09)40002-0.Mallat, S.G. (1999)
A Wavelet Tour of Signal Processing , 2nd edn. AcademicPress, London, UK.Markel, J. (1972) Digital inverse filtering – A new tool for formant trajectoryestimation.
IEEE Transactions on Audio and Electroacoustics , (2), 129–137, doi:10.1109/TAU.1972.1162367.Marler, P.R. & Slabbekoorn, H. (2004) Nature’s Music: the Science of Birdsong .Academic Press, Massachusetts, USA.Muˇseviˇc, S. (2013)
Non-Stationary Sinusoidal Analysis . Ph.D. thesis, Universi-tat Pompeu Fabra, Barcelona, Spain, URL http://mtg.upf.edu/node/2763 .Plumbley, M.D., Blumensath, T., Daudet, L., Gribonval, R. & Davies, M.E.(2010) Sparse representations in audio and music: From coding to sourceseparation.
Proceedings of the IEEE , (6), 995–1005, doi:10.1109/JPROC.2009.2030345.Podos, J. (1997) A performance constraint on the evolution of trilled vocaliza-tions in a songbird family (passeriformes: Emberizidae). Evolution , (2),537–551.R Development Core Team (2010) R: A Language and Environment for Statis-tical Computing . R Foundation for Statistical Computing, Vienna, Austria,URL . ISBN 3-900051-07-0.Slabbekoorn, H., Ellers, J. & Smith, T.B. (2002) Birdsong and sound trans-mission: The benefits of reverberations.
The Condor , (3), 564–573, doi:10.1650/0010-5422(2002)104[0564:BASTTB]2.0.CO;2.19prau, P., Roth, T., Schmidt, R., Amrhein, V. & Naguib, M. (2010) Communi-cation across territory boundaries: Distance-dependent responses in nightin-gales. Behavioral Ecology , (5), 1011–1017, doi:10.1093/beheco/arq097.Stowell, D. & Plumbley, M.D. (2012) Framewise heterodyne chirp analysis ofbirdsong. Proceedings of the European Signal Processing Conference (EU-SIPCO) . pp. 2694–2698.Trifa, V., Kirschel, A., Taylor, C. & Vallejo, E. (2008) Automated species recog-nition of antbirds in a Mexican rainforest using hidden Markov models.
Jour-nal of the Acoustical Society of America , , 2424, doi:10.1121/1.2839017.Trillo, P.A. & Vehrencamp, S.L. (2005) Song types and their structural featuresare associated with specific contexts in the banded wren. Animal Behaviour , (4), 921–935, doi:10.1016/j.anbehav.2005.02.004.Vehrencamp, S.L., Yantachka, J., Hall, M.L. & de Kort, S.R. (2013) Trillperformance components vary with age, season, and motivation in thebanded wren. Behavioral Ecology and Sociobiology , (3), 409–419, doi:10.1007/s00265-012-1461-x.Wang, A. (2003) An industrial strength audio search algorithm. Proceedingsof the 4th International Conference on Music Information Retrieval (ISMIR’03) . pp. 7–13.Witten, I.H. & Frank, E. (2005)
Data Mining: Practical Machine Learning Toolsand Techniques , 2nd edn. Morgan Kaufmann, San Francisco, CA, USA.Zollinger, S.A., Podos, J., Nemeth, E., Goller, F. & Brumm, H. (2012) On therelationship between, and measurement of, amplitude and frequency in bird-song.
Animal Behaviour , (4), e1–e9, doi:10.1016/j.anbehav.2012.04.026.20 ppendix: Feature selection results We performed feature selection on each of our three datasets, evaluated usingInformation Gain (IG) as described in the main text (Table 4, Figure 5). Theoverall pattern of IG values shows broad similarities between
PhyllASA and
PhyllXC , indicating commonalities in species-dependent features. The IG valuesevaluated on
PhyllXC are consistently lower than those in
PhyllASA , suggestingthat the species information in the former may be affected by noise and MP3effects. However, the tendency to lower IG values may also be an artefact ofdifferences in species distribution within each dataset.Table 4: Ranked results of information-gain (IG) feature selection applied toeach of our three datasets. Features are ranked in order of how strongly theypredict species identity. Left to right: