[PDF] Automatic acoustic identification of individual animals: Improving generalisation across species and recording conditions

Abstract

Many animals emit vocal sounds which, independently from the sounds' function, embed some individually-distinctive signature. Thus the automatic recognition of individuals by sound is a potentially powerful tool for zoology and ecology research and practical monitoring. Here we present a general automatic identification method, that can work across multiple animal species with various levels of complexity in their communication systems. We further introduce new analysis techniques based on dataset manipulations that can evaluate the robustness and generality of a classifier. By using these techniques we confirmed the presence of experimental confounds in situations resembling those from past studies. We introduce data manipulations that can reduce the impact of these confounds, compatible with any classifier. We suggest that assessment of confounds should become a standard part of future studies to ensure they do not report over-optimistic results. We provide annotated recordings used for analyses along with this study and we call for dataset sharing to be a common practice to enhance development of methods and comparisons of results.

Full PDF

AAutomatic acoustic identiﬁcation of individual animals:Improving generalisation across species and recordingconditions

Dan Stowell , Tereza Petruskov´a , Martin ˇS´alek , Pavel Linhart Abstract

Many animals emit vocal sounds which, independently from the sounds’ func-tion, embed some individually-distinctive signature. Thus the automatic recog-nition of individuals by sound is a potentially powerful tool for zoology andecology research and practical monitoring. Here we present a general automaticidentiﬁcation method, that can work across multiple animal species with vari-ous levels of complexity in their communication systems. We further introducenew analysis techniques based on dataset manipulations that can evaluate therobustness and generality of a classiﬁer. By using these techniques we conﬁrmedthe presence of experimental confounds in situations resembling those from paststudies. We introduce data manipulations that can reduce the impact of theseconfounds, compatible with any classiﬁer. We suggest that assessment of con-founds should become a standard part of future studies to ensure they do notreport over-optimistic results. We provide annotated recordings used for anal-yses along with this study and we call for dataset sharing to be a commonpractice to enhance development of methods and comparisons of results.

Keywords: animal communication; individual diﬀerences; individuality;acoustic monitoring; song repertoire; vocalisation.1 a r X i v : . [ c s . S D ] O c t Introduction

Animal vocalisations exhibit consistent individually-distinctive patterns, oftenreferred to as acoustic signatures. Individual diﬀerences in acoustic signalshave been reported universally across vertebrate species (e.g., ﬁsh [1], amphib-ians [2], birds [3], mammals [4]). Individual diﬀerences may arise from varioussources, for example: distinctive fundamental frequency and harmonic struc-ture of acoustic signal can result from individual vocal tract anatomy [4, 5];distinct temporal or frequency modulation patterns of vocal elements may re-sult from inaccurate matching of innate or learned template or can occur denovo through improvisation [6]. Such individual signatures provide individualrecognition cues for other conspeciﬁc animals, and individual recognition basedon acoustic signals is widespread among animals [7]. Long-lasting individualrecognition spanning over one or more year has also been often demonstrated[8, 9, 10]. External and internal factors such as, for example, sound degradationduring transmission [11, 12], variable ambient temperature [13], inner motiva-tion state [14, 15], acquisition of new sounds during life [16], may potentiallyincrease variation of acoustic signals. Despite these potential complications,robust individual signatures were found in many taxa.Besides being studied for their crucial importance in social interactions[17, 18, 19], individual signatures can become a valuable tool for monitoringanimals. Acoustic monitoring of individuals of various species based on vocalcues could become a powerful tool in conservation (reviewed in [3, 20, 21]).Classical capture-mark methods of individual monitoring involve physically dis-turbing the animals of interest and might have a negative impact on healthof studied animals or their behaviour (e.g. [22, 23, 24, 25]). Also, concernshave been raised about possible biases in demographic and behavioural studiesresulting from trap boldness or shyness of speciﬁc individuals [26]. Individualacoustic monitoring oﬀers a great advantage of being non-invasive, and thus canbe deployed across species with fewer concerns about eﬀect on behaviour [3].It also may reveal complementary or more detailed information about speciesbehaviour than classical methods [27, 28, 29, 30].Despite many pilot studies [31, 28, 32, 33], automatic acoustic individualidentiﬁcation is still not routinely applied. It is usually restricted to a particularresearch team or even to a single research project, and eventually, might beabandoned altogether for a particular species. Part of the problem probablylies in the fact that methods of acoustic individual identiﬁcation were closelytailored to a single species (software platform, acoustic features used, etc.).This is good in order to obtain the best possible results for a particular speciesbut it also hinders general, widespread application because methods need to bedeveloped from scratch for each new species or even project. Little attentionhas been paid to developing general methods of automatic acoustic individualidentiﬁcation (henceforth “AAII”) which could be used across diﬀerent species.A few studies in the past have proposed to develop a general, call-type-independent acoustic identiﬁcation, working towards approaches that could beused across diﬀerent species, having simple as well as complex vocalisations234]. Despite promising results, most of the published papers included vocal-isations recorded within very limited periods of time (a few hours in a day)[34, 35, 36, 37]. Hence, these studies might have failed to separate eﬀects oftarget signal and potentially confounding eﬀects of particular recording condi-tions and background sound, which have been reported as notable problems incase of other machine learning tasks [38, 39]. Reducing such confounds directly,by recording an animal in diﬀerent backgrounds, may not be achievable in ﬁeldconditions since animals typically live within limited home ranges and territo-ries. However, acoustic background can change during the breeding season dueto vegetation changes or cycles in activity of diﬀerent bird species. Also, songbirds may change territories in subsequent years or even within a single season[27]. Some other studies of individual acoustic identiﬁcation, on the other hand,provided evidence that machine learning acoustic identiﬁcation can be robustin respect to possible long-term changes in the acoustic background but did notprovide evidence of being generally usable for multiple species [30, 32]. There-fore, the challenge of reliable generalisation of machine learning approach inacoustic individual identiﬁcation across diﬀerent conditions and diﬀerent specieshas not yet been satisfactorily demonstrated.

We brieﬂy review studies representing methods for automatic classiﬁcation ofindividuals. Note that in the present work, as in many of the cited works, we setaside questions of automating the prior steps of recording focal birds and isolat-ing the recording segments in which they are active. It is common, in prepar-ing data sets, for recordists to collate recordings and manually trim them tothe regions containing the “foreground” individual of interest (often with somebackground noise), discarding the regions containing only background sound. Inthe present work we will make use of both the foreground and background clips,and our method will be applicable whether such segmentation is done manuallyor automatically.Matching a signal against a library of templates is a well-known bioacoustictechnique, most commonly using spectrogram (sonogram) representations ofthe sound, via spectrogram cross-correlation [40]. For identifying individuals,template matching will work in principle when the individuals’ vocalisations arestrongly stereotyped with stable individual diﬀerences—and in practice this cangive good recognition results for some species [41]. However, template matchingis only applicable to a minority of species. It is strongly call-type dependent andrequires a library covering all of the vocalisation units that are to be identiﬁed.It is unlikely to be useful for species which have a very large vocabulary, highvariability, or whose vocabulary changes substantially across seasons.An approach which can be more independent of call type is that of Gaussianmixture models (GMMs), previously used extensively in human speech technol-ogy [42, 30]. These do not rely on a strongly ﬁxed template but rather build astatistical model summarising the observations (e.g. the spectral shapes) that3re likely to be produced from each individual. A particularly useful aspect ofthe GMM paradigm is that it can straightforwardly incorporate the concept ofa “universal background model” (UBM), which represents not “background” asordinarily understood but a universal pool of the sounds that might be pro-duced by individuals known and unknown. It therefore allows for the practicalpossibility that a given sound might come from unknown individuals that arenot part of the target set [42]. This approach has been used in songbirds, al-though without testing across multiple seasons [42], and for orangutan includingacross-season evaluation [30].The GMM is a very basic statistical model, which does not incorporateany notion of temporal structure. It thus misses out on making use of alarge amount of information in the signal. One way to improve on this, againwell-developed in human speech technology, is to apply hidden Markov models(HMMs). HMMs are statistical models of temporal structure and have moreﬂexibility than template-matching. However, in general they are likely to becall-type-dependent since they do encode the temporal structure observed ineach vocalisation. Adi et al. used HMMs for recognising individual songbirds,in this case ortolan buntings, with a pragmatic approach to call-type dependence[32]. They ﬁrst applied HMMs to infer the call type active in a given recording(independent of individual), and then given the call type, applied GMMs toinfer which individual was active.Other computational approaches have been studied. Cheng et al. com-pared four classiﬁer methods, aiming to develop call-type-independent recog-nition across three passerine species [37]. They found HMM and support vectormachines to be favourable among the methods they tested. However, the dataused in this study was relatively limited: it was based on single recording ses-sions per individual, and thus could not test across-year performance; and theauthors deliberately curated the data to select clean recordings with minimalnoise, acknowledging that this would not be representative of realistic record-ings. Fox et al. also focused on the challenge of call-independent identiﬁcation,across three other passerine species [35, 34]. They used a neural network classi-ﬁer, and achieved good performance for their species. However, again the datafor this study was based on a single session per individual, which makes it un-clear how far the ﬁndings generalise across days and years, and also does notfully test whether the results may be aﬀected by confounding factors such asrecording conditions.Computational methods for various automatic recognition tasks have re-cently been dominated and dramatically improved by new trends in machinelearning, including deep learning. Within that broad ﬁeld, the challenge of re-liable generalisation is far from solved, and is an active research topic. Withinbioacoustics this has recently been studied for detection of bird sounds [43].In deep learning, it was discovered that even the best-performing deep neuralnetworks might be surprisingly non-robust, and could be forced to change theirdecisions by the addition of tiny imperceptible amounts of background noise toan image [38].Note that deep learning systems also typically require very large amounts of4ata to train, meaning they may currently be infeasible for tasks such as acous-tic individual ID in which the number of recordings per individual is necessar-ily limited. For deep learning, “data augmentation” has been used to expanddataset sizes. Data augmentation refers to the practice of synthetically creatingadditional data items by modifying or recombining existing items. In the audiodomain, this could be done for example by adding noise, ﬁltering, or mixingaudio clips together [44]. However, simple unprincipled data augmentation doesnot reduce issues such as undersampling (e.g. some vocalisations unrepresentedin data set) or confounding factors.There thus remains a gap in applying machine learning for automatic indi-vidual identiﬁcation as a general-purpose tool that can be shown to be reliablefor multiple species and can generalise correctly across recording conditions.In the work reported in this paper, we tested generalisation of machinelearning across species and across recording conditions in context of individualacoustic identiﬁcation. We used extensive data for three diﬀerent bird species,including repeated recordings of the same individuals within and across twobreeding seasons. As well as directly evaluating across seasons, we also intro-duced ways to modify the evaluation data to probe the generalisation propertiesof the classiﬁer. We then improved on the baseline approach by developing novelmethods which help to improve generalisation performance, again by modify-ing the data used. Although tested with selected species and classiﬁers, ourapproach of modifying the data rather than the classiﬁcation algorithm wasdesigned to be compatible with a wide variety of automatic identiﬁcation work-ﬂows.

For this study we chose three bird species of varying vocal complexity (Figure1), in order to explore how a single method might apply to the same task atdiﬀering levels of diﬃculty and variation. Little owl (

Athene noctua ) representsa species with simple vocalisation (Figure 1a): territorial call is a single syllablewhich is individually unique and it is held to be stable over time (Linhart andˇS´alek unpubl. data) as was shown in several other owl species (e.g. [31, 45]).Then, we selected two passerine species, which exhibit vocal learning: chiﬀchaﬀ(

Phylloscopus collybita ) and tree pipit (

Anthus trivialis ). Tree pipit songs arealso individually unique and stable over time [27]; but male on average uses11 syllable types (6-18) which are repeated in phrases that can be variablycombined to create a song ([46], Figure 1b). Chiﬀchaﬀ song, when visualised,may seem simpler than that of the pipit. However, the syllable repertoire sizemight actually be higher—9 to 24 types—and, contrary to the other speciesconsidered, chiﬀchaﬀ males may change syllable composition of their songs overtime ([47], (Figure 1c). Selected species also diﬀer in their ecology. While littleowls are sedentary and extremely faithful to their territories [48], tree pipits5 r equen cy ( k H z ) AB C Figure 1.

Example spectrograms representing our three study species:(a) little owl (b) tree pipit (c) chiﬀchaﬀ.and chiﬀchaﬀs belong to migratory species with high ﬁdelity to their localities.Annual returning rates for both are 25% to 30% ([27], Linhart unpubl. data).For each of these species, we used targeted recordings of single vocally activeindividuals. Distance to the recorded individual varied across individuals andspecies according to their tolerance towards people. We tried to get the bestrecording and minimise distance to each singing individual without disturbingits activities. Recordings were always done under favorable weather conditions(no rain, no strong wind). In general, signal-to-noise ratio is very good in allof our recordings (not rigorously assessed), but there are also environmentalsounds, sounds from other animals or conspeciﬁcs in the recording background.All three species were recorded with following equipment: Sennheiser ME67microphone, Marantz PMD660 or 661 solid-state recorder (sampling frequency44.1 kHz, 16 bit, PCM).

Little owl (Linhart and ˇS´alek 2017) [49]:

Little owls were recordedin two Central European farmland areas: northern Bohemia, Czech Republic(50 ° ° ° ° Chiﬀchaﬀ (Pr˚uchov´a et al 2017 [47], Pt´aˇcek et al 2016 [42]):

Chif-fchaﬀ males were recorded in a former military training area on the outer bound-ary of ˇCesk´e Budˇejovice town, the Czech Republic (48 ° ° Tree Pipit (Petruskov´a et al. 2015 [27]):

Tree Pipit males were6 able 1.

Details of the audio recording datasets used.Evaluation scenario

Num. ofinds Foreground

Chiﬀchaﬀ within-year 13 5107 : 1131 451 : 99 5011 : 1100 453 : 92Chiﬀchaﬀ only-15 13 195 : 1131 18 : 99 195 : 1100 21 : 92Chiﬀchaﬀ across-year 10 324 : 201 32 : 20 304 : 197 31 : 24Little owl across-year 16 545 : 407 11 : 8 546 : 409 34 : 27Pipit within-year 10 409 : 303 27 : 21 398 : 293 49 : 47Pipit across-year 10 409 : 313 27 : 19 398 : 306 49 : 37recorded at the locality Brdsk´a vrchovina, the Czech Republic (49 ° ° “Data augmentation” in machine learning refers to creating artiﬁcially largeor diverse data sets by synthetically manipulating items in data sets to cre-ate new items—for example, by adding noise or performing mild distortions.These artiﬁcially enriched data sets, used for training, often lead to improvedautomatic classiﬁcation results, helping to mitigate the eﬀects of limited dataavailability [50, 51]. Data augmentation is increasingly used in machine learningapplied to audio. Audio-speciﬁc manipulations used might include ﬁltering orpitch-shifting, or the mixing together of audio ﬁles (i.e. summing their signalstogether) [52, 53]. Some of the highest-performing automatic species recogni-tion systems rely in part on such data augmentations to attain their strongestresults [44].In this work, we describe two augmentation methods used speciﬁcally toevaluate and to reduce the confounding eﬀect of background sound. These structured data augmentations are based on audio mixing but with the com-binations of ﬁles to mix selected based on foreground and background identitymetadata. We make use of the fact that when recording audio from focal indi-viduals in the wild, it is common to obtain recording clips in which the focalindividual is vocalising (Figure 2a), as well as ‘background’ recordings in whichthe vocal individual is silent (Figure 2b). The latter are commonly discarded.We used them as follows: Adversarial data augmentation:

To evaluate the extent to which confound-ing from background information is an issue, we created datasets in which7ach foreground recording has been mixed with one background record-ing from some other individual (Figure 2c). In the best case, this shouldmake no diﬀerence, since the resulting sound clip is acoustically equiva-lent to a recording of the foreground individual, but with a little extrairrelevant background noise. In fact it could be considered a synthetictest of the case in which an individual is recorded having travelled outof their home range. In the worst case, a classiﬁer that has learnt un-desirable correlations between foreground and background will be misledby the modiﬁcation, either increasing the probability of classifying as theindividual whose territory provided the extra background, or simply con-fusing the classiﬁer and reducing its general ability to classify well. In ourimplementation, each foreground item was used once, each mixed with adiﬀerent background item. Thus the evaluation set remains the same sizeas the unmodiﬁed set. We evaluated the robustness of a classiﬁer by look-ing at any changes in the overall correctness of classiﬁcation, or in moredetail via the extent to which the classiﬁer outputs are modiﬁed by theadversarial augmentation.

Stratiﬁed data augmentation:

We can use a similar principle during thetraining process, to create an enlarged and improved training data set.We created training datasets in which each training item had been mixedwith an example of background sound from each other individual (Figure2d). If there are K individuals this means that each item is convertedinto K synthetic items, and the data set size increases by a factor of K . Stratifying the mixing in this way, rather than selecting backgroundsamples purely at random, is intended to expose a classiﬁer to trainingdata with reduced correlation between foreground and background, andthus reduce the chance that it uses confounding information in makingdecisions.To implement the foreground and background audio ﬁle mixing, we used the sox processing tool v14.4.1. Alongside our data augmentation, we can also consider simple interventions inwhich the background sound recordings are used alone without modiﬁcation.One way of diagnosing confounding-factor issues in AAII is to apply theclassiﬁer to background-only sound recordings. If there are no confounds inthe trained classiﬁer, trained on foreground sounds, then it should be unable to identify the corresponding individual for any given background-only sound(identifying ‘a’ or ‘b’ in Figure 2b). Automatic identiﬁcation (“AAII”) forbackground-only sounds should yield results at around chance level.A second use of using the background-only recordings is to create an explicit‘wastebasket’ class during training. As well as training the classiﬁer to recogniseindividual labels A, B, C, ..., we created an additional ‘wastebasket’ class which8 a) ‘Foreground’ recordings, which also contain some signal content comingfrom the background habitat. The foreground and background might not varyindependently, especially in the case of territorial animals. (b) ‘Background’ recordings, recorded when the focal animal is not vocalising → classify vs. → classify (c) In adversarial data augmentation, we mix each foreground recording with a backgroundrecording from another individual, and measure the extent to which this alters the classiﬁer’sdecision. ... → train (d) In stratiﬁed data augmentation, each foreground recording is mixed with abackground recording from each other class . This creates a to reduce theconfounding correlation in the training data. Figure 2.

Explanatory illustration of our data augmentation interventions.9hould be recognised as ‘none of the above’, or in this case, explicitly as ‘back-ground’. The explicit-background class may or may not be used in the eventualdeployment of the system. Either way, its inclusion in the training process couldhelp to ensure that the classiﬁer learns not to make mistaken associations withthe other classes. This approach is related to the universal background model(UBM) used in open-set recognition methods [42]. Note that the ‘background’class is likely to be diﬀerent in kind from the other classes, having very diversesounds. In methods with an explicit UBM, the background class can be handleddiﬀerently than the others [42]. Here, we chose to use methods that can workwith any classiﬁer, and so the background class was simply treated analogouslyto the classes of interest.

In this work, we started with a standard automatic classiﬁcation processingworkﬂow (Figure 3a), and then experimented with inserting our proposed im-provements. We modiﬁed the feature processing stage, but our main innovationsin fact came during the data set preparation stage, using the foreground and/orbackground data sets in various combinations to create diﬀerent varieties oftraining and testing data (Figure 3b).As in many other works, the audio ﬁles—which in this case may be theoriginals or their augmented versions—were not analysed in their raw waveformformat, but were converted to a mel spectrogram representation: ‘mel’ referringto a perceptually-motivated compression of the frequency axis of a standardspectrogram. We used audio ﬁles (44.1 kHz mono) converted into spectrogramsusing frames of length 1024 (23 ms), with Hamming windows, 50% frame over-lap, and 40 mel bands. We applied median-ﬁltering noise reduction to thespectrogram data.Following the ﬁndings of [54], we also applied unsupervised feature learning tothe mel spectrogram data as a preprocessing step. This procedure scans throughthe training data in unsupervised fashion (i.e. neglecting the data labels), ﬁndinga linear projection that provides an informative transformation of the data. Weevaluated the audio feature data with and without this feature learning step,to evaluate whether the data representation had an impact on the robustnessand generalisability of automatic classiﬁcation. In other words, as input to theclassiﬁer we used either the mel spectrograms, or the learned representationobtained by transforming the mel spectrogram data.The automatic classiﬁer we used was one based on a random forest classiferthat was previously tested successfully for bird species classiﬁcation, but hadnot been tested for AAII [54].

As is standard in automatic classiﬁcation evaluation, we divided our datasetsinto portions used for training the system, and portions used for evaluatingsystem performance. Items used in training were not used in evaluation, and10 raining data (foreground)

Testing data (foreground)

Mel spectrogram Train classi ﬁ erMel spectrogram Apply classi ﬁ er DecisionData set preparation Feature processing Classi ﬁ cation (a) A standard workﬂow for automatic audioclassiﬁcation. The upper portion shows thetraining procedure, and the lower shows theapplication or evaluation procedure.

Training data (foreground)

Training data (background)

Augment(mix audio) - strati ﬁ ed Concatenate data sets

Testing data (foreground)

Testing data (background)

Augment(mix audio) - adversarial

Choose bg or fg Mel spectrogram Feature-learning(learn & transform) Train classi ﬁ erMel spectrogram Feature-learning(transform) Apply classi ﬁ er Decision Data set preparation Feature processing Classi ﬁ cation (b) Workﬂow for our automatic classiﬁcation experiments. Dashed boxes represent stepswhich we enable/disable as part of our experiment. The upper portion shows the trainingprocedure, and the lower shows the evaluation procedure. The two portions are very similar.However, note that the purpose and method of augmentation is diﬀerent in each, as is the useof background-only audio: in the training phase the ‘concatenation’ block creates an enlargedtraining set as the union of the background items and the foreground items, while in theevaluation phase the ‘choose’ block select only one of the two, for the system to makepredictions about.

Figure 3.

Classiﬁcation workﬂows. 11he allocation of items to the training or evaluation sets was done to create apartitioning through time: evaluation data came from diﬀerent days within thebreeding season, or subsequent years, than the training data. This correspondsto a plausible use-case in which a system is trained with existing recordingsand then deployed; the partitioning also helps to reduce the probability of over-estimating performance.To quantify performance we used receiver operating curve (ROC) analysis,and as a summary statistic the area under the ROC curve (AUC). The AUCsummarises classiﬁer performance and has various desirable properties for eval-uating classiﬁcation [55].We evaluated the classiﬁers following the standard paradigm used in ma-chine learning. Note that during evaluation, we optionally modiﬁed the eval-uation data sets in two possible ways, as already described: adversarial dataaugmentation, and background-only classiﬁcation. In all cases we used AUC asthe primary evaluation measure. However, we also wished to probe the eﬀect ofadversarial data augmentation in ﬁner detail: even when the overall decisionsmade by a classiﬁer are not changed by modifying the input data, there may besmall changes in the full set of probabilities it outputs. A classiﬁer that is robustto adversarial augmentation should be one whose probabilities change little if atall. Hence for the adversarial augmentation test, we also took the probabilitiesoutput from the classiﬁer and compared them against their equivalent proba-bilities from the same classiﬁer in the non-adversarial case. We measured thediﬀerence between these sets of probabilities simply by their root-mean-squareerror (RMS error).

For our ﬁrst phase of testing, we wished to compare the eﬀectiveness of thediﬀerent proposed interventions, and their relative eﬀectiveness on data testedwithin-year or across-year. We chose to use the chiﬀchaﬀ datasets for these tests,since the chiﬀchaﬀ song has an appropriate level of complexity to elucidate thediﬀerences between classiﬁer performance, in particular the possible change ofsyllable composition across years. The chiﬀchaﬀ dataset is also by far the largest.We wanted to explore the diﬀerence in estimated performance when evalu-ating a system with recordings from the same year, separated by days from thetraining data, versus recordings from a subsequent year. In the latter case, thebackground sounds may have changed intrinsically, or the individual may havemoved to a diﬀerent territory; and of course the individual’s own vocalisationpatterns may change across years. This latter eﬀect may be an issue for AAIIwith a species such as the chiﬀchaﬀ, and also impose limits to the applicationof previous approaches such as template-based matching. Hence we wanted totest whether this more ﬂexible machine learning approach could detect individ-ual signature in the chiﬀchaﬀ even when applied to data from a diﬀerent ﬁeldseason. We thus evaluated performance on ‘within-year’ data—recordings fromthe same season—and ‘across-year’ data—recordings from the subsequent year,or a later year. 12ince the size of data available is often a practical constraint in AAII, andsince dataset size can have a strong inﬂuence on classiﬁer performance, we fur-ther performed a version of the ‘within-year’ test in which the training data hadbeen restricted to only 15 items per individual. The evaluation data was notrestricted.To evaluate formally the eﬀect of the diﬀerent interventions, we appliedgeneralised linear mixed models (GLMM) to our evaluation statistics, using the glmmadmb package within R version 3.4.4 [56, 57]. Since AUC is a continuousvalue constrained to the range [0 , In the second phase of our investigations, we evaluated the selected approachacross the three species separately: chiﬀchaﬀ, pipit and little owl. For each ofthese we compared the most basic version of the classiﬁer (using mel features,no augmentation, and no explicit-background) against the improved versionthat was selected from phase one of the investigation. For each species sep-arately, and using within-year and across-year data according to availability,we evaluated the basic and the improved classiﬁer for the overall performance(AUC measured on foreground sounds). We also evaluated their performance onbackground-only sounds, and on the adversarial data augmentation test, bothof which checked the relationship between improved classiﬁcation performanceand improvements or degradations in the handling of confounding factors.For both of these tests (background-only testing and adversarial augmenta-tion), we applied GLMM tests similar to those already stated. In these cases weentered separate factors for the testing condition and for whether the improvedclassiﬁer was in use, as well as an interaction term between the two factors. Thistherefore tested for an eﬀect of whether our improved classiﬁer indeed mitigatedthe problems that the tests were designed to expose.

AAII performance over the 13 chiﬀchaﬀ individuals was strong, above 85% AUCin all variants of the within-year scenario (Figure 4). For interpretation, notethat this corresponds to over 85% probability that a random true-positive itemis ranked higher than a random true-negative item by the system [55]. Thisreduced to around 70–80% when the training set was limited to 15 items per13 asic +exbg +aug +aug+exbg30405060708090100 A U C ( % ) mel spec features chiff chaff within-yearchiff chaff across-yearchiff chaff only-15 within-year Basic +exbg +aug +aug+exbg30405060708090100 A U C ( % ) learnt features chiff chaff within-yearchiff chaff across-yearchiff chaff only-15 within-year Figure 4.

Performance of classiﬁer (AUC) across the three chiﬀchaﬀevaluation scenarios, and with various combinations of conﬁguration:with/without augmentation (‘aug’), learnt features, and explicit-background(‘exbg’) training.individual, and reduced even further to around 60% in the across-year evaluationscenario. Recognising chiﬀchaﬀ individuals across years remains a challengingtask even under the studied interventions.The focus of our study is on discriminating between individuals, but our“explicit-background” conﬁguration additionally made it possible for the sameclassiﬁer to discriminate between cases where a focal individual was singing,and cases where it was not. Across all three of the conditions mentioned above,foreground-vs-background discrimination (aka “detection” of any focal individ-ual) for chiﬀchaﬀ was strong at over 95% AUC. Mel spectral features performedslightly better for this (range 96.6–98.6%) than learnt features (range 95.3–96.7%). Given this, in the remainder of the results we focus on our main ques-tion of discriminating between individuals.We tested the GLMM residuals for the two evaluation measures (AUC,RMSE) and found no evidence for overdispersion. We also tested all possi-ble reduced models with factors removed, comparing among models using AIC.In both cases, the full model as well as a model with ‘exbg’ (explicit-backgroundtraining) removed gave the best ﬁt, with the full model less than 2 units abovethe exbg-reduced model and leading to no diﬀerence in signiﬁcance estimates.We therefore report results from the full models.Feature-learning and structured data augmentation were both found to sig-niﬁcantly improve classiﬁer performance (Table 2) as well as robustness to ad-versarial data augmentation (Table 3). Explicit-background training was foundto lead to mild improvement but this was a long way below signiﬁcance.14 able 2.

Results of GLMM test for AUC, across the three chiﬀchaﬀevaluation scenarios. Estimate p-value(Intercept) 0.8199 0.041 *Feature-learning 0.3093 0.014 *Augmentation 0.2509 0.048 *Explicit-bg class 0.0626 0.621

Table 3.

Results of GLMM ﬁt for RMSE in the adversarial dataaugmentation test, across the three chiﬀchaﬀ evaluation scenarios.Estimate p-value(Intercept) 1.8543 1.9e-05 ***Feature-learning -0.5044 1.9e-08 ***Augmentation -0.8734 < Based on the results of our ﬁrst study, we took forward an improved version ofthe classiﬁer (using stratiﬁed data augmentation, and learnt features, but notexplicit-background training) to test across multiple species.Applying this classiﬁer to the diﬀerent species and conditions, we foundthat it led in most cases to a dramatic improvement in recognition performanceof foreground recordings, and little change in the recognition of backgroundrecordings (Figure 5, Table 4). This suggests that the improvement is based onthe individuals’ signal characteristics and not confounding factors.Our adversarial augmentation, intended as a diagnostic test to adversariallyreduce classiﬁcation performance, did not have strong overall eﬀects on theheadline performance indicated by the AUC scores (Figure 6, Table 4). Halfof the cases examined—the across-year cases—were not adversely impacted, infact showing a very small increase in AUC score. The chiﬀchaﬀ within-year testswere the only to show a strong negative impact of adversarial augmentation, andthis negative impact was removed by our improved classiﬁcation method.We also conducted a more ﬁne-grained analysis of the eﬀect of augmenta-tion, by measuring the amount of deviation induced in the probabilities outputfrom the classiﬁer. On this measure we observed a consistent eﬀect, with ourimprovements reducing the RMS error by ratios of approx 2–6, while the overallmagnitude of the error diﬀered across species (Figure 7).15 asic(fg test) Improved(bg test)Basic(bg test)Improved(fg test)30405060708090100 A U C ( % ) fg testing vs. bg-only testing chiff chaff within-yearchiff chaff across-yearchiff chaff only-15 within-yearlittle owl cross-yearpipit within-yearpipit across-year Basic(fg test) Improved(bg test)Basic(bg test)Improved(fg test)5060708090100 A cc u r a c y ( % ) fg testing vs. bg-only testing chiff chaff within-yearchiff chaff across-yearchiff chaff only-15 within-yearlittle owl cross-yearpipit within-yearpipit across-year Figure 5.

Our selected interventions—data augmentation andfeature-learning—improve classiﬁcation performance, in some casesdramatically (left-hand pairs of points), without any concomitant increase inthe background-only classiﬁcation (right-hand pairs of points) which would bean indication of counfounding.

Table 4.

Results of GLMM test for AUC, across all three species, to quantifythe general eﬀect of our improvements on the foreground test and thebackground test (cf. Figure 5). Estimate p-value(Intercept) 0.792 0.00150 **Use of improved classiﬁer 0.852 0.00032 ***Background-only testing -0.562 0.00624 **Interaction term -0.896 0.00391 **

Table 5.

Results of GLMM test for AUC, across all three species, to quantifythe general eﬀect of our improvements on the adversarial test (cf. Figure 6).Estimate p-value(Intercept) 0.873 0.0121 *Use of improved classiﬁer 0.820 0.0027 **Adversarial data augmentation -0.333 0.1713Interaction term 0.225 0.552016 asic(fg test) Improved(adversarial)Basic(adversarial) Improved(fg test)30405060708090100 A U C ( % ) adversarials chiff chaff within-yearchiff chaff across-yearchiff chaff only-15 within-yearlittle owl cross-yearpipit within-yearpipit across-year Basic(fg test) Improved(adversarial)Basic(adversarial) Improved(fg test)5060708090100 A cc u r a c y ( % ) adversarials chiff chaff within-yearchiff chaff across-yearchiff chaff only-15 within-yearlittle owl cross-yearpipit within-yearpipit across-year Figure 6.

Adversarial augmentation has a varied impact on classiﬁerperformance (left-hand pairs of points), in some cases giving a large decline.Our selected interventions vastly reduce the impact of this adversarial test,while also generally improving classiﬁcation performance (right-hand pairs ofpoints). 17 hiff chaffwithin--year chiff chaffacross--year pipitwithin--year pipitacross--year little owlacross--year024681012 R M S e rr o r BasicImproved

Figure 7.

Measuring in detail how much eﬀect the adversarial augmentationhas on classiﬁer decisions: RMS error of classiﬁer output, in each case applyingadversarial augmentation and then measuring the diﬀerences compared againstthe non-adversarial equivalent applied to the exact same data. In all ﬁvescenarios, our selected interventions lead to a large decrease in the RMS error.18

Discussion

We demonstrate that a single approach to automatic acoustic identiﬁcation ofindividuals (AAII) can be successfully used across diﬀerent species with dif-ferent complexity of vocalisations. One exception to this is the hardest case,chiﬀchaﬀ tested across years, in which automatic classiﬁcation performance re-mains modest. The chiﬀchaﬀ case (complex song, variable song content), in par-ticular, highlights the need for proper assessment of identiﬁcation performance.Without proper assessment we cannot be sure if promising results reﬂect thereal potential of proposed identiﬁcation method. We document that our pro-posed improvements to the classiﬁer training process are able, in some cases, toimprove the generalisation performance dramatically and, on the other hand,reveal confounds causing over–optimistic results.We evaluated spherical k-means feature-learning as previously used for speciesclassiﬁcation [54]. We found that for individual identiﬁcation it provides an im-provement over plain Mel spectral features, not just in accuracy (as previouslyreported) but also in resistance to confounding factors (ibid.). We believe this isdue to the feature-learning having been tailored to reﬂect ﬁne temporal detailsof bird sound; if so, this lesson would carry across to related systems such as con-volutional neural networks. Our machine-learning approach may be particularlyuseful for automatic identiﬁcation of individuals in species with more complexsongs, such as pipits (note huge increase in performance over mel features inFigure 5), or chiﬀchaﬀs (on short-time scale though).Using silence-regions from focal individuals to create an “explicit-background”training category provided only a mild improvement in the behaviour of theclassiﬁer, under various evaluations. Also, we found that the best-performingconﬁguration used for detecting the presence/absence of a focal individual wasnot the same as the best-performing conﬁguration for discriminating betweenindividuals. Hence, it seems generally preferable not to combine the detectionand AAII tasks into one classiﬁer.By contrast, using silence-regions to perform dataset augmentation of theforeground sounds was found to give a strong boost to performance as well asresistance against confounding factors. Background sounds are useful in traininga system for AAII, through data augmentation (rather than explicit-backgroundtraining).We found that adversarial augmentation provided a useful tool to diagnoseconcerns about the robustness of an AAII system. In the present work we foundthat the classiﬁer was robust against this augmentation (and thus we can inferthat it was largely not using background confounds to make its decision), exceptfor the case of chiﬀchaﬀ with the simple mel features (Figure 6). This lattercase exhorts us to be cautious, and suggests that results from previous call-typeindependent methods may have been over-optimistic in assessing performance[34, 35, 36, 37, 42]. Our adversarial augmentation method can help to test forthis even in the absense of across-year data.Background-only testing was useful to conﬁrm that when the performanceof a classiﬁer was improved, the confounding factors were not aggravated in19arallel, i.e. that the improvement was due to signal and not confound (Figure5). However, the performance on background sound recordings was not reducedto chance, but remained at some level reﬂecting the foreground-backgroundcorrelations in each case, so results need to interpreted comparatively against theforeground improvement, rather than in isolation. This individual speciﬁcity ofthe background may be related to the time interval between recordings. This isclear from the across-year outcomes; within-year, we note that there was one dayof temporal separation for chiﬀchaﬀs (close to 70 percent AUC on background-only sound), while an interval of weeks for pipits (chance-level classiﬁcation ofbackground). These eﬀects surely depend on characteristics of the habitat.Our improved classiﬁer performs much more reliably than the standard one;however, the most crucial factor still seems to be a targeted species. For the littleowl we found good performance, and least aﬀected by modiﬁcations in methods- consistent with the fact that it is the species with the simplest vocalisations.Little owl represents a species well suited for template matching individual iden-tiﬁcation methods which have been used in past for many species with similarsimple, ﬁxed vocalisations (discriminant analysis, cross-correlation). For thesecases, it seems that our automatic identiﬁcation method does not bring advan-tage regarding improved classiﬁcation performance. However, a general classiﬁersuch as ours, automatically adjusting a set of features for each species, wouldallow common users to start individual identiﬁcation right away without theneed to choose an appropriate template-matching method (e.g. [49]).We found that feature learning gave the best improvement in case of pipits(Figure 5). Pipits have more complex song, where simple template matchingcannot be used to identify individuals. In pipits, each song may have diﬀerentduration and may be composed of diﬀerent subsets of syllable repertoire, and soany a single song cannot be used as template for template matching approach.This singing variation likely also prevents good identiﬁcation performance basedon Mel features in pipits. Nevertheless, a singing pipit male will cycle throughthe whole syllable repertoire within a relatively low number of songs and indi-vidual males can be identiﬁed based on their unique syllable repertoires ([27]).We think that our improvements to the automatic identiﬁcation might allowthe system to pick up correct features associated with stable repertoire of eachmale. This extends the use of the same automatic identiﬁcation method to thelarge part of songbird species that organise songs into several song types and,at the same time, are so-called closed-ended learners ([58]).Our automatic identiﬁcation, however, cannot be considered fully indepen-dent of song content in a sense deﬁned earlier (e.g.[34, 36]). Such content-independent identiﬁcation method should be able to classify across-year record-ings of chiﬀchaﬀs in which syllable repertoires of males diﬀer almost completelybetween the two years [47]. Due to vulnerability of Mel feature classiﬁcationto confounds reported here and because performance of content independentidentiﬁcation has been only tested on short-term recordings, we believe thatthe concept of fully content-independent individual identiﬁcation needs to bereliably demonstrated yet.Our approach seems to be deﬁnitely suitable for species with individual20ocalisation stable over time, even if that vocalisation is complex—a very widerange of species—in general outdoor conditions. For such species it might besuccessfully used for individual automatic acoustic monitoring, although thisneeds to be tested at larger scale: in various species and in large populations.In future work these approaches should also be tested with ‘open-set’ classiﬁersallowing for the possibility that new unknown individuals might appear in data.This is well-developed in the “universal background model” (UBM) developedin GMM-based speaker recognition [42], and future work in machine learning isneeded to develop this for the case of more powerful classiﬁers.Important for further work in this topic is open sharing of data in stan-dard formats. Only this way can diverse datasets from individuals be usedto develop/evaluate automatic recognition that works across many taxa andrecording conditions.We conclude by listing the recommendations that emerge from this work forusers of automatic classiﬁers, in particular for acoustic recognition of individuals:1. Record ‘background’ segments, for each individual (class), and publishbackground audio samples alongside the trimmed individual audio sam-ples. Standard data repositories can be used for these purposes (e.g.Dryad, Zenodo).2. Improve robustness by:(a) suitable choice of input features;(b) structured data augmentation, using background sound recordings.3. Probe your classiﬁer for robustness by:(a) background-only recognition: higher-than-chance recognition stronglyimplies confound;(b) adversarial distraction with background: a large change in classiﬁeroutputs implies confound;(c) across-year testing (if such data are available): a stronger test thanwithin-year.4. Be aware of how species characteristics will aﬀect recognition. The vocal-isation characteristics of the species will inﬂuence the ease with which au-tomatic classiﬁers can identify individuals. Songbirds whose song changeswithin and between seasons will always be harder to identify reliably - asis also the case in manual identiﬁcation.5. Best practice is to test manual features and learned features since thegeneralisation and performance characteristics are rather diﬀerent. In thepresent work we compare basic features against learned features; for a dif-ferent example see [12]. Manual features are usually of lower accuracy, butwith learned features more care must be taken with respect to confoundsand generalisation. 21 thics

Our study primarily involved only non-invasive recording of vocalising indi-viduals. In the case of ringed individuals (all chiﬀchaﬀs and some tree pipitsand little owls), ringing was done by experienced ringers (PL, MˇS, TP) whoall held ringing licences at the time of study. Tree pipits and chiﬀchaﬀ maleswere recorded during spontaneous singing. Only for little owls short playbackrecording (1 min) was used to provoke calling. Playback provocations as wellas handling during ringing were kept as short as possible and we are not awareof any consequences for subjects’ breeding or welfare.

Data Accessibility

Our audio data and the associated metadata ﬁles are available online under theCreative Commons Attribution licence (CC BY 4.0) at http://doi.org/10.5281/zenodo.1413495

Competing Interests

We have no competing interests.

Authors’ Contributions

DS and PL conceived and designed the study. PL, TP and MˇS recorded audio.PL processed the audio recordings into data sets. DS carried out the classi-ﬁcation experiments and performed data analysis. DS, PL and TP wrote themanuscript. All authors gave ﬁnal approval for publication.

Funding

DS was supported by EPSRC Early Career research fellowship EP/L020505/1.PL was supported by the National Science Centre, Poland, under Polonez fel-lowship reg. no UMO-2015/19/P/NZ8/02507 funded by the European UnionsHorizon 2020 research and innovation programme under the Marie Skodowska-Curie grant agreement No 665778. TP was supported by the Czech ScienceFoundation (project P505/11/P572).MˇS was supported by the research aim ofthe Czech Academy of Sciences (RVO 68081766).22 eferences

1. Amorim MCP, Vasconcelos RO. Variability in the mating calls of theLusitanian toadﬁsh Halobatrachus didactylus: cues for potential individ-ual recognition. Journal of Fish Biology. 2008;73:1267–1283.2. Bee MA, Gerhardt HC. Neighbour-stranger discrimination by territorialmale bullfrogs (Rana catesbeiana): I. Acoustic basis. Animal Behaviour.2001;62:1129–1140.3. Terry AM, Peake TM, McGregor PK. The role of vocal individuality inconservation. Frontiers in Zoology. 2005;2(1):10.4. Taylor AM, Reby D. The contribution of source-ﬁlter theory to mam-mal vocal communication research. Journal of Zoology. 2010;280(3):221–236. Available from: http://onlinelibrary.wiley.com/doi/10.1111/j.1469-7998.2009.00661.x/abstract .5. Gamba M, Favaro L, Araldi A, Matteucci V, Giacoma C, Friard O. Mod-eling individual vocal diﬀerences in group-living lemurs using vocal tractmorphology. CURRENT ZOOLOGY. 2017;63(4):467–475.6. Janik V, Slater PB. Vocal Learning in Mammals. vol. Volume26. Academic Press; 1997. p. 59–99. Available from: .7. Wiley RH. Speciﬁcity and multiplicity in the recognition of individuals:implications for the evolution of social behaviour. Biological Reviews.2013;88(1):179–195. WOS:000317066700011.8. Boeckle M, Bugnyar T. Long-Term Memory for Aﬃliates in Ravens.Current Biology. 2012;22(9):801–806. Available from: .9. Insley SJ. Long-term vocal recognition in the northern fur seal : Article :Nature. Nature. 2000;406(6794):404–405. Available from: .10. Briefer EF, de la Torre MP, McElligott AG. Mother goats do not forgettheir kids ' calls. Proceedings of the Royal Society B: Biological Sciences.2012 jun;279(1743):3749–3755.11. Slabbekoorn H. Singing in the wild: the ecology of birdsong. In: Marler P,Slabbekoorn H, editors. Nature’s music: the science of birdsong. ElsevierAcademic Press; 2004. p. 178–205.12. Mouterde SC, Elie JE, Theunissen FE, Mathevon N. Learning to copewith degraded sounds: Female zebra ﬁnches can improve their expertiseat discriminating between male voices at long distance. The Journal ofExperimental Biology. 2014;p. jeb–104463.233. Gambale PG, Signorelli L, Bastos RP. Individual variation inthe advertisement calls of a Neotropical treefrog (Scinax constric-tus). Amphibia-Reptilia. 2014;35(3):271–281. Available from: http://booksandjournals.brillonline.com/content/journals/10.1163/15685381-00002949 .14. Collins SA. Vocal ﬁghting and ﬂirting: the functions of birdsong. In: Mar-ler PR, Slabbekoorn H, editors. Nature’s music: the science of birdsong.Elsevier Academic Press; 2004. p. 39–79.15. Linhart P, Jaˇska P, Petruskov´a T, Petrusek A, Fuchs R. Being an-gry, singing fast? Signalling of aggressive motivation by syllable ratein a songbird with slow song. Behavioural Processes. 2013;100:139–145.Available from: .16. Kroodsma DE. The diversity and plasticity of bird song. In: Marler PR,Slabbekoorn H, editors. Nature’s music: the science of birdsong. ElsevierAcademic Press; 2004. p. 108–131.17. Thom MDF, Dytham C. Female Choosiness Leads to the Evolutionof Individually Distinctive Males. Evolution. 2012;66(12):3736–3742.WOS:000312218200008.18. Bradbury JW, Vehrencamp SL. Principles of animal communication. 1sted. Sinauer Associates; 1998.19. Crowley PH, Provencher L, Sloane S, Dugatkin LA, Spohn B, Rogers L,et al. Evolving cooperation: the role of individual recognition. Biosys-tems. 1996;37(1):49–66. Available from: .20. Mennill DJ. Individual distinctiveness in avian vocalizations and thespatial monitoring of behaviour. Ibis. 2011;153(2):235–238. Availablefrom: http://onlinelibrary.wiley.com/doi/10.1111/j.1474-919X.2011.01119.x/abstract .21. Blumstein DT, Mennill DJ, Clemins P, Girod L, Yao K, Patricelli G,et al. Acoustic monitoring in terrestrial environments using micro-phone arrays: applications, technological considerations and prospec-tus. Journal of Applied Ecology. 2011;48(3):758–767. Availablefrom: http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2664.2011.01993.x/abstract .22. Johnsen A, Lifjeld J, Rohde PA. Coloured leg bands aﬀect male mate-guarding behaviour in the bluethroat. Animal Behaviour. 1997;54(1):121–130. 243. Gervais JA, Catlin DH, Chelgren ND, Rosenberg DK. Radiotransmit-ter mount type aﬀects burrowing owl survival. The Journal of wildlifemanagement. 2006;70(3):872–876.24. Linhart P, Fuchs R, Pol´akov´a S, Slabbekoorn H. Once bitten twice shy:long-term behavioural changes caused by trapping experience in willowwarblers Phylloscopus trochilus. Journal of avian biology. 2012;43(2):186–192.25. Rivera-Gutierrez HF, Pinxten R, Eens M. Songbirds never for-get: long-lasting behavioural change triggered by a single play-back event. Behaviour. 2015;152(9):1277–1290. Available from: http://booksandjournals.brillonline.com/content/journals/10.1163/1568539x-00003278 .26. Camacho C, Canal D, Potti J. Lifelong eﬀects of trapping experi-ence lead to age-biased sampling: lessons from a wild bird popula-tion. Animal Behaviour. 2017;130:133–139. Available from: .27. Petruskov´a T, Piˇsvejcov´a I, Kinˇstov´a A, Brinke T, Petrusek A.Repertoire-based individual acoustic monitoring of a migratory passer-ine bird with complex song as an eﬃcient tool for tracking territorialdynamics and annual return rates. Methods in Ecology and Evolution.2015 nov;7(3):274–284. Available from: https://doi.org/10.1111%2F2041-210x.12496 .28. Laiolo P, V¨ogeli M, Serrano D, Tella JL. Testing acoustic versus physicalmarking: two complementary methods for individual-based monitoring ofelusive species. Journal of Avian Biology. 2007;38(6):672–681.29. Kirschel ANG, Cody ML, Harlow ZT, Promponas VJ, Vallejo EE, TaylorCE. Territorial dynamics of Mexican Ant-thrushes Formicarius moniligerrevealed by individual recognition of their songs. Ibis. 2011;153:255–268.30. Spillmann B, van Schaik CP, Setia TM, Sadjadi SO. Who shall I say iscalling? Validation of a caller recognition procedure in Bornean ﬂangedmale orangutan (Pongo pygmaeus wurmbii) long calls. Bioacoustics.2017;26(2):109–120.31. Delport W, Kemp AC, Ferguson JWH. Vocal identiﬁcation of individualAfrican Wood Owls Strix woodfordii: a technique to monitor long-termadult turnover and residency. Ibis. 2002;144:30–39.32. Adi K, Johnson MT, Osiejuk TS. Acoustic censusing using automaticvocalization classiﬁcation and identity recognition. Journal of the Acous-tical Society of America. 2010 FEB;127(2):874–883.253. Terry AMR, McGregor PK. Census and monitoring based on individuallyidentiﬁable vocalizations: the role of neural networks. Animal Conserva-tion. 2002;5:103–111.34. Fox EJS. A new perspective on acoustic individual recognition in animalswith limited call sharing or changing repertoires. Animal Behaviour. 2008MAR;75(3):1187–1194.35. Fox EJS, Roberts JD, Bennamoun M. Call-independent individual iden-tiﬁcation in birds. Bioacoustics. 2008;18(1):51–67.36. Cheng J, Sun Y, Ji L. A call-independent and automatic acoustic sys-tem for the individual recognition of animals: A novel model using fourpasserines. Pattern Recognition. 2010 NOV;43(11):3846–3852.37. Cheng J, Xie B, Lin C, Ji L. A comparative study in birds: call-type-independent species and individual recognition using four machine-learning methods and two acoustic features. Bioacoustics. 2012JUN;21(2):157–171.38. Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, GoodfellowI, et al. Intriguing properties of neural networks. arXiv preprintarXiv:13126199. 2013;.39. Mesaros A, Heittola T, Virtanen T. Acoustic Scene Classiﬁcation: anOverview of DCASE 2017 Challenge Entries. In: 16th InternationalWorkshop on Acoustic Signal Enhancement (IWAENC). Tokyo, Japan;2018. .40. Khanna H, Gaunt S, McCallum D. Digital spectrographic cross-correlation: tests of sensitivity. Bioacoustics. 1997;7(3):209–234.41. Foote JR, Palazzi E, Mennill DJ. Songs of the Eastern Phoebe, a sub-oscine songbird, are individually distinctive but do not vary geographi-cally. Bioacoustics. 2013;22(2):137–151.42. Pt´aˇcek L, Machlica L, Linhart P, Jaˇska P, Muller L. Automatic recogni-tion of bird individuals on an open set using as-is recordings. Bioacoustics.2016;25(1):55–73.43. Stowell D, Stylianou Y, Wood M, Pamu(cid:32)la H, Glotin H. Automatic acous-tic detection of birds through deep learning: the ﬁrst Bird Audio Detec-tion challenge. ArXiv e-prints. 2018 Jul;.44. Lasseck M. Audio-based Bird Species Identiﬁcation with Deep Convolu-tional Neural Networks. Working Notes of CLEF. 2018;2018.45. Grava T, Mathevon N, Place E, Balluet P. Individual acoustic monitoringof the European Eagle Owl Bubo bubo. Ibis. 2008;150:279–287.266. Petruskov´a T, Osiejuk TS, Linhart P, Petrusek A. Structure and Com-plexity of Perched and Flight Songs of the Tree Pipit (Anthus trivi-alis). Annales Zoologici Fennici. 2008 apr;45(2):135–148. Available from: https://doi.org/10.5735%2F086.045.0205 .47. Pr˚uchov´a A, Jaˇska P, Linhart P. Cues to individual identity in songs ofsongbirds: testing general song characteristics in Chiﬀchaﬀs Phylloscopuscollybita . Journal of Ornithology. 2017 apr;Available from: https://doi.org/10.1007%2Fs10336-017-1455-6 .48. Nieuwenhuyse DV, Gnot JC, Johnson DH. The Little Owl: Conservation,Ecology and Behavior of

Athene noctua . Cambridge University Press;2008.49. Linhart P, ˇS´alek M. The assessment of biases in the acoustic discrimina-tion of individuals. PloS one. 2017;12(5):e0177206.50. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classiﬁca-tion with deep convolutional neural networks. In: Advancesin neural information processing systems (NIPS); 2012. p.1097–1105. Available from: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks .51. Cire¸san D, Meier U, Schmidhuber J. Multi-column deep neural networksfor image classiﬁcation. arXiv preprint arXiv:12022745. 2012;.52. Schl¨uter J, Grill T. Exploring Data Augmentation for Improved SingingVoice Detection with Neural Networks. In: Proceedings of the Inter-national Conference on Music Information Retrieval (ISMIR); 2015. p.121–126.53. Salamon J, Bello JP. Deep convolutional neural networks and data aug-mentation for environmental sound classiﬁcation. IEEE Signal ProcessingLetters. 2017;24(3):279–283.54. Stowell D, Plumbley MD. Automatic large-scale classiﬁcation of birdsounds is strongly improved by unsupervised feature learning. PeerJ.2014;2:e488.55. Fawcett T. An introduction to ROC analysis. Pattern Recognition Let-ters. 2006;27(8):861–874.56. Fournier DA, Skaug HJ, Ancheta J, Ianelli J, Magnusson A, MaunderMN, et al. AD Model Builder: using automatic diﬀerentiation for statis-tical inference of highly parameterized complex nonlinear models. OptimMethods Softw. 2012;27:233–249.57. Skaug H, Fournier D, Bolker B, Magnusson A, Nielsen A. GeneralizedLinear Mixed Models using ’AD Model Builder’; 2016-01-19. R packageversion 0.8.3.3. 278. Beecher MD, Brenowitz EA. Functional aspects of song learning insongbirds. Trends in Ecology & Evolution. 2005;20(3):143–149. Avail-able from: