[PDF] Diversity-Robust Acoustic Feature Signatures Based on Multiscale Fractal Dimension for Similarity Search of Environmental Sounds

Abstract

This paper proposes new acoustic feature signatures based on the multiscale fractal dimension (MFD), which are robust against the diversity of environmental sounds, for the content-based similarity search. The diversity of sound sources and acoustic compositions is a typical feature of environmental sounds. Several acoustic features have been proposed for environmental sounds. Among them is the widely-used Mel-Frequency Cepstral Coefficients (MFCCs), which describes frequency-domain features. However, in addition to these features in the frequency domain, environmental sounds have other important features in the time domain with various time scales. In our previous paper, we proposed enhanced multiscale fractal dimension signature (EMFD) for environmental sounds. This paper extends EMFD by using the kernel density estimation method, which results in better performance of the similarity search tasks. Furthermore, it newly proposes another acoustic feature signature based on MFD, namely very-long-range multiscale fractal dimension signature (MFD-VL). The MFD-VL signature describes several features of the time-varying envelope for long periods of time. The MFD-VL signature has stability and robustness against background noise and small fluctuations in the parameters of sound sources, which are produced in field recordings. We discuss the effectiveness of these signatures in the similarity sound search by comparing with acoustic features proposed in the DCASE 2018 challenges. Due to the unique descriptiveness of our proposed signatures, we confirmed the signatures are effective when they are used with other acoustic features.

Full PDF

aa r X i v : . [ c s . S D ] F e b This paper is being submitted to IEICE TRANS. INF. & SYST., XXXX 2021 PAPER

Diversity-Robust Acoustic Feature SignaturesBased on Multiscale Fractal Dimensionfor Similarity Search of Environmental Sounds

Motohiro SUNOUCHI † a) and Masaharu YOSHIOKA †† b) , Members

SUMMARY

This paper proposes new acoustic feature signatures basedon the multiscale fractal dimension (MFD), which are robust against thediversity of environmental sounds, for the content-based similarity search.The diversity of sound sources and acoustic compositions is a typical fea-ture of environmental sounds. Several acoustic features have been proposedfor environmental sounds. Among them is the widely-used Mel-FrequencyCepstral Coeﬃcients (MFCCs), which describes frequency-domain fea-tures. However, in addition to these features in the frequency domain,environmental sounds have other important features in the time domainwith various time scales. In our previous paper, we proposed enhancedmultiscale fractal dimension signature (EMFD) for environmental sounds.This paper extends EMFD by using the kernel density estimation method(EMFD-KDE), which results in increased stability and robustness againstsmall ﬂuctuations in the parameters of sound sources. Furthermore, it newlyproposes another acoustic feature signature based on MFD, namely very-long-range multiscale fractal dimension signature (MFD-VL). The MFD-VL signature describes several features of the time varying envelope forlong periods of time. The descriptiveness of EMFD-KDE and MFD-VLis evaluated through experiments on the similarity search of environmentalsounds. We deﬁne a similarity index to evaluate the performance of thesimilarity search. Our evaluation shows that EMFD-KDE and MFD-VLimprove the similarity index by 17.2%. key words: environmental sound analysis, fractals, content-based retrieval,feature extraction

1. Introduction

Acoustic feature extraction is a basic audio signal process-ing issue. Acoustic features are important and necessaryfor various contexts and applications related to environmen-tal sound recognition (ESR), such as large-scale content-based retrieval, auditory scene analysis, visualization, andevent detection for surveillance. During the last decade,handy digital sound recorders have gained popularity, andat present, not only professional creators, but also amateurshave started recording environmental sounds and sharingthem on web services such as Freesound [1], [2] and Sound-Cloud [3]. These sound recordings are not only appreciatedas music works, but also sampled for creating sound eﬀects,new music works, and live performances in music genres † The author is with the Design Department, Sapporo city uni-versity, Geĳutsu-no-mori 1, Minami-ku, Sapporo, Hokkaido 005–0864, Japan †† The author is with the Graduate School of Information Scienceand Technology, Hokkaido University, Kita 14, Nishi 9, Kita–ku,Sapporo, Hokkaido, 060–0814, Japana) E-mail: [email protected]) E-mail: [email protected]: 10.1587/transinf.E0.D.1 such as ambient, drone, and electronic [4] [5]. These soundrecordings are also utilized for research to analyze and under-stand the variety of sound environments that we live in [6].1.1 Applications Using Acoustic Features for Environmen-tal SoundsIn recent years, the research on ESR for understanding ascene and its context has received considerable attention [7].The workshop challenges on

Detection and Classiﬁcationof Acoustic Scenes and Events (DCASE) have demonstratedperformance evaluations of systems for the detection andclassiﬁcation of sound events [8]. Based on the best resultfrom Task 1B of DCASE2020, Koutini et al. evaluated theirReceptive Field (RF) regularized CNN model with someparameter reduction methods. [9].Classiﬁcation is a basic application that uses acous-tic features. In 2003, Cowling and Sitte [10] presenteda comprehensive comparative study of classiﬁcation tech-niques that use various acoustic features for environmentalsounds. They reported that the test patterns using each ofthe Mel-Frequency Cepstral Coeﬃcients (MFCCs) and thecontinuous wavelet transform achieved the best recognitionperformance. In 2009 and 2012, Chu et al. [11] and Mogi et al. [12] reported that recognition systems that use theMatching-Pursuit-based acoustic feature as a time-domainfeature shows better classiﬁcation performance than systemsthat use the popular MFCCs only as a frequency-domainfeature. In 2013, Bauge et al. [13] proposed a new acous-tic feature for environmental sounds based on the scatteringtransform. This feature is robust against frequency transpo-sition.Content-based retrieval is another basic application thatuses acoustic features. Web-based sound archives such asFreesound and SoundCloud are becoming popular and theamount of their sound content is increasing. The onlineusers who utilize these sound archives can share and browsesound content by means of content-based retrieval. In 2008,Xue et al. [14] proposed a similarity search system, whichemploys a cluster-based indexing approach for environmen-tal sounds. In 2010, Roma et al. [15] proposed a methodfor the retrieval of environmental sounds using the generalsound-events taxonomy deﬁned based on the principles ofecological acoustics. Chechik et al. [16] compared the scal-ability of several classiﬁcation methods using MFCCs for aCopyright © This paper is being submitted to IEICE TRANS. INF. & SYST., XXXX 2021 large-scale content-based sound retrieval. In 2013, Sunouchiand Tanaka [17] proposed a new acoustic feature signature,namely, the enhanced multiscale fractal dimension signature(EMFD) and demonstrated the eﬀectiveness of EMFD forcontent-based similarity search of environmental sounds.In recent years, the workshop challenges on DCASEhave focused on improving machine learning methods forESR and produced high performance results for their tasks.Acoustic features are still essential as input data for the ma-chine learning methods for ESR. Hence ﬁnding new acousticfeatures that can properly describe the features of environ-mental sounds is fundamental in improving the performanceof these ESR applications. In addition, by studying howacoustic features can describe the features of environmentalsounds and aﬀect the performance for ESR tasks, we candevelop an understanding of how we are listening to envi-ronmental sounds.1.2 Acoustic Feature Extraction for ESRThe environmental sounds are produced by action and move-ment. We can identify things by listening to their acousticproperties, which are the results of the sound productionprocess. However, environmental sound signals of the sametype cannot be physically identical to each other due to thediﬀerence in their production processes. Furthermore, thediﬀerent sound signals generated by simultaneous events aremixed with each other, which makes the properties of eachsound source obscured [18]. Various acoustic features havebeen proposed for content-based audio retrieval. The fea-ture selection is an important process for ESR [19], [20].Cepstral features that include MFCCs and their ﬁrst and sec-ond derivatives (MFCCs Δ and MFCCs ΔΔ ) are widely usedas frequency-domain acoustic features. MP-based acousticfeature has been proposed as one of the useful time-domainfeatures for ESR [11],[12] [21].Recent researches have focused on the evaluation oftime-domain features of environmental sounds. For ESR,we need acoustic features that describe the non-stationarycharacteristics of target sounds as a time-domain feature andare robust against the diversity of environmental sounds [7].We have recognized there may be three main causes of thediversity of environmental sounds.D1) Small ﬂuctuations of sound source parameters, such ascarrier signal frequency, due to the individuality of thesound source.D2) Background noises that the person who recorded thetarget sound did not expect to record.D3) Mixed composition of diﬀerent types of sound sources.For the third cause D3, it is necessary to apply, forexample, independent component analysis or non-negativematrix factorization to the sound signal before the featureextraction process [12],[22]. In this study, we focus on theextraction of new acoustic feature signatures that are robustagainst the diversities caused by D1 and D2. 1.3 Problems of EMFD Signature and Their SolutionsIn our previous work [17], we proposed an EMFD signaturethat can describe both the frequency-domain features andtime-domain features of target sounds. The EMFD signa-ture is a feature vector, which consists of the time-varyingmultiscale fractal dimension (MFD) values. We demon-strated that EMFD improves the performance of similaritysearch by supplementing MFCCs. Unfortunately, it is foundthat EMFD includes error values that depend on the numberof analysis windows and the histogram’s bin size used forcomputing its histogram. Furthermore, EMFD seems to beoversensitive while discriminating the features of environ-mental sounds, and may lack robustness against the diversityof environmental sounds.In this study, we extend the EMFD signature by improv-ing the process of computing its histogram using the kerneldensity estimation method. By optimizing the bandwidthparameter used for kernel density estimation, the histogramof the enhanced multiscale fractal dimension using kerneldensity estimation signature (EMFD-KDE) becomes suﬃ-ciently smooth and robust against the diversity of environ-mental sounds as an acoustic feature signature. In Sect. 2,we present the basic theory and characteristics of EMFD. InSect. 3, we propose a method to compute the EMFD-KDEsignature. In Sect. 5, we demonstrate that EMFD-KDE im-proves the performance of the similarity search system.Furthermore, we enhance the idea of EMFD and pro-pose a new acoustic feature signature, namely very-long-range multiscale fractal dimension signature (MFD-VL). Theenvironmental sounds have important acoustic features overa long time period. However, EMFD cannot describe thetime-domain feature for time periods longer than 10 ms.In Sect. 4, we propose a method to compute the MFD-VLsignature. In addition, we demonstrate that MFD-VL candescribe the features of the time varying envelope for longperiods of time, and that it has the robustness against noisesand frequency ﬂuctuations.In Sect. 6, we conclude that the proposed feature sig-natures of EMFD-KDE and MFD-VL solve the problemsof EMFD and are eﬀective in supplementing MFCCs in thesimilarity search of environmental sounds.

2. Basic Theory of Enhanced Multiscale Fractal Di-mension Signature

Mandelbrot, who advocated a concept of fractal in 1975 forthe ﬁrst time, demonstrated that some structures in naturecould be modeled well by the theory of fractals [23]. Oneof the most important characteristics of fractals is that theyhave self-similarity properties at multiple scales. In the ﬁeldof acoustics, Voss and Clarke analyzed the power spectrumof ﬂuctuating physical variables including frequency, loud-ness and pitch in music and speech[24]. They obtained the1 / 𝑓 𝛾 ( . . 𝛾 . . ) aspects in the power spectrum of eachvariable against the frequency of a signal passed through UNOUCHI and YOSHIOKA: DIVERSITY-ROBUST ACOUSTIC FEATURE SIGNATURES BASED ON MULTISCALE FRACTAL DIMENSION a low-pass ﬁlter having a range 0 Hz – 1 Hz. Hsu [25]compared the fractal geometry of classical music works, andfound that there is a relation, deﬁned by the theory of fractals,between the interval of successive notes and their frequencyof occurrence.2.1 Multiscale Fractal DimensionA fractal dimension is an index value that can describe thecharacteristics of a fractal by quantifying their complexity asa ratio of the change in detail to the change in scale. Acousticfeatures based on the fractal dimension have been proposedand utilized for various practical applications in the ﬁeldssuch as acoustics, music analysis, image analysis, physics,physiology, and neuroscience. Maragos et al. [26][27] pro-posed the short-time fractal dimension of speech signalsas an acoustic feature and used it for speech segmentationand sound classiﬁcation. Zlatintsi and Maragos [28][29]proposed a multiscale fractal dimension (MFD) proﬁle asa short-time descriptor and found that this descriptor candiscriminate several aspects among diﬀerent musical instru-ments.2.2 Steps to Compute the EMFD SignatureIn our previous work [17], we developed EMFD as a featuresignature of environmental sounds for a similarity searchsystem. The EMFD is computed as follows.2.2.1 Preprocessing Target SoundsThe maximum amplitude of each target sound that is tobe analyzed must be ﬁrst normalized to -0.1 db. They areconverted to the standard format with the following speciﬁ-cations: sampling rate of 44.1 kHz and bit depth of 16 bits.2.2.2 Computing the Area of Minkowski SausageThe fractal dimension of a sound signal can be computedbased on the Minkowski-Bouligand dimension. A cover-ing area can be drawn by moving a unit disk of radius 𝑟 along the curve of the waveform. This covering area iscalled a Minkowski Sausage. The center of the unit diskshould be at any position on the curve of waveform and thewidth of Minkowski Sausage becomes 2 𝑟 . Figure 1 showsthe Minkowski Sausage obtained by moving the unit diskalong the waveform. To compute the area of the MinkowskiSausage of a discrete sound signal, the unit disk vector 𝐶 ( 𝑟 ) is deﬁned as Eq. (1), where 𝑟 denotes the radius of the unitdisk and 𝑖 denotes the discrete position on the horizon. Fig-ure 2 shows how the model of the unit disk is built. Thevertical distance from the center to the top of unit disk ateach horizontal position is denoted by the unit disk vector 𝐶 ( 𝑟 ) . Let 𝑛 be the sampling position, 𝑟 the radius of the unitdisk, 𝑝 the discrete position of the unit disk, and sig ( 𝑥 ) theamplitude of sound signal at each sampling position 𝑥 . Thearea of the Minkowski Sausage 𝑎𝑟𝑒𝑎 ( 𝑛, 𝑟 ) at each sampling Fig. 1

A sound waveform and a Minkowski sausage ・・・・・・ r = 1 C (1) = {0, 1, 0} r = 2 C (2) = {0, 1, 2, 1, 0} r = 5 C (5) = {0, 3, 4, 4, 4, 5, 4, 4, 4, 3, 0} Fig. 2

Mesh-Approximation of a unit disk position 𝑛 is computed as Eq. (2). 𝐶 ( 𝑟 ) = n ﬂoor (cid:16)p 𝑟𝑖 − 𝑖 (cid:17) (cid:12)(cid:12)(cid:12) ≤ 𝑖 ≤ 𝑟, 𝑖 ∈ Z o (1)area ( 𝑛, 𝑟 ) = max ≤ 𝑝 ≤ 𝑟𝑝 ∈ Z (cid:18) sig ( 𝑛 − 𝑟 + 𝑝 )+ ﬂoor (cid:18)q 𝑟 𝑝 − 𝑝 (cid:19)(cid:19) − min ≤ 𝑝 ≤ 𝑟𝑝 ∈ Z (cid:18) sig ( 𝑛 − 𝑟 + 𝑝 )− ﬂoor (cid:18)q 𝑟 𝑝 − 𝑝 (cid:19)(cid:19) (2)2.2.3 Deﬁnition of Multiscale Fractal DimensionThe MFD values are computed for each analysis windowwhose period is 50ms. Let 𝐴 ( 𝑟 ) be the area of the MinkowskiSausage drawn by the unit disk of radius 𝑟 in each analysiswindow. The MFD of each analysis window is deﬁned byEq. (3). The minimum radius (r=1) corresponds to the sam-pling period of the signal (1/44.1 ms) and the range of 𝑟 from 1 to 132 corresponds to the range of the time scalesfrom 1/44.1 to 3 ms. 𝑀 𝐹𝐷 = ( − log ( 𝐴 ( 𝑟 + )/ 𝐴 ( 𝑟 )) log (( 𝑟 + )/ 𝑟 ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ 𝑟 ≤ , 𝑟 ∈ Z ) (3)2.2.4 Deﬁnition of the EMFD SignatureIn our previous work [17], we found that MFD has informa-tive values for unit disks larger than the disk with a radius This paper is being submitted to IEICE TRANS. INF. & SYST., XXXX 2021

𝑀 𝐹𝐷 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 ( 𝑥 ) = − log ( 𝐴 ( 𝑟 ( 𝑥 + ))/ 𝐴 ( 𝑟 ( 𝑥 ))) log ( 𝑟 ( 𝑥 + )/ 𝑟 ( 𝑥 )) , where 𝑟 ( 𝑥 ) = round ( . 𝑥 ) (4) 𝐴𝑊 ( 𝑠𝑜𝑢𝑛𝑑, 𝑝𝑒𝑟𝑖𝑜𝑑 ) = (cid:26) , 𝑝𝑒𝑟𝑖𝑜𝑑, ..., ﬂoor (cid:18) the length of 𝑠𝑜𝑢𝑛𝑑𝑝𝑒𝑟𝑖𝑜𝑑 − (cid:19) × 𝑝𝑒𝑟𝑖𝑜𝑑 (cid:27) (5) 𝐹 𝐴𝑊 ( 𝑑𝑏𝑖𝑛, 𝑟𝑏𝑖𝑛 ) = (cid:8) 𝑡 (cid:12)(cid:12) 𝑡 ∈ 𝐴𝑊 ( 𝑠𝑜𝑢𝑛𝑑, ) , + ( 𝑑𝑏𝑖𝑛 − )/ ≤ 𝑀 𝐹𝐷 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 ( 𝑟𝑏𝑖𝑛 ) of analysis window 𝑡 < + 𝑑𝑏𝑖𝑛 / (cid:9) (6) 𝐸 𝑀 𝐹𝐷 ( 𝑠𝑜𝑢𝑛𝑑 ) = ( card ( 𝐹 𝐴𝑊 ( 𝑑𝑏𝑖𝑛, 𝑟𝑏𝑖𝑛 )) card ( 𝐴𝑊 ( 𝑠𝑜𝑢𝑛𝑑, )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ 𝑑𝑏𝑖𝑛 ≤ , 𝑑𝑏𝑖𝑛 ∈ Z , ≤ 𝑟𝑏𝑖𝑛 ≤ , 𝑟𝑏𝑖𝑛 ∈ Z ) (7)of 3 ms ( 𝑟 = ) . The maximum radius of the unit diskwas extended to 218, which corresponds to 5 ms (1/10 ofthe period of analysis window), and the discrete values ofthe unit disk were modiﬁed to have exponential values. Theenhanced MFD value at the 𝑥 -th discrete value of the unitdisk is deﬁned as Eq. (4). The enhanced MFD values arecomputed for each analysis window. The EMFD signature isthen deﬁned as the two-dimensional histogram ( × ) ofthe time-varying enhanced MFD. Let 𝑝𝑒𝑟𝑖𝑜𝑑 be the period(ms) of the analysis window and 𝑠𝑜𝑢𝑛𝑑 be the target sound.The set of analysis windows of the target sound is deﬁned asEq. (5). Let 𝑟𝑏𝑖𝑛 be the bins that correspond to a series of 16numbers used to deﬁne the diﬀerent radius of the unit diskfor computing the enhanced MFD, and 𝑑𝑏𝑖𝑛 be the bins thatcorrespond to a series of the 32 small intervals into whichthe range of fractal dimension is divided. The set of analy-sis windows whose enhanced MFD values fall into the bin ( 𝑑𝑏𝑖𝑛, 𝑟𝑏𝑖𝑛 ) is deﬁned by Eq. (6). The set of values in eachbin of the EMFD histogram is deﬁned by Eq. (7). Figure 3shows the histogram that visualizes the EMFD signature ofa cuckoo sound 𝑆 𝑐𝑢𝑐𝑘𝑜𝑜 . The length of 𝑆 𝑐𝑢𝑐𝑘𝑜𝑜 is 21.08s.2.3 Known Characteristics of the EMFD signatureZlatintsi and Maragos [29] concluded that the MFD pro-ﬁles are useful for quantifying the multiscale complexity andfragmentation of the diﬀerent states of the instrument soundwaveforms. In our previous work [17], we conﬁrmed thatEMFD describes the frequency-domain features and severalother eﬀective features of environmental sounds that MFCCscannot describe. Furthermore, we conﬁrmed that the EMFDsignature has robustness against changes in volume levelsand phase shifting of sound signals in the analysis window.In the next subsection, we show another characteristic ofEMFD through quantitative analysis of the simulated soundsignals. fr ac t a l d i m e n s i on radius of the unit disk [sampling unit]

16 bins 32 bins

Fig. 3

Visualization of the EMFD signature of a cuckoo sound 𝑆 𝑐𝑢𝑐𝑘𝑜𝑜

3. Extending EMFD Employing a Kernel Density Esti-mation Method

As mentioned in Subsect. 2.2, EMFD is computed as thetwo-dimensional histogram of time-varying enhanced MFDvalues. Let

𝑁 𝐹 𝐴𝑊 ( 𝑏𝑖𝑛 ) be the number of analysis win- UNOUCHI and YOSHIOKA: DIVERSITY-ROBUST ACOUSTIC FEATURE SIGNATURES BASED ON MULTISCALE FRACTAL DIMENSION dows whose enhanced MFD values fall into the bin, and 𝑁 𝐴𝑊 ( 𝑠𝑜𝑢𝑛𝑑 ) be the total number of analysis windows of thetarget sound 𝑠𝑜𝑢𝑛𝑑 , as deﬁned in Eq. (8). The EMFD valueof each bin 𝐸 𝑀 𝐹𝐷 ( 𝑠𝑜𝑢𝑛𝑑, 𝑏𝑖𝑛 ) is computed as Eq. (9).This method has the following two problems. 𝑁 𝐴𝑊 ( 𝑠𝑜𝑢𝑛𝑑 ) = card (cid:0) 𝐴𝑊 ( 𝑠𝑜𝑢𝑛𝑑, ) (cid:1) (8) 𝐸 𝑀 𝐹𝐷 ( 𝑠𝑜𝑢𝑛𝑑, 𝑏𝑖𝑛 ) = 𝑁 𝐹 𝐴𝑊 ( 𝑏𝑖𝑛 ) 𝑁 𝐴𝑊 ( 𝑠𝑜𝑢𝑛𝑑 ) (9)The ﬁrst problem is that the value of each bin neces-sarily includes an error, because the value can be one of thediscrete values given by the density of analysis windows. Inparticular, a lower number of analysis windows for the targetsound increases the errors. Ideally, the EMFD histogramshould be a continuous probability distribution of the time-varying enhanced MFD values, regardless of the number ofanalysis windows.The second problem is that EMFD computed by the ex-isting method is oversensitive to discriminate the featuresof environmental sounds. The tones and frequencies ofeach environmental sound may often vary, depending onthe recording conditions and individual characteristics ofthe sources that generate this sound, even if person try torecord the same type of environmental sounds in the sameway. Therefore, a feature signature of environmental soundsshould have robustness against the diversity of environmen-tal sounds caused by D1 deﬁned in Subsect. 1.2.To solve these problems, we introduce the kernel densityestimation method to compute the EMFD histogram.3.1 Deﬁnition of the EMFD-KDE SignatureThe kernel density estimation method is employed to com-pute the probability distribution of the enhanced MFD valuesat each radius of the unit disk. The values of each bin ofEMFD-KDE can be deﬁned as Eq. (4), Eq. (8), Eq. (10),Eq. (11), and Eq. (12), where 𝐾 (·) is the kernel functionwhich is a Gaussian function, and ℎ in Eq. (11) is the smooth-ing parameter called bandwidth. 𝐾 ( 𝑥 ) = √ 𝜋 𝑒 − 𝑥 (10) 𝑓 𝑒𝑚 𝑓 𝑑 - 𝑘𝑑𝑒 ( 𝑑𝑏𝑖𝑛 𝑣𝑎𝑙 , 𝑟𝑏𝑖𝑛 ) = 𝑁 𝐴𝑊 ℎ × 𝑁 𝐴𝑊 Õ 𝐾 (cid:18) 𝑑𝑏𝑖𝑛 𝑣𝑎𝑙 − 𝑀 𝐹𝐷 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 ( 𝑟𝑏𝑖𝑛 ) ℎ (cid:19) (11) 𝐸 𝑀 𝐹𝐷 - 𝐾 𝐷𝐸 = ( 𝑓 𝑒𝑚 𝑓 𝑑 - 𝑘𝑑𝑒 (cid:18) + 𝑑𝑏𝑖𝑛 − . , 𝑟𝑏𝑖𝑛 (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ 𝑑𝑏𝑖𝑛 ≤ , 𝑑𝑏𝑖𝑛 ∈ Z , ≤ 𝑟𝑏𝑖𝑛 ≤ , 𝑟𝑏𝑖𝑛 ∈ Z ) (12)3.2 Optimization of the Bandwidth for Kernel Density Es-timationThe bandwidth ℎ is a smoothing parameter, which is usuallydetermined by the trade-oﬀ between the number of data sam-ples and their standard deviation. Let 𝑛 be the number of datasamples and 𝜎 be the standard deviation of the data samples.The bandwidth ℎ of a Gaussian kernel density estimator isgiven by the normal reference rule deﬁned by Eq. (13). Thenormal reference rule is most commonly used to determinethe bandwidth [31]. ℎ = (cid:18) 𝜎 𝑛 (cid:19) ≈ . 𝜎𝑛 − (13)We deﬁne the bandwidth ℎ 𝑟𝑏𝑖𝑛 ( 𝛼 ) , which is optimizedfor each radius of the unit disk, as Eq. (14), Eq. (15), andEq. (16), where 𝑎𝑣𝑔 is the arithmetic mean of the enhancedMFD values of each analysis window at 𝑟𝑏𝑖𝑛 , and 𝜎 𝑟𝑏𝑖𝑛 isthe standard deviation of the enhanced MFD values at 𝑟𝑏𝑖𝑛 .The smoothing parameter 𝛼 in (16) is a constant. Throughthe experiments with diﬀerent values of 𝛼 , we found that thebest result for the similarity search is obtained for 𝛼 =

32. Inthis study, ℎ 𝑟𝑏𝑖𝑛 ( ) is used as the bandwidth for each radiusof the unit disk to compute the EMFD-KDE signature. 𝑎𝑣𝑔 = 𝑁 𝐴𝑊

𝑁 𝐴𝑊 Õ 𝑀 𝐹𝐷 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 ( 𝑟𝑏𝑖𝑛 ) (14) 𝜎 𝑟𝑏𝑖𝑛 = vut 𝑁 𝐴𝑊

𝑁 𝐴𝑊 Õ ( 𝑀 𝐹𝐷 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 ( 𝑟𝑏𝑖𝑛 ) − 𝑎𝑣𝑔 ) (15) ℎ 𝑟𝑏𝑖𝑛 ( 𝛼 ) = . 𝜎 𝑟𝑏𝑖𝑛 𝑁 𝐴𝑊 − 𝛼 (16)Figure 4 shows the 3D histogram visualizing the EMFDsignature of the cuckoo sound 𝑆 𝑐𝑢𝑐𝑘𝑜𝑜 . Figure 5 showsthe 3D histogram visualizing the EMFD-KDE signature ofthe same cuckoo sound 𝑆 𝑐𝑢𝑐𝑘𝑜𝑜 . The 3D histogram of theEMFD-KDE signature is much smoother than that of theEMFD signature. At each radius of the unit disk, the largerstandard deviation of the enhanced MFD values 𝜎 𝑟𝑏𝑖𝑛 resultsin the smoother histogram.

4. Very Long Range Multiscale Fractal Dimension Sig-nature

The environmental sounds have important acoustic featureswith varying time periods. However, EMFD cannot describethe time-domain features for time periods longer than 10 ms.

This paper is being submitted to IEICE TRANS. INF. & SYST., XXXX 2021 h i s t og r a m den s i t y r ad i u s o f t he un i t d i sk [ s a m p li ng un i t] f r a c t a l d i m en s i on Fig. 4

The 3D histogram image that visualizes the existing EMFD sig-nature of the cuckoo sound 𝑆 𝑐𝑢𝑐𝑘𝑜𝑜 h i s t og r a m den s i t y r ad i u s o f t he un i t d i sk [ s a m p li ng un i t] f r a c t a l d i m en s i on Fig. 5

The 3D histogram image that visualizes the EMFD-KDE signatureof the cuckoo sound 𝑆 𝑐𝑢𝑐𝑘𝑜𝑜 . The bandwidth is ℎ 𝑟𝑏𝑖𝑛 ( ) . To solve this problem, we propose a new acoustic featuresignature for environmental sounds, based on the multiscalefractal dimension. This feature signature is called very-long-range multiscale fractal dimension signature (MFD-VL). Thebasic idea of MFD-VL is to extend the size range of the unitﬁgure to consider the larger ones, which are used to computethe area of the Minkowski Sausage. In this section, wedeﬁne the method to compute the MFD-VL signature anddemonstrate its characteristics.4.1 Deﬁnition of the MFD-VL SignatureThe multiscale fractal dimension values of MFD-VL arecomputed for an entire target sound, and not for each ﬁxed-length analysis window of the target sound. A unit square,instead of a unit disk, is used to compute the area ofthe Minkowski Sausage for MFD-VL. Figure 6 shows theMinkowski Sausage obtained by moving the unit squarealong the waveform. The method using the unit square ismuch faster than the one using the unit disk. Let 𝑛 denote thesampling position, 𝑟 the half side-length of the unit square, 𝑝 the discrete position of the unit square, and 𝑠𝑖𝑔 ( 𝑥 ) the am-plitude value of the sound signal at each sampling position 𝑥 . The area of the Minkowski Sausage 𝑎𝑟𝑒𝑎 𝑠𝑞 ( 𝑛, 𝑟 ) at eachsampling position 𝑛 is computed as Eq. (17). s ound a m p lit ud e [ db i n it s ] time [sampling unit] unit square (side length = 2 r )Minkowski sausage Fig. 6

A sound waveform and a Minkowski Sausage obtained by movingthe unit square 𝑎𝑟𝑒𝑎 𝑠𝑞 ( 𝑛, 𝑟 ) = max ≤ 𝑝 ≤ 𝑟𝑝 ∈ Z ( sig ( 𝑛 − 𝑟 + 𝑝 ) + 𝑟 )− min ≤ 𝑝 ≤ 𝑟𝑝 ∈ Z ( sig ( 𝑛 − 𝑟 + 𝑝 ) − 𝑟 ) (17)The half side-length 𝑟 of the unit square for each scalewas deﬁned as Eq. (18), where 𝑠 𝑓 is the sampling frequencyof the target sound. In this study, 𝑠 𝑓 is 44100 Hz. Let 𝐴 𝑠𝑞 ( 𝑟 ) denote the area of the Minkowski Sausage obtained for anentire target sound by moving the unit square whose sidelength is 2 𝑟 . The MFD-VL signature is deﬁned as Eq. (19).The MFD-VL signature is a feature vector that contains 10elements. 𝑟 ( 𝑥 ) = round (cid:16) 𝑠 𝑓 × − 𝑥 + (cid:17) (18) 𝑀 𝐹𝐷 - 𝑉 𝐿 = (cid:26) − log (cid:0) 𝐴 𝑠𝑞 ( 𝑟 ( 𝑥 ))/ 𝐴 𝑠𝑞 ( 𝑟 ( 𝑥 + )) (cid:1) log ( 𝑟 ( 𝑥 )/ 𝑟 ( 𝑥 + )) (cid:12)(cid:12)(cid:12)(cid:12) ≤ 𝑥 ≤ , 𝑥 ∈ Z (cid:27) (19)4.2 Basic Characteristics of the MFD-VL SignatureWe found several basic characteristics of MFD-VL throughthe experiments using test sounds.4.2.1 MFD-VL’s Descriptiveness of the Beats of SingleSine WavesThe MFD-VL signature is expected to describe the acousticfeatures over very long time-periods. We found that MFD-VL can discriminate frequencies of amplitude envelopes be-tween 22.6 Hz and 1 Hz. The range of the wavelength ofthe amplitude envelopes corresponds to the range of the sidelength of the unit square between 0.044 s and 1 s. Let UNOUCHI and YOSHIOKA: DIVERSITY-ROBUST ACOUSTIC FEATURE SIGNATURES BASED ON MULTISCALE FRACTAL DIMENSION 𝑓 𝑏𝑒𝑎𝑡 denote the frequency of the beats and 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 denotethe frequency of single sine waves inside the amplitude en-velopes. The set of test sounds 𝑆𝑆 𝑡 is deﬁned as Eq. (20),Eq. (21), and Eq. (22). Each test sound is ﬁltered by the pinknoise ﬁlter function 𝑓 𝑝𝑛 Eq. (20). A sound that is artiﬁciallysynthesized using pure tones usually has distinct or sparsespectra. This kind of sounds may cause numerical instabil-ities while calculating their acoustic features. To solve thisproblem, the pink noise ﬁlter function 𝑓 𝑝𝑛 is used to add abackground pink noise, which is deﬁned as Eq. (20), where 𝑠𝑖𝑔 is an input signal and 𝑁𝑜𝑖𝑠𝑒 𝑝𝑖𝑛𝑘 is a background pinknoise whose maximum amplitude is normalized to -0.1 db.The signal-to-noise ratio is 24 db. The pink noise, knownas 1 / 𝑓 noise, is a signal whose power spectral density is in-versely proportional to the signal frequency. The pink noisesignal is known to widely exist in the natural world. The fre-quency components below 40 Hz contained in the pink noiseare cut oﬀ by using a low cut ﬁlter before the amplitude nor-malization because the components with lower frequenciescannot be recorded nor played using common microphonesand speakers. 𝑓 𝑝𝑛 ( 𝑠𝑖𝑔 ) = 𝑠𝑖𝑔 + 𝑁𝑜𝑖𝑠𝑒 𝑝𝑖𝑛𝑘 (20) 𝑠 𝑡 ( 𝑓 𝑏𝑒𝑎𝑡 , 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 ) = 𝑓 𝑝𝑛 (cid:0) cos ( 𝜋 𝑓 𝑏𝑒𝑎𝑡 𝑡 ) sin ( 𝜋 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝑡 ) (cid:1) (21) 𝑆𝑆 𝑡 = (cid:8) 𝑠 𝑡 ( 𝑓 𝑏𝑒𝑎𝑡 , 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 ) (cid:12)(cid:12) 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 = , 𝑓 𝑏𝑒𝑎𝑡 ∈ { . , , , , , , } (cid:9) (22)Figure 7 shows the line charts of the MFD-VL valuesof single sine waves of frequency 440 Hz, which is ﬁlteredby Eq. (20), and those of the test sounds 𝑆𝑆 𝑡 . In Fig. 7, thefrequencies of the beats are indicated by the troughs of theline chart, in which the side length of the unit square is lessthan the wavelengths of the beats. This characteristic can beunderstood morphologically as shown in Fig. 8. When theside length of the unit square is more than the wavelength ofthe beat, the area of the Minkowski Sausage becomes almostthe same as that of a single sine wave without beats. Whenthe side length of the unit square is less than the wavelengthsof the beats, the shorter side length of the unit square resultsin a smaller area of Minkowski Sausage.4.2.2 MFD-VL’s Descriptiveness of Amplitude EnvelopeShapesHere we analyze some other characteristics of the MFD-VL that are related to the descriptiveness of the amplitudeenvelope shapes. Let 𝑓 𝑝𝑢𝑙𝑠𝑒 be the frequency of the rectan-gular pulse waves and 𝑤 𝑝𝑢𝑙𝑠𝑒 be the ratio of the rectangular F r ac t a l D i m e n s i on Side length of the unit square [sec] f pn (sin(440))s t1 (0.5, 440)s t1 (1, 440)s t1 (2, 440)s t1 (4, 440)s t1 (8, 440)s t1 (16, 440)s t1 (32, 440) Fig. 7

Line charts of the MFD-VL signatures of single sine waves offrequency 440Hz, with and without a beat. The beat frequencies are 0.5Hz,1Hz, 2Hz, 4Hz, 8Hz, 16Hz, and 32Hz.

Fig. 8

Minkowski Sausage obtained by moving the unit square whoseside length is less than the wavelength of the beat. pulse width to the wavelength of the rectangular pulse waves.The rectangular pulse function 𝑟𝑒𝑐𝑡 ( 𝑓 𝑝𝑢𝑙𝑠𝑒 , 𝑤 𝑝𝑢𝑙𝑠𝑒 , 𝑡 ) forgenerating the amplitude envelopes is deﬁned as Eq. (23).Let 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 be the frequency of a single sine wave inside theamplitude envelopes generated by the rectangular pulse func-tion. The test sound 𝑠 𝑡 is deﬁned as Eq. (23) and Eq. (24).The test sound is ﬁltered by the pink noise ﬁlter function 𝑓 𝑝𝑛 Eq. (20). 𝑟𝑒𝑐𝑡 ( 𝑓 𝑝𝑢𝑙𝑠𝑒 , 𝑤 𝑝𝑢𝑙𝑠𝑒 , 𝑡 ) = ( (cid:16) 𝑡 mod 𝑓 𝑝𝑢𝑙𝑠𝑒 < = 𝑤 𝑝𝑢𝑙𝑠𝑒 𝑓 𝑝𝑢𝑙𝑠𝑒 (cid:17) ( 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ) (23) 𝑠 𝑡 ( 𝑓 𝑝𝑢𝑙𝑠𝑒 , 𝑤 𝑝𝑢𝑙𝑠𝑒 , 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 ) = 𝑓 𝑝𝑛 (cid:0) 𝑟𝑒𝑐𝑡 ( 𝑓 𝑝𝑢𝑙𝑠𝑒 , 𝑤 𝑝𝑢𝑙𝑠𝑒 , 𝑡 ) sin ( 𝜋 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 𝑡 ) (cid:1) (24)We deﬁne the set of test sounds 𝑆𝑆 𝑡 as Eq. (25). Theline charts of MFD-VL of the single sine wave of frequency440 Hz, which is ﬁltered by Eq. (20), and 𝑆𝑆 𝑡 are showedin Fig. 9. Here, we compare the line charts of MFD-VL for 𝑠 𝑡 and those for 𝑠 𝑡 . This comparison shows that the bottomof the line chart trough of 𝑠 𝑡 for 𝑓 𝑝𝑢𝑙𝑠𝑒 = This paper is being submitted to IEICE TRANS. INF. & SYST., XXXX 2021 F r ac t a l D i m e n s i on Side length of the unit square [sec] f pn (sin(440)) s t1 (2, 440)s t2 (2, 0.5, 440)s t1 (4, 440)s t2 (4, 0.5, 440) Fig. 9

Line charts of the MFD-VL signatures for single sine waves havinga frequency of 440 Hz; those masked by the cosine function, and thosemasked by the rectangular pulse function. that of 𝑠 𝑡 for 𝑓 𝑏𝑒𝑎𝑡 =

2, and that the bottom of the line charttrough of 𝑠 𝑡 for 𝑓 𝑝𝑢𝑙𝑠𝑒 = 𝑠 𝑡 for 𝑓 𝑏𝑒𝑎𝑡 = 𝑆𝑆 𝑡 = (cid:8) 𝑠 𝑡 ( 𝑓 𝑏𝑒𝑎𝑡 , 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 ) (cid:12)(cid:12) 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 = , 𝑓 𝑏𝑒𝑎𝑡 ∈ { , } (cid:9) ∪ (cid:8) 𝑠 𝑡 ( 𝑓 𝑝𝑢𝑙𝑠𝑒 , 𝑤 𝑝𝑢𝑙𝑠𝑒 , 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 ) (cid:12)(cid:12) 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 = , 𝑓 𝑝𝑢𝑙𝑠𝑒 ∈ { , } , 𝑤 𝑝𝑢𝑙𝑠𝑒 = . (cid:9) (25)For another comparison, we deﬁne a set of test sounds 𝑆𝑆 𝑡 as Eq. (26). The set of test sounds 𝑆𝑆 𝑡 contains singlesine waves having a frequency of 440 Hz masked by therectangular pulse functions with various widths 𝑤 𝑝𝑢𝑙𝑠𝑒 . InFig. 10, we compare the line charts of MFD-VL of singlesine wave having a frequency of 440 Hz, 𝑠 𝑡 ( 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 = , 𝑓 𝑏𝑒𝑎𝑡 = ) , and 𝑆𝑆 𝑡 . This comparison shows thatthe narrower width of the amplitude envelopes made by therectangular pulse function results in deeper troughs in theline chart. 𝑆𝑆 𝑡 = (cid:8) 𝑠 𝑡 ( 𝑓 𝑝𝑢𝑙𝑠𝑒 , 𝑤 𝑝𝑢𝑙𝑠𝑒 , 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 ) (cid:12)(cid:12) 𝑓 𝑐𝑜𝑛𝑡𝑒𝑛𝑡 = , 𝑓 𝑝𝑢𝑙𝑠𝑒 = , 𝑤 𝑝𝑢𝑙𝑠𝑒 ∈ { . , . , . } (cid:9) (26)

5. EXPERIMENTAL EVALUATION F r ac t a l D i m e n s i on Side length of the unit square [sec] f pn (sin(440)) s t1 (4, 440)s t2 (4, 0.2, 440)s t2 (4, 0.5, 440)s t2 (4, 0.8, 440) Fig. 10

Line charts of the MFD-VL signatures of single sine waves havinga frequency of 440 Hz; those masked by the cosine function, and thosemasked by the rectangular pulse function having various widths. to share their sounds and describe metadata regarding theirshared sounds on the web. Each sound is labeled with agroup of tags that are relatively well maintained as user gen-erated contents [1]. The tags represent the objects the userswere listening to in their listening experiences. Based onthe following rules deﬁned by the authors, the sounds werechosen and imported to our dataset to be used in the sim-ilarity search system. The sounds that were tagged withﬁeld-recording and with lengths between 1 and 600 s werechosen. We imported the top 3000 sounds in the descendingorder of downloaded number by unspeciﬁed users countedby the Freesound’s system for each sound. Each sound wasconverted to a uniform format (1 channel, 44100 Hz sam-pling frequency, 16 bits bit depth, and maximal amplitudenormalized to -0.1 db) for normalization before extractingacoustic features including EMFD, EMFD-KDE, MFD-VL,and MFCCs. The average length of the imported sounds is70.4 s.5.2 Feature Signature ExtractionOne of the most well-known acoustic features used for ESRis MFCCs. Here, we used MFCCs for comparing the de-scriptiveness with our newly-proposed feature signatures.The SPTK toolkit [30] was used to compute 13-coeﬃcientsMFCCs (MFCC13) and MFCC39, which represents the ﬁrstand the second-order derivatives of MFCC13. MFCC13and MFCC39 were computed using a ﬁxed width analysiswindow of length 50 ms. The feature sets of MFCC13 andMFCC39 consist of mean values of their coeﬃcients of theanalysis window. EMFD and EMFD-KDE consist of the 512elements deﬁned in Sect. 2 and 3, and MFD-VL consists ofthe 10 elements deﬁned in Sect. 4.Table 1 lists diﬀerent feature sets to be comparedthrough experimental evaluation. L1 represents the totalnumber of features in the concatenated feature sets and L2represents the number of features in the feature sets after di-mensionality reduction through principal components anal-ysis (PCA). To achieve the best possible performance of thesimilarity search using k-NN method, PCA was applied tothe feature vectors of the most frequently downloaded 600

UNOUCHI and YOSHIOKA: DIVERSITY-ROBUST ACOUSTIC FEATURE SIGNATURES BASED ON MULTISCALE FRACTAL DIMENSION Table 1

List of acoustic feature sets for the comparison of their descrip-tiveness. sounds in the dataset to extract its eigenvectors for dimen-sionality reduction. The “prcomp” function of R languagewas used for PCA processing. The corresponding L2s offeature sets 1, 3, 6, and 8 were determined so that each oftheir cumulative contribution ratios was 99%. The L2s offeature sets 2, 4, 5, 7, 9, and 10 were ﬁxed to 114.Feature sets 1 and 6 are the standard sets for comparingwith other feature sets. Each feature set from 2 to 5 con-sists of MFCC13, and each feature signature is based on themultiscale fractal dimension. We deﬁned that each featureset from 1 to 5 belongs to the group FS1. In addition, eachfeature set from 7 to 10 consists of MFCC39 and each fea-ture signature is based on the multiscale fractal dimension.Moreover, we deﬁned each feature set from 6 to 10 to belongto group FS2.The suﬃx “( × 𝛾 )” of each feature vector denotes aweighting coeﬃcient 𝛾 . Each value of the feature vectorsis multiplied by 𝛾 when its feature vector is combined withother feature(s). Through the experimental evaluation, theweighting coeﬃcient 𝛾 for each feature vector was appropri-ately chosen to perform the best result.5.3 Evaluation MethodThe similarity search system using k-NN method returnsa search result list of environmental sounds based on thedistance in the space of the selected feature set through asearch-key sound. To evaluate the performance using eachfeature vector in table 1, we deﬁned the similarity index 𝑆𝐼 between the tag group of the search-key sound 𝑡𝑎𝑔𝑠 𝑘𝑒𝑦 andthat of the retrieved sound 𝑡𝑎𝑔𝑠 𝑠 as Eq. (27). This indexis known as the Jaccard similarity coeﬃcient that measuressimilarity between ﬁnite sample sets. 𝑆𝐼 = card (cid:0) 𝑡𝑎𝑔𝑠 𝑘𝑒𝑦 ∩ 𝑡𝑎𝑔𝑠 𝑠 (cid:1) card (cid:0) 𝑡𝑎𝑔𝑠 𝑘𝑒𝑦 ∪ 𝑡𝑎𝑔𝑠 𝑠 (cid:1) (27)To improve an accuracy of similarity index 𝑆𝐼 , we re-moved the commoner morphological and inﬂexional endingsfrom all tags by using Porter Stemmer [32] in advance. Fur-thermore, the predeﬁned stop words include sound formats, such as “mp3” and “stereo,” and tool makers, such as “sony”and “tascam,” were removed from the tag groups for com-puting the 𝑆𝐼 s. The tags that contain the text “ﬁeldrecord”were removed from the tag groups because all sounds inthe dataset have them. For each of the 3000 sounds in thedataset, 𝑆𝐼 s between a search-key sound and each retrievedsound in the search-result list were computed. We comparedthe average values of the 𝑆𝐼 s of 3000 sounds for each acousticfeature set.5.4 Evaluation ResultsFigure 11 shows the 𝑆𝐼 s of “top 𝑛 ” for each feature set. The 𝑆𝐼 of “top 𝑛 ” is the average value of the 𝑆𝐼 s between a search-key sound and each retrieved sound in the top 𝑛 rank of thesearch-result list. Figure 11 shows the average values of the 𝑆𝐼 s of “top 𝑛 ” computed for each of the 3000 sounds. Forreference, the average value of the 𝑆𝐼 between two randomlychosen sounds in the dataset is 0.014.By comparing feature sets 4 and 9 with 2 and 7, respec-tively, it was conﬁrmed that the descriptiveness of EMFD-KDE is superior to that of EMFD. 𝑆𝐼 of “top 1” of featureset 4 was 8.1% higher than that of feature set 2. 𝑆𝐼 of “top1” of the feature set 9 was 4.9% higher than that of featureset 7.By comparing feature sets 3 and 8 with 1 and 6, respec-tively, it was conﬁrmed that the newly-developed MFD-VLsignature improves the performance of similarity search re-sult. The MFD-VL signature can describe some acousticfeatures that MFCCs cannot. 𝑆𝐼 of “top 1” of the featureset 3 was 8.2% higher than that of the feature set 1. 𝑆𝐼 of“top 1” of feature set 8 was 1.6% higher than that of thefeature set 6.The results obtained using the feature sets 5 and 10achieved the best similarity search performance in eachgroup of feature vectors FS1 and FS2. 𝑆𝐼 of “top 1” offeature set 5 was 17.2% higher than that of the feature set 1. 𝑆𝐼 of “top 1” of the feature set 10 was 8.7% higher than thatof the feature set 6. It was conﬁrmed that EMFD-KDE andMFD-VL are eﬀective as an acoustic feature signature forthe environmental sounds.

6. Conclusion

Recent research on ESR focused on the evaluation of time-domain features of environmental sounds. For ESR, anacoustic feature must describe the nonstationary character-istics of target sounds as time-domain features and must berobust against the following three main causes of the diver-sity of environmental sounds.D1) Small ﬂuctuations of sound source parameters, such ascarrier signal frequency, due to the individuality of thesound source.D2) Background noises that the person who recorded thetarget sound did not expect to record.D3) Mixed composition of diﬀerent types of sound sources. This paper is being submitted to IEICE TRANS. INF. & SYST., XXXX 2021 top 1 top 2 top 3 S i m il a r it y I nd e x S I Fig. 11

Evaluation results of each feature set quantiﬁed using 𝑆𝐼 s In this study, we have focused on the extraction of acous-tic feature signatures that are robust against the diversitiescaused by D1 and D2.In a previous study, we proposed the EMFD feature sig-nature to describe both frequency- and time-domain featuresof target sounds.However, we also recognized the following problems inEMFD.p1) EMFD includes error values.p2) It is oversensitive to discriminate the features of envi-ronmental sounds.p3) It lacks the robustness against the diversity of environ-mental sounds.p4) It cannot describe the time-domain features for periodslonger than 10 ms.To solve these problems, we proposed the EMFD-KDE andMFD-VL feature signatures.The newly-proposed EMFD-KDE feature signature isthe probability distribution of the enhanced MFD values ateach radius of the unit disk computed using the kernel densityestimation method. In Subsect. 3.2, we studied the methodto optimize bandwidth ℎ used for kernel density estimation.Based on the normal reference rule, we deﬁned bandwidth ℎ 𝑟𝑏𝑖𝑛 ( 𝛼 ) optimized for each radius of the unit disk by usingEq. (16). By using experiments with diﬀerent values of 𝛼 ,we determined that the best result of the similarity searchis obtained for 𝛼 =

32. In addition, we showed that the3D histogram of the EMFD-KDE signature by using theoptimized bandwidth ℎ 𝑟𝑏𝑖𝑛 ( ) is much smoother than thatof the EMFD signature.In Sect. 4, we proposed the MFD-VL signature anddemonstrated its characteristics through experiments usingthe simulated cricket’s sounds as follows. • MFD-VL can discriminate the frequencies of the am-plitude envelopes between 22.6 and 1 Hz. • It can discriminate the shapes of the amplitude en-velopes.The MFD-VL signature is expected to describe the time- domain features for periods longer than 10 ms.From the experimental evaluation results, we con-ﬁrmed that the descriptiveness of EMFD-KDE supplement-ing MFCC13 and MFCC39 is evidently higher than that ofEMFD supplementing MFCC13 and MFCC39. We con-clude that the smoothness of the EMFD-KDE signature cansolve problems p1, p2, and p3. Furthermore, the experimen-tal evaluation results showed that the MFD-VL signaturesupplementing EMFD-KDE and MFCCs improves the per-formance of the similarity search. The MFD-VL signaturefunctions as an eﬀective time-domain feature and can solveproblems p3 and p4.The EMFD-KDE signature has 512 feature elements,which is much more than those of other feature signatures.This implies that the EMFD-KDE signature requires morecomputational time than the conventional methods for fea-ture extraction. For the similarity search task, feature sig-natures of library sounds can be computed a priori whichimplies that we do not care about the computation time of theEMFD-KDE signature. While the EMFD-KDE also requiresmore searching time than conventional method, this time isnegligible compared to the time required for the retrieval ofmatched sounds from the library. The computational time ofEMFD-KDE and the length of searching time using EMFD-KDE do not matter.Environmental sounds have the acoustic features in theirfrequency domain, as well as other important features in thetime domain with various time scales. We conclude thatboth the EMFD-KDE and MFD-VL signatures can describethe essential acoustic features of environmental sounds withrobustness against the diversity of environmental sounds.For further research, we must evaluate the performance ofother applications, such as classiﬁcation tasks, using publiclyavailable datasets.

References [1] V. Akkermans, F. Font, J. Funollet, B. de Jong, G. Roma, S. Togias,and X. Serra, “FREESOUND 2.0: An Improved Platform for SharingAudio Clips,” , (2011).

UNOUCHI and YOSHIOKA: DIVERSITY-ROBUST ACOUSTIC FEATURE SIGNATURES BASED ON MULTISCALE FRACTAL DIMENSION

Organised Sound,vol. 16, no. 03 , pp. 206–210 (2011).[6] T. H. Park, J. Lee, J. You, M.-J. Yoo, and J. Turner, “Towards Sound-scape Information Retrieval (SIR),”

Proc. ICMC|SMC |2014, Athens,Greece, 2014 , pp. 1218–1225. (2014).[7] S. Chachada and C. C. J. Kuo, “Environmental sound recognition:A survey,” APSIPA Trans. Signal Inf. Process., vol. 3, (2014).[8] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D.Plumbley, “Detection and Classiﬁcation of Acoustic Scenes andEvents,” IEEE Trans. Multimed., vol. 17, no. 10, pp. 1733–1746,(2015).[9] K. Koutini, F. Henkel, H. Eghbal-zadeh, and G. Widmer, “CP-JKUSubmissions to DCASE’20: Low-Complexity Cross-Device Acous-tic Scene Classiﬁcation with RF-Regularized CNNs,” DCASE2020Challenge, Tech. Rep., (2020).[10] M. Cowling and R. Sitte, “Comparison of techniques for environ-mental sound recognition,” Pattern Recognit. Lett., vol. 24, no. 15,pp. 2895–2907, (2003).[11] S. Chu, S. Narayanan, and C.-C. Kuo, “Environmental Sound Recog-nition With Time-Frequency Audio Features,” IEEE Trans. Audio.Speech. Lang. Processing, vol. 17, no. 6, pp. 1142–1158, (2009).[12] R. Mogi and H. Kasai, “Noise-Robust environmental sound classiﬁ-cation method based on combination of ICA and MP features,” Artif.Intell. Res., vol. 2, no. 1, p. 107, (2012).[13] C. Bauge, M. Lagrange, J. Anden, and S. Mallat, “Representing envi-ronmental sounds using the separable scattering transform,” ICASSP,IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 8667–8671, (2013).[14] J. Xue, G. Wichern, H. Thornbug, and A. Spanias, “Fast query byexample of environmental sounds via robust and eﬃcient cluster-based indexing,” ICASSP, IEEE Int. Conf. Acoust. Speech SignalProcess.- Proc., pp. 5–8, (2008).[15] G. Roma, J. Janer, S. Kersten, M. Schirosa, P. Herrera, and X. Serra,“Ecological Acoustics Perspective for Content-Based Retrieval ofEnvironmental Sounds,” EURASIP J. Audio, Speech, Music Pro-cess., vol. 2010, pp. 1–11, (2010).[16] G. Chechik, E. Ie, M. Rehn, S. Bengio, and D. Lyon, “Large-scalecontent-based audio retrieval from text queries,” Vancouver, BritishColumbia, Canada, pp. 105–112, (2008).[17] M. Sunouchi and Y. Tanaka, “Similarity Search of Freesound Envi-ronmental Sound Based on Their Enhanced Multiscale Fractal Di-mension,” Sound Music Comput. Conf. 2013, SMC 2013, pp. 715–721, (2013).[18] S. Handel, “Timbre perception and auditory object identiﬁcation,”Hearing. Academic Press, p. 468, (1995).[19] D. Mitrović, M. Zeppelzauer, and H. Eidenberger, “On Feature Selec-tion in Environmental Sound Recognition,” 51st Int. Symp. ELMAR,no. September, pp. 28–30, (2009).[20] D. Mitrović, M. Zeppelzauer, and C. Breiteneder, “Features forContent-Based Audio Retrieval,” Adv. Comput., 2010, vol. 78, ch. 3,pp. 71–150, (2010).[21] S. Mallat and Z. Zhang, “Matching pursuits with time-frequencydictionaries,” IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3397–3415, (1993).[22] S. Innami and H. Kasai, “NMF-based environmental sound sourceseparation using time-variant gain features,” Comput. Math. withAppl., vol. 64, no. 5, pp. 1333–1342, (2012).[23] B. Mandelbrot, “The Fractal Geometry of Nature,” W. H. Freemanand Company, (1982).[24] R. F. Voss and J. Clarke, “ ‘1/f noise’ in music and speech,” Nature,vol. 258, no. 5533, pp. 317–318, (1975). [25] K. J. Hsu and A. J. Hsu, “Fractal Geometry of Music,” Proc. Natl.Acad. Sci., vol. 87, no. 3, pp. 938–941, (1990).[26] P. Maragos, “Fractal aspects of speech signals: dimension and in-terpolation,” Acoust. Speech, Signal Process. 1991. ICASSP-91. -Proc., pp. 417–420, (1991).[27] P. Maragos and A. Potamianos, “Fractal dimensions of speechsounds: computation and application to automatic speech recog-nition.” J. Acoust. Soc. Am., vol. 105, no. 3, pp. 1925–1932, (1999).[28] A. Zlatintsi and P. Maragos, “Musical Instruments Signal Analysisand Recognition Using Fractal Features,” 19th Eur. Signal Process.Conf. (EUSIPCO 2011), Barcelona, Spain, no. Eusipco, pp. 684–688, (2011).[29] A. Zlatintsi and P. Maragos, “Multiscale Fractal Analysis of MusicalInstrument Signals With Application to Recognition,” IEEE Trans.Audio. Speech. Lang. Processing, vol. 21, no. 4, pp. 737–748, (2013).[30] SPTK working group, “Speech Signal Processing Toolkit (SPTK),”http://sp-tk.sourceforge.net/, (Retrieved 2020-12-20).[31] D. W. Scott, “Multivariate Density Estimation: Theory, Practice, andVisualization,” 1st ed. Wiley, (1992).[32] M. Porter, “An algorithm for suﬃx stripping,” Progr. Electron. Libr.Inf. Syst., vol. 14, no. 3, pp. 130–137, (1980).

Motohiro Sunouchi received masters of en-vironmental studies in human and engineered en-vironmental studies from the University of Tokyoin 2004. He is currently pursuing the Ph.D. ininformation science at the Hokkaido University.He has been a research assistant since 2007 anda senior lecturer since 2016 with the Departmentof Design, Sapporo City University, Japan. Hisresearch interests lie in the areas of audio signalprocessing and auditory culture.