Acoustic Scene Analysis Using Partially Connected Microphones Based on Graph Cepstrum
aa r X i v : . [ c s . S D ] J u l Acoustic Scene Analysis Using Partially ConnectedMicrophones Based on Graph Cepstrum
Keisuke Imoto
Ritsumeikan University, Japan
Abstract —In this paper, we propose an effective and robustmethod for acoustic scene analysis based on spatial informationextracted from partially synchronized and/or closely locateddistributed microphones. In the proposed method, to extractspatial information from distributed microphones while takinginto account whether any pairs of microphones are synchronizedand/or closely located, we derive a new cepstrum feature utilizinga graph-based basis transformation. Specifically, in the proposedgraph-based cepstrum, the logarithm of the amplitude in amultichannel observation is converted to a feature vector by aninverse graph Fourier transform, which can consider whetherany pair of microphones is connected. Our experimental resultsindicate that the proposed graph-based cepstrum effectivelyextracts spatial information with consideration of the microphoneconnections. Moreover, the results show that the proposed methodmore robustly classifies acoustic scenes than conventional spatialfeatures when the observed sounds have a large synchronizationmismatch between partially synchronized microphone groups.
I. I
NTRODUCTION
Acoustic scene analysis (ASA), which analyzes scenes inwhich sounds are produced, is now a very active researcharea in acoustics, and it is expected that ASA will enablemany useful applications such as systems monitoring elderlypeople or infants [1], [2], automatic surveillance systems[3]–[6], automatic file-logging systems [7]–[9], and advancedmultimedia retrieval [10]–[13].To analyze scenes from an acoustic signal, many approachesbased on machine learning techniques have been proposed.For instance, Eronen et al. [7] and Mesaros et al. [14]have proposed spectral feature-based methods such as mel-frequency cepstral coefficients (MFCCs) and Gaussian mixturemodels (GMMs). Han et al. [15] and Jallet et al. [16] haveproposed methods using the mel-spectrogram as input featuresand the convolutional neural network (CNN) or recurrent con-volutional neural network (RCNN) as classifiers. Guo and Li[17], Kim et al. [18], and Imoto and co-workers [8], [19] haveinvestigated ASA utilizing intermediate feature representationsbased on acoustic event histograms.ASA based on spatial information extracted from a mi-crophone array composed of smartphones, smart speakers,and IoT devices has also been proposed [20]–[22]. Many ofthese methods extract spatial information based on observedtime differences or sound power ratios between channels, andtherefore, they require that the microphones are synchronizedand the microphone locations and array geometry are known.However, since the distributed microphones in multiple smart-phones, smart speakers, or IoT devices are often unsynchro-nized and the microphone locations and array geometry are
Connection of microphonesMicrophone
Fully connectedmicrophone array Distributedmicrophone array Partially connectedmicrophone array
Fig. 1. Example of microphone connections unknown, conventional methods cannot be applied to suchdistributed microphone arrays. To extract spatial informationusing unsynchronized distributed microphones whose loca-tions and array geometry are unknown, Imoto and Ono haveproposed a spatial cepstrum that can be applied under theseconditions [23]. In this approach, log-amplitudes obtainedby multiple microphones are converted to a feature vectorsimilarly to when using the cepstrum, which is based onprincipal component analysis (PCA) of the feature vector.On the other hand, the numbers of smartphones, smartspeakers, or IoT devices that have multiple microphones havebeen increasing. A microphone array composed of these mi-crophones are often partially synchronized or closely locatedas shown in Fig. 1; we refer to these synchronized or closelylocated microphones collectively as connected microphones.The time delay or sound power ratio between channels is asignificant cue for extracting spatial information even whenthe microphones are partially connected; however, the con-ventional spatial cepstrum does not consider whether some ofthe microphones are partially connected.In this paper, we propose a novel spatial feature extractionmethod for a distributed microphone array that can take intoaccount whether or not microphones are partially connected.To consider whether any pairs of microphones are connected,we utilize a graph representation of the microphone connec-tions, where the power observations and microphone connec-tions are represented by the weights of the nodes and edges,respectively. Then, the proposed method introduces a graphFourier transform, which enables spatial feature extractionconsidering the connections between microphones.This paper is organized as follows. In section 2, the spatialcepstrum used in conventional spatial feature extraction for adistributed microphone array is introduced. In section 3, theproposed method of extracting a spatial feature for partiallyconnected distributed microphones and the similarity of theproposed method to the conventional cepstrum and spatialepstrum are discussed. In section 4, experiments performedto evaluate the proposed method are reported. In section 5, weconclude this paper.II. C
ONVENTIONAL S PATIAL F EATURE E XTRACTION FOR D ISTRIBUTED M ICROPHONES
To extract spatial information from unsynchronized dis-tributed microphones whose locations and array geometry areunknown, the spatial cepstrum, which is a similar techniqueto the cepstrum feature, has been proposed [23].Suppose that a multichannel observation is recorded by N microphones and ¯ a τ,n denotes the power observed formicrophone n at time frame τ . In the case of unsynchro-nized distributed microphones, synchronization over channelsis still a challenging problem and phase information may beunreliable. Therefore, the spatial cepstrum utilizes only thelog-amplitude vector q τ = log ¯ a τ, log ¯ a τ, ... log ¯ a τ,n ... log ¯ a τ,N , (1)which is relatively robust to a synchronization mismatch.Considering that the distributed microphones may be non-uniformly located, PCA is then applied for the basis transfor-mation of the spatial cepstrum instead of the inverse discreteFourier transform (IDFT). Suppose that R q is the covariancematrix of q τ and given by R q = 1 T X τ q τ q T τ , (2)where T is the number of time frames. Since R q is a symmet-ric matrix, the eigendecomposition of R q can be representedas R q = EDE T , (3)where E and D are the eigenvector matrix and the diagonalmatrix whose diagonal elements are equal to the eigenvalues indescending order, respectively. Using this eigenvector matrix E , the spatial cepstrum is defined as d τ = E T q τ . (4)The spatial cepstrum can extract spatial information withoutmicrophone locations or the array geometry, although it re-quires training sounds to estimate the eigenvector matrix E byPCA. Moreover, since the spatial cepstrum does not considerwhether or not the microphones are connected, observed timedifferences or sound power ratios between channels cannot beutilized for spatial feature extraction. Adjacency matrix ααα ααα ααα ααα α ααα α ααααα Weight of connection
55 888555 99555555 774444 999995555533 88888888833 99999999999933333 66663333444333333333 55555555222 6666332222 444444422222222 444444 6666666555555555555555555555551111 444444111111111 66666666666222111111 777733333333333333333333333
Log-powerof observation α Fig. 2. Example of observations on graph and relationship between micro-phone connections and adjacency matrix
III. S
PATIAL F EATURE E XTRACTION B ASED ON G RAPH C EPSTRUM
A. Graph Cepstrum
We consider the situation that a microphone array iscomposed of multiple generic acoustic sensors mounted onsmartphones, smart speakers, or IoT devices, where someof the microphones mounted on each device are connected.To extract spatial information while considering microphoneconnections, we here propose a novel spatial feature extractionmethod that utilizes a graph representation of the multichan-nel observations and microphone connections. Specifically, toextract spatial information, the proposed method performs thegraph Fourier transform [24] instead of PCA in the spatialcepstrum. This makes it possible to take into account whichpairs of microphones are connected.Consider the logarithm powers of the observations on thegraph shown in Fig. 2, where the power observations andmicrophone connections are represented by the weights of thenodes and edges, respectively. Here, the N × N adjacencymatrix is defined as A ( m, n ) = ( m and n are connected0 or α otherwise, (5)where α is an arbitrary weight of the connection within therange of 0.0–1.0. We also assume the N × N degree matrix D , which is a diagonal matrix whose diagonal elements arerepresented as D ( m, m ) = X n A ( m, n ) . (6)The degree matrix indicates the number of microphonesconnected with microphone m . Then, the unweighted graphLaplacian is written as L , D − A , (7)where L is also a symmetric matrix since both D and A aresymmetric matrices. Thus, eigendecomposition of L can beexpressed as L = UΛU T , (8)where U and Λ are the eigenvector matrix and the diag-onal matrix whose diagonal elements λ m are equal to theeigenvalues in ascending order, respectively. The eigenvector Square Regular pentagon
12 3 4 5
Equilateral triangle
12 3
Connection ofmicrophones
Fig. 3. Examples of ring graph condition (left) and circularly symmetricmicrophone arrangements (right) matrix U T and its transpose U are the graph Fourier transform(GFT) matrix and the inverse graph Fourier transform (IGFT)matrix, respectively, which enable the basis transformationsconsidering the connections between microphones.Thus, the proposed spatial feature, which can consider theconnections between microphones, is defined in terms of theIGFT of the log-amplitude vector q τ as e τ = Uq τ . (9)Because this proposed spatial feature also resembles theconventional cepstrum as well as the spatial cepstrum, we callit the graph cepstrum (GC). B. Graph Cepstrum on Ring Graph
Let us consider a circular connected condition, namely thering graph condition shown in Fig. 3. For this condition, agraph Laplacian is represented as the circulant matrix L sym = − · · · − − − · · · − · · · ... ... ... . . . ... ... ... · · · − · · · − − − · · · − . (10)On the basis of the fact that a circulant matrix is diagonalizedby an IDFT matrix Z N [25] defined by Z N = 1 √ N · · · ζ ζ · · · ζ N − ζ N − ζ ζ · · · ζ N − ζ N − ... ... ... . . . ... ... ζ N − ζ N − · · · ζ ( N − ζ ( N − N − ζ N − ζ N − · · · ζ ( N − N − ζ ( N − (11) ζ = e j π/N , (12)the IGFT is identical to the IDFT.Thus, in the case of a ring graph, the GC is identical tothe definition of the cepstrum. Moreover, it is also identicalto the definition of the spatial cepstrum of circular symmetric (mm) ChattingOperating PCReadingnewspaper EatingTVVacuuming CookingDoing laundryDishwashing
Microphone
ChattingOperating PCEating
EatingEatingChattingChattingChatting
H88777000 H1000000000HHH
HHHHHH11111222000000HH888000050
CookingCooking
Connection of microphones
Fig. 4. Microphone arrangement and sound source locations. Channelindices (1–13) and group indices of synchronized microphones (I–V) are alsoindicated. microphones in an isotropic sound field [23]. This means thatthe ring connection in the GC domain corresponds to thecircular symmetric arrangement of microphones in an isotropicsound field in the acoustic spatial condition.IV. E
XPERIMENTS
A. Experimental Conditions
To evaluate the effectiveness of the proposed method forpartially synchronized microphones, we conducted classifica-tion experiments on acoustic scenes in a living room. Sincemost of the public datasets for acoustic scene analysis includ-ing TUT Acoustic Scenes 2017 [26] and AudioSet [27] areprovided in single or stereo channels, we recorded a multichan-nel sound dataset with 13 synchronized microphones in a realenvironment. The sound dataset includes nine acoustic scenes,“vacuuming,” “cooking,” “dishwashing,” “eating,” “reading anewspaper,” “operating a PC,” “chatting,” “watching TV,” and“doing the laundry,” which happen frequently around theliving room. The microphone arrangement and the locationsof the sound sources are shown in Fig. 4. The recordedsounds consisted of 257.1 min. of recordings, which wererandomly separated into 5,180 sound clips for model trainingand 2,532 sound clips for classification evaluation, where noacoustic scene overlapped with another scene in all the soundclips. To evaluate the scene classification performance withsynchronization mismatch among the microphone groups, therecorded sounds for classification evaluation were misalignedwith various error times among the microphone groups shownin Fig. 4. The error times were randomly sampled from aGaussian distribution with µ = 0 and various variances σ .The other recording conditions and experimental conditionsare listed in Table I. ABLE IE
XPERIMENTAL CONDITIONS
Sampling rate 48 kHzQuantization bit rate 16 bitsSound clip length 8 sFrame length/FFT point 20 ms/2,048Connection weight α × B. Spatial Information Extracted by Graph Cepstrum
To clarify how the GC extracts spatial information, we showthe IGFT matrix U T in Fig. 5. The n th-row vector of U T corresponds to the n th-eigenvector of the graph Laplacian L .The n th-order GC is calculated using the n th-row vector of U T as follows: e τ,n = u T n q τ = N X m =1 u n,m q τ,m , (13)where e τ,n , u T n , u n,m , and q τ,m are the n th-order GC, the n th-row vector of U T , the ( n, m ) entry of U T , and the m th ele-ment of q τ , respectively. This indicates that the n th-order GCis obtained by a linear combination of log-amplitudes q τ,m ,where u n,m is the weight of the linear combination. FromFig. 5, it can be interpreted that the first-order GC representsthe average sound level in the whole space because all theweights u T n are positive. For the middle-order eigenvectors,the signs of the weights between connected microphones aresimilar. This indicates that the GC can capture spatial informa-tion while taking the connections of microphones into account.For the higher-order eigenvectors, the weights of only part ofthe connected microphone group are active and the signs of theweights differ. These eigenvectors capture spatial informationof the sound sources close to the microphone groups becauseif the sound sources are far from the microphone groups, thelinear combination of the microphone groups is canceled inEq. (13). C. Acoustic Scene Classification
Acoustic scenes were then modeled and classified foreach sound clip using a Gaussian mixture model (GMM),a supervised acoustic topic model (sATM) [8], [28], and aconvolutional neural network (CNN). Specifically, the GMMwas applied to acoustic feature vectors e τ and d τ for eachacoustic scene x . After that, acoustic scene x of sound clip c was estimated by calculating the product of the likelihoodsover the sound clip as follows: x c = arg max x T c Y τ =1 p τ ( f τ | x ) , (14)where T c , f τ , and p τ ( f τ | x ) are the number of frames in soundclip c , an acoustic feature vector calculated frame by frame, -1.0-0.8-0.6-0.4-0.20.00.20.40.60.81.0 Channel index E i g e nv ec t o r i nd e x Fig. 5. IGFT matrix U T in red-blue color map representation A v e r a g e F - s c o r e ( % ) Standard deviation σ / Frame length
Graph cepstrum + GMM Spatial cepstrum + GMMGraph cepstrum (BoW) + CNNMFCC (BoW) + GMM (late fusionof channel-based classification)
MFCC + GMM (late fusion of channel-based classification)
MFCC (BoW) + Classifier stacking [29] (late fusion of channel-based classification)
Spatial cepstrum (BoW) + CNNGraph cepstrum (BoW) + sATM [23] Spatial cepstrum (BoW) + sATM
Fig. 6. Acoustic scene classification accuracy with various synchronizationerror times between connected microphone groups such as d τ or e τ , and the likelihood of acoustic scene x attime frame τ , respectively. As other methods for acoustic sceneclassification utilizing a distributed microphone array, we alsoevaluated classifiers based on late fusion-based classificationmethods [29]. D. Experimental Results
The classification performance of acoustic scenes is shownin Fig. 6. For each experimental condition, the acoustic scenemodeling and classification were conducted ten times witharious synchronization error times sampled randomly. Theseresults show that when the synchronization error betweenmicrophone groups is small, the GC and conventional spatialcepstrum effectively classify acoustic scenes. When the syn-chronization error between microphone groups increases, thescene classification performance for the GC slightly decreases.In contrast, the classification accuracy decreases rapidly whenusing conventional methods. This indicates that the proposedGC is more robust against synchronization error than conven-tional methods. V. C
ONCLUSION
In this paper, we proposed an effective spatial featureextraction method for acoustic scene analysis using partiallysynchronized or closely located distributed microphones. Inthe proposed method, we derived the graph cepstrum (GC),which is defined as the inverse graph Fourier transform ofthe logarithm power of a multichannel observation. We thendemonstrated that the GC in a ring graph is identical to theconventional cepstrum and spatial cepstrum in a circularlysymmetric microphone arrangement with an isotropic soundfield. Our experimental results using real environmental soundsshowed that the GC more robustly classifies acoustic scenesthan conventional spatial features even when the synchro-nization mismatch between partially synchronized microphonegroups is large. A
CKNOWLEDGMENTS
Part of this work was supported by the Support Center forAdvanced Telecommunications Technology Research, Foun-dation. R
EFERENCES[1] Y. Peng, C. Lin, M. Sun, and K. Tsai, “Healthcare audio eventclassification using hidden Markov models and hierarchical hiddenMarkov models,”
Proc. IEEE International Conference on Multimediaand Expo ( ICME ), pp. 1218–1221, 2009.[2] P. Guyot, J. Pinquier, and R. Andr´e-Obrecht, “Water sound recognitionbased on physical models,”
Proc. IEEE International Conference onAcoustics, Speech and Signal Processing ( ICASSP ), pp. 793–797, 2013.[3] A. Harma, M. F. McKinney, and J. Skowronek, “Automatic surveil-lance of the acoustic activity in our living environment,”
Proc. IEEEInternational Conference on Multimedia and Expo ( ICME ), 2005.[4] R. Radhakrishnan, A. Divakaran, and P. Smaragdis, “Audio analysis forsurveillance applications,”
Proc. 2005 IEEE Workshop on Applicationsof Signal Processing to Audio and Acoustics ( WASPAA ), pp. 158–161,2005.[5] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “On acoustic surveillanceof hazardous situations,”
Proc. IEEE International Conference onAcoustics, Speech and Signal Processing ( ICASSP ), pp. 165–168, 2009.[6] T. Komatsu and R. Kondo, “Detection of anomaly acoustic scenes basedon a temporal dissimilarity model,”
Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing ( ICASSP ), pp. 376–380,2017.[7] A. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund,T. Sorsa, G. Lorho, and J. Huopaniemi, “Audio-based context recog-nition,”
IEEE Trans. Audio Speech Lang. Process. , vol. 14, no. 1, pp.321–329, 2006.[8] K. Imoto and S. Shimauchi, “Acoustic scene analysis based onhierarchical generative model of acoustic event sequence,”
IEICE Trans.Inf. Syst. , vol. E99-D, no. 10, pp. 2539–2549, October 2016.[9] J. Schr¨oder, J. Anemiiller, and S. Goetze, “Classification of human coughsignals using spectro-temporal Gabor filterbank features,”
Proc. IEEEInternational Conference on Acoustics, Speech and Signal Processing ( ICASSP ), pp. 6455–6459, 2016. [10] T. Zhang and C. J. Kuo, “Audio content analysis for online audiovisualdata segmentation and classification,”
IEEE Trans. Audio Speech Lang.Process. , vol. 9, no. 4, pp. 441–457, 2001.[11] Q. Jin, P. F. Schulam, S. Rawat, S. Burger, D. Ding, and F. Metze,“Event-based video retrieval using audio,”
Proc. INTERSPEECH , 2012.[12] Y. Ohishi, D. Mochihashi, T. Matsui, M. Nakano, H. Kameoka, T. Izumi-tani, and K. Kashino, “Bayesian semi-supervised audio event transcrip-tion based on Markov Indian buffet process,”
Proc. IEEE InternationalConference on Acoustics, Speech and Signal Processing ( ICASSP ), pp.3163–3167, 2013.[13] J. Liang, L. Jiang, and A. Hauptmann, “Temporal localization of audioevents for conflict monitoring in social media,”
Proc. IEEE InternationalConference on Acoustics, Speech and Signal Processing ( ICASSP ), pp.1597–1601, 2017.[14] A. Mesaros, T. Heittola, A. Eronen, and T. Virtanen, “Acoustic eventdetection in real life recordings,”
Proc. 18th European Signal ProcessingConference ( EUSIPCO ), pp. 1267–1271, 2010.[15] Y. Han, J. Park, and K. Lee, “Convolutional neural networks withbinaural representations and background subtraction for acoustic sceneclassification,” the Detection and Classification of Acoustic Scenes andEvents ( DCASE ), pp. 1–5, 2017.[16] H. Jallet, E. C¸ akır, and T. Virtanen, “Acoustic scene classificationusing convolutional recurrent neural networks,” the Detection andClassification of Acoustic Scenes and Events ( DCASE ), pp. 1–5, 2017.[17] G. Guo and S. Z. Li, “Content-based audio classification and retrievalby support vector machines,”
IEEE Trans. Neural Networks , vol. 14,no. 1, pp. 209–215, 2003.[18] S. Kim, S. Narayanan, and S. Sundaram, “Acoustic topic models foraudio information retrieval,”
Proc. 2009 IEEE Workshop on Applicationsof Signal Processing to Audio and Acoustics ( WASPAA ), pp. 37–40,2009.[19] K. Imoto, Y. Ohishi, H. Uematsu, and H. Ohmuro, “Acoustic sceneanalysis based on latent acoustic topic and event allocation,”
Proc. IEEEInternational Workshop on Machine Learning for Signal Processing ( MLSP ), 2013.[20] H. Kwon, H. Krishnamoorthi, V. Berisha, and A. Spanias, “A sensornetwork for real-time acoustic scene analysis,”
Proc. IEEE InternationalSymposium on Circuits and Systems , pp. 169–172, 2009.[21] H. Phan, M. Maass, L. Hertel, R. Mazur, and A. Mertins, “A multi-channel fusion framework for audio event detection,”
Proc. 2015 IEEEWorkshop on Applications of Signal Processing to Audio and Acoustics ( WASPAA ), pp. 1–5, 2015.[22] P. Giannoulis, A. Brutti, M. Matassoni, A. Abad, A. Katsamanis,M. Matos, G. Potamianos, and P. Maragos, “Multi-room speechactivity detection using a distributed microphone network in domesticenvironments,”
Proc. 23rd European Signal Processing Conference ( EUSIPCO ), pp. 1271–1275, 2015.[23] K. Imoto and N. Ono, “Spatial cepstrum as a spatial feature usingdistributed microphone array for acoustic scene analysis,”
IEEE/ACMTrans. Audio Speech Lang. Process. , vol. 25, no. 6, pp. 1335–1343, June2017.[24] A. Ribeiro, A. G. Marques, and S. Segarra, “Graph signal processing:Fundamentals and applications to diffusion processes,”
Proc. 24thEuropean Signal Processing Conference ( EUSIPCO ), 2016.[25] G. Golub and C. Van Loan,
Matrix Computations , Johns HopkinsUniversity Press, 1996.[26] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent,B. Raj, and T. Virtanen, “Dcase 2017 challenge setup: Tasks, datasetsand baseline system,”
Proc. the Detection and Classification of AcousticScenes and Events Workshop (DCASE) , pp. 85–92, 2017.[27] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C.Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,”
Proc. IEEE International Conferenceon Acoustics, Speech and Signal Processing ( ICASSP ), pp. 776–780,2017.[28] K. Imoto and N. Ono, “Acoustic scene classification based on generativemodel of acoustic spatial words for distributed microphone array,”
Proc.European Signal Processing Conference ( EUSIPCO ), pp. 2343–2347,2017.[29] J. K¨urby, R. Grzeszick, A. Plinge, and G. A. Fink, “Bag-of-featuresacoustic event detection for sensor networks,”