[PDF] Anomaly Detection in Audio with Concept Drift using Adaptive Huffman Coding

Abstract

When detecting anomalies in audio, it can often be necessary to consider concept drift: the distribution of the data may drift over time because of dynamically changing environments, and anomalies may become normal as time elapses. We propose to use adaptive Huffman coding for anomaly detection in audio with concept drift. Compared with the existing method of adaptive Gaussian mixture modeling (AGMM), adaptive Huffman coding does not require a priori information about the clusters and can adjust the number of clusters dynamically depending on the amount of variation in the audio. To control the size of the Huffman tree, we propose to merge clusters that are close to each other instead of replacing rare clusters with new data. This reduces redundancy in the Huffman tree while ensuring that it never forgets past information. On a dataset of audio with concept drift which we have curated ourselves, our proposed method achieves higher area under the curve (AUC) compared with AGMM and fixed-length Huffman trees. The proposed approach is also time-efficient and can be easily extended to other types of time series data (e.g., video).

Full PDF

AAnomaly Detection in Audio with Concept Drift usingAdaptive Huffman Coding

PRATIBHA KUMARI,

Indian Institute of Technology, Ropar, India

MUKESH SAINI,

Indian Institute of Technology, Ropar, IndiaIn this work, we propose a framework to apply Huffman coding for anomaly detection in audio. There are anumber of advantages of using the Huffman coding technique for anomaly detection, such as less dependenceon the a-priory information about clusters (e.g., number, size, density) and variable event length. The codingcost can be calculated for any duration of the audio. Huffman codes are mostly used to compress non-timeseries data or data without concept drift. However, the normal class distribution of audio data varies greatlywith time due to environmental noise. In this work, we explore how to adapt the Huffman tree to incorporatethis concept drift. We found that, instead of creating new nodes, merging existing nodes gives a more effectiveperformance. Note that with node merging, you never actually forget the history, at least theoretically. To thebest of our knowledge, this is the first work on applying Huffman coding techniques for anomaly detection intemporal data. Experiments show that this scheme improves the result without much computational overhead.The approach is time-efficient and can be easily extended to other types of time series data (e.g., video).CCS Concepts: •

Computer systems organization → Embedded systems ; Redundancy ; Robotics; •

Net-works → Network reliability.Additional Key Words and Phrases: anomaly detection, long term audio surveillance, unsupervised modelling,Huffman coding

ACM Reference Format:

Pratibha Kumari and Mukesh Saini. 2018. Anomaly Detection in Audio with Concept Drift using AdaptiveHuffman Coding. 1, 1 (February 2018), 22 pages. https://doi.org/10.1145/1122445.1122456

Initial automated surveillance systems were based on video sensors, which are quite popular eventoday. However, exclusively relying on video data can lead to unavoidable errors in monitoring. Forexample, the video analysis algorithms fail to work in the presence of adverse weather conditions,abrupt illumination changes, shadows, reflections, or when there is an obstacle on the line ofsight [10]. Moreover, a slight movement in the camera may cause background models to fail [27].Audio sensors are less sensitive to these errors. They provide a complementary aspect of theenvironment and facilitate a better scene understanding when used along with a video camera [6, 9,26]. While there have been a large number of works on detecting suspicious activities (e.g., vehiclecrashes, shouting, and gunshot detection) using audio analysis [7, 10, 14], detecting anomaliesis still a challenging task due to inherent concept drift. The phenomenon of concept drift refersto the change in the data distribution over time in dynamically changing and non-stationaryenvironments [32]. A typical example of the concept drift is a change in noise level observed from

Authors’ addresses: Pratibha Kumari, Indian Institute of Technology, Ropar, Ropar, Punjab, India, [email protected];Mukesh Saini, Indian Institute of Technology, Ropar, Ropar, Punjab, India, [email protected] to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2018 Association for Computing Machinery.XXXX-XXXX/2018/2-ART $15.00https://doi.org/10.1145/1122445.1122456 , Vol. 1, No. 1, Article . Publication date: February 2018. a r X i v : . [ c s . S D ] F e b Kumari and Saini, et al. day to night in an office. A noisy environment in the office appears normal during the workinghours, whereas it becomes abnormal if observed in non-working hours. To accommodate theconcept drift, the model needs to be updated online, also referred as adaptive learning. Researchrelated to learning with concept drift has been increasingly growing, and many drift-aware adaptivelearning algorithms have been developed. In spite of the popularity of this research topic, there is alack of focus on concept drift handling for audio-visual data. There have been a few attempts toadapt GMM (Gaussian Mixture Model) for anomaly detection in audio [8, 9]. However, the learningand forgetting strategy in an adaptive GMM (AGMM) approach reacts poorly to the concept drift,which inherits a context of a longer time-span. Moreover, such a modeling requires crucial priorknowledge about the number of clusters, their size, and their density. In addition, there is a needto calculate the inverse covariance matrix for Gaussians, which may even not exist in the case ofmultivariate data.Coding-cost based approaches do not need such prior knowledge; they are considered mostlyparameter-free [1]. In this article, we propose to use Huffman coding compression technique foraudio anomaly detection. Huffman coding is a very simple yet effective technique for lossless datacompression. It assigns more bits to the less probable data and less bit to frequent data. We haveutilized this property to detect an anomaly, which is defined as a rare event. The existing workson the Huffman coding based anomaly detection method train a stationary model [1] [4]. To thebest of our knowledge, we have not seen Huffman coding being applied for anomaly detection intime-series data with the concept drift. A single model is trained and applied to the whole data[1] [4]. To cope up with the concept drift in audio data, we utilize a modified version of the adaptiveHuffman coding tree.There are numerous challenges in applying the Huffman coding on a data coupled with conceptdrift. The nodes in the tree can no longer be fixed. The Huffman tree should learn the environmentin an online fashion. What leaf nodes represent also has to change with time. We learn an adaptivecodebook for the leaf nodes using the audio bag-of-words approach [20]. The audio segments areconverted to the words, and a group of words is classified as an anomaly based on the coding cost.Along with the classification, the new data is also used to update the codebook as well as the treestructure. Traditionally, when a new sample does not match any of the Gaussians in the AGMMbased approach, the weakest Gaussian is replaced with a new one corresponding to the new data[8, 9]. We propose to use a node merging strategy instead of node replacement in the Huffman tree.After updating the tree parameters at the arrival of a new sample, the tree nodes that are relativelyclose are merged. We experimentally found that the number of required nodes in the tree neverexplodes, no matter how long the data is. The node merging strategy also solves the problem oftwo distribution modes drifting close to each other, which actually represent a single mode. In thisway, we never completely forget any past data.We also found that most of the existing public audio datasets have samples from stationary classes(also sometimes called fixed anomalies), e.g., glass breaking, gunshot, shouting, etc. [10, 30]. Theseare mostly short audio clips containing such events, mixed with miscellaneous background [21].Foggia et al. [13] proposed a dataset that contains specific event samples useful for road surveillance.While these datasets are good for the evaluation of supervised classification techniques, they do nothave concept drift. The audio clips are too short to have any drift in the data distribution. Therefore,we collect various challenging datasets that contain multiple natural anomalies (not staged). Weintend to make the dataset available for researchers in this field. Experiments on this dataset revealthat (1) the node merging scheme is more effective than node replacement, (2) the proposed methodoutperforms the previous method on adaptive anomaly modeling.We brief our contributions as follows: , Vol. 1, No. 1, Article . Publication date: February 2018. nomaly Detection in Audio with Concept Drift using Adaptive Huffman Coding 3 • We give a framework to apply Huffman coding to model the concept drift in adaptive anomalydetection. The proposed method outperforms previous works and needs less prior knowledge. • We propose a node-merging scheme for adapting the Huffman tree, which gives better resultsthan the traditional node replacement scheme. • We propose a long duration audio dataset with concept drift for the evaluation of adaptiveanomaly detection works.The rest of the paper is structured as follows. We do a literature survey in Section 2. Thenwe discuss the proposed work in Section 3. The experimental results & analysis are presented inSection 4. We discuss the limitations of the work in Section 5. Finally, Section 6 concludes thearticle.

Audio analysis has been effectively used for detecting defined suspicious activities (e.g., vehiclecrashes, shouting, and gunshot detection) both in indoor and outdoor scenarios [7, 10, 14]. Carlettiet al. [5] separated gunshot, scream, broken glass, and background noise using multiple SVMclassifiers, one for each type of anomalous event. Foggia et al. [13] used a similar approach for thetask of car crashes and tire skidding detection. Rushe et al. [24] used a convolutional autoencoder-based semi-supervised approach to detect baby crying, broken glass, and gunshot. The abovesupervised methods are only effective in detecting stationary anomalies without any concept drift[32].Real-world time-series data like weather data, audio streams, CCTV videos, etc., often haveconcept drift. A stationary anomaly model relies on the one-time learned fixed distributions andhence performs poorly on the new unseen time series data. To detect an anomaly in the presenceof concept drift, we need to adapt the model with time [16, 19, 23, 34].Saurav et al. [25] discussed the concept drift issue for point anomaly detection where each samplehas a semantic meaning. The authors proposed incremental learning of an RNN (Recurrent NeuralNetwork) model by feeding the data present in a window of the given length. Fenza et al. [12]used LSTM (Long Short Term Memory) based drift aware model to detect anomalies in smart gridsensory data. They store the errors made in the last 24 hours to retrain the LSTM.The problem with the deep network-based approaches is that it takes a large number of samplesand huge computing power to reflect any change in the distribution. In the case of audio, aftermapping to the semantic space (feature calculation), we do not get as many examples.AGMM is one of the successful models for anomaly detection in multimedia data. The baselineAGMM based approach [27] was proposed for visual foreground-background detection. EachGaussian in the mixture represents either a background or foreground sample. The anomaloussamples have a lower likelihood of being generated from the mixture. The number of Gaussians iskept fixed (3-6). Over time, Gaussians representing rare events are replaced by newer anomalies;hence, the model only captures recent history. The model is later used to detect anomalies inaudio data. Cristani et al. [8, 9] build feature-wise univariate AGMM. Each Gaussian in a mixturerepresents a possible aural word distribution. Moncrieff et al. [22] introduced an idea to keeprecently deleted Gaussians in a fixed size cache to extend the model memory. Applying AGMM foraudio anomaly detection requires a lot of prior information about the number of clusters, clustersize, density, etc. Furthermore, an AGMM framework requires the inverse covariance matrix forlikelihood calculation, which may even not exist for multivariate data.In this work, we propose to use Huffman coding as an alternative approach for anomaly detectionin audio. The idea is that the coding cost is generally higher for an anomalous data segment [1].Callegari et al. [4] did a one-time training of the Huffman tree and used it for network anomaly , Vol. 1, No. 1, Article . Publication date: February 2018.

Kumari and Saini, et al. v v . . . v R Continious audio stream

Feature Extraction v R Find Match in Tree Node Addition TreeReorganization Codebook Generator Anomaly Score EstimatorMerge Nodes yes no

Fig. 1. Coarse level architecture of the proposed framework. detection. Similarly, Böhm et al. [1] trained a Huffman tree on the sports data. Uthayakumar et al.[29] also, employed Huffman coding (and some other compression methods) to detect anomalies innetwork data. All these methods perform one-time training of the Huffman tree. In this work, wepropose a framework to adapt the Huffman coding approach for audio data with concept drift. Wepropose a new idea of merging the tree nodes, which gives better performance than the traditionalstrategy of replacing past data with new data.

Huffman coding has been used in various anomaly detection applications like software obfuscationtechniques [33], network traffic anomaly detection [28], etc. Callegari et al. [4] used standardHuffman coding for offline anomaly detection in TCP network traffic data. The authors found it toperform better than clustering-based outlier detection approaches. Böhm et al. [1] used a probabilitydensity-based coding method for offline outlier detection in the data-mining field. However, tothe best of our knowledge, the coding techniques have never been exploited for complex anomalydetection in multimedia data. In this paper, we use a modified version of dynamic Huffman codingfor anomalous event detection using urban audio (outdoor as well as indoor) information.In this section, we first discuss the standard dynamic Huffman coding algorithm and thencustomize it to handle the inherent concept drift in audio anomaly detection task. Figure 1 gives ahigh level overview of the proposed approach.We give a detailed description of each module of the framework in the below subsections.

Huffman coding algorithm is commonly used for lossless compression of data [17]. A binary tree isbuilt on probabilities of unique data samples. The tree outputs a variable-length code (also knownas minimal prefix code) for each data sample. The algorithm requires prior knowledge of datadistribution, therefore, it is only suitable to encode offline data. The dynamic version of Huffmancoding [18] allows encoding without prior knowledge of source distribution. The tree evolves as itencounters new samples. Initially, there is a single node in the tree known as the root node. Thereis a concept of 0-node or NYT (not yet transmitted/traversed) label to a node. The root node islabelled as NYT in the beginning. The NYT node is broken into two child nodes when a new uniquesample is encountered. The left child becomes the new NYT node, and the right child contains thedata corresponding to the new unique sample. The left child is given a ‘0’ weight, and the rightchild is assigned a weight of ‘1’. The weight of the parent node is the sum of the weights of the , Vol. 1, No. 1, Article . Publication date: February 2018. nomaly Detection in Audio with Concept Drift using Adaptive Huffman Coding 5 child nodes. For n number of unique data samples, the tree posses a maximum of 2 n +1 (1 root, 1NYT, n data nodes, and ( n -1) internal nodes) number of nodes. Each node is given an unique indexfrom 1 to 2 n +1. If the root node is given an index of 1, then it’s right and left children are indexed 2and 3. Thus, the indexing follows top to down and right to left order. The lower the value of theindex higher is the rank, and hence the equivalent probability of the node in the Huffman tree.When a new sample matches any of the existing data nodes (let’s name this node as 𝑥 ), itsweight is to be increased by one. Before increasing the weight, the node’s correct future position isdetermined. Positions of all the higher ranking nodes with the weight same as that of 𝑥 in the treeare considered. Then the node 𝑥 is swapped with the highest ranking node among the candidatenodes (let’s name it as ˆ 𝑦 ) along with the sub-trees. After the swap, the weight of 𝑥 (in its newposition now) is increased by one. Further, we go to the parent of 𝑥 and follow the same procedureto find it’s correct position in the tree. If we do not find any such node ˆ 𝑦 then we simply increase 𝑥 ′ 𝑠 weight by one and go to it’s parent node. This way, we keep increasing the weight until wereach the root node. The tree maintains a sibling property [18, 31] to ensure the correctness of thedynamic tree. The tree is said to possess a sibling property if each node has a sibling (except theroot node), and they are numbered in order of non-decreasing weight with each node adjacentto its sibling. Also, the index of the parent node is lower than the indexes of the children nodes.Whenever there is an addition of a node or update in a node, the tree reorganizes itself by cascadedweight updates and swapping of the rule violating nodes. Thus it remains an optimal Huffman treefor the data seen so far. Three types of swapping are possible while correcting the tree, namely,on the level ( → ), up ( ↑ ), and down ( ↓ ). However, a ↓ swap is always followed by an immediate ↑ swap. Nodes at the different levels may be swapped. The NYT node may also move up or downduring the swap of the nodes which contain the NYT node in their sub-trees. However, after thetree correction process finishes, the NYT node will always be at the lowest height (largest depth)in the tree since it has the lowest weight (0).In our case, the weights are not integers but fractional numbers. In addition, on matching, theweights are not increased by a fixed amount. The increment depends on the previous weight as wellas the learning rate of the environment. Therefore, we modify the node swapping procedure whileupdating the tree (Section 3.3). Further, we add a concept of merging the nodes, which removes theconstraint on the maximum number of nodes in the tree. In a practical application, the audio captured by a surveillance device is mostly continuous innature. The given audio is first divided into frames. Let 𝑡 represent the frame number, and Δ 𝐹 bethe duration of each frame. An event is represented by a set of contiguous frames. Let E be the setof frame numbers that represent event 𝐸 . We utilize the most frequently used audio features likeMFCC (Mel-Frequency Cepstral Coefficients), energy, and ZCR (Zero Crossing Rate) to representthe audio signal. For a frame, we compute these features and concatenate them to get a columnvector representation. Thus, we have a frame-wise feature vector representation of event 𝐸 as { (cid:174) 𝑓 𝑡 | 𝑡 ∈ E} .Feature normalization is an important step of data prepossessing in order to avoid any biases ofthe classifier towards numerically dominating features. However, for online settings, normalizationor feature scaling is not straightforward as the actual mean and standard deviation of the data isnot known in prior. Therefore we adopt a dynamic scaling technique [2] where the approximatedmean and standard deviation are dynamically computed for each of the features (energy, ZCR, andMFCC). These parameters are also updated on the arrival of each instance to accommodate thechanges. , Vol. 1, No. 1, Article . Publication date: February 2018. Kumari and Saini, et al.

Algorithm 1: Adapting the scene on arrival of a new audio frame Input:

Tree, (cid:174) 𝑓 𝑡 , 𝐼𝐷 𝑁𝑌𝑇 𝑜𝑙𝑑 , 𝜃 𝑐𝑜𝑠 , 𝜃 𝑚𝑒𝑟𝑔𝑒 , Codebook Output:

Codebook, 𝑚 𝑡 , score( (cid:174) 𝑓 𝑡 ), 𝐼𝐷 𝑁𝑌𝑇 𝑡 Search the tree to find a best matchmatch_flag, ˆ 𝑖 , 𝑆 ( (cid:174) 𝜇 ˆ 𝑖𝑡 , (cid:174) 𝑓 𝑡 ) = Search_Tree (Tree, (cid:174) 𝑓 𝑡 , 𝐼𝐷 𝑁𝑌𝑇 𝑜𝑙𝑑 ) using Equation 1; if match_flag==1 and 𝑆 ( (cid:174) 𝜇 ˆ 𝑖𝑡 , (cid:174) 𝑓 𝑡 ) ≥ 𝜃 𝑐𝑜𝑠 then update 𝑤 ˆ 𝑖𝑡 & (cid:174) 𝜇 ˆ 𝑖𝑡 using Equation 2, 3;recompute ˆ 𝑖 ’s parent weight; 𝑚 𝑡 ← ˆ 𝑖 ; else replace the old NYT node with two children node such that left child is the new NYTnode and right child is the data node corresponding to (cid:174) 𝑓 𝑡 ; 𝐼𝐷 𝑟𝑖𝑔ℎ𝑡𝑐ℎ𝑖𝑙𝑑 ← 𝐼𝐷 𝑁𝑌𝑇 𝑜𝑙𝑑 + 1; 𝐼𝐷 𝑁𝑌𝑇 𝑡 ← 𝐼𝐷 𝑁𝑌𝑇 𝑜𝑙𝑑 + 2; // NYT node for future. 𝑤 𝑟𝑖𝑔ℎ𝑡𝑐ℎ𝑖𝑙𝑑 ← 𝑤 𝑜 ; 𝑤 𝑁𝑌𝑇 𝑡 ← 𝑤 𝑁𝑌𝑇 𝑜𝑙𝑑 ← 𝑤 𝑜 ; 𝜇 𝑟𝑖𝑔ℎ𝑡𝑐ℎ𝑖𝑙𝑑 ← (cid:174) 𝑓 𝑡 ; 𝑚 𝑡 ← 𝐼𝐷 𝑟𝑖𝑔ℎ𝑡𝑐ℎ𝑖𝑙𝑑 ; score( (cid:174) 𝑓 𝑡 )= 𝐷 ( 𝑚 𝑡 )/ 𝐷 ( 𝐼𝐷 𝑁𝑌𝑇 𝑡 ) ; // D gives depth of a node using Codebook. Collect all the { 𝑤 𝑖𝑡 , (cid:174) 𝜇 𝑖𝑡 } corresponding to leaf nodes in an array or list as:filtered_w 𝜇 ← { 𝑤 𝑖𝑡 , (cid:174) 𝜇 𝑖𝑡 } ;merge_flag, merged_w 𝜇 =Recursive_Merge( 𝜃 𝑚𝑒𝑟𝑔𝑒 , filtered_w 𝜇 ); // recursively finds twoappropriate nodes to merge followed by deleting these two nodes and adding the outputnode to the list. if merge_flag==1 then Redraw a static Huffman tree with existing leaf nodes i.e., merged_w 𝜇 ; else Reorganize_Tree( 𝑚 𝑡 , Tree) //corrects sibling property violations if any. Normalize {( 𝑤 𝑖𝑡 ) | ∀ 𝑖 ∈ {H }} between 0-1 and update parents’ weights accordingly; Update the Codebook; // apply a tree traversal algorithm to visit each node and updatetheir depth in the Codebook. procedure Reorganize_Tree(x, Tree)visitList ← [x]; while 𝑥 ≥ do recompute weight of x if it is an internal node;ˆ 𝑦 ← arg max 𝑦 ( 𝑤 𝑥 − 𝑤 𝑦 )∀ 𝑦 ; 𝐼𝐷 𝑦 < 𝐼𝐷 𝑥 ; if ( 𝑤 𝑥 − 𝑤 ˆ 𝑦 ) > then swap x and ˆ 𝑦 along with the subtrees (if any) and adjust the depth information;recompute parent’s weight for nodes x & ˆ 𝑦 ;append ˆ 𝑦 in visitList if not in visitList;z ← parent(x) if x is a right child otherwise sibling(x);append z in visitList if not in visitList and remove x from visitList;x ← lowest rank node from visitList;recompute weight of node with ID=2 followed by node with ID=1; , Vol. 1, No. 1, Article . Publication date: February 2018. nomaly Detection in Audio with Concept Drift using Adaptive Huffman Coding 7 We list all the steps of adapting the scene at the arrival of a new audio frame in Algorithm 1. Aftercomputing the low-level representation of 𝑡 𝑡ℎ frame ( (cid:174) 𝑓 𝑡 ) and dynamic normalization, the existingadaptive Huffman tree is traversed from the top to the bottom to find the best match (step 2). Weuse cosine similarity as a distance measure between the mean vector of 𝑖 𝑡ℎ data node ( (cid:174) 𝜇 𝑖𝑡 ) and thecurrent frame vector ( (cid:174) 𝑓 𝑡 ). The choice of cosine similarity measure is due to its bounded output,which makes it easier to keep a generic similarity threshold. Each node in the tree is assigned aunique index (ID). Indexing of nodes starts from the root node with index ‘1’, and it increases fromtop to down and right to left in the tree. Let ˆ 𝑖 be the index of the best-matched node, which iscomputed as follows: ˆ 𝑖 = arg max 𝑖 (cid:104) 𝑆 ( (cid:174) 𝜇 𝑖𝑡 , (cid:174) 𝑓 𝑡 ) , ∀ 𝑖 ∈ {H } (cid:105) (1)where H is the set of indices of all the data nodes in the tree, and 𝑆 (cid:16) (cid:174) 𝜇 𝑖𝑡 , (cid:174) 𝑓 𝑡 (cid:17) is a function that returnscosine similarity between mean vector of 𝑖 𝑡ℎ node (cid:174) 𝜇 𝑖𝑡 and (cid:174) 𝑓 𝑡 . If 𝑆 ( (cid:174) 𝜇 ˆ 𝑖𝑡 , (cid:174) 𝑓 𝑡 ) is greater than or equal to athreshold ( 𝜃 𝑐𝑜𝑠 ), then this is called a hit. Hence, the hit node is a candidate node that has the largestsimilarity value. When we get a hit, the parameters (weight and mean) of the hit node are updated(step 3) as follows: 𝑤 ˆ 𝑖𝑡 + = ( − 𝛼 ) 𝑤 ˆ 𝑖𝑡 + 𝛼 (2) (cid:174) 𝜇 ˆ 𝑖𝑡 + = ( − 𝛾 𝑡 ) (cid:174) 𝜇 ˆ 𝑖𝑡 + 𝛾 𝑡 (cid:174) 𝑓 𝑡 (3)where 𝑤 ˆ 𝑖𝑡 and (cid:174) 𝜇 ˆ 𝑖𝑡 represent the weight and the mean vector of ˆ 𝑖 𝑡ℎ node at frame 𝑡 and 𝑤 ˆ 𝑖𝑡 + and (cid:174) 𝜇 ˆ 𝑖𝑡 + at frame 𝑡 + 𝛼 is a learning parameter kept between 0 and 1; 𝛾 𝑡 is the distribution updatefactor corresponding to (cid:174) 𝑓 𝑡 . It is proportional to the distance between (cid:174) 𝜇 ˆ 𝑖𝑡 and (cid:174) 𝑓 𝑡 . Cosine similarityvaries between -1 to 1. Closer is the 𝑆 ( (cid:174) 𝜇 ˆ 𝑖𝑡 , (cid:174) 𝑓 𝑡 ) to 1; higher is the update factor. The coefficient 𝛾 𝑡 iscomputed as follows: 𝛾 𝑡 = 𝑆 ( (cid:174) 𝜇 ˆ 𝑖𝑡 , (cid:174) 𝑓 𝑡 ) − 𝜃 𝑐𝑜𝑠 − 𝜃 𝑐𝑜𝑠 × ( 𝛾 𝑚𝑎𝑥 − 𝛾 𝑚𝑖𝑛 ) + 𝛾 𝑚𝑖𝑛 (4)After updating the hit node, we recompute the weight of it’s parent node as the weight of oneof its children nodes has changed. The weight of the parent is recomputed simply as the sum ofthe weights of the children nodes. If we do not get a hit, then we break the current NYT nodeinto the left child and right child (step 4). The node IDs for right and left children are assigned as 𝐼𝐷 𝑝𝑎𝑟𝑒𝑛𝑡 +1, 𝐼𝐷 𝑝𝑎𝑟𝑒𝑛𝑡 +2, respectively. Now the left child becomes the new NYT node for the future.The right child represents the new sample. The initial mean vector for the right child node is setas the current frame vector. We give an initial weight ( 𝑤 𝑜 ) to this newly created data node. TheNYT node is given ‘0’ weight, and thus the old NYT node gets a weight of 𝑤 𝑜 (sum of children’sweight). After a hit/miss, we reorganize the tree in order to maintain the sibling property (step 8).In the process of cascaded weight update, a node may violate the sibling property. In that case, wefirst swap the node violating this property with an appropriate node and then continue the weightupdate process until we reach up to the root node. We state the reorganization process in step 11of Algorithm 1. We maintain a list, namely ‘visitList’, where we keep the node IDs that are to bevisited in order to correct the tree. Initially, the list has the right child of the old NYT node in caseof a miss and ˆ 𝑖 𝑡ℎ node in case of a hit. Let’s denote this node as 𝑥 , and it’s weight as 𝑤 𝑥 . We take allthe nodes with the ID lower (higher rank) than the ID of 𝑥 and weight lower than 𝑤 𝑥 . If we get anysuch nodes, then we swap 𝑥 with the node having the lowest weight among all the candidate nodes(let’s denote this node as ˆ 𝑦 ). The sub-trees are also swapped along with the nodes. The weight ofthe parent of node 𝑥 and ˆ 𝑦 are also recomputed in the decreasing order of ID, i.e., the weight of , Vol. 1, No. 1, Article . Publication date: February 2018. Kumari and Saini, et al.

Status-1 Status-2Status-3 Status-4Status-5 Status-6

Got a match, new weight is 0.078

Fig. 2. The figure shows an example of the reorganization process in the proposed algorithm. The colors ofnodes: yellow, blue, green, and white denotes a data node, the NYT node, the root node, and an internal node,respectively. A new audio frame matches with node-12 (Status-1). Adapting this new frame is shown viaStatus-2, 3, 4, 5, and 6. Red boxes in status-2, 3, and 4 highlight the nodes to be swapped. , Vol. 1, No. 1, Article . Publication date: February 2018. nomaly Detection in Audio with Concept Drift using Adaptive Huffman Coding 9 higher ID is recomputed first then the lower ID. After this, we add the node ˆ 𝑦 to the visitList. Thesuccessor node (right sibling if 𝑥 is a left child else the parent node) of 𝑥 is also added to the list.At this stage, the node 𝑥 is removed from the list. In case we do not find any such ˆ 𝑦 , we simplyadd the successor of 𝑥 and remove 𝑥 from the list. This way, we again take a new 𝑥 from the list asthe node with the maximum index value (lower rank) and perform the reorganization. We keeprepeating the process until we reach to right child of the root node. After reaching the right child ofthe root node, we simply recompute the weight of the root node. Note that the weight of the NYTnode is not affected in the tree reorganization process, it remains at the lowest height in the tree.We demonstrate the tree reorganization process with an example tree in Figure 2. Status-1 in thefigure shows a tree with 13 nodes. There are six unique data nodes (with IDs 4, 6, 7, 9, 10, and 12) inthe tree. A new sample is matched with node-12 (for simplicity, we are denoting the node with ID12 as node-12). So, the weight of this node is updated using Equation 2, followed by recomputingthe weight of it’s parent (node-11) as shown in Status-2. Currently, the ‘visitList’ consists of onlynode-12 ( 𝑥 ). We find a suitable node ˆ 𝑦 (node-10) as explained in the above paragraph. We swapnode-12 and node-10, followed by updating their parents’ weights (status-3). Now node-10 andnode-11 (parent of node-12) are added to the list, and node-12 is removed from the list. For the nextround, we take the node with maximum index from the list, i.e., node-11, and find the correspondingˆ 𝑦 . We did not get any such ˆ 𝑦 , so we add it’s parent to the list, i.e., node-8, and delete node-11 fromthe list. In the further round, node-10 is taken out from the list. As there is no need to swap fornode-10 as well, we push it’s parent to the list (in this case parent is already in the list, so we do notpush it). The next node taken out from the list is node-8. We found the corresponding ˆ 𝑦 as node-7and perform swapping as shown in ‘Status-4’ in the figure. This time, node-7 and node-5 are addedto the list, and node-8 is removed from the list. For the next round, node-7 is taken out. We donot find a ˆ 𝑦 for node-7, so we push node-6 and remove node-7 from the list. Now the list has onlynode-6 and node-5, so node-6 is taken out. There is no need of swapping for node-6, so we pushit’s parent, i.e., node-3, to the list and remove node-6. The next node taken from the list is node-5.We get as ˆ 𝑦 for this node in the tree as node-4 (Status-4). We swap these two nodes and recomputethe parents’ weight, followed by pushing node-2 in the list (Status-5). The next node taken out isnode-3, and we do not find the need for swapping. Now we are left with just the right child of theroot node, so we finish this procedure by just recomputing the weight of the root node (Status-6).The weight of a node may explode in the long run if we keep incrementing the weight. Therefore,we normalize weights of nodes between 0-1 at the end of each hit or miss (step 9). Basically, wetraverse the tree and take out all the weights of data nodes (all leaf nodes except the NYT node).Then we divide the weight of each data node by the summation of weights of all the data nodes.After this, the normalized weights of all the nodes are reflected by performing a postorder traversalon the tree. Thus, the NYT node has the lowest weight (0), and the root node gets the highestweight (1) after the weight normalization process. The next step of our algorithm is to generate the codebook. The codebook contains (cid:174) 𝜇 𝑖 s (distributionmodes) and 𝐷 ( 𝑖 ) s (the distance of 𝑖 𝑡ℎ node from root node) (step 10). For anomaly detection, we donot need the actual codes but only the length of the code. The anomaly score is proportional tothe code length, i.e., the depth of the node. Along with the weight and the mean vector, we alsostore the depth information for each node. Whenever an NYT node is split, its children nodes areassigned a depth as 𝐷 ( 𝐼𝐷 𝑁𝑌𝑇 𝑡 ) +1. Also, at the time of a swap, the depth of the two nodes, as wellas the attached sub-trees, is updated. The depth of nodes in the sub-tree can simply be updatedby visiting all nodes using a pre-order traversal from the parent node as the root node. Further,the 𝐷 ( 𝑖 ) s in the codebook can be updated after processing the new sample by traversing the tree , Vol. 1, No. 1, Article . Publication date: February 2018. using any tree traversal algorithm. Let the index of the hit node for 𝑡 𝑡ℎ frame be 𝑚 𝑡 . With thecodebook and matched node ID, we can compute the score for the audio frame as the ratio of thedepth of the matched node and the NYT node ( step 5 of the algorithm). There, the numerator termis the distance of the matched node from the root. 𝐼𝐷 𝑁𝑌𝑇 𝑡 represents the ID of the NYT node in thecurrent tree for 𝑡 𝑡ℎ frame. Thus, the denominator term ( 𝐷 ( 𝐼𝐷 𝑁𝑌𝑇 𝑡 ) ) is nothing but the height ofthe tree for 𝑡 𝑡ℎ frame. Hence, the anomaly score of a frame is a relative distance of the hit nodefrom the root node. Further, the anomaly score ( Ω ) for the event 𝐸 is computed by averaging theanomaly scores of all the frames of the event, i.e., Ω = |E | ∑︁ ∀ 𝑡 ∈E (cid:18) 𝐷 ( 𝑚 𝑡 ) 𝐷 ( 𝐼𝐷 𝑁𝑌𝑇 𝑡 ) (cid:19) (5) Contrary to the traditional methods, our algorithm does not keep an upper limit on the maximumnumber of nodes in the tree. The algorithm keeps adding the new unique frames in the tree oneach miss. Thus the codebook contains each possible unique frame distribution seen so far. Codelength for a true anomalous frame is initially large as it does not have many neighbors. The othernormal nodes have sufficient neighbors and thus remain in the upper part of the tree, leading toreduced code length. An abnormal frame that has seen sufficient neighbors may get upgraded tothe upper level (toward the root node) of the tree through the reorganization process. Thus, thecode length transition from large to small makes it a new normal. Remembering everything takescare of the drifting nature of anomaly patterns. The drifting nature may also bring two nodes nearover time. We introduce the concept of merging the close distributions. At the arrival of each newframe, we check for the need to merge the nodes. We iteratively merge the two most similar datanodes if the distance between them is found to be below a merging threshold (step 6-7). Thus, thesetwo nodes can be at the same or different levels and not necessarily be siblings. To perform theiterative merge, we take out all the data nodes by traversing the tree in a list or array data structure.We extract the two most similar nodes, and if the similarity between these nodes is above 𝜃 𝑚𝑒𝑟𝑔𝑒 then we merge these nodes. We assign the weight of the output node as the sum of the weight ofthe two nodes. The mean vector for the output node is set as the point-wise average of the meanvectors of these nodes. After this, we delete the two nodes from the list and add the merged node tothe list. We keep merging until we do not find any such nodes to be merged. After this, we redrawthe tree with existing nodes and weights similar to a fixed Huffman tree. Note that the redrawntree is already a correct Huffman tree; hence we do not need to perform reorganization after themerge operation. For ‘n’ data nodes, there will be a total of (cid:0) 𝑛 (cid:1) unique data pairs. Hence, the timecomplexity to find such pair will be O( 𝑛 ). For performing consecutive merge, where the array/listsize shrinks by one element on each merge operation, the worst case time complexity will be O( 𝑛 ).The merging process lowers the memory requirement as well as reduces redundant distributionscausing false alarms. In this section, we discuss the datasets, experimental setup, and results. We compare the pro-posed approach with a fixed Huffman coding-based approach and a baseline multivariate AGMMframework.

We utilized three datasets to show the efficacy of our work, namely ‘12 Angry Men’ (1.5 hours),‘Outdoor Canteen-2’ (3 hours), and ‘An Outdoor Area’ (3 hours). The ‘12 Angry Men’ dataset is , Vol. 1, No. 1, Article . Publication date: February 2018. nomaly Detection in Audio with Concept Drift using Adaptive Huffman Coding 11 (a) Reception Area (morning) (b) Reception Area (night) (c) Lab-1 & 2(d) Outdoor Canteen (morning) (e) Outdoor Canteen (evening) (f) An Outdoor Area

Fig. 3. Representative images from the different data collection scenarios.Table 1. The table lists the example anomalies and the percentage of the anomaly in the datasets used forperformance evaluation purposes.

Dataset Anomalypercentage Example anomalies12 Angry Men 2.27% Shout, laugh, door knock, water tap, raining, fan, etc.Outdoor Canteen-2 1.57% Grinder sound, dragging chair, shout, vehicle passing,music, sudden loud phone ringtone, fire siren, etc.An Outdoor Area 2.6% Police siren, dog bark, floor cleaning machine,laugh, music, bike, car, construction vehicle sound, etc.taken from the movie of the same name. The movie is shot in a single closed room. It containsmultiple natural anomalies like shouting, door knocking, sudden rain, laughing, etc. We collectedthe ‘Outdoor Canteen-2’ (3 hours) and ‘An Outdoor Area’ (3 hours) datasets by deploying a digitalvoice recorder, ‘SONY ICD-UX560F’, with a sampling frequency of 22.05 kHz in two differentoutdoor environments. The sudden noise of grinder machine, fire siren, mixer noise, chair dragging,loud phone ringtone, etc., are some of the natural anomaly observed in the ‘Outdoor Canteen-2’dataset. The ‘An Outdoor Area’ dataset is accompanied by more real-life outdoor anomalies, suchas dog barking, police vehicle siren, sudden bike/car arrival, heavy construction vehicle passing by,etc. We give details of the anomaly duration and example anomalies for each of these datasets inTable 1. We label these three datasets with the assistance of security guards posted on these places.They were instructed to label interesting or attention-catching phenomena as an anomaly. If ananomaly is observed for enough time, it’s all future occurrences are treated as normal. Again, herethe drift from abnormal to normal is subjective; hence we opted for a majority voting strategy forlabeling each dataset. We have used (2/3) volume of each of these datasets for hyper-parameterselection and the rest for performance evaluation.In addition to these datasets, we use four other datasets, each with a period of 10 hours, to studythe maximum tree size (number of nodes in the tree). These datasets are as follows: ‘TV Series’(Breaking Bad), ‘Reception Area’ (of an educational institute), ‘Outdoor Canteen-1’, and ‘Lab-1’. The , Vol. 1, No. 1, Article . Publication date: February 2018.

Table 2. The table describes the datasets used in this paper. ‘TV Series’ and ‘12 Angry Men’ datasets aretaken from YouTube, and the rest others are self-collected.

Dataset Name Location/Description Length PurposeLab-1 a research lab 10 hours No. of NodesReception Area an institute 10 hours No. of NodesOutdoor Canteen-1 an open canteen 10 hours No. of NodesTV Series Breaking Bad 10 hours No. of Nodes12 Angry Men single room movie 1.5 hours PerformanceOutdoor Canteen-2 an open canteen 3 hours PerformanceAn Outdoor Area an institute 3 hours PerformanceLab-2 a research lab 1.33 hours Adaptivenesslast three datasets are collected by us, and the first one is downloaded from the web. The choice ofdataset covers a versatile range of outdoor scenarios. The ‘TV Series’ dataset is chosen to show thatthe tree size doesn’t burst even if we use multiple random audios mixed together. Apart from thesedatasets, we collected one more dataset, namely ‘Lab-2’ of length ‘80’ minutes (1.33 hours). Thisdataset also contains intentionally created anomalies. We played a song twice with an interval of20 minutes. All the above-described datasets are listed in Table 2 with duration, recording location,and purpose. We also paste scene images via Figure 3 for better visualization of the datasets. Notethat the camera is placed at the same location as the microphone to take the image.We test the approach on an event size of 4 seconds (extracted with 50% overlap). However, theapproach can be used for any length of the event because the tree is built for the fixed frame size,not the event. We fix the length of the frame as 0.5 seconds. Thus, there are a total of 8 frames inan event of 4 seconds. We keep 𝑤 𝑜 =1e-3, 𝛾 𝑚𝑎𝑥 =0.5, and 𝛾 𝑚𝑖𝑛 =0 throughout the experiments. Thehyper-parameters ( (cid:8) 𝜃 𝑚𝑒𝑟𝑔𝑒 , 𝜃 𝑐𝑜𝑠 , 𝛼 (cid:9) ) need to be tuned for each surveillance environment, which isa common practice in adaptive anomaly detection frameworks [9, 27].For performance evaluation with a skewed class distribution like anomaly detection, ROC(receiver operating characteristic curve) and AUC (area under the ROC curve) measures havebeen generally used in literature. ROC curve illustrates the performance of a binary classifier asthe discrimination threshold for the two classes is varied. It is a plot between the true positiverate (tpr) and the false positive rate (fpr) at multiple threshold values. In the case of the Huffmantree-based approaches (soft-type classifier that produces raw anomaly score), ROC points weregenerated using the algorithm proposed in [11]. In AGMM, we get hard class labels as abnormal ornormal for a fixed value of the foreground-background threshold and hence only one point in ROCspace. However, to generate a full ROC curve, a discrete classifier (hard-type) can be mapped to ascoring classifier (soft-type) by “looking inside” the model [11]. For AGMM, we utilize distributionprobabilities to map the binary classifier into a scoring classifier. A score for each Gaussian in themixture is maintained according to their weight. Thus, the Gaussian with maximum weight gets ascore of ‘0’, and Gaussian with minimum weight gets a score of ‘1’. An audio frame is assignedan anomaly score same as the score of matching or newly added Gaussian. Further, the anomalyscore for an event is computed as the average score of all the audio frames in the event. Once weconvert the hard label into a soft score, ROC points can be generated by the same algorithm asused for Huffman tree-based approaches. The AUC score is computed as the area under the ROCcurve by adding the successive areas of trapezoids [3]. It gives a single scalar value for each ROCcurve, which is used to compare various algorithms. , Vol. 1, No. 1, Article . Publication date: February 2018. nomaly Detection in Audio with Concept Drift using Adaptive Huffman Coding 13 Table 3. The table shows AUC performance on various choices of 𝛼 , 𝜃 𝑐𝑜𝑠 , and 𝜃 𝑚𝑒𝑟𝑔𝑒 for the‘Outdoor Canteen-2’ dataset. 𝛼 𝜃 𝑐𝑜𝑠 → 𝜃 𝑚𝑒𝑟𝑔𝑒 ↓ − − − − To achieve the best performance, the hyper-parameters need to be tuned according to the targetedsurveillance environment. We can shrink the search space with prior expert knowledge. For example,if the scene is about a static indoor environment, the similarity threshold should be kept lowercompared to a dynamic environment. Also, for a frequently changing environment, the weightupdate rate should be high; otherwise, it will not be able to adapt to the speedy variations inthe environment. The experimental results for selecting the optimum hyper-parameters on the‘Outdoor Canteen-2’ dataset are given in Table 3. There are 5394 events of 4 seconds in the ‘OutdoorCanteen-2’ dataset. Hence, 2/3 fraction, i.e., 3,596 events, were used for hyper-parameter selection.As a new audio frame arrives, it is matched against the existing tree, an anomaly score is assignedaccordingly, and finally, the tree is updated as well. Likewise, the anomaly score of all the audioframes in one event is collected to predict the anomaly score of the event using Equation 5. Further,we calculate the anomaly score of all the events in the train data and compute the AUC value.We varied the model hyper-parameters 𝛼 , 𝜃 𝑐𝑜𝑠 , and 𝜃 𝑚𝑒𝑟𝑔𝑒 in range {1e-05-1e-01}, {0.88-0.95}, and{0.96-0.99} respectively and reported the corresponding AUC measures in Table 3. If we see averageAUC behavior per 𝛼 value for a fixed value of 𝜃 𝑐𝑜𝑠 , we observe that model is overall performingbetter with an 𝛼 =1e-04 (with some exceptions). Likewise, when we see average AUC for each 𝜃 𝑐𝑜𝑠 for a fixed value of 𝛼 , we find that a similarity threshold ( 𝜃 𝑐𝑜𝑠 ) of 0.93 or 0.94 is giving better resultsin this scenario. We conclude that the model attains the best performance on 𝛼 =1e-04, 𝜃 𝑚𝑒𝑟𝑔𝑒 =0.97,and 𝜃 𝑐𝑜𝑠 =0.93. Using the selected values of hyper-parameters, we report AUC and ROC on the test , Vol. 1, No. 1, Article . Publication date: February 2018. N u m b e r o f n o d e s ( N ) Reception AreaOutdoor canteen-1TV SeriesLab-1

Fig. 4. The plot shows the number of nodes in the tree over a duration of 10 hours in multiple challengingdatasets. datasets and found that performance does not degrade. Hence, we say that the parameter for anenvironment needs to be tuned once, and later, it can be used for that particular environment. Thehyper-parameters for other scenarios are also tuned in the similar way. The model attains the bestperformance for the other scenarios at the following ( 𝜃 𝑐𝑜𝑠 , 𝛼 , and 𝜃 𝑚𝑒𝑟𝑔𝑒 ) values: 12 Angry Men(0.88, 1e-04, and 0.98), An Outdoor Area (0.94, 1e-03, and 0.98). Figure 4 shows the number of nodes (‘N’) in the tree over a duration of 10 hours. The ‘Lab-1’ datasetin this plot (red line in the figure) is recorded from 6 PM to 4 AM. Generally, the lab is vacant frommidnight to early morning (12 midnight - 5 AM). We can observe that after 6 hours, the number ofnodes becomes almost static (approx 20 nodes). It verifies that with no change in the environment,the tree becomes stable. Another important observation is that the number of nodes required inan outdoor environment (‘Outdoor Canteen-1’ dataset with orange line) is more in comparison toan indoor environment (‘Lab-1’ dataset). This is because a change in an outdoor scenario is muchrapid than in an indoor scenario. The plot for the random audio dataset shows a comparativelyincreasing graph over time, but it shows convergence later on. We can infer from the plot that evenif we use any random audio data, the tree will converge after some time when it has seen enoughvariations.

We compare the proposed approach, which has variable tree size, with the fixed-length Huffmantree-based approach on the test volume (1/3 volume) of the three datasets, namely ‘Outdoor Canteen-2’, ‘12 Angry Men’, and ‘An Outdoor Area’. The results are given in Tables 4, 5, and 6, respectively,on the three datasets. The corresponding ROC curves are shown in Figure 5. We keep suitablevalues of 𝜃 𝑐𝑜𝑠 and 𝛼 selected in the hyper-parameter tuning step for each of the three datasetsrecorded in different environments. The finalized values ( 𝜃 𝑐𝑜𝑠 , 𝛼 ) are as follows: Outdoor Canteen-2(0.93, 1e-04), 12 Angry Men (0.88, 1e-04), An Outdoor Area (0.94, 1e-03). We vary ‘N’ between 9-15in the fixed Huffman tree method as the number of unique distributions is generally kept low in , Vol. 1, No. 1, Article . Publication date: February 2018. nomaly Detection in Audio with Concept Drift using Adaptive Huffman Coding 15 Table 4. The table compares the performance of fixed length Huffman tree and proposed approach on the‘Outdoor Canteen-2’ dataset. We set 𝜃 𝑐𝑜𝑠 =0.93 and 𝛼 =1e-04 for this experiment. Fixed Length Tree Variable Length TreeN Replaced AUC(%) AUC(%) avg. N Merged 𝜃 𝑚𝑒𝑟𝑔𝑒

130 235 0.9713 1549 76.14 82.16 141 132 0.9815 1329

Table 5. The table compares the performance of fixed length Huffman tree and proposed approach on ‘12Angry Men’ dataset. We set 𝜃 𝑐𝑜𝑠 =0.88 and 𝛼 =1e-04 for this experiment. Fixed Length Tree Variable Length TreeN Replaced AUC(%) AUC(%) avg. N Merged 𝜃 𝑚𝑒𝑟𝑔𝑒

17 22 0.9815 57 66.61 75.44 21 5 0.99

Table 6. The table compares the performance of fixed length Huffman tree and proposed approach on ‘AnOutdoor Area’ dataset. We set 𝜃 𝑐𝑜𝑠 =0.94 and 𝛼 =1e-03 for this experiment. Fixed Length Tree Variable Length TreeN Replaced AUC(%) AUC(%) avg. N Merged 𝜃 𝑚𝑒𝑟𝑔𝑒

32 76 0.9815 98 48.14 60.94 31 50 0.99fixed mode approaches. We report the number of modes replaced and the AUC measure for thiscase. For the proposed approach, we mention the average number of nodes in the tree, the numberof total merged modes over total test dataset duration, and the AUC measure for different values of 𝜃 𝑚𝑒𝑟𝑔𝑒 . Note that the tree is built on audio frames and hence the reported replaced or merged countdenotes the count of frame instances, not events. There are a total of 8 ∗ ∗ ∗ replaced modes decreases as we increase ‘N’ in the fixed Huffman method. It shouldbe so because with a larger ‘N’, the chances of getting a hit increase. The optimal performance isobtained with N=15, 11, and 13 for ‘Outdoor Canteen-2’, ‘12 Angry Men’, and ‘An Outdoor Area’datasets, respectively. We can make a similar observation for the proposed approach method with 𝜃 𝑚𝑒𝑟𝑔𝑒 and the number of merged modes. As we increase the merging threshold, modes are lesslikely to be merged. For a higher value of 𝜃 𝑚𝑒𝑟𝑔𝑒 , two modes which are indeed similar may be leftunmerged. Whereas a lower 𝜃 𝑚𝑒𝑟𝑔𝑒 may end up merging two distinct modes. In the case of the‘Outdoor Canteen-2’ dataset, we get the best AUC of 86.00% for the proposed case, whereas, for the , Vol. 1, No. 1, Article . Publication date: February 2018. False Positive Rate T r u e P o s i t i v e R a t e EER Line merge =0.96, AUC=80.94 merge =0.97, AUC=86.00 merge =0.98, AUC=82.16 merge =0.99, AUC=82.42 (c) Outdoor Canteen-2 [Ours]

False Positive Rate T r u e P o s i t i v e R a t e EER LineN=9, AUC=76.57N=11, AUC=75.44N=13, AUC=76.14N=15, AUC=79.78 (d) Outdoor Canteen-2 [Fixed Huffman]

False Positive Rate T r u e P o s i t i v e R a t e EER Line merge =0.96, AUC=80.28 merge =0.97, AUC=81.34 merge =0.98, AUC=83.61 merge =0.99, AUC=75.44 (a) 12 Angry Men [Ours]

False Positive Rate T r u e P o s i t i v e R a t e EER LineN=9, AUC=61.23N=11, AUC=75.48N=13, AUC=69.38N=15, AUC=66.61 (b) 12 Angry Men [Fixed Huffman]

False Positive Rate T r u e P o s i t i v e R a t e EER Line merge =0.96, AUC=57.04 merge =0.97, AUC=58.40 merge =0.98, AUC=61.12 merge =0.99, AUC=60.94 (e) An Outdoor Area [Ours]

False Positive Rate T r u e P o s i t i v e R a t e EER LineN=9, AUC=58.48N=11, AUC=52.49N=13, AUC=58.62N=15, AUC=48.14 (f) An Outdoor Area [Fixed Huffman]

Fig. 5. Performance comparison between fixed length and variable length Huffman Tree approaches. fixed Huffman tree, we get the best AUC of 79.78%. We observe improvement for the other twodatasets also in Tables 5 and 6.

Here, we discuss the advantage of the ‘Merge Nodes’ module in our approach. The experiments areconducted on the test volume of the three datasets: ‘An Outdoor Area’, ‘12 Angry Men’, and ‘OutdoorCanteen-2’ in Figure 6. We choose the same selected (through hyper-parameter tuning) values of , Vol. 1, No. 1, Article . Publication date: February 2018. nomaly Detection in Audio with Concept Drift using Adaptive Huffman Coding 17

False Positive Rate T r u e P o s i t i v e R a t e EER LineOutdoor Canteen-2 [ours], AUC=86.00Outdoor Canteen-2 [no merge], AUC=79.71An Outdoor Area [ours], AUC=61.12An Outdoor Area [no merge], AUC=57.7012 Angry Men [ours], AUC=83.6112 Angry Men [no merge], AUC=79.64

Fig. 6. The table compares the performance of the proposed approach with and without the ‘Merge Nodes’module on ‘Outdoor Canteen-2’, ‘An Outdoor Area’, and ‘12 angry men’ datasets.Table 7. Comparing hyper-parameters of AGMM and the proposed adaptive Huffman coding based approach.

Description AGMM Range(AGMM) Ours Range(ours) CommentsNo. of distribution K 4-6 none – adaptive in oursMerge threshold none – 𝜃 𝑚𝑒𝑟𝑔𝑒 𝛼 𝑤 𝛼 𝛼 𝑔 𝛼 𝑤 ∗ 𝜂 𝛾 𝑚𝑖𝑛 , 𝛾 𝑚𝑎𝑥

0, 0.5 required in bothMatching threshold 𝜃 𝑚𝑎ℎ𝑎𝑙𝑜𝑛𝑜𝑏𝑖𝑠 (2-3) 𝜎 𝜃 𝑐𝑜𝑠 𝛼 and 𝜃 𝑐𝑜𝑠 for both the cases, i.e., merge and no merge. The drastic drop in AUC performance isobserved when we remove the merging module. It signifies the importance of merging redundantdistribution, which reduces false alarms. As discussed earlier, most of the existing works assume a stationary anomaly distribution, hence,ignore concept drift. We cannot compare the proposed work with those works (supervised anomalyclassification) as the anomaly score in our case changes over time due to the concept drift. Tothe best of our knowledge, unsupervised AGMM is the main approach used to address conceptdrift in audio anomaly detection [8, 9, 22]. Table 7 describes the required hyper-parameters alongwith their range for both of the approaches. We see that the ‘No. of distribution’ parameter is notrequired for our method. Moreover, the node merging avoids multiple instances of the same mode.In literature, the hyper-parameters of the multivariate AGMM are varied as follows: K=3-6 (numberof Gaussian), 𝑤 𝑜 =0.01-0.1 (initial weight), 𝛼 𝑤 =0.001-0.1 (weight update), 𝛼 𝑔 =0.001-0.1 (Gaussianupdate), T=0.92-0.99 (% data accounted for background events), and 𝜃 𝑚𝑎ℎ𝑎𝑙𝑜𝑛𝑜𝑏𝑖𝑠 =4.5 (Z-score isreplaced by Mahalanobis distance for multivariate AGMM and for a feature-length of 15 sigmarule says 4.5 𝜎 contains 90% population [15] ). We compare the AUC performance of our approachwith the AGMM approach on different values of ‘K’ (No. of distributions) for each of the three testdatasets in Table 8. We can see the proposed approach outperforms the existing AGMM approachon all three datasets. The table also reports the number of replaced (in the case of AGMM) or merged , Vol. 1, No. 1, Article . Publication date: February 2018. Table 8. The table compares the performance of the proposed approach and AGMM on the ‘An OutdoorArea’, ‘12 Angry Men’, and ‘Outdoor Canteen-2’ dataset.

Dataset Method No. of data modes AUC(%) Replaced/MergedAGMM 3 False Positive Rate T r u e P o s i t i v e R a t e EER LineOutdoor Canteen-2 [ours], AUC=86.00Outdoor Canteen-2 [AGMM], AUC=78.02An Outdoor Area [ours], AUC=61.12An Outdoor Area [AGMM], AUC=58.4812 Angry Men [ours], AUC=83.6112 Angry Men [AGMM], AUC=81.36

Fig. 7. The table compares the performance of the proposed approach and AGMM on ‘Outdoor Canteen-2’,‘An Outdoor Area’, and ‘12 angry men’ datasets. modes. We can see that, similar to the fixed Huffman approach, the number of replaced modes isdecreasing as we increase ‘K’ in AGMM, i.e., with a larger value of ‘K’, the chances of a miss are less.However, the learning and forgetting strategy in the AGMM is causing a large amount of replacedmodes as observed from Table 8. Such a frequent replacement of modes causes poor memory of themodel and hence degradation in the performance. Another limitation of this approach is the nodedrifting. The two nodes may drift and come very close to each other but still represent two differentGaussians and hence may cause misclassifications. For better visualization, we compare the ROCcurve of our approach and the AGMM approach (on ‘K’ with the best performance) in Figure 7. , Vol. 1, No. 1, Article . Publication date: February 2018. nomaly Detection in Audio with Concept Drift using Adaptive Huffman Coding 19 time (minutes) s i g n a l (a) time (minutes) a n o m a l y s c o r e Music [34:4-35:42] Music [74:30-74:47] (b) time (minutes) a n o m a l y s c o r e (c) Music [34:4-35:42] M u s i c [ : - : ] time (minutes) a n o m a l y s c o r e (d) Music [34:4-35:42] Music [74:30-74:47]0/1 predictionscore prediction time (minutes) a n o m a l y s c o r e M u s i c [ : - : ] M u s i c [ : - : ] (e) Fig. 8. The plot shows the anomaly score given by the proposed approach, fixed Huffman coding, and AGMMapproach for the ‘Lab-2’ dataset. The raw audio is plotted in Fig.(a). Fig.(b) shows the predicted anomaly scoresby the proposed approach. Fig.(c), (d), and (e) show predicted anomaly score by 9-node, 17-node dynamicHuffman Tree, and AGMM, respectively. The parameters for each sub-figures is as follows: (b) proposedapproach ( 𝜃 𝑐𝑜𝑠 =0.90, 𝜃 𝑚𝑒𝑟𝑔𝑒 =0.98, 𝛼 =1e-04, and 𝑤 𝑜 =1e-04), (c) fixed length Huffman tree ( 𝜃 𝑐𝑜𝑠 =0.90, N=9, 𝛼 =1e-04, and 𝑤 𝑜 =1e-04), (d) fixed length Huffman tree ( 𝜃 𝑐𝑜𝑠 =0.90, N=17, 𝛼 =1e-04, and 𝑤 𝑜 =1e-04), and (e)AGMM (K=6, 𝑤 𝑜 =0.1, 𝛼 𝑤 =0.1, 𝛼 𝑔 =0.01, and 𝜃 𝑚𝑎ℎ𝑎𝑙𝑜𝑛𝑜𝑏𝑖𝑠 =4.5). The dataset contains an event (song played)which is observed multiple times. The proposed approach shows adaptiveness to the new normality. When itsees the event for a significant time, it remembers and produces a low anomaly score. On the other hand,fixed node approaches (c, d, e) that tend to keep short term temporal context forgets the long past. Thuswhen they encounter the same event again, it produces a high score. , Vol. 1, No. 1, Article . Publication date: February 2018. In this section, we closely analyze what happens when the model encounters the same event againand again. We utilize 1.33 hours long ‘Lab-2’ dataset for the purpose. Figure 8(a) shows the rawaudio plot over time. A song was played for the following (start-end) timestamps (minute:second):{(34:4-35:42), (74:30-74:47)}. We have marked these stamps on the plots. Figure 8(b) depicts the rawanomaly score (orange color in the figure) as well as 0/1 decision (blue color in the figure) for theproposed approach. The threshold to convert raw score in 0/1 prediction is kept 0.6 (this is just forthe plot, it does not affect any inference in this experiment). The best suitable hyper-parameters forthis dataset are 𝜃 𝑐𝑜𝑠 =0.90, 𝜃 𝑚𝑒𝑟𝑔𝑒 =0.98, and 𝛼 =1e-04. The number of total nodes reaches up-to 17 inthis case. We plot the same for the fixed-length Huffman coding. Figure 8(c) and 8(d) represent theanomaly scores for a fixed length of 9 and 17 nodes, respectively. Figure 8(e) shows the raw scoreas well as 0/1 predictions from AGMM with the hyper-parameters as K=6, 𝑤 𝑜 =0.1, 𝛼 𝑤 =0.1, 𝛼 𝑔 =0.01,and 𝜃 𝑚𝑎ℎ𝑎𝑙𝑜𝑛𝑜𝑏𝑖𝑠 =4.5. From the above plots, it can be observed that the anomaly score graduallydecreases in the interval {34:4-35:42} for the proposed case. It should decrease because the modecorresponding to the song is getting enough neighbors, and hence the node attains a lower depthin the tree. The same behavior can be observed in the fixed node cases also (Figure 8(c), (d), (e)).Whether we use the fixed length or variable length approaches, the online models are able to adaptto the short term temporal context. But when the K-node models encounter the same event (thesame song played in interval 74:30-74:47) after a long duration, it again produces a high anomalyscore and classifies it as abnormal. This is because models with fixed modes and replacement policytend to forget the long past. They only remember recent history. In contrast, our model remembersa longer history. It puts the mode corresponding to this event in the upper part of the tree andproduces a very small score (<0.3) because the same event was observed for a large amount (> 95seconds) of time in the past, as seen in Figure 8(b). The proposed model is unsupervised; it trains itself automatically with time. In order to train themodel, the data samples (frames in our case) need to arrive in the temporal order. Therefore, themodel cannot be trained on traditional datasets where temporal order is missing (such as DCASE[21]). Another limitation of the work is that the hyper-parameters need to be tuned according tothe application scenario. However, it is only a one-time task, and the same set of parameters shouldwork in all similar scenarios. Finally, like all previous multimedia based anomaly detection works,we fail to accommodate the periodicity of the events; we are only able to track the trend in thedata.

The proposed framework is able to effectively apply Huffman coding for audio anomaly detection.The dynamic tree construction and reorganization are able to accommodate concept drift in audio.New data adapts the codes representing the leaf nodes. Instead of replacing the old nodes, wepropose to merge the nodes when they are close enough. The proposed method with node mergingframework is able to grasp the long past, while the fixed mode approaches with replacement policyforget the long term history. The experiments also show that the number of maximum nodesremains tractable; it is found that the number of nodes converges between 70 and 75 for mostcontinuous audio datasets, even when the audio length is very long. The proposed method workseffectively on three datasets, referring to challenging outdoor scenarios. We are able to outperforman AGMM framework that requires prior knowledge about data distribution. In the future, we wantto extend this to include video information to build the Huffman tree and detect the anomaly. , Vol. 1, No. 1, Article . Publication date: February 2018. nomaly Detection in Audio with Concept Drift using Adaptive Huffman Coding 21

REFERENCES [1] Christian Böhm, Katrin Haegler, Nikola S Müller, and Claudia Plant. 2009. CoCo: coding cost for parameter-free outlierdetection. In

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining .ACM New York, NY, USA, Paris,France, 149–158.[2] Danushka Bollegala. 2017. Dynamic feature scaling for online learning of binary classifiers.

Knowledge-Based Systems

129 (2017), 97–105.[3] Andrew P Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms.

Pattern recognition

30, 7 (1997), 1145–1159.[4] Christian Callegari, Stefano Giordano, and Michele Pagano. 2009. On the use of compression algorithms for networkanomaly detection. In . IEEE, Dresden, Germany, 1–5.[5] Vincenzo Carletti, Pasquale Foggia, Gennaro Percannella, Alessia Saggese, Nicola Strisciuglio, and Mario Vento. 2013.Audio surveillance using a bag of aural words classifier. In . IEEE, Krakow, Poland, 81–86.[6] Chloé Clavel, Thibaut Ehrette, and Gaël Richard. 2005. Events detection for an audio-based surveillance system. In . IEEE, Amsterdam, Netherlands, 1306–1309.[7] Donatello Conte, Pasquale Foggia, Gennaro Percannella, Alessia Saggese, and Mario Vento. 2012. An ensemble ofrejecting classifiers for anomaly detection of audio events. In . IEEE, Beijing, China, 76–81.[8] Marco Cristani, Manuele Bicego, and Vittorio Murino. 2004. On-line adaptive background modelling for audiosurveillance. In

Proceedings of the 17th International Conference on Pattern Recognition . IEEE, Cambridge, UK, 399–402.[9] Marco Cristani, Manuele Bicego, and Vittorio Murino. 2007. Audio-visual event recognition in surveillance videosequences.

IEEE Transactions on Multimedia

9, 2 (2007), 257–267.[10] Marco Crocco, Marco Cristani, Andrea Trucco, and Vittorio Murino. 2016. Audio surveillance: A systematic review.

Comput. Surveys

48, 4 (2016), 1–46.[11] Tom Fawcett. 2006. An introduction to ROC analysis.

Pattern recognition letters

27, 8 (2006), 861–874.[12] Giuseppe Fenza, Mariacristina Gallo, and Vincenzo Loia. 2019. Drift-aware methodology for anomaly detection insmart grid.

IEEE Access

IEEE transactions on intelligent transportation systems

17, 1 (2015),279–288.[14] Pasquale Foggia, Nicolai Petkov, Alessia Saggese, Nicola Strisciuglio, and Mario Vento. 2015. Reliable detection ofaudio events in highly noisy environments.

Pattern Recognition Letters

65 (2015), 22–28.[15] Guillermo Gallego, Carlos Cuevas, Raul Mohedano, and Narciso Garcia. 2013. On the Mahalanobis distance classificationcriterion for multidimensional normal distributions.

IEEE Transactions on Signal Processing

61, 17 (2013), 4387–4396.[16] João Gama, Indr˙e Žliobait˙e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on conceptdrift adaptation.

ACM computing surveys (CSUR)

46, 4 (2014), 1–37.[17] David A Huffman. 1952. A method for the construction of minimum-redundancy codes.

Proceedings of the IRE

40, 9(1952), 1098–1101.[18] Donald E Knuth. 1985. Dynamic huffman coding.

Journal of algorithms

6, 2 (1985), 163–180.[19] Louis Kratz and Ko Nishino. 2009. Anomaly detection in extremely crowded scenes using spatio-temporal motionpattern models. In . IEEE, Miami, FL, USA, 1446–1453.[20] Hyungjun Lim, Myung Jong Kim, and Hoirin Kim. 2015. Robust sound event classification using LBP-HOG basedbag-of-audio-words feature representation. In . ISCA, Dresden, Germany, 3325–3329.[21] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen. 2017. DCASE 2017 ChallengeSetup: Tasks, Datasets and Baseline System. In

Proceedings of the Detection and Classification of Acoustic Scenes andEvents 2017 Workshop (DCASE2017) . Inria, Nancy, France, 85–92.[22] Simon Moncrieff, Svetha Venkatesh, and Geoff West. 2007. Online audio background determination for complex audioenvironments.

ACM Transactions on Multimedia Computing, Communications, and Applications

3, 2 (2007), 8–es.[23] Mehrsan Javan Roshtkhari and Martin D Levine. 2013. An on-line, real-time learning method for detecting anomaliesin videos using spatio-temporal compositions.

Computer vision and image understanding

IEEE International Conference on Acoustics, Speech and Signal Processing . IEEE, Brighton, United Kingdom, 3597–3601.[25] Sakti Saurav, Pankaj Malhotra, Vishnu TV, Narendhar Gugulothu, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff.2018. Online anomaly detection with concept drift adaptation using recurrent neural networks. In

Proceedings of theACM India Joint International Conference on Data Science and Management of Data . ACM New York, NY, USA, Goa,India, 78–87. , Vol. 1, No. 1, Article . Publication date: February 2018. [26] Alan F Smeaton and Mike McHugh. 2005. Towards event detection in an audio-based sensor network. In

Proceedingsof the third ACM international workshop on Video surveillance & sensor networks . ACM New York, NY, USA, Hilton,Singapore, 87–94.[27] Chris Stauffer and W Eric L Grimson. 1999. Adaptive background mixture models for real-time tracking. In

Proceedings.IEEE Computer Society Conference on Computer Vision and Pattern Recognition . IEEE, Fort Collins, CO, USA, 246–252.[28] Teng Sun, Hui Tian, and Xuan Mei. 2015. Anomaly detection and localization by diffusion wavelet-based analysis ontraffic matrix.

Computer Science and Information Systems

12, 4 (2015), 1361–1374.[29] J Uthayakumar, T Vengattaraman, and J Amudhavel. 2017. A simple data compression algorithm for anomaly detectionin Wireless Sensor Networks.

International Journal of Pure and Applied Mathematics . IEEE, London, UK, 21–26.[31] Jeffrey Scott Vitter. 1987. Design and analysis of dynamic Huffman codes.

Journal of the ACM (JACM)

34, 4 (1987),825–845.[32] Gerhard Widmer and Miroslav Kubat. 1996. Learning in the presence of concept drift and hidden contexts.

Machinelearning

23, 1 (1996), 69–101.[33] Zhenyu Wu, Steven Gianvecchio, Mengjun Xie, and Haining Wang. 2010. Mimimorphism: A new approach to binarycode obfuscation. In

Proceedings of the 17th ACM conference on Computer and communications security . ACM New York,NY, USA, Chicago Illinois, USA, 536–546.[34] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. 2017. Spatio-temporal autoencoder forvideo anomaly detection. In