Unsupervised Contextual Anomaly Detection using Joint Deep Variational Generative Models
UUnsupervised Contextual Anomaly Detection using Joint DeepVariational Generative Models
Yaniv Shulman [email protected]
Abstract
A method for unsupervised contextual anomaly detection is proposed using a cross-linked pairof Variational Auto-Encoders (VAE) for assigning a normality score to an observation. Themethod enables a distinct separation of contextual from behavioral attributes and is robust to thepresence of anomalous or novel contextual attributes. The method can be trained with data setsthat contain anomalies without any special pre-processing.
1. Introduction
Anomaly detection is an important area of research since anomalies represent a substantialdeviation from the normal characteristics of a system or process of interest. Often these processesresult in highly dimensional data sets, with complex relationships within the data and exhibitstochastic behavior. Furthermore the anomalies by definition contain high self-information mea-sure and therefore carry useful information about the underlying data generation process. Thereexist a number of similar definitions of what an anomaly is however in this paper the followingdefinition is adopted [11]:1. Anomalies are di ff erent from the norm in respect to their attributes.2. They are rare in a data set compared to the normal instances.3. In addition a novel observation is defined as an observation that is substantially di ff erentthan any observation in the training data set.In this paper a method for contextual anomaly detection is proposed using a cross-linkedpair of Variational Auto-Encoders (VAE) for assigning a normality score to an observation. Themethod enables a distinct separation of contextual from behavioral attributes and is robust tothe presence of anomalous or novel contextual attributes. The method can be trained with datasets that contain anomalies without any special pre-processing. In addition the method can beextended in a straight forward way to further decompose and separately model the joint varia-tional approximation by introducing additional independent recognition networks thus allowingfor more accurate representation in the latent space. a r X i v : . [ s t a t . M L ] A p r n summary the key contributions of this paper are: • A novel architecture for auto-encoding joint latent variational Bayes. • A novel method for robust unsupervised anomaly detection in the presence of contextualanomalies.
2. Preliminaries
In this section a number of criteria for broadly categorizing anomaly detection algorithms isbriefly discussed. These concepts are covered in more detail in [1, 8, 11].
Proximity based anomaly detection assumes that anomalous data are isolated from the major-ity of the data whether in relation to clusters or global / local dense regions. To determine if anobservation is anomalous, the distance to the clusters or the density estimate is calculated togenerate a normality score; Statistical based anomaly detection assumes that data is generatedfrom a known probability distribution which can be described by parametric or non-parametricformulation. To determine if a data point is an anomaly the probability of it being generatedfrom the assumed distribution is determined and a normality score is produced derived from thisprobability;
Deviation based anomaly detection is based on the reconstruction errors followinga spectral or other transformation of the data to a lower dimensional space and then back to theoriginal space. The magnitude of the reconstruction error is used to generate a normality score.
Supervised anomaly detection is employed where both the training and test data sets specifyfor each observation whether it is normal or anomalous;
Semi-supervised anomaly detection istypically defined as scenarios where the training data contains only normal observations;
Unsu-pervised anomaly detection is the case where there are no labels provided in either the trainingor the testing data sets and no assumptions are made on the existence or number of anomalousobservations in the available data.
Contextual anomaly detection is formulated such that the data contains two types of attributes,behavioral and contextual attributes.
Behavioral attributes are attributes that relate directly to theprocess of interest whereas contextual attributes relate to exogenous but highly a ff ecting factorsin relation to the process. Generally the behavioral attributes are conditional on the contextualattributes. In this section a brief overview of the Variational Auto-Encoder (VAE) [14] is provided forpresenting the notation used in subsequent sections of the paper.A Variational Auto-Encoder (VAE) is a directed probabilistic graphical model that enables ane ffi cient variational inference for intractable posterior distributions which are approximated bya neural network. The VAE is comprised of two serially adjoined neural networks which arereferred to as encoder / recognizer and decoder / generator respectively. The generator network g ( z , θ ) where z is a latent variable approximates the generative process p θ ( x ) = p θ ( x | z ) p θ ( z ).The recognition network f ( x , φ ) models q φ ( z | x ) a variational approximation of the intractableposterior p θ ( z | x ). All parameters are learned jointly and e ffi ciently by employing the Stochastic2radient Variational Bayes (SGVB) [14] estimator. As the marginal likelihood of the data p ( x ) isintractable, the problem is transformed into an optimization problem where the objective functionof the VAE is the Evidence Lower Bound (ELBO), a lower bound on log p ( x ) as formulated inequation 2c. log p θ ( { x ( i ) } Ni = ) = N (cid:88) l = log p θ ( x ( i ) ) (1a) = KL ( q φ ( z | x ) (cid:107) p θ ( z )) + L ( θ, φ, x ) (1b) log p θ ( x ) ≥ L ( θ, φ, x ) (2a) = E q φ ( z | x ) [ − log q φ ( z | x ) + log p θ ( x | z )] (2b) = − KL ( q φ ( z | x ) (cid:107) p θ ( z )) + E q φ ( z | x ) [ log p θ ( x | z )] (2c)Where the inequality in equation 2a follows from the non-negativity of the KullbackLeiblerdivergence. A complete derivation can be found in [5]. In this section a very brief overview of the Conditional Variational Auto-Encoder (CVAE)[17]. The CVAE expands on the learning capacity of the VAE by defining an architecture thatenables the model to learn explicit joint variational approximation of the latent variable q φ ( z | x , y )and a directly modulated conditional generative p θ ( y | x , z ) model. In CVAE the input is denotedas x , the output is denoted as y and the latent variable is z . The CVAE utilizes the SGVBoptimization framework and an objective function closely related to the VAE defined in equation3e. log p θ ( y | x ) = (3a) KL ( q φ ( z | x , y ) (cid:107) p θ ( z | x , y )) + E q φ ( z | x , y ) [ − log q φ ( z | x , y ) + log p θ ( y , z | x )] (3b) ≥ E q φ ( z | x , y ) [ − log q φ ( z | x , y ) + log p θ ( y , z | x )] (3c) = E q φ ( z | x , y ) [ − log q φ ( z | x , y ) + log p θ ( z | x )] + E q φ ( z | x , y ) [ log p θ ( y | x , z )] (3d) = − KL ( q φ ( z | x , y ) (cid:107) p θ ( z | x )) + E q φ ( z | x , y ) [ log p θ ( y | x , z )] (3e) Anomaly detection has attracted large interest from the research community over decadesdue to the varied areas of application and theoretical importance. There are many suggestedmethods for the general case however a much smaller number of methods that deal explicitlywith contextual anomaly detection exist. A review of related work is given in [8] and in [11], thelatter being more recent and also endeavors to provide an elaborate comparative evaluation fora large number of methods. In this section the focus is on more recent methods proposed eitherfor contextual anomaly detection or anomaly detection that make use of variational inferenceand deep learning methods. Note that both supervised, semi-supervised and unsupervised meth-ods are included. [15] proposes a contextual anomaly detection method (ROCOD) for dealing3ith situations where there are abnormal or sparse contextual attributes by utilizing local andglobal behavioral models conditional on the context. [15] also performs comparative analysisof a number of methods and demonstrates that state-of-the-art point methods achieve relativelypoor results on contextual anomaly detection problems. [19] has proposed a method for generalcontextual anomaly detection and proposes three di ff erent expectation-maximization algorithmsfor learning the model. Additionally [19] comparatively evaluates more than 13 di ff erent datasets against several other non-contextual anomaly detection methods. [12] propose a multivariateconditional outlier detection framework for clinical applications by defining a multi-variate func-tion to calculate the normality score. [3] propose a method for improved unsupervised learningof L constrained representations for clustering analysis using deep Auto-Encoders. Normalityscores are then calculated based on similarity measure to clusters. Note that in [3] the numberof clusters is assumed to be known. [18] apply a Stochastic Recurrent Network (STORN) [4] forsupervised detection of anomalies in robot sensors time series data. [2] suggests an anomaly de-tection method using a VAE and proposes the Reconstruction Probability a novel normality scorebased on the probabilistic measure expressed in the objective function of the VAE. [20] suggestDonot, an unsupervised anomaly detection algorithm utilizing a Variational Auto-Encoder foranomaly detection in Seasonal KPI arising from web applications utilizing the ReconstructionProbability.
3. Problem Description
Most anomaly detection methods known to the author at this time do not provide explicittreatment of contextual and behavioral attributes separately but simply merge the two attributetypes into a single observation thus transforming the original task into a standard point anomalydetection [11, 8]. On the other hand some contextual anomaly detection methods either requirea labeled data set for training or are designed for specific domains therefore it seems not manymethods exist to perform general unsupervised contextual anomaly detection. Furthermore bydefinition relatively little information is available on the distribution of the behavioral attributesin low density areas of the contextual subspace which results in an additional challenge for theexisting algorithms especially when there is no information available on the distribution of thebehavioral attributes when the context is in itself novel.In this paper the focus is on unsupervised contextual anomaly detection where the training andtesting data sets are generated by the same process. It is of interest to develop a robust modelthat is able to learn e ffi ciently the state of the process and correctly predict an observation as ananomaly when the behavioral attributes are in fact an anomaly given the context. However it isdesirable for such a model to be robust to anomalies present in the contextual attributes and usethe best available relevant context to make meaningful predictions.4 igure 1: Illustration of the generative model as a directed graphical model. x is the behavioral attributes for the processof interest, c is the contextual attributes which in this case do not participate directly in the generative process p θ ( x ) = (cid:82)(cid:82) p θ ( x | z x , z c ) p θ ( z x ) p θ ( z c ) d z x d z c . Solid lines denote the generative process whereas dashed line denote the variationalapproximations.
4. Proposed Method
Given a data set of observations D = { d ( i ) = [ c ( i ) , x ( i ) ] | c ( i ) ∈ C , x ( i ) ∈ X } Ni = where [ ◦ , ◦ ]denotes concatenation, the set X = { x ( i ) } Ni = contains only behavioral attributes and the set C = { c ( i ) } Ni = contains only the corresponding contextual attributes and [ x ( i ) , c ( i ) ] are jointly and inde-pendently drawn. The data generation process where the N samples are taken can be modeled asfollows:1. A sample z ( i ) x is taken from a latent variable z x with prior distribution p θ ( z x ).2. A sample z ( i ) c is taken from a latent variable z c with prior distribution p θ ( z c ).3. A sample c ( i ) is taken from a variable c with conditional distribution p θ ( c | z c ).4. A sample x ( i ) is taken from a variable x with conditional distribution p θ ( x | z x , z c ).The generative process is defined as p θ ( x ) = (cid:82) (cid:82) p θ ( x | z x , z c ) p θ ( z x ) p θ ( z c ) d z x d z c , p θ ( c ) = (cid:82) p θ ( c | z c ) p θ ( z c ) d z c and is chosen so to prevent c from modulating the generative processof x directly for reasons brought in subsequent sections. Figure 4.1 provides an overview of thegenerative process. p θ ( x ) and p θ ( c ) are often intractable. Let z = [ z x , z c ] denote the complete set of latent variables. The variational lower bound of p θ ( x ) and p θ ( c ) is defined as follows: log p θ ( c ) ≥ − KL ( q φ ( z c | x , c ) (cid:107) p θ ( z c )) + E q φ ( z c | x , c ) [ log p θ ( c | z c )] (4) log p θ ( x ) ≥ − KL ( q φ ( z | x ) (cid:107) p θ ( z )) + E q φ ( z | x ) [ log p θ ( x | z )] (5)To optimize jointly the variational lower bound objective of the two marginal likelihoodsequations 4 and 5 are combined. 5 og p θ ( c ) + log p θ ( x ) ≥− KL ( q φ ( z c | x , c ) (cid:107) p θ ( z c )) + E q φ ( z c | x , c ) [ log p θ ( c | z c )] − KL ( q φ ( z | x ) (cid:107) p θ ( z )) + E q φ ( z | x ) [ log p θ ( x | z )] (6)Given the KL terms in equation 6 may be integrated analytically under certain conditions forcalculating the empirical loss, the objective is optimized using the Stochastic Gradient VariationalBayes (SGVB) [14] estimator: L ( θ, φ, c ( i ) , x ( i ) ) = − KL ( q φ ( z ( i ) c | x ( i ) , c ( i ) ) (cid:107) p θ ( z ( i ) c )) − KL ( q φ ( z ( i ) | x ( i ) ) (cid:107) p θ ( z ( i ) )) + L L (cid:88) l = log p θ ( c ( i ) | z ( i , l ) c ) + L L (cid:88) l = log p θ ( x ( i ) | z ( i , l ) ) (7)Where z ( i , l ) c = g φ ( x ( i ) , c ( i ) , ε ( i , l ) c ) , ε c ∼ N ( , I ) and z ( i , l ) = h φ ( x ( i ) , c ( i ) , ε ( i , l ) ) , ε ∼ N ( , I ), L is thenumber of samples. The first two KL terms in equation 7 represent the latent error for the twovariational distributions q φ ( z c | x , c ) and q φ ( z | x ), and the two remaining terms the log probability ofthe reconstruction errors for the contextual and behavioral attributes C = { c ( i ) } Ni = and X = { x ( i ) } Ni = respectively. To approximate the posteriors of the joint generative models p θ ( c | z c ) and p θ ( x | z x , z c ) tworecognition networks and two generator networks are jointly trained. The behavioral attributes x are input into one of the recognition networks and both the contextual and behavioral attributes[ x , c ] are input into the other. Both recognition networks output the parameters of the variationalapproximations to the prior followed by L samples that are drawn from the variational approx-imations to form a Monte Carlo approximation of the expectations of the reconstruction withrespect to variational approximations [14]. This architecture provides a number of benefits:1. Explicit treatment of behavioral and contextual attributes.2. Enables an indirect modulation of the generative process of p θ ( x | z ) by c based on thelatent representation of the contextual attributes rather than a direct modulation of the pro-cess as done in a CVAE architecture which results in increased robustness to the presenceof outliers and novelties in the contextual space. Intuitively this can be explained by thesimilarity of the latent representation of c to spectral dimensionality reduction representa-tion which maps the data into a known sub-space, but with the increased model capacityof the recognition network and the benefit of a probabilistic interpretation.3. Enables assigning di ff erent priors for the contextual and behavioral spaces, having multipleof each as a method to decompose and separately model the joint latent distribution. All recognition and generator networks are jointly trained using the Stochastic Gradient Vari-ational Bayes (SGVB) [14] estimator. Having learned the model parameters a normality scorecan be obtained by either calculating a reconstruction error norm for the behavioral attributes6 igure 2: Illustration of the architecture || x ( i ) − ˆ x ( i ) || or by calculating the Reconstruction Probability of x ( i ) defined as E qx φ ( z | x ) log p θ ( x | z )[2]. Note that the reconstructed context ˆ c ( i ) is not strictly required for assigning a normality scorefor classification but can be used to estimate the normality score of the context if desired. Figure4.2.3 provides an overview of the architecture.
5. Experimental Results
Comparative evaluation of contextual anomaly detection methods is a challenging task dueto lack of availability of common and suitable data sets that are both labeled and partitioned intobehavioral and contextual attributes. To overcome this challenge a publicly available data set theKddcup99 was adopted as well as an evaluation method used in [15] to provide a performancebaseline. The Kddcup99 is ”the data set used for The Third International Knowledge Discoveryand Data Mining Tools Competition, which was held in conjunction with KDD-99 The FifthInternational Conference on Knowledge Discovery and Data Mining. The competition task was http: // kdd.ics.uci.edu / databases / kddcup99 / kddcup99.html
7o build a network intrusion detector, a predictive model capable of distinguishing between “bad”connections, called intrusions or attacks, and “good” normal connections. This database containsa standard set of data to be audited, which includes a wide variety of intrusions simulated in amilitary network environment”. The Kddcup99 data set is by a large margin the most challengingdata set evaluated by [15] and therefore was elected for this experiment. An e ff ort was made toadhere to the same method of pre-processing and data inclusion as described in [15] howeverthere are some di ff erences as described subsequently.The observations from the r2l and u2r attack families were retained as well as attacks of type ipsweep and nmap , and normal observations. This results in a total of 605,803 observations outof which 595,797 are labeled as normal, and the rest 10,006 are considered anomalies (approx.1.652%). Similarly to [15] the service , duration , src bytes and dst bytes were used as behavioralattributes and all other as contextual attributes. The logarithm of duration , src bytes and dst bytes was taken since these attributes are processed in the same manner in [15]. All categorical featureswere one-hot-encoded and finally all attributes are normalized to [0 ,
1] range. The resulting dataset contains 65 behavioral attributes and 45 contextual attributes and enables quantitative analysisof the proposed algorithm’s e ff ectiveness against the algorithms evaluated in [15] on a similardata set. The model is comprised of behavioral recognizer and generator networks and contextualrecognizer and generator networks as in the basic architecture described in section 4.2.2 andillustrated in figure 4.2.3. The arrangement of units in the behavioral recognizer MLP were: 65(input), 58, 32 and 4 units for the latent output, with the generator having a mirror architecture.The arrangement of units in the contextual recognizer MLP were: 110 (input), 40, 22 and 4 unitsfor the latent output, with the generator having a mirror architecture except for the output layercontaining 45 units. All activation functions in the MLPs are Relu where applicable, howeverthe latent parameters layer as well as the outputs of both generators employ linear activation.Isotropic normal distribution were assumed to the data and latent distributions which lead to thetotal empirical objective is presented in equation 8, note there is an added L1 regularization termover the MLPs’ weights with λ = − . L ( θ, φ, c ( i ) , x ( i ) ) = − (cid:20) | z | (cid:88) i = (1 + log (( σ ( i ) z ) ) − ( µ ( i ) z ) − ( σ ( i ) z ) ) + | z c | (cid:88) i = (1 + log (( σ ( i ) z c ) ) − ( µ ( i ) z c ) − ( σ ( i ) z c ) ) (cid:21) + L L (cid:88) l = || x ( i , l ) − ˆ x ( i , l ) || + L L (cid:88) l = || c ( i , l ) − ˆ c ( i , l ) || + λ (cid:88) w | w | = − | z x | (cid:88) i = (1 + log (( σ ( i ) z x ) ) − ( µ ( i ) z x ) − ( σ ( i ) z x ) ) − | z c | (cid:88) i = (1 + log (( σ ( i ) z c ) ) − ( µ ( i ) z c ) − ( σ ( i ) z c ) )] + L L (cid:88) l = || x ( i , l ) − ˆ x ( i , l ) || + L L (cid:88) l = || c ( i , l ) − ˆ c ( i , l ) || + λ (cid:88) w | w | (8)Where L = | z | , | z c | and | z x | are the dimensions ofthe latent variables z , z c and z x respectively. For optimization Adam [13] was employed. Notethat the aforementioned architecture is likely not optimal and was chosen based on previous8ersonal experience for illustrative purposes with no attempt to find an optimal hyper-parametersetting for this experiment. Training was performed with early stop strategy once the loss on thevalidation set has started increasing. Despite aiming to compare primarily against the results presented in [15] for diligence thesame data set was evaluated by three additional algorithms: Isolation Forest [16], One ClassSVM [9] and Local Outlier Factor [7]. Not much e ff ort was put into fine tuning these algorithmson the target data set and the results should be taken as indicative only. The following metrics were evaluated against each of the methods:1. Area under the Precision-Recall Curve (PRC): The area under the curve when plotting therecall on the x-axis against precision on the y-axis for all relevant possible threshold valuesfor discriminating between normal and anomalous observations. PRC is recommended inscenarios where the data set is highly imbalanced [10]. The area under the curve (AUC)provides a summary statistic to the performance of a classifier in the PRC space.2. Average Precision Score (APS): Provides a summary statistic for the Precision-RecallCurve as a weighted mean of precision obtained at each threshold, with the weight beingthe increase in recall from the previous threshold, calculated as
APS = (cid:80) n ( R n − R n − ) P n where P n and R n are the precision and recall at the n − th threshold.3. Area under the Receiver Operating Characteristics Curve (ROC): The ROC curve en-ables the visualization of the relative trade-o ff between true-positive rate (TPR) and false-positive rate (FPR) by plotting the FPR on the x-axis against TPR on the y-axis for allrelevant threshold values. The area under the curve (AUC) provides a summary statistic tothe performance of a classifier in the ROC space.4. Top-100 Precision: The fraction of correctly detected anomalies in the top 100 scoredobservations. Due to the challenges related to binary classification over a highly imbalanced data sets [6]a cross-validation with 5-fold stratified partitioning was performed where the ratio of the twoclasses in each of the the train / test partitions was kept equal to the distribution in the completedata set. The results are summarized in the following tables:Method PRC (AUC) APS ROC (AUC) Top-100 Precis. JLVAE 0.51848 0.51874 0.99257
Table 1: Summary of mean results obtained over the 5-folds for all methods.
The results obtained demonstrate a substantial improvement compared to the benchmarkalgorithms tested in the described setting and to the results obtained by [15] for a similar data9et. The following tables contain detailed information as to the results obtained for each of thealgorithms and k-folds.K-Fold PRC (AUC) APS ROC (AUC) Top-100 Precision1 0.50543 0.5057 0.99240 0.012 0.53492 0.53524 0.99321 0.033 0.51293 0.5131 0.99227 0.014 0.53134 0.53162 0.99264 0.035 0.50777 0.50805 0.99233 0.01mean 0.51848 0.51874 0.99257 0.018
Table 2: JLVAE - proposed method.
K-Fold PRC (AUC) APS ROC (AUC) Top-100 Precision1 0.0084 0.00854 0.01773 02 0.0084 0.0085 0.01203 03 0.00843 0.00859 0.02282 04 0.00848 0.0086 0.02942 05 0.00838 0.00853 0.01486 0mean 0.00842 0.00855 0.01937 0
Table 3: Isolation Forest.
K-Fold PRC (AUC) APS ROC (AUC) Top-100 Precision1 0.00847 0.00854 0.02476 02 0.00847 0.00853 0.02475 03 0.00846 0.00853 0.02462 04 0.00846 0.00853 0.02469 05 0.00846 0.00852 0.02414 0mean 0.00846 0.00853 0.02459 0
Table 4: One Class SVM.
K-Fold PRC (AUC) APS ROC (AUC) Top-100 Precision1 0.02442 0.03593 0.64976 0.012 0.02464 0.03627 0.64744 0.093 0.02394 0.0353 0.64671 0.034 0.02483 0.0351 0.64564 0.085 0.02506 0.03633 0.65290 0.07mean 0.02458 0.03579 0.64849 0.056
Table 5: Local Outlier Factor. .2. Waste Water Treatment Plant5.2.1. Robustness to Contextual Anomalies To demonstrate the e ff ectiveness of the method in dealing with contextual anomalies it wasevaluated on a real-world waste water treatment plant located at Western Australia. The plantdesign features a splitter chamber that divides the incoming waste water into two wells eachhaving two pumps. Waste water pumped by the pumps are then merged into a single outlet pipeby a series of two joiner pipes, one joining the pumps output in each well, and one joining thetwo well’s output. The control logic for the plant under normal conditions will turn pumps onand o ff as required to meet inflow conditions and also use variable speed drives to modulate thespeed of the operational pumps based on a level reading of the splitter chamber. This designresults in a system where the operational characteristics of a pump is not independent from theother pumps. The data set contains roughly 30 months of operational data, close to 150 attributes and about690,200 coincident observations with 2 minutes frequency and is comprised of the followinginformation:1. Sensors specific per pump such as vibration, temperature, speed, operational pressures andflows, power supply characteristics, and more.2. Generated features per pump such as e ffi ciency.3. Environmental readings from the two wells.4. Other useful data such as the splitter chamber level and external weather conditions.The data is assumed to contain anomalies of unknown nature and frequency. The data waspartitioned such that data generated in a particular pump run-cycle was kept together and notpartitioned across sets. Partitioning was done into training (65%), validation(15%) and testing(%20) sets where the percentages represent the portion of pump run-cycles rather than singleobservations. Lastly the data was not pre-processed except for aligning observations in time bymean interpolation, discarding partial observations with the remaining observations standardized.Note that there are no categorical attributes in this data set. A model is developed for each pump individually where the behavioral attribute are the datarelating directly to the operational sensor readings of the pump, and where contextual attributesare some of the behavioral attributes of the remaining pumps as well as environmental factorssuch as the splitter chamber level and weather conditions. For example, a model for pump onewill include as context the inflow and outflow rate and pressure of pumps 2-4, the splitter cham-ber level and environmental information. The setup was similar to the one described in section5.1.2 with arrangement of units in the behavioral recognizer as follows: 28 (input), 20, 10 and 5units for the latent output, with the generator having a mirror architecture. The arrangement ofunits in the contextual recognizer were: 38 (input), 20, 10 and 2 units for the latent output, withthe generator having 4, 7 and 10 units in the output layer. Note that similarly to the previous ex-periment the aforementioned architecture is likely not optimal and was chosen based on personalexperience of the author for illustrative purposes.11 .2.4. Metrics
In this case it is intended to evaluate the models robustness to contextual anomalies andnovelties. To do so the following method is applied. A threshold was set so that the numberof anomalies detected by the model in the test data set is roughly 1%. Then 10,000 normalobservation are randomly selected and transformed by scaling and o ff setting a randomly chosensubset of the attributes element-wise where scale ∼ U ( − . , .
5) and o f f set ∼ U ( − . , . Table 6: Summary of number of anomalies detected in the noisy data sets.
The results demonstrate the algorithm is robust to anomalies and novelties in the contextualdata attributes whilst maintaining sensitivity to anomalies in the behavioral space. It is notablethat even when the entire set of contextual attributes is transformed in data set Fc, still lessanomalies are reported than data set Dx where only two behavioral attributes are corrupted withnoise.
6. Conclusion
In this paper a novel algorithm for contextual anomaly detection is presented and a novelANN architecture comprised of multiple cross-linked VAEs to model directed graphical distribu-tion models for modeling generative processes. The algorithm performs well in the test scenariosand is robust to contextual anomalies and novelties.12 . Acknowledgements
This research was supported by the Water Corporation of Western Australia. I gratefullyacknowledge my colleagues from the Water Corporation for access to infrastructure and for theircooperation, which greatly assisted the research.
References [1] C. C. Aggarwal.
Outlier Analysis . Springer Publishing Company, Incorporated, 2013.[2] J. An and S. Cho. Variational autoencoder based anomaly detection using reconstruction probability. 2015.[3] C¸ . Aytekin, X. Ni, F. Cricri, and E. Aksu. Clustering and unsupervised anomaly detection with L2 normalized deepauto-encoder representations.
CoRR , abs / ff e. Variational inference: A review for statisticians. Journal of theAmerican Statistical Association , 112(518):859–877, 2017.[6] P. Branco, L. Torgo, and R. P. Ribeiro. A survey of predictive modelling under imbalanced distributions.
CoRR ,abs / SIGMODRec. , 29(2):93–104, May 2000.[8] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection : A survey.
ACM Computing Surveys , 09:1–72, 2009.[9] C. Cortes and V. Vapnik. Support-vector networks.
Machine Learning , 20(3):273–297, Sep 1995.[10] J. Davis and M. Goadrich. The relationship between precision-recall and roc curves. In
Proceedings of the 23rdInternational Conference on Machine Learning , ICML ’06, pages 233–240, New York, NY, USA, 2006. ACM.[11] M. Goldstein and S. Uchida. A comparative evaluation of unsupervised anomaly detection algorithms for multi-variate data.
PLoS ONE , 11(4):1–31, 04 2016.[12] C. Hong and M. Hauskrecht. Multivariate conditional outlier detection and its clinical application. In
Proceedingsof the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA. , pages4216–4217, 2016.[13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.
CoRR , abs / CoRR ,abs / ACM Trans. Knowl. Discov. Data ,6(1):3:1–3:39, Mar. 2012.[17] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models.In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,
Advances in Neural InformationProcessing Systems 28 , pages 3483–3491. Curran Associates, Inc., 2015.[18] M. S¨olch, J. Bayer, M. Ludersdorfer, and P. van der Smagt. Variational inference for on-line anomaly detection inhigh-dimensional time series.
CoRR , abs / IEEE Transactions on Knowledgeand Data Engineering , 19, 2007.[20] H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei, Y. Feng, J. Chen, Z. Wang, and H. Qiao.Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In
Proceedingsof the 2018 World Wide Web Conference , WWW ’18, pages 187–196, Republic and Canton of Geneva, Switzerland,2018. International World Wide Web Conferences Steering Committee., WWW ’18, pages 187–196, Republic and Canton of Geneva, Switzerland,2018. International World Wide Web Conferences Steering Committee.