Detecting Novel Processes with CANDIES -- An Holistic Novelty Detection Technique based on Probabilistic Models
11 Detecting Novel Processes with CANDIES – AnHolistic Novelty Detection Technique based onProbabilistic Models
Christian Gruhl, Bernhard Sick (cid:70)
Abstract — In this article, we propose CANDIES (Combined Approachfor Novelty Detection in Intelligent Embedded Systems), a new approachto novelty detection in technical systems. We assume that in a technicalsystem several processes interact. If we observe these processes withsensors, we are able to model the observations (samples) with aprobabilistic model, where, in an ideal case, the components of theparametric mixture density model we use, correspond to the processesin the real world. Eventually, at run-time, novel processes emerge in thetechnical systems such as in the case of an unpredictable failure. As aconsequence, new kinds of samples are observed that require an adap-tation of the model. CANDIES relies on mixtures of Gaussians which canbe used for classification purposes, too. New processes may emerge inregions of the models’ input spaces where few samples were observedbefore (low-density regions) or in regions where already many sampleswere available (high-density regions). The latter case is more difficult,but most existing solutions focus on the former. Novelty detection in low-and high-density regions requires different detection strategies. WithCANDIES, we introduce a new technique to detect novel processesin high-density regions by means of a fast online goodness-of-fit test.For detection in low-density regions we combine this approach with a2SND (Two-Stage-Novelty-Detector) which we presented in preliminarywork. The properties of CANDIES are evaluated using artificial dataand benchmark data from the field of intrusion detection in computernetworks, where the task is to detect new kinds of attacks.
Index Terms —ovelty Detection Gaussian Mixture Models CANDIESOnline Goodness-of-Fitovelty Detection Gaussian Mixture Models CAN-DIES Online Goodness-of-FitN
NTRODUCTION
Today, so-called “smart” or “intelligent” technical sys-tems are often equipped with abilities to act in realenvironments that are termed to be “dynamic” in thesense that their characteristics are time-variant (changeover time). But typically, knowledge about the basicnature of these changes is built into these systems and itis assumed that only the time when these changes occurcannot be predicted. Future systems, however, have toevolve over time. Not all knowledge about any situationsthe system will face at run-time will be available atdesign-time. That is, the system has to detect and reacton fundamental changes in time-variant environments . As
C. Gruhl and B. Sick are with the University of Kassel, Department ofElectrical Engineering and Computer Science, Wilhelmshoeher Allee 73,34121 Kassel, Germany (email: cgruhl,[email protected]). an example, we may consider technical systems thatmake observations of their environment with sensorsand classify these observations (samples). In a time-invariant environment , there may be different causes fordifferent kinds of observations (called “processes” inthe following). The data are modeled (e.g., by meansof probabilistic models such as Gaussian mixtures) andthen a classifier is built (e.g., by gradually assigningcomponents of the Gaussian mixture to classes). At run-time, the data model and the classifier are not adapted.An example for such a system could be a machine, thatproduces various parts (cf. left part of Figure 1). A process in this hypothetical environment would be similar to theproduction of a certain part. If the system is monitoredwith multiple sensors (e.g., S1 and S2 ), the resulting sen-sor signals span a two-dimensional input space (as shownin the middle part of Figure 1). Each value pair is calledan observation or sample . With suitable machine learningtechniques we can approximate the resulting distributionof samples . On the right side of Figure 1 a Gaussianmixture model (GMM, cf. 4.1) is used to approximate,and thus modeling, the sample distribution. Ideally, eachcomponent of the GMM describes a physical process in the environment. Reasons to rely on GMM for thispurpose is that arbitrary continuous densities can beapproximated by GMM (with any desired precision,based on the number of components) and the general-ized central limit theorem, which states that the sum ofi.i.d. random samples tends to be normally distributed(assumed, that the variance is finite). In technical systemthis is frequently the case, since observed sensor valuesare often the outcome of various random parameters thatinfluence each other. In a time-variant environment , theremay eventually be some conspicuous samples (cf. Fig.2). Then (if the system is able to detect such a situation),some questions come up: Are these samples outliers ofexisting processes or not? If not, is there an anomaly inthe observed environment or did a new process emergethat was unknown at design-time? And how can webuild novelty detection methods to identify such novel processes? How can we adapt the data model to suitthe changed environment and when (in order to find a r X i v : . [ c s . L G ] M a y S1 S2 -3-2-10123 -3 -2 -1 0 1 2 3S1 S -3-2-10123 -3 -2 -1 0 1 2 3S1 S Fig. 1. Hypothetical scenario of a monitored machine.On the left: abstract machine with monitoring sensors S1and S2. In the middle: the two-dimensional input space consisting of measured sensor signals from S1 and S2.In this case, the outcomes of three different processesare gathered in three clusters. On the right: approximateddensity model, the ellipses are called components andcorrespond to multivariate Gaussians that represent thephysical processes . a trade-off between fast and accurate reaction)? And ifthere are new model components, to which class do wehave to assign them? -3-2-10123 -3 -2 -1 0 1 2 3outliers or novel process? Fig. 2. A situation with samples produced by three pro-cesses (represented by three components and marked inred) and a model of the situation with three components.Some observations appear in a low-density region andare “suspicious”, they may indicate that a novel processcurrently emerges.
With CANDIES (Combined Approach for NoveltyDetection in Intelligent Embedded Systems), some of thequestions can be answered. One major challenge is thereliable detection of (possibly multiple) novel processes in the complete input space. Herby we assume that theinput space is divided into two parts:1)
High-density regions (HDR): These are regions thatare already covered by one or more components ofthe mixture model (i.e., the support of the kernels,in our case Gaussians, is high). This implies that normal observations are expected to appear in theseregions and thus forming the normal model . How-ever, new processes might also emerge here (e.g.,“close” to, or between existing components, or even -2-101234 -3 -2 -1 0 1 2 3
Fig. 3. Same situation with samples produced bythree processes (represented by three components andmarked in red) and corresponding model. Some obser-vations (green circles ◦ and blue crosses + ) in the high-density region covered by the model components, are theoutcome of two not yet known processes and should beconsidered to be “suspicious”. totally overlapping these) and therefore changethe characteristics of the approximated density. Weassume that HDR can be considered to be spatiallycompact in the input space, and that they containthe majority of the overall density mass.2) Low-density regions (LDR): These regions are distantfrom any component centers, resulting in a lowsupport of the kernels. Thus, normal data is notexpected to be observed here and observationsappearing here are considered to be suspicious . Incontrast to HDR we assume, that LDR are widelyspread in the input space and that usually only asingle LDR exists (i.e., not separated by HDR).The transition between HDR and LDR is not strictlydefined and is application dependent. Caused by theirdifferent characteristics, different problems are faced todetect novel processes . Since LDR have a potentially in-finite support, the main difficulty is to efficiently findspatial relations (i.e., clusters) between suspicious obser-vations . On the other hand, for HDR, two issues mustbe addressed: 1) which observations are assumed tobe normal (outcome of an already known and modeledprocess) and which are suspicious (i.e., outcome of a novelprocess , or anomalies). 2) When is a novel process present.In a preliminary article (see [15]), we presented 2-SNDR, an approach to solve the novelty detection prob-lem sketched above for situations, where novel pro-cesses start to “generate” data in LDR of a probabilisticknowledge model (based on Gaussian mixtures). Figure2 depicts such an exemplary scenario (where a novelprocess is emerging in a LDR). To detect novelty in HDRCANDIES relies on a new approach that is premisedon statistical goodness-of-fit testing (i.e., measuring how well observed samples fit the assumed distribution),adjusted to suite Gaussian mixture models (GMM, cf.Section 4.1) and online environments. Figure 3 showsa different situation, where two novel processes startedto “generate” samples in a HDR, but are not yet repre-sented in the current model.Altogether, it is possible to address a specific kind oftime-variance in the observed environment which is use-ful for many applications. We may imagine other kindsof situations where processes disappear ( obsoleteness ) orchange some basic parameters ( concept shift or conceptdrift ). Our current research addresses these situations aswell.The remainder of this article is organized as follows:Section 3 gives a broad overview of related work, in-cluding other common novelty detection techniques, re-lated topics, and what are the distinctions to CANDIES.Section 4 briefly summarizes methodical foundationsessential for this article. Preliminary to the technical indepth details, a simplified overview of the idea behindthe proposed technique is given in Section 2. The mainbody, introducing CANDIES in detail, is contained inSection 5. In Section 6 a small case study based on theKDD Cup 99 Computer Intrusion data set is presented.Finally, a conclusion and outlook to future work is givenin Section 7. VERVIEW OF
CANDIES
With CANDIES we aim on three main goals: 1) De-tecting clusters of suspicious samples (i.e., those thatdiffer notably from what is expected). 2) Detecting suchclusters in the complete input space, that is, in LDR andin HDR. 3) using the discovered clusters to model new processes .The algorithms consists of multiple detectors for HDRand a single one for the LDR. It works (simplified)in the following manner. The foundation of the wholeapproach is a GMM, that provides a density estimate ofthe expected data. An advantage of GMM is that theycan easily be extended to a classifier and that they be-long to the family of generative models, thus additionalstructural information about the expected observationscan be deduced (in contrast to discriminative classifiers,e.g., SVM). At first a new sample x (cid:48) located either ina HDR or in a LDR. Depending on the location it ismarked as normal (located in HDR) or suspicious (locatedin LDR) ( suspicious is what comparable algorithms markas novelties ). Depending on that decision either the LDRdetector or one of the HDR detectors is responsible forhandling the new sample. If the sample is marked as suspicious the sample is stored in a ring buffer on whicha nonparametric clustering is performed. If a cluster inthe buffer reaches a certain size the detector will reportthe detection of a novel process . Otherwise, when thesample is regarded as normal , it is used to update oneof the HDR-Detectors (there is one HDR-Detector foreach individual component of the GMM), the decision which detector is updated is made at random. The HDR-Detector works by testing how well the last m samplesare fitting the estimated Gaussian bell. This is done byusing a χ test. If the t -value exceeds the critical valuethe detector reports the detection of a novel process . ELATED W ORK
The main task for a Novelty Detector is to distinguishif a previously unseen sample belongs either to a normal model or if it is different in some way so that it does notbelong to the normal data and is therefore novel . Closelyrelated to the topic are the fields of anomaly and outlierdetection. Over a decade ago it was sufficient to roughlygroup novelty detection approaches into two classes:either statistical (cf. [24]) or neural network based (cf.[25]).Most of the statistical approaches are relying on amodel of the processed data. Observations are identifiedas (potentially) novel if they differ to much from whatis expected, e.g., described by an appropriate model.Further, these approaches can be discerned based on themodels they are using – either parametric or nonpara-metric models. Novelty detection techniques based onnonparametric density modeling are, for example, thoseusing k -nearest neighbors approaches or kernel densityestimators, see [37] for a sample application in intrusiondetection. Parametric models on the other side makeassumptions about the distribution of the observed sam-ples, e.g. Gaussian mixture models. In preliminary work[12] we detect novelty based on a parametric Gaussianmixture model and a state variable which monitors howwell the observations fit the model. The approach is usedfor comparison to CANDIES and briefly presented inthe case study in Section 6.The second group comprises detection techniques thatare based on neural networks, e.g., multi-layer percep-trons, radial basis function neural networks, [4], [6]but, according to Markou and Singh [25], also includemethods based on support vector machines, e.g. One-Class SVM as described by Tax and Duin [35].Since the early 2000s, the topic draw much attentionas objective of research and changed considerably (i.e.new ranges of applications or whole new techniques,due to advances in computing power). Now, a morerecent survey [29] suggest five different categories togroup novelty detection approaches: i) probabilistic, ii) distance-based, iii) reconstruction-based, iv) domain-based, and v) information theoretical.The first category covers a large part of the approachesthat where previously affiliated with the statistical group.Typically these techniques are build upon a parametricdensity estimation of training data as a model. Fre-quently used are mixtures of Gaussians ([12] [20], [38],for instance). Novelty is usually detected if samples areobserved in low-density-regions (i.e., the density for theobserved sample is below a selected threshold). Severalmethod to define a threshold are based on Extreme Value Theory (EVT, cf. [8], [18], [32]). The idea in EVTis to estimate the distribution of extreme values (i.e.,maximum or minimum for legit samples) for a givendensity model and a given sample size. Then, samplesthat exceed the expected maximum or surpass the ex-pected minimum are identified as novel . Recently Ex-treme Learning Machines with decision making depend-ing on EVT where proposed by Al-Behadili et al. [2] toimplement incremental semi-supervised learning basedon novelty detection. Thus, probabilistic approaches arenot limited to generative models, cf. [10], for example,where Support Vector Machines are used for detectionand resulting novelty values calibrated in order to beinterpreted as class-conditional probabilities.To the second category belong approaches that arebased on distances. Popular representatives of this cate-gory are approaches based on k -nearest-neighbors ( k nn).E.g., [7] or [17], [27], where the latter use the density ofa k -neighborhood (i.e., a radius required to enclose k neighbors) to identify novel samples. A sample is novel if its neighborhood density is considerably lower thanthe density of its neighbors. Clustering based approachesrefer also to category ii) . Typically, normal samples areaggregated to form clusters, novelty is then determinedby the minimal distance of an unseen sample to any cen-troid (e.g. [33], [36]). It is questionable whether category i) and ii) are sharply differentiable. Gaussian MixtureModels for example, consists of multiple location invari-ant kernels and the density is finally greatly dependenton the applied distance measure.Our new CANDIES approach does not fit into a singlecategory but is a hybrid in the sense that it belongs tothe first two categories: probabilistic and distance-based.For a detailed summary of the remaining categories iii) , iv) , and v) cf. [29].However, most of the introduced paradigms are de-signed to spot only single samples as novelties and do notrelate those samples to one another. Thus, potential newknowledge (structural information in form of a cluster,that is evidence of a novel physical process ) is unexploitedand discarded. In some common applications such asmedical condition monitoring [9], [31], [34] or machin-ery monitoring [30], this is not a real drawback, since anomalies might arose everywhere in the input space andare very specifically stuck to a concrete application (i.e.,monitoring a specific patient or a specific engine). Butin other fields, such as network intrusion detection, thisdiscovered knowledge has great potential to be used todetect future attacks.The contributions of this article are:1) CANDIES is trimmed to detect novel processes(clusters of suspicious observations, cf. knowledge ) insuch a way, that the process can easily be integratedas new component into the existing GMM. Thisleads to the result, that learning does not onlyhappen in a isolated training phase, off-line atdesign-time, but it is also conducted at run-time[16]. 2) Novelty is not only detected in low-density regions(where normal observations are not likely to ap-pear), but also in high-density regions, i.e., where normal observations are expected. ETHODICAL F OUNDATIONS
For the purpose of a self-contained article, we brieflyrecap the most important techniques used to implementour approach. This includes an overview of Gaussianmixtures, a method to extend those to a classifier, ashort introduction to nonparametric density estimationas foundation for cluster analysis, and a brief descriptionof statistical goodness-of-fit testing.
One frequently used approach to generative modeling isthe Gaussian mixture model (GMM). That is, a superpo-sition of multiple multivariate normal distributions (de-noted as N and commonly referred to as Gaussian) anda mixing coefficients π j (Eq. (1)). Each Gaussian is calleda component and has its own set of parameters which arethe mean vector µ j ∈ R D and a covariance matrix Σ j (with dim ( Σ j ) = D × D ) that describes its shape. Themixing coefficients π j (with constraints (cid:80) Jj =1 π j = 1 , π j ∈ R + ) ensure that the resulting p ( x ) ( x ∈ R D isthe random variable) still fulfills the requirements fora density function. They may also be interpreted aspriors for each component (i.e, the probability that anunobserved sample is generated by the correspondingcomponent). Altogether, we get: p ( x ) = J (cid:88) j =1 π j · N ( x | µ j , Σ j ) . (1) -2.5-2-1.5-1-0.500.511.522.5 -3 -2 -1 0 1 (a) Training set with classes greencircle ◦ and blue cross + . Theresulting GMM is trained in anunsupervised manner and mod-els the density with three compo-nents. -2.5-2-1.5-1-0.500.511.522.5 -3 -2 -1 0 1 (b) Model with class conclusionsextended to a classifier. The thickblack line is the decision bound-ary that devides the input spaceinto decision regions. Fig. 4. Each × denotes the center µ j of the j-th com-ponent while each ellipse represents the shape which isdefined by the j-th covariance matrix Σ j . The distancebetween µ j to the associated ellipse, which is a constantdensity surface, corresponds to a Mahalanobis distanceof 1. An ordinary GMM models only the density of an asso-ciated training set and can be trained in an unsupervisedmanner (i.e., labels are not required). Since the sufficientstatistics for the components cannot be computed inclosed form, we pursue this goal with an expectation-maximization (EM) like approach that uses 2nd order (orhyper-) distributions and is heavily based on variationalBayesian inference (VI). An extensive introduction to VIis given by [5]. For clarification, a trained GMM for atwo-dimensional data set is shown in Figure 4(a).Relying on VI gives rise to two advantages: (1) priorknowledge about the data can be included, which isespecially valuable in real-world applications, and (2)multiple GMM can be fused into one model as describedin [14]. The final GMM is obtained from the expectations(a point estimate from the second-order distributions) ofthe hyper-distributions after the VI training finishes.Since we assume a certain functional form of the un-derlying distribution and estimate its parameters, GMMare parametric density models.
To derive a classifier h ( x ) from the trained density model p ( x ) , we estimate the class posteriors p ( c | x ) in a second,supervised (i.e., with respect to class labels) iteration.The classification of a given sample x is then done, asshown in Eq. (2) by selecting the maximum a-posteriori(MAP) of the class probabilities: h ( x ) = argmax c { p ( c | x ) } , (2)with p ( c | x ) = J (cid:88) j =1 p ( c | j ) · p ( j | x ) = J (cid:88) j =1 ξ j,c · γ x ,j , (3)where γ x ,j = π j N ( x | µ j , Σ j ) (cid:80) Jj (cid:48) =1 π j (cid:48) N ( x | µ j (cid:48) , Σ j (cid:48) ) , (4) ξ j,c = 1 N j (cid:88) x n ∈ X c γ x n ,j . (5)Eq. (4) shows the responsibilities γ x ,j which are theprobability that a given sample x was generated bythe j-th component. For each component j and class c the conclusion is determined by Eq. (5), which is thefraction of all responsibilities for samples x n ∈ X c thatare labeled with class c and the effective number ofsamples (denoted as N j = (cid:80) Nn =1 γ x n ,j ) belonging to the j-th component ( X is the overall set of labeled samples, X c the subset of X associated with class c ).Finally, the class posteriors p ( c | x ) given in Eq. (3) area composition of the responsibilities γ x ,j and the classconclusions ξ j,c . The resulting decision boundary, whichdescribes the classifier for the previously estimated den-sity model, is shown in Figure 4(b). Rather than assuming a specific functional form such asparametric methods, nonparametric techniques providea point estimate for the density p ( x ) at a given point x . One well-known nonparametric method is the Parzenwindow (or kernel) density estimator, here with Gaus-sian kernel: p ( x ) = 1 N · h D N (cid:88) n =1 k (cid:18) x − x n h (cid:19) . (6)It is the sum of a finite set of N samples x n of anunderlying training set to which an appropriate kernelfunction is applied to. The kernel is placed at the point x where the density should be estimated. The parameter h is a smoothing factor that controls how smooth theestimation is while D is the number of dimensions.Closely related to the Parzen window are histograms (cf.[5]).The DBSCAN clustering algorithm (cf. [11]) uses adensity estimation that is quite similar to a Parzenwindow estimator. Based on the density at each samplethe algorithm decides whether a sample belongs to, liesat the edge, or is outside a cluster (in that case it isconsidered as noise ). To do so, the kernel in Eq. (7): k ( x ) = (cid:40) , if dist ( x , ) ≤ (cid:15) , otherwise (7)is used which forms an D -dimensional sphere aroundthe point x with radius (cid:15) . Typically, dist is realized withan Euclidean metric. If a sample is part of a cluster, allsamples inside the sphere are also assigned to the samecluster. The advantage of this approach is that clustersof arbitrary shapes can be identified. To validate whether an observed sample matches ahypothesized distribution or not, goodness-of-fit testscan be applied. A fast and reliable method is Pearson’schi-squared ( χ ) test [28]. The test compares observedfrequencies from mutually exclusive events (finite set ofpossible outcomes/values of a discrete random variable)against expected theoretical frequencies (obtained froma suitable fitted distribution) of these events. The teststatistic (or t -value) is calculated by: t = k (cid:88) i ( x i − e i ) e i (8)where x i is the observed event frequency of event i and k is the total number of different events. The expectedfrequencies of events i are given by e i : e i = P fit ( i | Θ) . (9)where P fit is the fitted distribution. The test aggregatesthe squared deviations between observed and expectedfrequencies and weights them by the expected frequency. (x), k = 11upper 5% Fig. 5. Distribution of t -values for χ test with degreesof freedom. The marked region at the right is the rejectionarea for a significance level of α = 5% . The beginning ofthe region is equal to the critical value χ ,upper , . This leads to stronger penalties when only small frequen-cies are expected. To accept or reject the null -hypothesisthat the sample is drawn from the hypothesized distri-bution the t -value must be less than the critical value: χ , upper α,k = F − χ ν (1 − α ) (10)The critical value is calculated by evaluating the inversecumulative density function F − of the χ distributionwith ν degrees of freedom at point − α . Where α isthe significance level which implies that the error rate fortype I errors is at most α (often α = 5% or ). Thedegrees of freedom ν are given by the number of eventsminus the number p of covariate parameters Θ of thefitted density P fit (e.g., p = 2 for univariate Gaussian with Θ = ( µ, σ ) ). Figure 5 emphasizes the relation between the χ distribution of t -values and the critical value χ ,upper for a significance level of α = 5% . NLINE N OVELTY D ETECTION
This section is split into three parts: the first part isbased on our previous work [15] and discusses noveltydetection and reaction in LDR with . The secondpart treats novelty detection in HDR with online capable χ goodness-of-fit tests. The last part then introduces CANDIES a detector which is able to detect noveltiesin the whole input space by combining both previouslymentioned techniques. All techniques share the propertyto be applicable to online environments (i.e., soft real-time).
To detect novel processes in sparse LDR we developedthe tage N ovelty D etection (2SND) algorithm. Thealgorithm works on top of an existing GMM or CMM(as described in Section 4.2) and extends it with noveltydetection capabilities. Further, with 2SND it is possibleto update the underlying GMM/CMM and to enhancethem by including components that model the detected novel processes .The algorithm itself consists of two procedures: a mainprocedure 2SND (Alg. 1) and an auxiliary procedure PROPAGATE, that propagates the cluster id to all af-filiated samples using a modified breadth-first search.To detect novel processes, we propose a two-stageapproach which identifies suspicious samples in the firststage and novel processes in the latter. Each assessedsample is individually tested how well it suits thecurrent model by determining whether it resides in ahigh- (HDR) or low-density region (LDR). This is doneby exploiting the fact that the squared Mahalanobisdistances between samples from a Gaussian j to its mean µ j : ∆ j ( x ) = ( x − µ j ) T Σ − j ( x − µ j ) (11)are χ D -distributed, where Σ − j is the inverted covariance(or the precision) matrix. With the quantile function F − χ D of the χ D distribution, we can determine a squaredMahalanobis distance ρ = F − χ D ( α ) such that a fraction α of samples (which belong to the Gaussian) have a smallersquared Mahalanobis distance to the mean as ρ . Figure6 depicts the relationship for one- and two-dimensionalGaussians.Separating the input space into HDR and LDR sim-plifies the model considerably. Legitimate samples areassumed to appear in the dense regions while samplesin the low-density regions are less likely to be observed.To detect novel processes in HDR additional detectorsare required (cf. Section 5.2), since 2SND focuses onlyon LDR. By selecting α we specify how much of thetotal probability mass is covered by HDR, thus definingthe transition between HDR and LDR. Samples with aMahalanobis distance of ∆ j ( x ) ≤ ρ to at least one of thecomponent centers µ j are located within a HDR andtherefore seen as not suspicious . Samples with a higherdistance ∆ j ( x ) > ρ to all centers are regarded as being suspicious . This complies to the first stage of our noveltydetection.Figures 6(b) and (c) are illustrate an exemplary situa-tion for a GMM with a single component and differentvalues of α . Observations inside the α -region (whichis equal to a HDR) are depicted by circles ◦ , whilesuspicious samples (located in a LDR) are shown astriangles (cid:77) . The first stage is implemented in the firstpart of the procedure 2SND given in Alg. 1.The second stage utilizes a density based clusteringapproach. Each sample x (cid:48) that is identified as beingsuspicious is cached in a circular buffer B with size ˜ b .Based on the distance to the nearest neighbor of x (cid:48) in B ,the algorithm decides if x (cid:48) belongs to an already existingcluster. This behavior depends on the kernel given inEq. (7) with (cid:15) being the maximum distance between asample x (cid:48) and its nearest neighbor and it is implementedin the second part of the 2SND procedure, given inAlg. 1. If the sample is associated with a cluster C , thecluster is extended to include all (cid:15) -neighbors (i. e., allbuffered samples with a distance dist ( x (cid:48) , x B ) ≤ (cid:15) ). Thisis achieved with the procedure PROPAGATE, which isbasically a breadth-first search with constraints. In fact, -0.02500.0250.050.0750.10.1250.150.175-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5α=0.9 α=0.66√ρ=1.64 √ρ=0.95 (a) A normal distribution with mean µ = 0 and variance σ = 1 .The darker green area shows the region where of the probabil-ity mass is located. The combination of both areas corresponds toa mass of . Since the squared distances are χ distributed, theradii of the areas are equal to the root of the quantile function F − χ D of the χ D distribution which is √ ρ ≈ . for the darker green area(blue line) and √ ρ ≈ . for the combined area (green line).Themarked areas are identical to the respective high-density regions. -2.5-2-1.5-1-0.500.511.522.5 -2 -1 0 1 2√ρ=2.14 (b) Bivariate Gaussian with α -region and maximum Maha-lanobis distance of √ ρ = 2 . . -2.5-2-1.5-1-0.500.511.522.5 -2 -1 0 1 2√ρ=1.46 (c) Bivariate Gaussian with α -region and maximum Maha-lanobis distance of √ ρ = 1 . . Fig. 6. Relation between normal distribution and χ D distributed distances. The dashed ellipses in (b) and (c)are level curves with a Mahalanobis distance of 1 whilethe black ellipses have a distance of √ ρ = (cid:113) F − χ ( α ) to their center. Samples displayed as red triangles (cid:77) are suspicious (potentially novel ), while samples depicted asblue circles ◦ are not suspicious . High-density regions(HDR) are colored in blue, low-density regions (LDR) inred. expanding a cluster can lead to a merger of multipleclusters and, thus, create a much larger cluster.A novel process is detected as soon as a cluster C fulfills the adaptation criterion | C | ≥ minP ts whichcorresponds to the number of samples that are associatedwith the cluster C . In Section 5.3.2 a measure is proposedto represent the current amount of novelty in LDR inhuman readable form. The last part of the 2SND procedure is responsible fordeciding whether a novel process exists in the moni-tored LDR and how the model is adapted. If a newprocess in form of a cluster is identified, the underlyingGMM needs to be updated. This is done by performinga VI training on all samples that are associated withthe corresponding cluster. After that training step, the
Algorithm 1
Input: sample x (cid:48) , parameters α , (cid:15) , minP ts , ˜ b Global: model M , buffer B Initialize ρ = F − χ D ( α ) . { } for all components j in M doif ∆ j ( x (cid:48) ) ≤ ρ then { The observation is not suspicious } return classification of x (cid:48) based on M . end ifend forif |B| = ˜ b then Remove oldest sample from buffer B . end if Add suspicious sample x (cid:48) to buffer B . { } Find nearest neighbor nn x (cid:48) of x (cid:48) if dist ( x (cid:48) , nn x (cid:48) ) ≤ (cid:15) thenif nn x (cid:48) belongs to noise cluster then Create new cluster C new with samples x (cid:48) and nn x (cid:48) C = C new else Assign x (cid:48) to the same cluster C nn as nn x (cid:48) C = C nn PROPAGATE C to (cid:15) -neighborhood of x (cid:48) . end if { Process detected – model adaptation } if | C | ≥ minP ts then Train GMM M novel of process C with VI.Update M and fuse it with M novel .Remove C and delete all samples of C from B return Classification of x (cid:48) based on updated M . end ifend ifreturn classification of x (cid:48) based on M . novel process is represented by another GMM. To up-date the model, we exploit the properties of the hyper-distributions and use a fusion technique proposed in[14]. To fuse two given GMM we measure the pairwisedivergence between each component and fuse only thosewhich exceed a given threshold (0.5). In this case, wemay assume that both components model the same pro-cess. This might happen, if a process emerges close to theborder if the α -region (which also separates LDR fromHDR). As divergence measure the Hellinger distanceis used (cf. [3], [19]). The actual fusion combines thehyper-parameters of both components. Non-overlappingcomponents are simply inserted into the existing model.In each case the hyper-distributions of the mixing coeffi-cients must be adjusted, such that the mixing coefficientsof the combined GMM still form a distribution. After thisfusion step, the model is adapted to the changes in its environment and all samples belonging to the cluster C are removed from buffer B . If the updated model is usedas the base for a classifier, the conclusion for the newcomponent must be determined. Possible solutions tothis problem are the involvement of a human domain ex-pert, if meaningful labels are required, or the automaticgeneration of new, unique labels. As investigated by [12],it is also possible to exchange knowledge with othersystems so that a novel process can be faster detectedby another system. This kind of behavior is especiallyinteresting for cyber-physical systems that share knowl-edge about their environment.For clarification, an exemplary scenario that highlightsthe important operational phases of the new approach isshown in Figure 7. The above presented method detects suspicious samplesin LDR only. However, observations in HDR are alwaysconsidered as being normal , thus making this approachunable to detect novel processes there. For the momentwe focus on the detection of overlapping processes forsingle components and later extend the idea to alsosuit GMM (where multiple components are present).The difficulty in detecting novel processes in denseregions is that we cannot decide if an observed sampleis the legitimate outcome of a known (i.e., the existingcomponent) or from an unknown overlapping processwithout knowing its affiliation (which in this case isa latent variable). We therefore make use of a slidingwindow that keeps track of the last ω observed samples,no matter if they are actually novel or not. It is clear,that if a novel process is present (that deviates at least init’s mean or covariance from the existing component)the observed sample population (i.e., the content ofthe sliding window) will not match the distribution(described by the component) anymore and a noticeabledifference between population and component has to bemeasurable. Due to their high computational complexitydivergence measures such as the Kullback-Leibler di-vergence [22] or Hellinger distance [19] are intractablefor the measurements, especially if the input space isof high dimensionality. To tackle this problem we donot measure the divergence of the sliding window andthe existing component directly, instead we test howwell the distances between samples in the buffer an thecomponent’s center suit the expected distribution (whichis a χ distribution as stated in Section 5.1). This task canbe performed by using the χ test described in Section4.4. Since the test is performed on the sliding window,the approach is suitable for online environments. One of the requirements of the χ goodness-of-fit testis, that the different events must be mutually exclu-sive. Therefore the continuous distance density must betransformed into a discrete one. Since any distribution -5-4-3-2-1012345 -5 -4 -3 -2 -1 0 1 2 3 4 5 (a) Initial training set with sam-ples from two different classes,green circle ◦ and blue cross + .The density model is trained withVI and extended to a classifier asdescribed in Section 4.2. -5-4-3-2-1012345 -5 -4 -3 -2 -1 0 1 2 3 4 5 (b) Resulting initial GMM withtwo components after VI training.The black line is the combinationof the decision boundary and the α -regions. Samples that appear inthe outer (cyan colored) LDR areidentified as suspicious . -5-4-3-2-1012345 -5 -4 -3 -2 -1 0 1 2 3 4 5 (c) Situation after the observationof potentially novel samples. Dif-ferent symbols represent samplesof the same cluster, while bluetriangles (cid:77) are samples not yetassigned to a cluster. -5-4-3-2-1012345 -5 -4 -3 -2 -1 0 1 2 3 4 5 (d) After the apperance of somemore suspicious samples, the clus-ter in the upper center reacheda certain size and is consideredto be a novel process. Its samplesare isolated and used to train aparametric model with VI. -5-4-3-2-1012345 -5 -4 -3 -2 -1 0 1 2 3 4 5 (e) After the VI training con-verges, the novel process is repre-sented by a GMM which consistsof a single component. The newlyacquired knowledge will be fusedwith the initial model shown in(b). -5-4-3-2-1012345 -5 -4 -3 -2 -1 0 1 2 3 4 5 (f) Updated GMM and classifierafter the novel process is inte-grated. The updated decision re-gions are shown as well. Thered component and region corre-sponds to the novel process. Fig. 7. Illustration of the proposed technique. In thetraining set only samples of two processes are present.In the operational phase a third process emerges andstarts to generate samples. After enough potentially novel samples are observed, the model and the classifier areupdated. is normalized (i.e., (cid:82) p ( x )d x = 1 ) and, furthermoredistances (here x ) are always strictly positive it is quiteeasy to transform the density of distances into a discrete,uniform distribution.Choosing a finite uniform distribution with λ events(also cells, or buckets) brings several advantages includ-ing constant calculation of expected frequencies: unif λ ( i ) = 1 λ , ≤ i ≤ λ, i ∈ N . (12)The fitted uniform distribution has only one free param-eter ( λ ), therefore the degree of freedom is given by: k = λ − . (13)To do the actual transformation the boundaries of theindividual cells ( cell i ) must be estimated so that each cellis equally likely. This is done with the inverse cumulativedensity function F − of the continuous density (as statedbefore χ d for multivariate Gaussians with d dimensions)by dividing the density into λ areas of equal size: cell i = (cid:40) [ l i , r i ) 1 ≤ i < λ [ l i , ∞ ) i = λ , (14) l i = F − χ d (cid:18) ( i − · λ (cid:19) , (15) r i = l i +1 . (16)Thus the first cell always begins at , and the last one isunbounded (right boundary is → ∞ ).Now, if a previously unseen sample x (cid:48) is observed itis stored in the sliding window buffer and a counter b i for the responsible cell cell i is incremented. The lookupfor the correct cell can be done in O (log n ) (e.g., using atree structure). The t -value for the current buffer config-uration at time point n is then calculated as follows: t n = λ (cid:88) i =1 ( b i,n − e i ) e i (17)with e i = unif λ ( i ) · λ (cid:88) j =1 b j ≈ ωλ . (18)The expected value e i is only approximated by ωλ sincethe buffer can be not completely filled (i.e., contains lessthan ω samples). If the buffer is at capacity when anew observation x (cid:48) is processed, the oldest element getsremoved. The t -value is compared with significance of α = 0 .
01 = 1% against the critical value: χ ,upperk = F − χ k (1 − α ) = F − χ k (0 . . (19)If t > χ ,upperk the threshold is exceeded and the detectorreports novelty . Figure 8 shows on the left an exemplarydata set with a single component trained to fit thegreen circles ◦ . Additionally the set contains samplesfrom three more processes (purple + crosses around (0 . , . , red (cid:52) triangles around ( − , , and blue ◦ circles close to the components center around ( − , . ). -101234 -4 -3 -2 -1 0 1 Fig. 8. Test data set and corresponding test statistics t .First crosses appear, then triangles, and finally circles.The parameters are: w = 50 , λ = 12 , α = 0 . , p = 0 . .The red line is are the test statistics t . The horizontal,black line indicates the critical value. All additional processes represent novelties and appearin the given order. The image on the right shows thecurve of the calculated t -values for the sliding windowin red. When samples from the novel processes appearthe curve changes and exceeds the critical value (givenas black line) considerably.The signal is however noisy, so that at some points oftime the critical value is slightly exceeded even thoughno novel processes are present. To compensate for thiseffect smoothing the t -values with a moving average: t ma,n = 1 M M (cid:88) i =0 t n − i , (20)where M is the number of considered previous t -values,is a promising approach as the blue curve on the rightimage of Figure 8 illustrates. Real world data sets or sensory data from embeddedsystems often differ from their assumed distributions.Whereas this is not a problem for classification for ourgoodness-of-fit approach to novelty detection it is, asthe t -value curve in Figure 9 highlights. Here a singleGaussian is fitted based on the depicted samples, whichare uniformly distributed rather than normal. Thereforethe distances are not χ distributed as it is presumedby our test and the critical value is almost permanentlyexceeded by the test statistics.The problem can be solved by estimating the cellboundaries directly from the samples X train that areused to train the component: cell i = [ l i , r i ) 1 < i < λ [0 , r i ) i = 1[ l i , ∞ ) i = λ , (21) l i = ∆( x (cid:100) i · wλ (cid:101) ) , x j ∈ Sort( X train ) , (22) r i = l i +1 . (23)where ∆( x ) is the Mahalanobis distance to the center µ of the component. By using the distance of every (cid:100) i · wλ (cid:101)
01 02 03 04 05 06 07 0 0 250 500 750
Fig. 9. 1000 Uniformly distributed 2D samples in theinterval [0 , and trained Gaussian component. On theright the test statistics (red) and moving average teststatistic (blue) of the whole applied train set. The criticalvalue (black line) is clearly exceeded.
Fig. 10. Ordered mahalanobis distances to the center ofobserved samples drawn from a bounded uniform distri-bution ( x ∈ [( x , x ) | x , x ∈ unif(0 , ). The horizontaland vertical lines mark the boundaries for the cells of theresulting discrete uniform distribution. element of the order samples each cell contains approxi-mately the same number of entries, thus forming again adiscrete uniform distribution. The computationally mostcomplex part is the sorting of the training samples X train ,which can be done in O ( n · log n ) . Note that the last cell λ might be underestimated due to the rounding.In Figure 10 the ordered training set X train is depicted.The y -axis (index of the sorted samples) is divided into λ = 12 equally sized parts (each part corresponding toa cell) the associated function arguments on the x -axis(distances to component center) are equal to the intervalboundaries. If the curve is normalized an approximationof the cumulative density function of the real distancedistribution can be obtained.Figure 11 shows the resulting t -value curves of twouniform distributed data sets (on the left the same exam-ple as in Figure 9 with 2 dimensions, on the right anothersample set with 5 dimensions) where the expected fre-quencies for the test are estimated according to Equation(21). The (moving-average) curves are now clearly below
01 02 03 04 05 06 07 0 0 250 500 750
01 02 03 04 05 06 07 0 0 250 500 750
Fig. 11. Test statistics (red) and moving average teststatistic (blue) each for uniformly distributed samples withestimated distance distribution. On the left side for 1000two dimensional samples, on the right for 1000 five di-mensional samples. w = 50 , λ = 12 , ma = 100 , p = 0 . the critical value (black line), thus not indicating any novelty . To extend the high-density approach to GMM withmultiple components, each component needs its owndetector.If a new sample x (cid:48) is observed it should be usedto update the detector of its affiliated component. Theaffiliation is however a latent variable and thus notknown at run-time. One method to estimate the affilia-tions is (Monte Carlo) random sampling, which requiresonly the evaluation of the unnormalized (without mixingcoefficient π j ) densities P j ( x (cid:48) ) = N ( x (cid:48) | µ j , Σ j ) for eachcomponent j and a continuous uniform pseudo randomnumber generator unif(0 , for the unit interval [0 , .The sampling works by partitioning the unit interval into J parts. Each partition m j is associated with exactly onecomponent j and the boundaries are given by: m j = [0 , r j ) j = 1[ l j , r j ) 1 < j < J [ l j , j = J , (24) p x (cid:48) ,j = P j ( x (cid:48) ) (cid:80) Jk =1 P k ( x (cid:48) ) (25) l j = r j − j > , (26) r j = l j + p x (cid:48) ,j . (27)where p x (cid:48) ,j are the normalized densities to ensure thatthe support of the individual parts sum up to . Toidentify a winner component (i.e., the one that will beaffiliated with the observation x (cid:48) ) a random value r (cid:48) is drawn from the uniform generator unif(0 , . Thepartition m j that covers the drawn value r (cid:48) indicates the winning component j .Figure 12(a) shows clouds (5000 samples, 2 dimensions,2 classes) a widely used artificial data set from theUCI Repository [23], with trained classifier. The CMMis trained in 5-fold cross-validation fashion, where four -2-1012 -2 -1 0 1 2 3 (a) Clouds data set from UCI MLrepo with trained CMM with 5foldcross validation. (b) Average novelty measures forhigh-density detectors. Clearlynothing suspicious is happeninghere. Fig. 12. Data set with trained CMM and corresponding average-novelty curve.
01 02 03 04 0 0 250 500 750 01 02 03 04 0 0 250 500 75001 02 03 04 0 0 250 500 750 01 02 03 04 0 0 250 500 750
Fig. 13. Test statistics (red) and moving average teststatistic (blue) each for remaining folds applied to alldetectors. w = 50 , λ = 12 , ma = 100 , p = 0 . folds are used for the actual training and the remainingfold for testing. This leads to an experiment where no novel (unknown) processes are present since all foldscontain samples from all four known processes. Thetest result for one experiment is displayed in Figure12(b). The curve shows the average high-density noveltymeasure (discussed in Section 5.3.2), which does notexceeded the critical value that is given by the blackline and has a constant value of , thus indicating thatno novel processes are present. The test statistics of theindividual component detectors are depicted in Figure13.A modified test setup for clouds is illustrated in Figure14(a). Here the training is performed on 2000 samplesand the remaining 3000 samples are interspersed with400 samples from two overlapping novel processes (red (cid:52) triangles). The novel processes appear around timesteps 1000 and 2200. The corresponding high-densitynovelty measure curve for the experiment is given inFigure 14(b) and indicates novelty (blue bars rising to )in the regions where the novel samples are interspersed.The test statistics of the individual component detectors -2-10123 -3 -2 -1 0 1 2 (a) Samples drawn from cloudsand two novel processes (red)which appear at ts ≈ and ≈ . (b) Average novelty measure forhigh-density dectors. The blue linemarks regions where a novel pro-cess is detected. Fig. 14. Data set with interspersed samples from novel processes and corresponding average-novelty curve.
01 02 03 04 0 0 1000 2000 3000 01 02 03 04 0 0 1000 2000 300001 02 03 04 0 0 1000 2000 3000 01 02 03 04 0 0 1000 2000 3000
Fig. 15. Test statistics (red) and moving average teststatistic (blue) each for modified clouds applied to alldetectors. w = 50 , λ = 12 , ma = 100 , p = 0 . of this experiment are depicted in Figure 15. From thecurves it can be inferred that the main contributions forthe detection is from the component (bottom right) thatrepresents blue + crosses. CANDIES is our holistic approach to detect novelty inhigh- as well as in low-density regions of a GMM. Thisis achieved by using a single 2SND (cf. Section 5.1)detector combined with multiple HDR detectors for eachcomponent (cf. Section 5.2.3).
For the system to operate some adjustments are neces-sary. At first a previously unseen sample x (cid:48) is checkedwhether it is located in an HDR or not (similar tothe first stage of 2SND). If this is not the case, thesample is passed to the second stage of 2SND and thedensity based clustering is refreshed. At this point a novel process might be detected. Otherwise, the in Section5.2.3 described random sampling is executed and x (cid:48) getsaffiliated with exactly one of the components J . Then Algorithm 2
CANDIES
Input: sample x (cid:48) , parameters α , ω , (cid:15) , minP ts , ˜ b Global: model M , buffer B Initialize ρ = F − χ D ( α ) . { decide if x (cid:48) is located in high- or low-density region } for all components j in M doif ∆ j ( x (cid:48) ) ≤ ρ then { The observation is in dense region }{ get winner according to Section 5.2.3 } j = winner component for x (cid:48) update χ detector of component j { Compare t -value with critical value } if t j > χ ,upper then { Process detected } end ifreturn classification of x (cid:48) based on M . end ifend for { The observation is in low-density region } return ( x (cid:48) , α, (cid:15), minP ts, ˜ b ) .for this component j a new t -value is estimated andcompared against the critical value. At this point anoverlapping novel process might be detected.Since samples x (cid:48) with a distance ∆ j ( x (cid:48) ) > ρ for all j ∈ J are always processed by the LDR detection part,the assumed distance distribution of the componentsHDR detectors will not match the observed samples.However, by establishing the following dependency λ = − α between the 2SND LDR detector and the HDRdetectors, and adjusting the calculation of the t -valuesto: t n,j = λ − (cid:88) i =1 ( b i,n − e i ) e i , (28) e i ≈ ωλ − , (29)the last cells cell λ are representing exactly the fraction α of samples that are located in LDR (but with e λ =0 ) while the first λ − cells cover the remaining − α percentage. The critical value is changed to: χ ,upperλ − = F − χ λ − (0 . . (30)Thus the goodness-of-fit test is adjusted to evaluate onlythe frequencies of samples expected to appear in HDR.The whole approach is summarized and commented inAlgorithm 2.Figure 16 shows another modification of the previ-ously used clouds data set. Again, additional samples(red (cid:52) triangles) from two novel processes are inter-spersed. The locations are chosen in a way so that oneprocess (centered around (1 , ) shares a large fraction ofits support with two of the known processes, while theother one (centered around ( − , − ) is positioned in a -2-10123 -3 -2 -1 0 1 2 Fig. 16. Samples drawn from clouds and two novelprocesses (red) which appear at ts ≈ and ≈ .
01 02 03 04 0 0 1000 2000 3000 01 02 03 04 0 0 1000 2000 300001 02 03 04 0 0 1000 2000 3000 01 02 03 04 0 0 1000 2000 3000
Fig. 17. Test statistics (red) and moving average teststatistic (blue) each for modified clouds applied to alldetectors. low-density region. The t -value curves of the four known components are illustrated in Figure 17. Furthermorethe novelty measures (discussed in Section 5.3.2) givenin Figure 18 clearly indicate novelty in the expectedtime ranges (blue bars rising to ). The left curve givesthe novelty value in low-density regions, where thefirst novel process starts generating samples around timestamp ≈ . On the right curve the average novelty measure for high-density regions is illustrated. Here,the novel process gets also detected but a small delaybetween appearance of novel samples (around ts ≈ )and the detection can be observed. This is most likelydue to the random sampling , which disperses novel sam-ples to multiple components. We propose two novelty measures to quantify how much novelty is present in different regions (i.e. HDR or LDR)in a way that is comprehensible for (data scientists).Therefore, the measures should express the absence ofa novel process (or novelty ) with a value near , whilethe presence of such a process should be expressed by avalue ≥ .The measure ν snd for LDR is given by: ν snd,n = 1 − |C| + | N oise ||B| (31) Fig. 18. Measured novelty for modified clouds (cf. Figure16).On the left: curve indicating a novel process in low-density regions around ts ≈ . On theright: average-novelty curve that represents novelty inhigh-density regions. The rising of the blue bars aroundts ≈ and ≈ are indicating that a novel processis present. Where |B| is the number of observations currently storedin the buffer, |C| is the number of different cluster, and | N oise | the number of samples associated with the noise cluster. If a single cluster that contains most samplescurrently kept in the buffer is present (which is a strongindicator for a novel process), the measure will be closeto . On the other hand, if all samples are considered tobe noise or multiple clusters with only a few samples arepresent, the novelty value will be closer to .The HDR measure ¯ ν ( average novelty ) is based on thegeometric mean of the normalized t -values ν j of theindividual components: ¯ ν n = J (cid:89) j w ( ν j ) J , (32) ν j = t n,j χ ,upperk . (33)The normalization constant is given by the critical value.As Equation (32) shows, the ν j are passed to a function w which is a non-linear transform that boosts values near : w ( x ) = x · (2 − comp(1 − x, , (34) comp( x, µ ) = log(1 + µ · x )log(1 + µ ) . (35)The idea here is that if multiple components approachthe critical value (an indicator for novel process locatedbetween these components) the novelty measure shouldalso express this. If the model however consists ofconsiderably more components (with t -values distantfrom the critical value), the mean is dominated by thesecomponents. Thus boosting values already close to , al-lows to overcome the normal components to increase themean, so that novelty is also expressed there. Exemplarycurves for both measures are given in Figure 18 ( novel processes present) and Figure 19 (only normal processesobserved). Fig. 19. Measured novelty for clouds.On the left: curve showing low novelty values for low-densityregions. On the right: average-novelty curve that repre-sents novelty in high-density regions.
CANDIES comes along with a considerable amountof adjustable parameters. Table 1 gives an overview ofall parameters present in CANDIES including a shortdescription, recommendations for (good) default values(if possible), and which detector is influenced by the pa-rameter. Note, that especially the buffer-size parametersare application-dependent on how many novel processesare expected to appear at once, and how many samplesthey will generate.
Since the novelty detection is designed to detect novel processes and not single observations, it is rather robustagainst distributed noise in the input space. While novel samples of a novel process will appear in a dense form,random noise is scattered across the input space so thatit is quite unlikely to form sufficiently large clusters.For the LDR detection part (based on 2SND) therobustness is achieved by the two stage architecturethat suspicious samples pass through. Figure 20 shows ascenario that includes uniformly distributed noise that ismixed into a test set with observations from one knownprocess and one novel process (located to the right,outside of the α -zone in low-density region). Dependingon its parametrization, the LDR approach only detects a novel process where the novel observations are actuallylocated.Figure 21(a) depicts the same exemplary data setthat is already used in Section 5.2.1 interspersed withuniform random noise (purple + crosses). The corre-sponding t -value curve is displayed in Figure 21(b) andshows a recognizable up-shift, introduced by the noise.Nevertheless, this undesired effect can be circumventedby adjusting the distance distribution according to Sec-tion 5.2.2. The curve of the adjusted test is illustratedin Figure 21(c). The course of the moving-average isnow clearly below the threshold in intervals where no novelty is present, but rises clearly - although weakeras compared to the application without noise - abovethe critical value, when the novel processes start togenerate samples. Therefore the high-density approach TABLE 1Different parameters necessary for the proposed combined novelty detection approach.
Parameter Description Default Detector ω Window size per component detector · λ High-Density λ Number of discrete levels − α High-Densityma Size of MovingAverage filter · ω High-Densityp α -Value for χ test 0.01 High-Density α Size of alpha region 0.95 Low-Density: 1. Stage (cid:15)
Maximum distance betweensamples in a cluster 2 Low-Density: 2. Stage |B|
Buffer for Samples in lowdensity region 100 Low-Density: 2. Stage P ( C ) =minPts Size of a cluster to be consideredas the outcome of a new process 10 Low-Density: 2. Stage is essentially capable of handling noise, but requires thepresence of noise in the training data. -4-3-2-101234 -4 -3 -2 -1 0 1 2 3 4 (a) Test set consisting of samplesfrom one previously known pro-cess (green circles ◦ ), noise sam-ples (red triangles (cid:77) ) and samplesfrom a novel process (blue crosses + ) . -4-3-2-101234 -4 -3 -2 -1 0 1 2 3 4 (b) Model identified by our ap-proach after the process (blue × surrounded by dotted ellipse)that was responsible for the bluecrosses is detected and integratedinto the classification model. Fig. 20. Scenario with samples from a novel process andadditional uniform distributed noise that is scattered in theinput space. The region where observations are identifiedas novel is colored in red, regions with different classassignments are also separated by a solid black decisionboundary.
At first the two different approaches to novelty detectionfor LDR and HDR might seem quite different. It is, how-ever, possible to get a consistent view by interpreting onedetector by means of the other. As mentioned before, thelast bins cell λ of each HDR detector matches the low-density parts of the input space. Therefore the ring buffer B used for 2SND can be seen as a shared cell acrossall HDR detectors. On the other hand, the individualbuffers of each HDR detector allow an interpretation as clusters (with a different adaptation predicate P ), andthus suiting the 2nd stage of 2SND. ASE S TUDY
To validate that the presented approach can be used toreal-world applications, we show experimental results -6-5-4-3-2-10123456 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 (a) The same data set as in Figure 8 but withuniform noise added (purple crosses + ).
02 55 07 5100125 0 2500 5000 7500 10000 (b) Test statistics over time fornon-adjusted test.
02 55 07 5100125 0 2500 5000 7500 10000 (c) Test statistics over time forestimated distance distribution.
Fig. 21. Scenario with samples from three novel pro-cesses and additional uniform distributed noise that isscattered in the input space and corresponding testcurves: test statistics (red) and moving average test statis-tic (blue). based on the well-known
KDD Cup 1999 network intru-sion data set [21]. Even though it is pointed out that thereare some serious flaws in the data set, which makes it in-appropriate for the evaluation of real intrusion detectionsystems, its properties are still suitable for our purposes,since we are not interested in building a state of the artintrusion detection system. As mentioned before, our new approach is comparedto a novelty detection technique that is proposed in [12].Here, novel samples are also identified using a GMM andthe squared Mahalanobis distance between processedsamples and the mean of the different components. Eachtime a new sample is processed, an internal state variable S n is updated, such that S n = S n − + χ nov , with: χ nov ( x ) = η J (cid:88) j =1 p ( j | x ) (cid:18) δ α,j ( x ) − α − α (1 − δ α,j ( x )) (cid:19) (36)being a penalty or reward, depending on how well thenew sample fits the model. To compute whether thestate variable is rewarded or punished, the indicatorfunctions: δ α,j ( x ) = (cid:40) , ∆ j ( x ) ≤ ρ = F − χ D ( α )0 , sonst (37)of each component are evaluated and the results aremultiplied with the responsibilities of the components.If the algorithm is processed in an environment withoutemerging processes, the expectation of the state variablewill be equal to its initial value E [ S n ] = 1 . The presenceof a novel process will lead to a decrease of the valueof the state variable S n . This can be exploited to detect novel processes as soon as the state variable underflows agiven threshold (here: . ). The parameter η controls howfast the state variable changes (here: . ). This causesa model adaptation that uses the last observationsto retrain the model, which is done with a modified VIalgorithm, that allows to insert new components into anexisting GMM and train only those, keeping the existingcomponents “fixed”. After the model is adapted to itschanged environment, the state variable is reset to itsinitial value. We refer to this approach as CSND ( χ -novelty detection)Originating from the various recorded connectionsin the KDD99 data set, different attack scenarios aresampled (these are: ipsweep, neptune, nmap, satan, andsmurf). Each scenario consist of background connections(legitimate network traffic) and connections related tothe specific attack. A dimension reduction to 6 out ofthe 41 dimensions is performed as preprocessing step.Additionally and due to the massive support in terms ofcategories, we interpret the discrete attributes as nearlycontinuous. Each scenario consists of three parts withan overall of 25000 connections. The first part contains10000 connections drawn from a pool of backgroundconnections only. The second part is a mixture of back-ground and attack connections (with the attack name aslabel) with a ratio of 3:1 and a total of 10000 connections.The last 5000 connections form the third part, whichagain consists only of legitimate traffic.Both adaptive classifiers are trained with the first 5000samples of the first part of each scenario to learn aninitial GMM with VI. The experiments themselves are conducted in a 5-fold cross-validation fashion, with in-dependent folds for the train sets , which consist of con-nections from the first and third parts of each scenario,and a single test set that is equal to the second part ofeach scenario.Additionally, to get a baseline for the classificationperformance, a static classifier (as described in Section4.2, referred to as GMM-Static) is trained on samples ofall classes (background connections and attacks). That is,this classifier can be seen as omniscient as it anticipatesfuture attacks that are completely new and unpredictablefor the two adaptive classifiers above. In order to getmeaningful results, a stratified 5-fold cross-validation,with all connections mixed together, is carried out. Thenthe accuracy and the F -score of the class assigned tosamples of the novel process are used to evaluate theclassification performance. The resulting averaged classification performances aresummarized in Table 2, which states that both adaptiveapproaches are able to identify the attacks and performmodel adaptations that integrate the acquired knowl-edge. In all scenarios, the accuracy and the observed F -score of CANDIES is equal or higher compared to thoseof the CSND approach. In three out of five scenariosour approach performs comparably well as the staticbaseline and still satisfiable on the other two. TABLE 2Comparison of classification accuracies and F -scoresfor the novel process (in form of the applied attack) ofboth novelty detection approaches and the GMM-Staticbaseline. S CENARIO
CANDIES CSND GMM-S
TATIC A CC F nov A CC F nov A CC F nov IPSWEEP . ( . ) . ( . ) . ( . ) NEPTUNE . ( . ) . ( . ) . ( . ) NMAP . ( . ) . ( . ) . ( . ) SATAN . ( . ) . ( . ) . ( . ) SMURF . ( . ) . ( . ) . ( . ) ø 97 . ( . ) . ( . ) . ( . ) The higher performance of CANDIES over CSND isexplained by Table 3, which shows the average numberof actual novel samples (samples actually belonging tothe attack) that are processed before the novel process isdetected and a model adaptation triggered. Here CAN-DIES displays its strength to exploit spatial informationbetween suspicious samples in LDR the form of clusters,which accelerates the detection compared to the slowlychanging state variable of CSND.The algorithm is designed to be processed in an onlinemode. Therefore, the number of triggered model adap-tation steps and the number of inserted components arealso investigated. Table 4 shows the averaged numberof adaptation and insertion steps for each scenario. As TABLE 3Number of actual novel samples needed until a novel process gets detected and a model adaptation istriggered. S CENARIO
CANDIES CSND
REQUIRED OBSERVATIONS
IPSWEEP . . NEPTUNE . . NMAP . . SATAN . . SMURF . .
4ø 46 . . TABLE 4Number of triggered model adaptations and averagenumber of inserted components (in parentheses) for bothapproaches. S CENARIO
CANDIES CSND A DAPT . C
OMP . A
DAPT . C
OMP . IPSWEEP . ( . ) . ( . ) NEPTUNE . ( . ) . ( . ) NMAP . ( . ) . ( . ) SATAN . ( . ) . ( . ) SMURF . ( . ) . ( . ) ø 1 . ( . ) . ( . ) we can see, both approaches tend only to a single modeladaptation, which is the optimum here. The CSND ap-proach has fewer model adaptations on average than theCANDIES, but has a higher average number of insertedcomponents, which is not negligible since the number ofcomponents in the GMM has a direct influence on therun-time of both algorithms. ONCLUSION AND O UTLOOK
We introduced CANDIES, a holistic approach to noveltydetection for (new) emerging processes throughout thecomplete input space of a probabilistic classifier. Toachieve this, different novelty detectors for low-density regions (LDR, where it is less likely to observe samples)and high-density regions (HDR, samples are expectedto be observed here) are combined and thus able tocover the complete input space. For LDR we resorton 2SND, this algorithm works with two stages. First, suspicious observations are identified with the help ofa GMM (which are based on parametric densities). Inthe second stage, suspicious samples are then clusteredin a nonparametric way (inspired by DBSCAN). A novel process is recognized as soon as one of the (nonpara-metric) clusters reaches a sufficient size. The detectionin HDR on the other hand is purely based on parametricdensity estimation . We showed how to use multiple de-tectors (one detector per component) to identify novelty in GMM. The presence of novel processes in HDR is directly identified. This is accomplished by maintaininga sliding window of recent observations and performingstatistical goodness-of-fit tests between sliding window andthe affiliated component.In a compact case study in the field of computernetwork intrusion detection, we could show that CAN-DIES is applicable to real-world data sets. We tested iton a subset of the well-known KDD Cup ’99 IntrusionDetection data set, where rather promising results wereobtained. So far, first experiments on artificial laboratorydata sets lead us to the conclusion that CANDIES willbe a satisfactory solution to novelty detection with modeladaptation in the near future.In our future work we will focus on extending thedescribed novelty detector further, this includes in par-ticular reaction procedures for the HDR detection. Wewill elaborate the performance of CANDIES on moresample applications, e.g., in the fields of robotics orvideo based surveillance. Detection and handling ofobsoleteness or concept shift will be accomplished withtechniques similar to the ones proposed here. The sameholds for concept drift, but here, it will be quite difficultto effect the trade-off between under- and overreaction(too early or too late). The accuracy of our techniquesmust be set in relation to a “degree” of time-variancein the observed system. It will be possible to detectemergent phenomena in the observed environment andto numerically assess the degree of emergence (cf. [13]).Also, these techniques allow for an application to variousanomaly detection problems. Furthermore, the designof the approach is not necessarily limited to GMM butapplicable to other mixture models as well.Another possible application field that could benefitfrom our proposed technique are systems equipped with awareness capabilities. Often, terms such as location-aware, context-aware, self-aware, or environment-awareare used in the literature (see, e.g., [1], [26]). In ouropinion, awareness is essentially • the capability to compare knowledge about theself, the environment, other systems etc. to currentobservations in order to detect when expectationsconcerning current observations do not meet theactual observations anymore and • the ability to adapt the knowledge model in a waysuch that the system meets some performance re-quirements which includes a solution to the problemwhen to adapt the model in odrder to avoid aperformance loss either due to too fast or too slowreactions.Altogether, awareness techniques will be a key to de-velop new kinds of technical systems that could actuallybe termed to be “intelligent” or “smart” with somehigher degree of justification. A CKNOWLEDGMENTS
The authors would like to thank the German ResearchFoundation (DFG) for support within the DFG projectCYPHOC (SI 674/9-1). R EFERENCES [1] G. Abowd, A. Dey, P. Brown, N. Davies, M. Smith, andP. Steggles. Towards a better understanding of context andcontext-awareness. In H.-W. Gellersen, editor,
Handheld andUbiquitous Computing , volume 1707 of
Lecture Notes in ComputerScience , pages 304–307. Springer Berlin Heidelberg, 1999.[2] H. Al-Behadili, A. Grumpe, C. Dopp, and C. Wohler. ExtremeLearning Machine based Novelty Detection for Incremental Semi-Supervised Learning. In
International Conference on Image Informa-tion Processing , pages 230–235, 2015.[3] A. Bhattacharyya. On a measure of divergence between twomultinomial populations.
Sankhy¯a: The Indian Journal of Statistics ,pages 401–406, 1946.[4] C. M. Bishop. Novelty detection and neural network validation.In
Vision, Image and Signal Processing , volume 141, pages 217–222,1994.[5] C. M. Bishop.
Pattern Recognition and Machine Learning (InformationScience and Statistics) . Springer-Verlag, 2006.[6] J. Bonifacio, A. Cansian, A. Carvalho, and E. Moreira. Neuralnetworks applied in intrusion detection systems. In
Proc. ofIJCNN , volume 1, pages 205–210, 1998.[7] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF:Identifying Density-Based Local Outliers.
ACM SIGMOD Record ,29(2):93–104, 2000.[8] D. A. Clifton, S. Hugueny, and L. Tarassenko. Novelty detectionwith multivariate extreme value statistics.
Journal of Signal Pro-cessing Systems , 65(3):371–389, 2011.[9] L. Clifton, D. A. Clifton, P. J. Watkinson, and L. Tarassenko.Identification of patient deterioration in vital-sign data usingone-class support vector machines. , (2):125–131,2011.[10] L. Clifton, D. A. Clifton, Y. Zhang, P. Watkinson, L. Tarassenko,and H. Yin. Probabilistic novelty detection with support vectormachines.
IEEE Transactions on Reliability , 63(2):455–467, 2014.[11] M. Ester, H. Kriegel, S. J., and X. Xu. A density-based algorithmfor discovering clusters in large spatial databases with noise. In
Proc. of KDD-96 , pages 226–231. AAAI Press, 1996.[12] D. Fisch, M. J¨anicke, E. Kalkowski, and B. Sick. Techniques forknowledge acquisition in dynamically changing environments.
TAAS , 7(1):1–25, 2012.[13] D. Fisch, M. J¨anicke, B. Sick, and C. M ¨uller-Schloer. Quantitativeemergence – a refined approach based on divergence measures.In
SASO , pages 94–103, 2010.[14] D. Fisch, E. Kalkowski, and B. Sick. Knowledge fusion forprobabilistic generative classifiers with data mining applications.
TKDE , 26(3):652–666, 2014.[15] C. Gruhl, B. Sick, A. Wacker, S. Tomforde, and J. H¨ahner. Abuilding block for awareness in technical systems: Online noveltydetection and reaction with an application in intrusion detection.In
IEEE iCAST , pages 194–200. IEEE, 2015.[16] J. Haehner, U. Brinkschulte, P. Lukowicz, S. Mostaghim, B. Sick,and S. Tomforde. Runtime Self-Integration as Key Challenge forMastering Interwoven Systems. pages 1–8, 2015.[17] V. Hautam¨aki, I. K¨arkk¨ainen, and P. Fr¨anti. Outlier detectionusing k-nearest neighbour graph.
Proc. – International Conferenceon Pattern Recognition , 3(09):430–433, 2004.[18] A. Hazan, J. Lacaille, and K. Madani. Extreme value statistics forvibration spectra outlier detection. In
International Conference onCondition Monitoring and Machinery Failure Prevention Technologies ,pages 736–744, London, UK, 2012.[19] E. Hellinger. Neue Begr ¨undung der Theorie quadratischer For-men von unendlich vielen Ver¨anderlichen.
Journal f¨ur die reine undangewandte Mathematik , 136:210–271, 1909.[20] J. Ilonen, P. Paalanen, J. Kamarainen, and H. K¨alvi¨ainen. Gaussianmixture pdf in one-class classifcation: computing and utilizingconfidence values. In
ICPR , volume 2, pages 577–580, 2006.[21] KDD Cup. KDD Cup 1999 Data – Data Set. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, 1999.(last access: 06.02.2015).[22] S. Kullback and R. A. Leibler. On information and sufficiency.
The Annals of Mathematical Statistics , 22:79–86, 1951.[23] M. Lichman. UCI machine learning repository, 2013.[24] M. Markou and S. Singh. Novelty Detection: a review – part 1:statistical approaches.
Signal Processing , 83:2481–2497, 2003. [25] M. Markou and S. Singh. Novelty Detection: a review – part 2:neural network based approaches.
Signal Processing , 83:2499–2521,2003.[26] C. M ¨uller-Schloer, H. Schmeck, and T. Ungerer.
Organic Computing– A Paradigm Shift for Complex Systems . Springer-Verlag BerlinHeidelberg, 2011.[27] Papadimitriou, S. and Kitagawa, H. and Gibbons, P. B. and Falout-sos, C. LOCI: Fast Outlier Detection Using the Local CorrelationIntegral. In
Data Engineering , volume 1, pages 315–326, 2003.[28] K. Pearson. On the criterion that a given system of deviationsfrom the probable in the case of a correlated system of variablesis such that it can be reasonably supposed to have arisen fromrandom sampling.
Philosophical Magazine Series 5 , 50(302):157–175,1900.[29] M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko. Areview of novelty detection.
Signal Processing , 99:215–249, June2014.[30] N. H. Pontoppidan and J. Larsen. Unsupervised condition changedetection in large diesel engines.
Neural Networks for SignalProcessing – Proc. of the IEEE XIII Workshop , pages 565–574, 2003.[31] S. Roberts. Extreme value statistics for novelty detection inbiomedical data processing.
IEE Proc. – Science, Measurement andTechnology , 147(6):363–367, 2000.[32] S. J. Roberts. Novelty detection using extreme value statistics.
Vision, Image and Signal Processing, IEEE Proc. , 146(3):124–129,1999.[33] E. J. Spinosa, F. de Carvalho, A. deLeon, and J. Gama. Novelty de-tection with application to data streams.
Intelligent Data Analysis ,13(3):405–422, 2009.[34] L. Tarassenko, P. Hayton, N. Cerneaz, and M. Brady. Novelty de-tection for the identification of masses in mammograms.
ArtificialNeural Networks, 1995., Fourth International Conference on , (10):442–447, 1995.[35] D. Tax and R. Duin. Uniform Object Generation for OptimizingOne-class Classifiers.
The Journal of Machine Learning Research ,2:155–173, 2002.[36] C. H. Wang. Outlier identification and market segmentationusing kernel-based clustering techniques.
Expert Systems withApplications , 36(2):3744–3750, 2009.[37] D. Yeung and C. Chow. Parzen-window network intrusiondetectors. In
Proc. of ICPR , volume 4, pages 385–388, 2002.[38] F. Zorriassatine, A. Al-Habaibeh, R. M. Parkin, M. R. Jackson,and J. Coy. Novelty detection for practical pattern recognitionin condition monitoring of multivariate processes: A case study.