Local projections for high-dimensional outlier detection
Thomas Ortner, Peter Filzmoser, Maia Zaharieva, Sarka Brodinova, Christian Breiteneder
aa r X i v : . [ s t a t . M E ] A ug Local pro jections for high-dimensional outlier detection
Thomas Ortner ∗ , Peter Filzmoser , Maia Zaharieva , Sarka Brodinova andChristian Breiteneder TU Wien Institute of Statistics and Mathematical Methods in Economics Institute of Software Technology and Interactive SystemsAugust 7, 2017
Abstract
In this paper, we propose a novel approach for outlier detection, called local projec-tions, which is based on concepts of Local Outlier Factor (LOF) (Breunig et al., 2000) andRobPCA (Hubert et al., 2005). By using aspects of both methods, our algorithm is robust to-wards noise variables and is capable of performing outlier detection in multi-group situations.We are further not reliant on a specific underlying data distribution.For each observation of a dataset, we identify a local group of dense nearby observations,which we call a core, based on a modification of the k-nearest neighbours algorithm. Byprojecting the dataset onto the space spanned by those observations, two aspects are revealed.First, we can analyze the distance from an observation to the center of the core within theprojection space in order to provide a measure of quality of description of the observationby the projection. Second, we consider the distance of the observation to the projectionspace in order to assess the suitability of the core for describing the outlyingness of theobservation. These novel interpretations lead to a univariate measure of outlyingness basedon aggregations over all local projections, which outperforms LOF and RobPCA as wellas other popular methods like PCOut (Filzmoser et al., 2008) and subspace-based outlierdetection (Kriegel et al., 2009) in our simulation setups. Experiments in the context of real-word applications employing datasets of various dimensionality demonstrate the advantagesof local projections.
Classical outlier detection approaches in the field of statistics are experiencing multiple problemsin the course of the latest developments in data analysis. The increasing number of variables, ∗ [email protected] Guided projections for analyzing the structure ofhigh-dimensional data (Ortner et al., 2017). We identify a subset of observations locally, describingthe structure of a dataset in order to evaluate the outlyingness of other nearby observations. Whileguided projections create a sequence of projections by exchanging one observation by another andre-project the data onto the new selection of observations, in this work, we re-initiate the subsetselection in order to cover the full data structure as good as possible with n local descriptions, where n represents the total number of observations. We discuss how outlyingness can be interpreted withregard to local projections, why the local projections are suitable for describing the outlyingnessof an observation, and how to combine those projections in order to receive an overall outlyingnessestimation for each observation of a dataset.The procedure of utilizing projections linked to specific locations in the data space has thecrucial advantage of avoiding any assumptions about the distribution of the analyzed data as uti-lized by other knn-based outlier detection methods as well (e.g. Kriegel et al., 2009). Furthermore,multi-group structures do not pose a problem due to the local investigation.We compare our approach to related and well-established methods for measuring outlying-ness. Besides RobPCA and LOF, we consider PCOut (Filzmoser et al., 2008), an outlier detec-tion method focusing on high-dimensional data from the statistics, KNN (Campos et al., 2016),since our algorithm incorporates knn-selection similar to LOF, subspace-based outlier detection2SOD) (Kriegel et al., 2009), a popular subspace selection method from the computer scienceand Outlier Detection in Arbitrary Subspaces (COP) (Kriegel et al., 2012), which follows a sim-ilar approach but has difficulties when dealing with flat data structures. Our main focus in thiscomparison is exploring the robustness towards an increasing number of noise variables.The paper is structured as follows: Section 2 provides the background for a single local pro-jection including a demonstration example. We then provide an interpretation of outlyingnesswith respect to a single local projection and a solution for aggregating the information based onseries of local projections in Section 3. Section 4 describes all methods used in the comparison,which are then applied in two simulated settings in Section 5. Finally, in Section 6, we show theimpact on three real-world data problems of varying dimensionality and group structure before weprovide a brief discussion on the computation time in Section 7. We conclude with a discussionin Section 8. Let X denote a data matrix with n rows (observations) and p columns (variables) drawn from a p -dimensional random variable X , following a non-specified distribution function F X . We explicitlyconsider the possibility of p > n to emphasize the situation of high-dimensional low sample sizedata referred to as flat data, which commonly emerges in modern data analysis problems. Weassume that F X represents a mixture of multiple distributions F X , . . . , F X q , where the number ofsub-distributions q is unknown. The distributions are unspecified and can differ from each other.However, we assume that the distributions are continuous. Therefore, no ties are present in thedata, which is a reasonable assumption especially for a high number of variables. In the case ofties, a preprocessing step, excluding ties can be applied in order to meet this assumption. Anoutlier in this context is any observation, which deviates from each of the groups of observationsassociated with the q sub-distributions.Our approach for evaluating the outlyingness of observations is based on the concept of usingrobust approximations of F X , which do not necessarily need to provide a good overall estimationof F X on the whole support but only of the local neighborhood of each observation. Therefore,we aim at estimating the local distribution around each observation x i , for i = 1 , . . . , n , not byall available observations but by a small subset, which is located close to x i .We limit the number of observations included in the local description in order to avoid theinfluence of inhomogenity in the distribution (e.g. multimodal distributions or outliers beingpresent in the local neighbourhood) of the underlying random variable.For complex problems, especially high-dimensional problems, such approximations are difficultto find. We use projections onto groups of observations locally describing the distribution. There-fore, we start by introducing the concept of a local projection, which will then be used as one suchapproximation before describing a possibility of combining those local approximations. In orderto provide a more practical incentive, we demonstrate the technical idea in a simulated examplethroughout the section. 3 .1 Definition of local projections Let y denote one particular observation of the data matrix X = ( x , . . . , x n ) ′ , where x i =( x i . . . x ip ) ′ . For any such y , we can identify its k nearest neighbours using the Euclidean distancebetween y and x i , denoted by d ( y , x i ) for all i = 1 , . . . , n : knn ( y ) = { x i : d ( y , x i ) ≤ d k } , (1)where d k is the k -smallest distance from y to any other observation in the dataset.Using the strategy of robust estimation, we consider a subset of ⌈ α · k ⌉ observations from knn ( y )for the description of the local distribution, where α represents a trimming parameter describingthe proportion of observations, which are assumed to be non-outlying in any knn . Here, ⌈ c ⌉ denotes the smallest integer ≥ c . The parameter α is usually set to 0 . ⌈ α · k ⌉ observations,which we call the core of the projection, initiated by y , not including y itself. The center of thiscore is defined by x = arg min x i ∈ knn ( y ) { d ( ⌈ α · k ⌉ ) ( x i ) } , (2)where d ( ⌈ α · k ⌉ ) ( x i ) represents the ⌈ α · k ⌉ -largest distance between x i and any other observation from knn ( y ). The observation x can be used to define the core of a local projection initiated by y : core ( y ) = { x i : d ( x , x i ) < d ( ⌈ α · k ⌉ ) ( x ) ∧ x i ∈ knn ( y ) ∧ x i = y } (3)In order to provide an intuitive access to the proposed approach, we explain the concept oflocal projections for a set of simulated observations. In this example, we use 200 observationsdrawn from a two-dimensional normal distribution. The original observations and the procedureof selecting the core ( y ) are visualized in Figure 1: The red observation was manually selected toinitiate our local projection process and refers to y . It can be exchanged by any other observation.However, in order to emphasize the necessity of the second step of our procedure, we selected anobservation off the center. The blue observations are the k = 20 nearest neighbours of y andthe filled blue circles represent the core of y using α = 0 .
5. We note that the observations of core ( y ) tend to be closer to the center of the distribution than y itself, since we can expect anincreasing density towards the center of the distribution, which likely leads to more dense groupsof observations.A projection onto the space, spanned by the observations contained in core ( y ), provides adescription of the similarity between any observation and the core, which is especially of interestfor y itself. Such a projection can efficiently be computed using the singular value decomposition(SVD) of the matrix of observations in core ( y ), centered and scaled with respect to the core itself.In order to estimate the location and scale parameters for scaling the data, we can apply classicalestimators on the core preserving robustness properties, since the observations have been included4 − − x1 x Figure 1: Visualization of the core -selection process. The red observation represents the initiatingobservation y . The blue observations represent knn ( y ) and the filled blue observations represent core ( y ). x itself is not visualized but it is known to be an element of core ( y ).into the core in a robust way. X core ( y ) =( x y , , . . . , x y , ⌈ α · k ⌉ ) ′ (4) x y ,j ∈ core ( y ) ∀ j ∈ { , . . . , ⌈ α · k ⌉} ˆ µ y = 1 ⌈ α · k ⌉ X x i ∈ core ( y ) x i (5) ˆ σ y = (cid:16)q V ar ( x y , , . . . , x y , ⌈ α · k ⌉ ) , . . . , q V ar ( x y , p , . . . , x y , ⌈ α · k ⌉ p ) (cid:17) ′ (6)=(ˆ σ y , , . . . , ˆ σ y ,p ) ′ , where V ar denotes the sample variance. Using ˆ µ y , the centered observations are given by x c y = ( x c y , , . . . x c y ,p ) ′ = x y − ˆ µ y , (7)which can be used to provide the centered and column-wise scaled data matrix with respect tothe core of y : ˜ X y = (cid:18)(cid:18) x c y , ˆ σ y , , . . . , x c y , p ˆ σ y ,p (cid:19) ′ , . . . , (cid:18) x c y ⌈ α · k ⌉ ˆ σ y , , . . . , x c y , ⌈ α · k ⌉ p ˆ σ y ,p (cid:19) ′ ! ′ (8)Based on ˜ X y , we provide a projection onto the space spanned by the observations of core ( y ) by V y from the SVD of ˜ X y , ˜ X y = U y D y V ′ y . (9)5ny observation x can be projected onto the projection space by centering with ˆ µ y , scalingwith ˆ σ y , and applying the linear transformation V ′ y . The projection of the whole dataset is givenby ˜ X y V y . We refer to the projected observations as the representation of observations in the core space of y . Since the dimension of the core space is limited by ⌈ α · k ⌉ , in any case where p > ⌈ α · k ⌉ holds and X core ( y ) is of full rank, a non-empty orthogonal complement of this corespace exists. Therefore, any observation x consists of two representations, the core representation x core ( y ) given the core space, x core ( y ) = V ′ y (cid:18) x c ˆ σ y , , . . . , x cp ˆ σ y ,p (cid:19) ′ , (10)where x c = ( x c , . . . , x cp ) ′ is computed as defined in Equation (7) and the orthogonal representation x orth ( y ) given the orthogonal complement of the core space, x orth ( y ) = x c − V y x core ( y ) . (11)Figure 2a shows the representation of our 200 simulated observations in the core space. Notethat in this special case, the orthogonal representation is constantly due to the non-flat datastructure of the core observations ( p < k ). We further see that the center of the core is nowlocated in the center of the coordinate system.Given a large enough number of observations and a small enough dimension of the sample space,we can approximate F X with arbitrary accuracy given any desired neighborhood. However, inpractice, the quality of this approximation is limited by a finite number of observations. Therefore,it depends on various aspects like the size of d k and d ⌈ α · k ⌉ and, thus, the approximation is alwayslimited by the restrictions imposed by the properties of the dataset. Especially the behavior ofthe core observations will, in practice, significantly deviate from the expected distribution withincreasing d ⌈ α · k ⌉ .In order to take this local distribution into account, it is useful to include the properties ofthe core observations in the core space into the distance definition within the core space. Amore advantageous way to measure the deviation of core distances from the center of the corethan using Euclidean distances is the usage of Mahalanobis distances (e.g. De Maesschalck et al.,2000). For the projection space, an orthogonal basis is defined by the left eigenvectors of the SVDfrom Equation (9), while the singular values given by the diagonal of the matrix D y provide thestandard deviation for each direction of the projection basis. Therefore, weighting the directionsof the Euclidean distances with the inverse singular values directly leads to Mahalanobis distancesin the core space, which take the variation of each direction into account: CD y ( x ) = s x ′ core ( y ) D − y x core ( y ) min ( ⌈ α · k ⌉ − , p ) (12). The computation of core distances can be derived from Figure 2a. The green cross in thecenter of the coordinate system refers to the (projected) left singular vectors of the SVD. Wenote that the scale of the two axes in Figure 2 differ appreciably. The green ellipses representMahalanobis distances based on the variation of the two orthogonal axes, which provide a moresuitable measure for describing the distribution locally.6
30 −20 −10 0 10 20 − − − ● ●●● ●● ●● ●●● (a) −3 −2 −1 0 1 2 3 − − ● ●● ●● (b) Figure 2: Plot (a) provides a visualization of the transformed observations from Figure 1. Thered observation represents the initiating observation y . The blue observations represent knn ( y )and the filled blue observations represent core ( y ). The green ellipses represent the covariancestructure estimated by the core observations representing the local distribution. Plot (b) uses thesame representation as Figure 1 but shows the concept of multiple local projections initiated bydifferent observations marked as red dots. Each of the core distances represented by green ellipsesrefers to the same constant value taking the different covariance structures of the different coresinto account.The distances of the representation in the orthogonal complement of the core cannot be rescaledas in the core space. All observations from the core, which are used to estimate the local structure,i.e. to span the core space, are fully located in the core space. Therefore, their orthogonalcomplement is equal to : x orth ( y ) = ∀ x ∈ core ( y ) (13)Since no variation in the orthogonal complement is available, we cannot estimate the rescalingparameters for the orthogonal components. Therefore, we directly use the Euclidean distances inorder to describe the distance from any observation x to the core space of y . We will refer to thisdistance as orthogonal distance ( OD ). OD y ( x ) = || x orth ( y ) || (14)The two measures for similarity, CD and OD , are inspired by the score and the orthogonaldistance of Hubert et al. (2005). In contrast to Hubert et al. (2005), we do not try to elaboratecritical values for CD and OD to directly decide if an observation is an outlier. Such criticalvalues always depend on an underlying normal distribution and on the variation of the coreand the orthogonal distances of the core observations. Instead, we aim at providing multiplelocal projections in order to be able to estimate the degree of outlyingness for observations in anylocation of a data set. A core and its core distances can be defined for every observation. Therefore,7 total of n projections with core and orthogonal distances are available for analyzing the datastructure. Figure 2b visualizes a small number (5) of such projections in order to demonstrate howthe concept works in practice. The red observations are used as the initiating observations, thegreen ellipses represent core distances based on each of the 5 cores. Each core distance refers tothe same constant value considering the different covariance estimations of each core. We see thatobservations closer to the boundary of the data are described less adequately by their respectivecore, while other observations, close to the center of the distribution, are well described by multiplecores. Most subspace-based outlier detection methods, including PCA-based methods such as
PCOut (Filzmoser et al., 2008) and projection pursuit methods (e.g. Henrion et al., 2013), focus on theoutlyingness of observations within a single subspace only. The risk of missing outliers due tothe subspace selection by the applied method is evident as the critical information might remainoutside the projection space. RobPCA (Hubert et al., 2005) is one of the few methods consideringthe distance to the projection space in order to monitor this risk as well.We would like to use both aspects, distances within the projection space and to the projectionspace, to evaluate the outlyingness of observations as follows: The projection space itself is oftenused as a model, employed to measure the outlyingness of an observation. Since we are using alocal knn-based description, we can not directly apply this concept as our projections are boundto a specific location defined by the cores. The core distance from the location of our projectionrather describes whether an observation is close to the selected core. If this is the case, we canassume that the model of description (the projection represented by the projection space) fits theobservation well. Therefore, if the observation is well-described, there should be little informationremaining in the orthogonal complement leading to small orthogonal distances.We visualize this approach in Figure 3 in two plots. Plot (a) shows the first two principalcomponents of the core space and plot (b) the first principal component of the core and theorthogonal space respectively. In order to retrace our concept of interpreting core distances as thequality of the local description model and the core distances as a measure of outlyingness withrespect to this description, we look at the two observations marked in red and blue. While thered observation is close to the center of our core as seen in plot (a), the blue one is located far off.Therefore, the blue observation is not as well described by the core as the red observation, whichbecomes evident when looking at the first principal component of the orthogonal complement inplot (b), where the blue observation is located far off the green line representing the projectionspace.Note that this interpretation does not hold for core observations. This is due to the fact thatthe full information of core distances is located in the core space. With increasing p , the distanceof all observations from the same group converges to a constant as shown in e.g. Filzmoser et al.(2008) for multivariate normal distributions. While this distance is completely represented inthe core space for core observations, a proportion of distances from non-core observations will berepresented in the orthogonal complement of the core space. Therefore, the probability of the coredistance of a core observation being larger than the core distance of any other observation from8 − − Core−PC2 C o r e − P C (a) −30 −20 −10 0 10 20 − − Orthogonal−PC1 C o r e − P C (b) Figure 3: Visualization of orthogonal and core distances for a local projection of a multivariate100-dimensional normal distribution. Plot (a) describes the core space by its first two principalcomponents. The measurement of the core distances is represented by the green ellipses. Plot (b)includes the orthogonal distance. The vertical green line represents the projection space.the same group converges to 1 with increasing p :lim p →∞ P ( CD y ( x i ) > CD y ( z )) = 1 , x i ∈ core ( y ) , (15) z / ∈ core ( y )So far we considered a single projection, where we deal with a total of n projections. Let X denote a set of n observations { x , . . . , x n } . Therefore, for each initializing observation x ∈ X , thecore distance CD x and the orthogonal distance OD x are well-defined for all observations from X .As motivated above, we want to measure the quality of local description using the core distancesand the local outlyingness using the orthogonal distances. The smaller the core distance of anobservation for a specific projection is, the more relevant this projection is for the overall evaluationof the outlyingness of this observation. Therefore, we downweight the orthogonal distance basedon the inverse core distances. In order to make the final outlyingness score comparable, we scalethese weights by setting the sum of weights to 1 for each local projection initiating observation y : w y ( x ) = , x ∈ core ( y ) CD y ( x ) − min ˜ z ∈X (cid:16) CD ˜ z ( x ) (cid:17)P z ∈X (cid:18) CD z ( x ) − min ˜ z ∈X (cid:16) CD ˜ z ( x ) (cid:17)(cid:19) , else (16)The scaled weights w y make sure, that the sum of contributions by all available projectionsremains the same. Therefore, the sum of weighted orthogonal distances, corresponding to local9utlyingness through local projections (LocOut), LocOut ( x ) = X y ∈X ( w y ( x ) · OD y ( x )) , (17)provides a useful, comparable measure of outlyingness for each observation.Note that this concept of outlyingness is limited to high-dimensional spaces. Whenever weanalyze a space where p ≤ ⌈ α · k ⌉ holds, the full information of all observations will be located in thecore space of each local projection. Therefore, for varying core distances, the orthogonal distancewill always remain zero. Thus, the weighted sum of orthogonal distances can not provide anyinformation on outlyingness unless there is information available in the orthogonal representationof observations. In order to evaluate the performance of our proposed methodology, we compare it with relatedalgorithms, namely LOF (Breunig et al., 2000), RobPCA (Hubert et al., 2005), PCOut (Filzmoseret al., 2008), COP (Kriegel et al., 2012), KNN (Ramaswamy et al., 2000), and SOD (Kriegel et al.,2009). Each of those algorithms tries to identify outliers in the presence of noise variables. Somemethods use a configuration parameter describing the dimensionality of the resulting subspace orthe number of neighbours in a knn-based algorithm. In our algorithm, we use ⌈ α · k ⌉ observationsto create a subspace, which we employ for assessing the outlyingness of observations. In order toprovide a fair comparison, the configuration parameters of each method are adjusted individuallyfor each dataset: We systematically test different configuration values and report the best achievedperformance for each method. Instead of outlier classification, we rather use each of the computedmeasures of outlyingness since not all methods provide cutoff values. The performance itself isreported in terms of the commonly used area under the ROC Curve (AUC) (Fawcett, 2006). Local Outlier Factor (LOF) (Breunig et al., 2000) is one of the main inspirations for ourapproach. The similarity of observations is described using ratios of Euclidean distances to k-nearest observations. Whenever this ratio is close to 1, there is a consistent group of observationsand, therefore, no outliers. As for most outlier detection methods, no explicit critical value canbe provided for LOF (e.g. Campos et al., 2016; Zimek et al., 2012). In order to optimize theperformance of LOF, we estimate the number of k-nearest neighbours for each evaluation. Weused the R-package
Rlof (Hu et al., 2015) for the computation of LOF.The second main inspiration for our approach is the
RobPCA algorithm by Hubert et al.(2005). The approach employs distances (similar to the proposed core and orthogonal distances)for describing the outlyingness of observations with respect to the majority of the underlying data.This method should work fine with one consistent majority of observations. In the presence ofa multigroup structure, we would expect it to fail since the majority of data cannot be properlydescribed with a model of a single normal distribution. RobPCA calculates two outlyingnessscores, namely orthogonal and score distances . RobPCA usually flags observations as outliers if For a multivariate interpretation of outlyingness based on those two scores, we refer to Pomerantsev (2008). χ distributions. We use the maximumquantile of each observation for the distributions of orthogonal and score distances as a measurefor outlyingness in order to stay consistent with the original outlier detection concept of RobPCA.The dimension of the subspace used for dimension reduction is dynamically adjusted. We usedthe R-package rrcov (Todorov and Filzmoser, 2009) for the computation of RobPCA.In addition to LOF and RobPCA, we compare the proposed local projections with PCOut by Filzmoser et al. (2008). PCOut is an outlier detection algorithm where location and scat-ter outliers are identified based on robust kurtosis and biweight measures of robustly estimatedprincipal components. The dimension of the projection space is automatically selected based onthe proportion of the explained overall variance. A combined outlyingness weight is calculatedduring the process, which we use as an outlyingness score. The method is implemented using theR-package mvoutlier (Filzmoser and Gschwandtner, 2015).Another method included in our comparison is the subspace-based outlier detection(SOD) by Kriegel et al. (2009). The method is looking for relevant dimensions parallel to theaxis in which outliers can be identified. The identification of those subspaces is based on knn,where k is optimized in a way similar to LOF and local projections. We used the implementationof SOD in the ELKI framework (Achtert et al., 2008) for performance reasons.All three methods, LocOut, LOF and SOD, implement knn-estimations in their respectiveprocedures. Therefore, it is reasonable to monitor the performance of k-nearest neighbors(KNN) , which can be directly used for outlier detection as suggested in Ramaswamy et al.(2000). The performance is optimized over all reasonable k between 1 and the minimal numberof non-outlying observations of a group which we take from the ground truth in our evaluation.We used the R-package dbscan for the computation of KNN.Similar to the proposed local projections, Outlier Detection in Arbitrary Subspaces(COP) by Kriegel et al. (2012) locally evaluates the outlyingness of observations. The k-nearestneighbours of each observation are used to estimate the local covariance structure and robustlycompute the representation of the evaluated observation in the principal component space. Thelast principal components are then used to measure the outlyingness, while the number of principalcomponents to be cut off is dynamically computed. Although the initial concept looks similar toour proposed algorithm, it contains some disadvantages. The number of observations used for theknn estimation needs to be a lot larger than the number of variables. A proportion of observationsto variables of three to one is suggested. Therefore, the method can not be employed for flat datastructures, which represent the focus of the proposed approach for outlier analysis. While COPperformed competitive for simulations with no or a very small number of noise variables, thecomputation of COP is not possible in flat data settings. As the non-flat settings only representa minor fraction of the overall simulations, we did not include COP in the simulated evaluationbut only in the low-dimensional real data evaluation of Section 6.
We used two simulation setups to evaluate the performance of the methods for increasing numberof noise variables in order to determine their usability for high-dimensional data. We do that11y starting with 50 informative variables and 0 noise variables, increasing the number of noisevariables up to 5000. We use three groups of observations with 150, 150, and 100 observations.Starting from 350 noise variables, the data structure becomes flat, which we expect to lead to per-formance drops as the estimation of the underlying density becomes more and more problematic.Each of the three groups of observations is simulated based on a randomly rotated covariancematrix Σ i as performed in Campello et al. (2015), Σ i = (cid:18) Σ iinf I noise (cid:19) Σ iinf = Ω i ρ i . . . ρ i ρ i . . . . . . ...... . . . . . . ρ i ρ i . . . ρ i Ω ′ i , (18)for i = 1 , ,
3, where I noise is an identity matrix describing the covariance of uncorrelated noisevariables and Σ iinf the covariance matrix of informative variables, which are variables containinginformation about the separation of present groups where ρ i is randomly selected between 0.1 and0.9, ρ i ∼ U [0 . , . Ω i represents the randomly generated orthonormal rotation matrix. For oursimulation setups we always consider the dimensionality of Σ iinf to be 50. During the simulation,we evaluate the impact of such noise variables and therefore perform the simulation for a varyingnumber of noise variables. While the mean values of the noise variables are fixed to zero for allgroups, the mean values of the informative variables are set as follows:( µ , µ , µ ) = µ µ
00 0 µµ µ . (19)Therefore, for each informative variable, one group can be distinguished from the two other groups.The degree of separation, given by µ , is randomly selected from a uniform distribution U [ − , − ∪ [3 , .The first simulation setup uses multivariate normally distributed groups of observations using theparameters µ i and Σ iinf , for i ∈ { , , } , and the second setup uses multivariate log-normallydistributed groups of observations with the same parameters. Note that noise variables can beproblematic for several of the outlier detection methods, and skewed distributions can createdifficulties for methods relying on elliptical distributions.After simulating the groups of observations, scatter outliers are generated by replacing 5% ofthe observations of each group with outlying observations. Therefore, we use the same locationparameter µ i , but their covariance matrix is a diagonal matrix with constant diagonal elements σ which are randomly generated between 3 and 9, σ ∼ U [3 , , for informative variables. The reasonfor using scatter outliers instead of location outliers (changed µ i ) is the advantage, that outlierswill not form a separate group but will stick out of their respective group in random directions.The outcome of the first simulation setup based on multivariate normal distribution is visual-ized in Figure 4. Figure 4a shows the performance for 100 repetitions with 1000 noise variablesas boxplots measured by the AUC value. We note that local projections (LocOut) outperform all12ther methods, while LOF, SOD, and KNN perform approximately at the same level. For smallernumbers of noise variables, especially SOD performs better than local projections. This becomesclear in Figure 4b, showing the median performance of all methods with a varying number of noisevariables. We see that the performance of SOD drops quicker than other methods, while localprojections are effected the least by an increasing number of noise variables. The horizontal greyline corresponds to a performance of 0.5 which refers to random outlier scores. Lo c O u t L O F R ob P C A P C O u t K NN S O D A UC (a) LocOutLOF RobPCASOD KNNPCOut (b)
Figure 4: Evaluation of outliers in three multivariate normally distributed groups with a varyingnumber of noise variables. 5% of the observations were replaced by outliers. Plot (a) showsboxplots for the setup with 1000 noise variables. Each setup was repeatedly simulated 100 times.Plot (b) shows the median performance of each method for various numbers of noise variables.Setup 2, visualized in Figure 5, shows the effect of non-normal distributions on the outlierdetection methods. The same parameters used for log-normal distributions as in the normallydistributed setup, make it easier for all methods to identify outliers. Nevertheless, the order ofperformance changes since the methods are affected differently. SOD is stronger affected thanLOF, since it is easier for SOD to identify useful spaces for symmetric distributions while LOFdoes not benefit from such properties. LocOut still shows the best performance, at least for anincreasing number of noise variables. The most notable difference is the effect on RobPCA, whichheavily depends on the violated assumption of normal distribution.
In order to demonstrate the effectiveness of local projections in real-world applications, we analyzethree different datasets, varying in the number of groups, the dimension of the data space, andthe separability of the groups. We always use observations from multiple groups as non outlyingobservations and a small number of one additional group to simulate outliers in the dataset.13 o c O u t L O F R ob P C A P C O u t K NN S O D A UC (a) LocOutLOF RobPCASOD KNNPCOut (b)
Figure 5: Evaluation of outliers in three multivariate log-normally distributed groups with avarying number of noise variables. 5% of the observations were replaced by outliers. Plot (a)shows boxplots for the setup with 1000 noise variables. Each setup was repeatedly simulated100 times. Plot (b) shows the median performance of each method for various numbers of noisevariables.
The first real-world dataset consists of 120 samples of measurements of 25 chemical compositions(fatty acids, sterols, triterpenic alcohols) of olive oils from Tuscany, Italy (Armanino et al., 1989).The dataset is used as a reference dataset in the R-package rrcovHD (Todorov, 2016) for robustmultivariate methods for high-dimensional data and consists of four groups of 50, 25, 34, and 11observations, respectively. We use observations from the smallest group with 11 observations tosample 5 outliers 50 times.In our context, this dataset represents a situation where the distribution can be well-estimateddue to its non-flat data structure. Therefore, it is possible to include COP in the evaluation.It is important to note that at least 26 observations must be used by COP in order to be ableto locally estimate the covariance structure, while there will always be a smallest group of 25observations at most present for each setup. Thus, we would assume, that COP has problemsdistinguishing between outliers and observations from this smallest group which does not yieldenough observations for the covariance estimation.We show the performance of the compared outlier detection methods based on the AUC valuesin Figure 6. We note that all methods but PCOut and COP perform at a very high level. ForKNN, SOD and LocOut, there is only a non-significant difference in the median performance.14 o c O u t L O F R ob P C A P C O u t K NN S O D C O P A UC Figure 6: Performance of different outlier detection methods for the 25 dimensional olive oil datasetmeasured by the area under the ROC curve (AUC). For each method the configuration parametersare optimized based on the ground truth.
The second real-world dataset used for the evaluation is a fruit data set, which consists of 1095observations in a 256 dimensional space corresponding to spectra of the different melon cultivars.The observations are documented as members of three groups of sizes 490, 106 and 499, but inaddition, during the cultivation processes different illumination systems have been used leadingto subgroups. The dataset is often used to evaluate robust statistical methods (e.g. Hubert andVan Driessen, 2004).We sample 100 observations from two randomly selected main groups to simulate a highlyinconsistent structure of main observations and add 7 outliers, randomly selected from the thirdremaining group. We repeatedly simulate such a setup 150 times in order to make up for the highdegree of inconsistency. As Figure 7 shows, the identification of outliers is extremely difficult forthis dataset. A combination of properly reducing the dimensionality and modelling the existingsub-groups is required. LocOut outperforms the compared methods, followed by LOF and PCOut.
The observations of the glass vessels dataset, described e.g. in Janssens et al. (1998) refer toarchaeological glass vessels, which have been excavated in Antwerp (Belgium). In order to distin-guish between different types of glass, which have either been imported or locally produced, 180glass samples were chemically analyzed using electron probe X-ray microanalysis (EPXMA). Bymeasuring a spectrum at 1920 different energy levels corresponding to different chemical elementsfor each of those vessels, a high-dimensional data set for classifying the glass vessels is created. A15 o c O u t L O F R ob P C A p c ou t K NN S O D A UC Figure 7: Evaluation of the performance of the outlier detection algorithms on the fruit data set,showing boxplots of the performance of 150 repetitions of outlier detection measured by the AUC.few (11) of those variables/energy levels contain no variation and are therefore removed from ourexperiments in order to avoid problems during the computation of outlyingness.While performing PLS regression analysis, Lemberge et al. (2000) realized that some vesselshad been measured at a different detector efficiency and, therefore, removed those spectra fromthe dataset. We do not remove those observations, since from an outlier detection perspectivethey represent bad leverage points as indicated by Serneels et al. (2005), which we want to beable to identify. These leverage points are visualized in Figure 8a with x-symbols. By includingthese observations as part of the main groups, it becomes especially difficult to identify outlierssampled from the green group (potasso-calic). We sample 100 observations from the non-potasso-calic group 50 times and add 5 randomly selected potasso-calic observations as outliers. Theperformance is visualized in Figure 8b. Again, LocOut outperforms all compared methods, whileLOF and PCOut have problems to deal with this data setup.
The algorithm for local projections was implemented in an R-package which is publicly available .The package further includes the glass vessels data set used in Section 6. Based on this R-package,we performed simulations to test the required computational effort for the proposed algorithm andthe impact of changes in the number of observations and the number of variables.For each projection, the first step of our proposed algorithm is based on the k -nearest neigh-bours concept. Therefore, we need to compute the distance matrix for all n available p -dimensionalobservations leading to an effort of O ( n ( n − p/ n ( n − /
10 15 . . . .5 . . . Na2O C a O / ( C a O + K O ) (a) Lo c O u t L O F R ob P C A P C O u t K NN S O D A UC (b) Figure 8: Evaluation of the performance of the outlier detection algorithms on the glass vessels dataset. Plot (a) shows the classification of the group structure based on the chemical composition.Plot (b) shows boxplots of the performance of 50 repetitions of outlier detection measured by theAUC.of observations and p , being the dimension of the data space, reflects the effort for the computationof each Euclidean distance.After the basic distance computation, we need to compare those distances (which scales with n but only contributes negligibly to the overall effort) and scale the data based on the locationand scale estimation for the selected core (which also does not significantly affect the computationtime).For all of the n cores, we perform an SVD decomposition leading to an effort of O ( p n + n ).Therefore, a total effort of O ( n p (1+ p )+ n ) is expected for the computation of all local projections.In this calculation, reductions, such as the multiple computation of the projection onto the samecore, are not taken into account. Such an effect is very common due to the presence of hubsin data sets (Zimek et al., 2012). Figure 9 provide an overview of the overall computation timedecomposed into the different aspects of the computation algorithm.We observe that the computation time increases approximately linearly with p , while it in-creases faster than a linear term with increasing n . There is an interaction effect between k and n visible in plot (a) of Figure 9 as well, due to the necessity of n knn computations. Plots (c) and(d) show that the key factors are the n SVDs. Especially the core estimation and the computa-tion of the core distance are just marginally affected by increasing n and not affected at all byincreasing p . The orthogonal distance computation is non-linearly affected by increasing n and p which however remains relatively small when being compared to the SVD estimations.17
00 15 0 200 25 0 300 n C o m pu t a t i on t i m e ( m s ) p=400k=30k=40k=5 0 (a)
200 400 600 800 1000 p C o m pu t a t i on t i m e ( m s ) n=200k=30k=40k=5 0 (b)
100 15 0 200 25 0 300 n C o m pu t a t i on t i m e ( m s ) Computation time for p=400Core estimationSVD−decompositionsCD−computationOD−computation (c)
200 400 600 800 1000 p C o m pu t a t i on t i m e ( m s ) Computation time for n=200Core estimationSVD−decompositionsCD−computationOD−computation (d)
Figure 9: Visualisation of the computation time of the local projections. Plots (a) and (b) evaluatethe development of the overall computation time for increasing n in plot (a) and increasing p in plot(b). Those evaluations are performed for varying k . Plot(c) and (d) focus on different componentsof the computation for a fixed k = 40 and increasing n and p . We proposed a novel approach for evaluating the outlyingness of observations based on their localbehaviour, named local projections . By combining techniques from the existing robust outlierdetection RobPCA (Hubert et al., 2005) and from Local Outlier Factor (LOF) (Breunig et al.,2000), we created a method for outlier detection, which is highly robust towards large numbersof non-informative noise variables and which is able to deal with multiple groups of observations,not necessarily following any specific standard distribution.18hese properties are gained by creating a local description of a data structure by robustlyselecting a number of observations based on the k-nearest neighbours of an initiating observationand projecting all observations onto the space spanned by those observations. Doing so repeatedly,where each available observation initiates a local description, we describe the full space in whichthe data set is located. In contrast to existing subspace-based methods, we create a new conceptfor interpreting the outlyingness of observations with respect to such a projection by introducingthe concept of quality of local description of a model for outlier detection. By aggregating themeasured outlyingness of each projection and by downweighting the outlyingness with this quality-measure of local description, we define the univariate local outlyingness score,
LocOut . LocOut measures the outlyingness of each observation in comparison to other observations and results ina ranking of outlyingness for all observations. We do not provide cut off values for classifyingobservations as outliers and non-outliers. While at first consideration this poses a disadvantage,it allows for disregarding any assumptions about the data distribution. Such assumptions wouldbe required in order to compute theoretical critical values.We showed that this approach is more robust towards the presence of non-informative noisevariables in the data set than other well-established methods we compared to (LOF, SOD, PCOut,KNN, COP, and RobPCA). Additionally, skewed non-symmetric data structures have less impactthan for the compared methods. These properties, in combination with the new interpretation ofoutlyingness allowed for a competitive analysis of high-dimensional data sets as demonstrated onthree real-world application of varying dimensionality and group structure.The overall concept of the proposed local projections utilized for outlier detection opens uppossibilities for more general data analysis concepts. Any clustering method and discriminantanalysis method is based on the idea of observations being an outlier for one group and thereforebeing part of another group. By combining the different local projections, a possibility for avoidingassumptions about the data distribution - which are in reality often violated - is provided. Thus,applying local projections on data analysis problems could not only provide a suitable methodfor analyzing high-dimensional problems but could also reveal additional information on method-influencing observations due to the quality of local description interpretation of local projections.
Acknowledgements
This work has been partly funded by the Vienna Science and Technology Fund (WWTF) throughproject ICT12-010 and by the K-project DEXHELPP through COMET - Competence Centers forExcellent Technologies, supported by BMVIT, BMWFW and the province Vienna. The COMETprogram is administrated by FFG.
References
Achtert, E., Kriegel, H.-P., and Zimek, A. (2008). Elki: a software system for evaluation ofsubspace clustering algorithms. In Scientific and statistical database management, pages 580–585. Springer. 19ggarwal, C. C. and Yu, P. S. (2001). Outlier detection for high dimensional data. In ACMSigmod Record, volume 30, pages 37–46. ACM.Armanino, C., Leardi, R., Lanteri, S., and Modi, G. (1989). Chemometric analysis of tuscan oliveoils. Chemometrics and Intelligent Laboratory Systems, 5(4):343–354.Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. (2000). Lof: identifying density-basedlocal outliers. In ACM sigmod record, volume 29, pages 93–104. ACM.Campello, R. J., Moulavi, D., Zimek, A., and Sander, J. (2015). Hierarchical density estimates fordata clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discoveryfrom Data (TKDD), 10(1):5.Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenkov´a, B., Schubert, E., Assent,I., and Houle, M. E. (2016). On the evaluation of unsupervised outlier detection: measures,datasets, and an empirical study. Data Mining and Knowledge Discovery, 30(4):891–927.De Maesschalck, R., Jouan-Rimbaud, D., and Massart, D. L. (2000). The mahalanobis distance.Chemometrics and intelligent laboratory systems, 50(1):1–18.Fawcett, T. (2006). An introduction to roc analysis. Pattern recognition letters, 27(8):861–874.Filzmoser, P. and Gschwandtner, M. (2015). mvoutlier: Multivariate outlier detection based onrobust methods. R package version 2.0.6.Filzmoser, P., Maronna, R., and Werner, M. (2008). Outlier identification in high dimensions.Computational Statistics & Data Analysis, 52(3):1694–1711.Henrion, M., Hand, D. J., Gandy, A., and Mortlock, D. J. (2013). Casos: a subspace method foranomaly detection in high dimensional astronomical databases. Statistical Analysis and DataMining: The ASA Data Science Journal, 6(1):53–72.Hu, Y., Murray, W., Shan, Y., and Australia. (2015). Rlof: R Parallel Implementation of LocalOutlier Factor(LOF). R package version 1.1.1.Hubert, M., Rousseeuw, P. J., and Vanden Branden, K. (2005). Robpca: a new approach to robustprincipal component analysis. Technometrics, 47(1):64–79.Hubert, M. and Van Driessen, K. (2004). Fast and robust discriminant analysis. ComputationalStatistics & Data Analysis, 45(2):301–320.Janssens, K., Deraedt, I., Freddy, A., and Veekman, J. (1998). Composition of 15-17th centuryarcheological glass vessels excavated in antwerp. belgium. Mikrochimica Acta. v15 iSuppl, pages253–267.Kriegel, H.-P., Kr¨oger, P., Schubert, E., and Zimek, A. (2009). Outlier detection in axis-parallelsubspaces of high dimensional data. Advances in knowledge discovery and data mining, pages831–838. 20riegel, H.-P., Kroger, P., Schubert, E., and Zimek, A. (2012). Outlier detection in arbitrarilyoriented subspaces. In Data Mining (ICDM), 2012 IEEE 12th International Conference on,pages 379–388. IEEE.Lemberge, P., De Raedt, I., Janssens, K. H., Wei, F., and Van Espen, P. J. (2000). Quantitativeanalysis of 16–17th century archaeological glass vessels using pls regression of epxma and µµ