Generalized Prediction Intervals for Arbitrary Distributed High-Dimensional Data
GGeneralized Prediction Intervals for ArbitraryDistributed High-Dimensional Data
Steffen K¨uhnTechnische Universit¨at BerlinChair of Electronic Measurement and Diagnostic TechnologyEinsteinufer 17, 10585 Berlin, Germanysteff[email protected]
Abstract
This paper generalizes the traditional statistical concept of pre-diction intervals for arbitrary probability density functions in high-dimensional feature spaces by introducing significance level distribu-tions, which provides interval-independent probabilities for continuousrandom variables. The advantage of the transformation of a proba-bility density function into a significance level distribution is that itenables one-class classification or outlier detection in a direct manner.
A prediction interval is an interval that will, with a specified degree of confi-dence, contain future realizations or, in the terminology of pattern recogni-tion, feature vectors (Hahn and Meeker, 1991). The appeal of this conceptis its clear stochastic meaning. The great disadvantage is that this definitionis usually too restricted, for example for multimodal distributions. It is in-tuitively clear that, in this case, more than one interval for probable featurevectors can exist and it would be better to speak of prediction regions . Evenmore complicated is the situation for high-dimensional feature spaces. Thislack of generality is probably the reason why prediction intervals are rarelyused in pattern recognition.This is actually a pity, because prediction regions would be very useful,for example, for the recognition of outliers (Barnett and Lewis, 1994) or thedetection of novelty or normality. Instead of prediction intervals, numerous1 a r X i v : . [ c s . C V ] S e p ther methods are used for this purpose. They can be grouped roughly intotwo categories:1. Distance-based and novelty or normality score-based approaches (Knorret al., 2000; Dolia, 2002; Moonesignhe and Tan, 2006; Angiulli et al.,2006; Ting et al., 2007),2. Methods that introduce a separate rejection class in combination witha classifier (Singh and Markou, 2004; Steinwart et al., 2005).If applying the method I propose here to outlier detection, it belongs to thefirst category with a probability as normality score. Before going into thedetails, I will give a short overview of related works.Simple distance-based methods rely on the concept of the neighborhoodof a point, for example, the k nearest neighborhood (Knorr et al., 2000).Outliers are those points for which there are less than k points within adistance δ in the dataset. Ramaswamy et al. (2000) propose a method tochoose this threshold δ automatically based upon a dataset. The idea is toconsider as outliers the set of points with the highest distances to their k thnearest neighbors. Of course, here is also a threshold necessary, but it hasnow a statistical reasoning as quartile of the k th nearest neighbor distancedistribution, which simplifies the choice. A more recent article based on thisidea is published by Angiulli et al. (2006), who apply a weighted sum ofthe k th nearest distances per point. Although the idea is quite simple, themethods have low computation costs. Furthermore, they make only minorassumptions about the underlying distribution.Another category of algorithms that are related to outlier detection isrobust regression. The outlier detection is here more a means to an end,because the goal is to avoid that outliers influence the estimation of theregression function. This means that it is in this case sufficient to detectoutliers indirectly. Ting et al. (2007), for example, apply an outlier-score tocontrol the influence that a point has in the parameter estimation process ofthe regression function. For this purpose, weights are introduced, which areestimated based on the assumption that the noise is Gaussian distributed.This is often sufficient, for example for most sensor signals. The algorithmis real-time capable, but far away from generality. This method also belongsto category one.The idea of the second category is very different. At the first glance, itseems to be impossible to use classifiers to detect outliers, because classifiersneed for the estimation of their parameters samples from inliers and outliers.Usually, only samples for inliers are available. The idea is to create a enclosingcloud of outlier samples synthetically with a random generator. Afterwards,2t is possible to train the classifier. Singh and Markou (2004), for example,apply a neural network for this purpose. Other classifiers are also possible,for example, an SVM (Steinwart et al., 2005). Regardless of the appliedclassifier, these probabilistic methods need for the generation of the hull ameasure in which degree a generated sample point is an outlier. Singh andMarkou (2004), for example, use for this purpose simple prediction intervals(2 . σ ranges).In conclusion, both categories have to solve the same problem: to find anappropriate zero level set for the inlier generating density. In the subsequentsections I will show that this problem can be mapped to a choice of a sig-nificance level and that it is possible to generalize the traditional statisticalconcept of prediction intervals to prediction regions. A prediction interval denotes a region in which future feature vectors x occurwith a predetermined probability. For its computation, it is essential thatthe generating probability density function p X ( x ) for the random variable X is known. In this case P ( x ≤ X ≤ x ) = x (cid:90) x p X ( ω ) d ω (1)is the probability for a future feature vector within the region [ x , x ].For a prediction interval, the region borders have to be established sothat the probability for outliers is lower than a given, fixed threshold – thesignificance level α . A typical example is α = 5% and mean that the regionis to be determined so that 95% of all possible feature vectors fall into it.Usually, an infinite number of borders fulfill this requirement. Thus, forGaussian distributions, the region is centered on the expectation value. But,this definition is only appropriate for this special case.The reason for this is the fact that the integration borders are defineddirectly. A way to solve this problem is to apply the level set idea with theprobability density function p X ( x ) as level set function. The setΓ θ = { x | p X ( x ) − θ = 0 } (2)can serve as implicit definition for the integration borders. The question is,how the threshold θ corresponds to the significance level α so that α ! = (cid:90) p X ( ω ) ≤ θ p X ( ω )d ω (3)3ecomes true.Let F Y be the cumulative distribution function of the probability densityfunction values y = p X ( x ) and F − Y its inverse. In this case, the threshold θ can be computed from a given significance level α by θ = F − Y ( α ) . (4)I now prove this assertion. Proof 1
The feature vectors x can be interpreted as realizations of the vec-torial random variable X . Because of this, Y = p X ( X ) is also a randomvariable. But contrary to X , Y is scalar valued. The relation between X und Y is strictly deterministic, that is, p Y | X ( y | x ) = δ ( y − p X ( x )) (5) and consequently p Y, X ( y, x ) = δ ( y − p X ( x )) p X ( x ) . (6) The marginalization in respect to X gives finally a formula to convert theprobability density function p X into the probability density function p Y : p Y ( y ) = (cid:90) ω δ ( y − p X ( ω )) p X ( ω )d ω . (7) Now, we can calculate the cumulative distribution function F Y ( y ) , which isdefined by F Y ( y ) = y (cid:90) −∞ p Y ( y (cid:48) )d y (cid:48) . (8) Inserting (7) in (8) delivers F Y ( y ) = y (cid:90) −∞ (cid:90) ω δ ( y (cid:48) − p X ( ω )) p X ( ω )d ω d y (cid:48) . (9) With the Interval function b (cid:117) a ( x ) = (cid:26) , a ≤ x ≤ b , otherwise (10)4 he expression (9) can be transformed to F Y ( y ) = (cid:90) ω y (cid:117) −∞ ( p X ( ω )) p X ( ω )d ω = (cid:90) p X ( ω ) ≤ y p X ( ω )d ω (11) A comparison shows that the right side of expression (3) is identical to F Y ( θ ) .For this reason, we can write (3) as α ! = F Y ( θ ) . (12) The cumulative distribution function F Y is per definition monotonic. If it iseven strictly monotonic, the inverse function F − Y exists and we can compute θ for a given α by expression (4). Otherwise, we could obtain for one valueof α an interval of possible values for θ . Because all values in this intervalresult in the same α for the integration (3), any value in this interval can beused to solve the equation. Summarization: A significance level α , as it is usually applied in statis-tics, can be transformed with the scalar valued function F − Y into a levelset threshold θ . With this threshold, it is possible to classify those featurevectors with a statistical significance of 1 − α as outlier, whose probabilitydensity function values p X ( x ) are lower than θ . The ranges for which thecondition p X ( x ) ≥ θ is fulfilled are the prediction regions. It is shown that the cumulative distribution function F Y ( y ) can be used totransform a given significance level α into a level set threshold θ . But, becauseof expression (11), 1 − F Y ( y ) can also be applied as measure to describe ourdegree of surprise about a certain feature vector x . We summarize and define b X ( x ) := (cid:90) p X ( ω ) ≤ p X ( x ) p X ( ω )d ω = F Y ( p X ( x )) (13)and name b X ( x ) the significance level distribution of x for the random vari-able X . 5he significance level distribution is in the true sense of the word a “prob-ability distribution” because it provides a probability (the significance level)for every continuous realization x . Unfortunately, the term “probability dis-tribution” is already used for probability density functions, which do notprovide probabilities but probability density values. Note that the signifi-cance level distribution does not deliver the probability for a single realiza-tion x itself, but the probability for all even more unlikely realizations than x . Nevertheless, b X ( x ) provides valuable information for the assessment ofthe realization x and allows to decide if it is sure, probable, or only possible. -4 40.10.20.30.4 p X ( x ) x . (A) p X ( x ) x -7.5 -5 -2.5 2.5 5 7.50.050.10.150.2 0 . (B) -4 40.10.20.30.4 p X ( x ) xb x ( x ) = 0 . x (C) p X ( x ) x -7.5 -2.5 2.5 5 7.50.050.10.150.2 b x ( x ) = 0 . x (D) Figure 1: For unimodal and symmetric distributions the significance leveldistribution (C) and the prediction interval (A) lead to the same results.But for multimodal distributions only the significance level distribution isreasonable (D).For simple standard distributions, such as the Gaussian distribution or theCauchy distribution, the significance level distribution can be given in closedform. Note that for a symmetric and unimodal distribution the significancelevel distribution and the prediction interval is identically (see Fig. 1). Formore complex distributions this is usually not valid and it is here seldom6ossible to give the significance level distribution in closed form. In thesecases it is reasonable to estimate the cumulative distribution function F Y .The next section 4 proposes a method and investigates its convergence speed.Figure 2 shows an example of a significance level distribution for a non-trivialprobability density function. Please note that significance level distributionsare not restricted to the one-dimensional case. xxb X ( x ) p X ( x ) α = 5% θ = 0 . − − − −
10 0 10 20 30 40 − − − −
10 0 10 20 30 40
Figure 2: An example for a probability density function and its related sig-nificance level distribution. The white zones are the “outlier regions” for asignificance level of 5%. The threshold θ = 0 . α = 5%. F Y Contrary to the complicated integration (3), the expression (12) can be eas-ily computed, if we estimate F Y . We assume that the probability densitydistribution p X ( x ) is known or can be appropriately estimated. With theknowledge of p X ( x ), it is possible to generate n correspondingly distributedrandom samples: D X = (cid:8) x , . . . , x n (cid:9) . (14)Now we can transform this dataset into a dataset of probability densityfunction values D Y = (cid:8) p X ( x ) , . . . , p X ( x n ) (cid:9) = (cid:8) y , . . . , y n (cid:9) . (15)7ith this dataset D Y , the cumulative distribution function F Y can be esti-mated by ˜ F Y ( y ) = 1 n n (cid:88) k =1 Θ( y − y k ) (16)with the Heaviside functionΘ( y ) = (cid:26) , y ≥ , otherwise . (17)The Glivenko-Cantelli theorem guarantees the convergence of this empiri-cal distribution function ˜ F Y ( y ) to the true cumulative distribution function F Y ( y ) for n → ∞ . Note that it is unnecessary to sum over all elements y i forthe computation of expression (16) if the dataset D Y is sorted. In this case,a binary search with computation costs of Ω(log( n )) can be applied.It is possible to give the root mean squared error of this estimator independency on the size of the dataset n :RMSE = (cid:114) F Y − F Y n ≤ √ n . (18)This formula makes it possible to calculate the number of samples n nec-essary for a desired accuracy with a given significance level α = F Y . It isimportant that the convergence speed does neither depend on the generatingdensity p X ( x ) nor on the dimension of the original problem (3). It follows aproof of expression (18). Proof 2
We calculate the mean squared error
MSE by computing the expec-tation value
E { ( F y − ˜ F y ) } = + ∞ (cid:90) −∞ . . . + ∞ (cid:90) −∞ ( F y − ˜ F y ) p Y ( y )d y . . . p Y ( y n )d y n (19) of the squared error in respect to all elements in the dataset (15). The MSE is usually written as sum
MSE = BIAS + VAR (20) with BIAS =
E { ˜ F Y } − F Y (21) and VAR =
E { ˜ F Y } − E { ˜ F Y } . (22)8 or the estimation formula (16) is E { ˜ F Y } = 1 n n (cid:88) k =1 E { Θ( y − y k ) } = 1 n n (cid:88) k =1 + ∞ (cid:90) −∞ . . . + ∞ (cid:90) −∞ Θ( y − y k ) p Y ( y ) . . . p Y ( y n )d y . . . d y n = 1 n n (cid:88) k =1 + ∞ (cid:90) −∞ Θ( y − y k ) p Y ( y k )d y k = 1 n n (cid:88) k =1 y (cid:90) −∞ p Y ( y k )d y k = 1 n n (cid:88) k =1 F Y = F Y . (23) This shows that the estimator (16) is unbiased. Furthermore, we have
E { ˜ F Y } = 1 n n (cid:88) k =1 n (cid:88) j =1 E { Θ( y − y k ) Θ( y − y j ) } = 1 n n (cid:88) k =1 n (cid:88) j (cid:54) = k E { Θ( y − y k ) }E { Θ( y − y j ) } + 1 n n (cid:88) k =1 E { Θ( y − y k ) } . (24) Because of Θ( y − y k ) = Θ( y − y k ) , we obtain E { ˜ F Y } = 1 n n (cid:88) k =1 n (cid:88) j (cid:54) = k F Y + 1 n n (cid:88) k =1 F Y = F Y − n F Y + 1 n F Y . (25) Inserting (25) and (23) into (22) yields
VAR = MSE = 1 n F Y (1 − F Y ) . (26) Finally, because of
RMSE = √ MSE , we obtain expression (18). Overview of the method
Estimation
1. If the inlier generating density p X ( x ) is unknown, estimate it with asuitable algorithm.2. Choose a upper bound for the RMSE and calculate the necessary num-ber n of random samples to generate: n = 1 / (2 RMSE) .3. Generate the random samples D X = { x , . . . , x n } .4. Compute the derived dataset D Y = p X ( D X ).5. Sort D Y . Application
1. Choose a significance level α .2. Compute the density value y = p X ( x ) for the interesting feature vector x .3. Calculate the significance level value z = b X ( x ) by computing z = F Y ( y ).4. Classify x as outlier, if z < α . Usually, the inlier generating density p X ( x ) is unknown and has to be esti-mated. For this reason, the quality of the proposed method for the outlierdetection depends significantly on the quality of the applied density estima-tion algorithm. The influence of the F Y estimation is, however, marginal,because the RMSE (18) can be reduced arbitrarily – in contrast to the esti-mation of p X ( x ) – by increasing the random sample number n . The followingexperiment verifies this by comparing a theoretically determined significancelevel distribution b X to an estimated version ˜ b X .It is only for some simple distributions possible to give the significancelevel distribution in closed form. Such an example is the Gaussian distribu-tion p X ( x ) = 1 √ π σ e − x σ . (27)10he significance level distribution has, in this case, the form b X ( x ) = 1 − erf (cid:18) | x |√ σ (cid:19) . (28)In an experiment, I have estimated the significance level distribution fora standard normal distribution with the proposed method by varying therandom sample number n . I have averaged for each value of n the squareddifferences between the closed form (28) and the estimated versions over 2000single estimations. The results are summarized in Fig. 3, which shows thatthe experiment confirms the theoretical predictions. n = 100 n = 1000 n = 100000 n = 10000 0 1 2 3 4-1-2-3-4 0 1 2 3 4-1-2-3-40 1 2 3 4-1-2-3-4 0 1 2 3 4-1-2-3-4 x R M S E Figure 3: The figure shows the averaged root mean squared errors of 2000single experiments for the significance level distribution estimation of a stan-dard normal distribution by different random sample numbers n (black lines)in comparison to the theoretical errors (fat white lines in the background). In this article, I have shown that it is always possible to compute predictionregions as generalization of prediction intervals, no matter if the generatingdensity is high-dimensional or multimodal. Only the density has to be knownor estimated.The idea was to define the integration borders indirectly by a zero levelset with the probability density function as level set function. This has lead11o the problem of transforming the significance level defining a predictioninterval into a level set threshold. I have shown that this can be easilyaccomplished by the cumulative distribution function of the probability den-sity function values. The advantage is that the complicated integration inthe high-dimensional feature space is mapped to a one dimensional functionevaluation.Furthermore, I have introduced a new probability measure, the signif-icance level distribution, which can be easily derived from the probabilitydensity function. The advantage is that it enables the assessment of the“plausibility” of an realization or feature vector because it provides proba-bilities also for continuous realizations. The transformation procedure haslow computation costs and the estimation error of the method is negligible.Please note that in practice the performance of the proposed method forone-class classification tasks depends significantly on the quality of the ap-plied density estimation method, just like the quality of a Bayes classifier formulti-class classification. On the contrary, for an optimal estimated density,the method would be necessarily optimal for one-class classification, just likea Bayes classifier is optimal for the multi-class classification case. Becausedensity estimation it not the topic of this article I have deliberately omittedsome experimental comparisions with other outlier recognition or one-classclassification methods.
References
Angiulli, F., Basta, S., Pizzuti, C., 2006. Distance-based detection and pre-diction of outliers. IEEE J KDE 18 (2), 145–160.Barnett, V., Lewis, T., 1994. Outliers in Statistical Data. John Wiley & Sons.Dolia, A., 2002. Statistical descriptor of normality based on hotelling’s t2