[PDF] Generalized Prediction Intervals for Arbitrary Distributed High-Dimensional Data

Abstract

This paper generalizes the traditional statistical concept of prediction intervals for arbitrary probability density functions in high-dimensional feature spaces by introducing significance level distributions, which provides interval-independent probabilities for continuous random variables. The advantage of the transformation of a probability density function into a significance level distribution is that it enables one-class classification or outlier detection in a direct manner.

Full PDF

GGeneralized Prediction Intervals for ArbitraryDistributed High-Dimensional Data

Steﬀen K¨uhnTechnische Universit¨at BerlinChair of Electronic Measurement and Diagnostic TechnologyEinsteinufer 17, 10585 Berlin, Germanysteﬀ[email protected]

Abstract

This paper generalizes the traditional statistical concept of pre-diction intervals for arbitrary probability density functions in high-dimensional feature spaces by introducing signiﬁcance level distribu-tions, which provides interval-independent probabilities for continuousrandom variables. The advantage of the transformation of a proba-bility density function into a signiﬁcance level distribution is that itenables one-class classiﬁcation or outlier detection in a direct manner.

A prediction interval is an interval that will, with a speciﬁed degree of conﬁ-dence, contain future realizations or, in the terminology of pattern recogni-tion, feature vectors (Hahn and Meeker, 1991). The appeal of this conceptis its clear stochastic meaning. The great disadvantage is that this deﬁnitionis usually too restricted, for example for multimodal distributions. It is in-tuitively clear that, in this case, more than one interval for probable featurevectors can exist and it would be better to speak of prediction regions . Evenmore complicated is the situation for high-dimensional feature spaces. Thislack of generality is probably the reason why prediction intervals are rarelyused in pattern recognition.This is actually a pity, because prediction regions would be very useful,for example, for the recognition of outliers (Barnett and Lewis, 1994) or thedetection of novelty or normality. Instead of prediction intervals, numerous1 a r X i v : . [ c s . C V ] S e p ther methods are used for this purpose. They can be grouped roughly intotwo categories:1. Distance-based and novelty or normality score-based approaches (Knorret al., 2000; Dolia, 2002; Moonesignhe and Tan, 2006; Angiulli et al.,2006; Ting et al., 2007),2. Methods that introduce a separate rejection class in combination witha classiﬁer (Singh and Markou, 2004; Steinwart et al., 2005).If applying the method I propose here to outlier detection, it belongs to theﬁrst category with a probability as normality score. Before going into thedetails, I will give a short overview of related works.Simple distance-based methods rely on the concept of the neighborhoodof a point, for example, the k nearest neighborhood (Knorr et al., 2000).Outliers are those points for which there are less than k points within adistance δ in the dataset. Ramaswamy et al. (2000) propose a method tochoose this threshold δ automatically based upon a dataset. The idea is toconsider as outliers the set of points with the highest distances to their k thnearest neighbors. Of course, here is also a threshold necessary, but it hasnow a statistical reasoning as quartile of the k th nearest neighbor distancedistribution, which simpliﬁes the choice. A more recent article based on thisidea is published by Angiulli et al. (2006), who apply a weighted sum ofthe k th nearest distances per point. Although the idea is quite simple, themethods have low computation costs. Furthermore, they make only minorassumptions about the underlying distribution.Another category of algorithms that are related to outlier detection isrobust regression. The outlier detection is here more a means to an end,because the goal is to avoid that outliers inﬂuence the estimation of theregression function. This means that it is in this case suﬃcient to detectoutliers indirectly. Ting et al. (2007), for example, apply an outlier-score tocontrol the inﬂuence that a point has in the parameter estimation process ofthe regression function. For this purpose, weights are introduced, which areestimated based on the assumption that the noise is Gaussian distributed.This is often suﬃcient, for example for most sensor signals. The algorithmis real-time capable, but far away from generality. This method also belongsto category one.The idea of the second category is very diﬀerent. At the ﬁrst glance, itseems to be impossible to use classiﬁers to detect outliers, because classiﬁersneed for the estimation of their parameters samples from inliers and outliers.Usually, only samples for inliers are available. The idea is to create a enclosingcloud of outlier samples synthetically with a random generator. Afterwards,2t is possible to train the classiﬁer. Singh and Markou (2004), for example,apply a neural network for this purpose. Other classiﬁers are also possible,for example, an SVM (Steinwart et al., 2005). Regardless of the appliedclassiﬁer, these probabilistic methods need for the generation of the hull ameasure in which degree a generated sample point is an outlier. Singh andMarkou (2004), for example, use for this purpose simple prediction intervals(2 . σ ranges).In conclusion, both categories have to solve the same problem: to ﬁnd anappropriate zero level set for the inlier generating density. In the subsequentsections I will show that this problem can be mapped to a choice of a sig-niﬁcance level and that it is possible to generalize the traditional statisticalconcept of prediction intervals to prediction regions. A prediction interval denotes a region in which future feature vectors x occurwith a predetermined probability. For its computation, it is essential thatthe generating probability density function p X ( x ) for the random variable X is known. In this case P ( x ≤ X ≤ x ) = x (cid:90) x p X ( ω ) d ω (1)is the probability for a future feature vector within the region [ x , x ].For a prediction interval, the region borders have to be established sothat the probability for outliers is lower than a given, ﬁxed threshold – thesigniﬁcance level α . A typical example is α = 5% and mean that the regionis to be determined so that 95% of all possible feature vectors fall into it.Usually, an inﬁnite number of borders fulﬁll this requirement. Thus, forGaussian distributions, the region is centered on the expectation value. But,this deﬁnition is only appropriate for this special case.The reason for this is the fact that the integration borders are deﬁneddirectly. A way to solve this problem is to apply the level set idea with theprobability density function p X ( x ) as level set function. The setΓ θ = { x | p X ( x ) − θ = 0 } (2)can serve as implicit deﬁnition for the integration borders. The question is,how the threshold θ corresponds to the signiﬁcance level α so that α ! = (cid:90) p X ( ω ) ≤ θ p X ( ω )d ω (3)3ecomes true.Let F Y be the cumulative distribution function of the probability densityfunction values y = p X ( x ) and F − Y its inverse. In this case, the threshold θ can be computed from a given signiﬁcance level α by θ = F − Y ( α ) . (4)I now prove this assertion. Proof 1

The feature vectors x can be interpreted as realizations of the vec-torial random variable X . Because of this, Y = p X ( X ) is also a randomvariable. But contrary to X , Y is scalar valued. The relation between X und Y is strictly deterministic, that is, p Y | X ( y | x ) = δ ( y − p X ( x )) (5) and consequently p Y, X ( y, x ) = δ ( y − p X ( x )) p X ( x ) . (6) The marginalization in respect to X gives ﬁnally a formula to convert theprobability density function p X into the probability density function p Y : p Y ( y ) = (cid:90) ω δ ( y − p X ( ω )) p X ( ω )d ω . (7) Now, we can calculate the cumulative distribution function F Y ( y ) , which isdeﬁned by F Y ( y ) = y (cid:90) −∞ p Y ( y (cid:48) )d y (cid:48) . (8) Inserting (7) in (8) delivers F Y ( y ) = y (cid:90) −∞ (cid:90) ω δ ( y (cid:48) − p X ( ω )) p X ( ω )d ω d y (cid:48) . (9) With the Interval function b (cid:117) a ( x ) = (cid:26) , a ≤ x ≤ b , otherwise (10)4 he expression (9) can be transformed to F Y ( y ) = (cid:90) ω y (cid:117) −∞ ( p X ( ω )) p X ( ω )d ω = (cid:90) p X ( ω ) ≤ y p X ( ω )d ω (11) A comparison shows that the right side of expression (3) is identical to F Y ( θ ) .For this reason, we can write (3) as α ! = F Y ( θ ) . (12) The cumulative distribution function F Y is per deﬁnition monotonic. If it iseven strictly monotonic, the inverse function F − Y exists and we can compute θ for a given α by expression (4). Otherwise, we could obtain for one valueof α an interval of possible values for θ . Because all values in this intervalresult in the same α for the integration (3), any value in this interval can beused to solve the equation. Summarization: A signiﬁcance level α , as it is usually applied in statis-tics, can be transformed with the scalar valued function F − Y into a levelset threshold θ . With this threshold, it is possible to classify those featurevectors with a statistical signiﬁcance of 1 − α as outlier, whose probabilitydensity function values p X ( x ) are lower than θ . The ranges for which thecondition p X ( x ) ≥ θ is fulﬁlled are the prediction regions. It is shown that the cumulative distribution function F Y ( y ) can be used totransform a given signiﬁcance level α into a level set threshold θ . But, becauseof expression (11), 1 − F Y ( y ) can also be applied as measure to describe ourdegree of surprise about a certain feature vector x . We summarize and deﬁne b X ( x ) := (cid:90) p X ( ω ) ≤ p X ( x ) p X ( ω )d ω = F Y ( p X ( x )) (13)and name b X ( x ) the signiﬁcance level distribution of x for the random vari-able X . 5he signiﬁcance level distribution is in the true sense of the word a “prob-ability distribution” because it provides a probability (the signiﬁcance level)for every continuous realization x . Unfortunately, the term “probability dis-tribution” is already used for probability density functions, which do notprovide probabilities but probability density values. Note that the signiﬁ-cance level distribution does not deliver the probability for a single realiza-tion x itself, but the probability for all even more unlikely realizations than x . Nevertheless, b X ( x ) provides valuable information for the assessment ofthe realization x and allows to decide if it is sure, probable, or only possible. -4 40.10.20.30.4 p X ( x ) x . (A) p X ( x ) x -7.5 -5 -2.5 2.5 5 7.50.050.10.150.2 0 . (B) -4 40.10.20.30.4 p X ( x ) xb x ( x ) = 0 . x (C) p X ( x ) x -7.5 -2.5 2.5 5 7.50.050.10.150.2 b x ( x ) = 0 . x (D) Figure 1: For unimodal and symmetric distributions the signiﬁcance leveldistribution (C) and the prediction interval (A) lead to the same results.But for multimodal distributions only the signiﬁcance level distribution isreasonable (D).For simple standard distributions, such as the Gaussian distribution or theCauchy distribution, the signiﬁcance level distribution can be given in closedform. Note that for a symmetric and unimodal distribution the signiﬁcancelevel distribution and the prediction interval is identically (see Fig. 1). Formore complex distributions this is usually not valid and it is here seldom6ossible to give the signiﬁcance level distribution in closed form. In thesecases it is reasonable to estimate the cumulative distribution function F Y .The next section 4 proposes a method and investigates its convergence speed.Figure 2 shows an example of a signiﬁcance level distribution for a non-trivialprobability density function. Please note that signiﬁcance level distributionsare not restricted to the one-dimensional case. xxb X ( x ) p X ( x ) α = 5% θ = 0 . − − − −

10 0 10 20 30 40 − − − −

10 0 10 20 30 40

Figure 2: An example for a probability density function and its related sig-niﬁcance level distribution. The white zones are the “outlier regions” for asigniﬁcance level of 5%. The threshold θ = 0 . α = 5%. F Y Contrary to the complicated integration (3), the expression (12) can be eas-ily computed, if we estimate F Y . We assume that the probability densitydistribution p X ( x ) is known or can be appropriately estimated. With theknowledge of p X ( x ), it is possible to generate n correspondingly distributedrandom samples: D X = (cid:8) x , . . . , x n (cid:9) . (14)Now we can transform this dataset into a dataset of probability densityfunction values D Y = (cid:8) p X ( x ) , . . . , p X ( x n ) (cid:9) = (cid:8) y , . . . , y n (cid:9) . (15)7ith this dataset D Y , the cumulative distribution function F Y can be esti-mated by ˜ F Y ( y ) = 1 n n (cid:88) k =1 Θ( y − y k ) (16)with the Heaviside functionΘ( y ) = (cid:26) , y ≥ , otherwise . (17)The Glivenko-Cantelli theorem guarantees the convergence of this empiri-cal distribution function ˜ F Y ( y ) to the true cumulative distribution function F Y ( y ) for n → ∞ . Note that it is unnecessary to sum over all elements y i forthe computation of expression (16) if the dataset D Y is sorted. In this case,a binary search with computation costs of Ω(log( n )) can be applied.It is possible to give the root mean squared error of this estimator independency on the size of the dataset n :RMSE = (cid:114) F Y − F Y n ≤ √ n . (18)This formula makes it possible to calculate the number of samples n nec-essary for a desired accuracy with a given signiﬁcance level α = F Y . It isimportant that the convergence speed does neither depend on the generatingdensity p X ( x ) nor on the dimension of the original problem (3). It follows aproof of expression (18). Proof 2

We calculate the mean squared error

MSE by computing the expec-tation value

E { ( F y − ˜ F y ) } = + ∞ (cid:90) −∞ . . . + ∞ (cid:90) −∞ ( F y − ˜ F y ) p Y ( y )d y . . . p Y ( y n )d y n (19) of the squared error in respect to all elements in the dataset (15). The MSE is usually written as sum

MSE = BIAS + VAR (20) with BIAS =

E { ˜ F Y } − F Y (21) and VAR =

E { ˜ F Y } − E { ˜ F Y } . (22)8 or the estimation formula (16) is E { ˜ F Y } = 1 n n (cid:88) k =1 E { Θ( y − y k ) } = 1 n n (cid:88) k =1 + ∞ (cid:90) −∞ . . . + ∞ (cid:90) −∞ Θ( y − y k ) p Y ( y ) . . . p Y ( y n )d y . . . d y n = 1 n n (cid:88) k =1 + ∞ (cid:90) −∞ Θ( y − y k ) p Y ( y k )d y k = 1 n n (cid:88) k =1 y (cid:90) −∞ p Y ( y k )d y k = 1 n n (cid:88) k =1 F Y = F Y . (23) This shows that the estimator (16) is unbiased. Furthermore, we have

E { ˜ F Y } = 1 n n (cid:88) k =1 n (cid:88) j =1 E { Θ( y − y k ) Θ( y − y j ) } = 1 n n (cid:88) k =1 n (cid:88) j (cid:54) = k E { Θ( y − y k ) }E { Θ( y − y j ) } + 1 n n (cid:88) k =1 E { Θ( y − y k ) } . (24) Because of Θ( y − y k ) = Θ( y − y k ) , we obtain E { ˜ F Y } = 1 n n (cid:88) k =1 n (cid:88) j (cid:54) = k F Y + 1 n n (cid:88) k =1 F Y = F Y − n F Y + 1 n F Y . (25) Inserting (25) and (23) into (22) yields

VAR = MSE = 1 n F Y (1 − F Y ) . (26) Finally, because of

RMSE = √ MSE , we obtain expression (18). Overview of the method

Estimation

1. If the inlier generating density p X ( x ) is unknown, estimate it with asuitable algorithm.2. Choose a upper bound for the RMSE and calculate the necessary num-ber n of random samples to generate: n = 1 / (2 RMSE) .3. Generate the random samples D X = { x , . . . , x n } .4. Compute the derived dataset D Y = p X ( D X ).5. Sort D Y . Application

1. Choose a signiﬁcance level α .2. Compute the density value y = p X ( x ) for the interesting feature vector x .3. Calculate the signiﬁcance level value z = b X ( x ) by computing z = F Y ( y ).4. Classify x as outlier, if z < α . Usually, the inlier generating density p X ( x ) is unknown and has to be esti-mated. For this reason, the quality of the proposed method for the outlierdetection depends signiﬁcantly on the quality of the applied density estima-tion algorithm. The inﬂuence of the F Y estimation is, however, marginal,because the RMSE (18) can be reduced arbitrarily – in contrast to the esti-mation of p X ( x ) – by increasing the random sample number n . The followingexperiment veriﬁes this by comparing a theoretically determined signiﬁcancelevel distribution b X to an estimated version ˜ b X .It is only for some simple distributions possible to give the signiﬁcancelevel distribution in closed form. Such an example is the Gaussian distribu-tion p X ( x ) = 1 √ π σ e − x σ . (27)10he signiﬁcance level distribution has, in this case, the form b X ( x ) = 1 − erf (cid:18) | x |√ σ (cid:19) . (28)In an experiment, I have estimated the signiﬁcance level distribution fora standard normal distribution with the proposed method by varying therandom sample number n . I have averaged for each value of n the squareddiﬀerences between the closed form (28) and the estimated versions over 2000single estimations. The results are summarized in Fig. 3, which shows thatthe experiment conﬁrms the theoretical predictions. n = 100 n = 1000 n = 100000 n = 10000 0 1 2 3 4-1-2-3-4 0 1 2 3 4-1-2-3-40 1 2 3 4-1-2-3-4 0 1 2 3 4-1-2-3-4 x R M S E Figure 3: The ﬁgure shows the averaged root mean squared errors of 2000single experiments for the signiﬁcance level distribution estimation of a stan-dard normal distribution by diﬀerent random sample numbers n (black lines)in comparison to the theoretical errors (fat white lines in the background). In this article, I have shown that it is always possible to compute predictionregions as generalization of prediction intervals, no matter if the generatingdensity is high-dimensional or multimodal. Only the density has to be knownor estimated.The idea was to deﬁne the integration borders indirectly by a zero levelset with the probability density function as level set function. This has lead11o the problem of transforming the signiﬁcance level deﬁning a predictioninterval into a level set threshold. I have shown that this can be easilyaccomplished by the cumulative distribution function of the probability den-sity function values. The advantage is that the complicated integration inthe high-dimensional feature space is mapped to a one dimensional functionevaluation.Furthermore, I have introduced a new probability measure, the signif-icance level distribution, which can be easily derived from the probabilitydensity function. The advantage is that it enables the assessment of the“plausibility” of an realization or feature vector because it provides proba-bilities also for continuous realizations. The transformation procedure haslow computation costs and the estimation error of the method is negligible.Please note that in practice the performance of the proposed method forone-class classiﬁcation tasks depends signiﬁcantly on the quality of the ap-plied density estimation method, just like the quality of a Bayes classiﬁer formulti-class classiﬁcation. On the contrary, for an optimal estimated density,the method would be necessarily optimal for one-class classiﬁcation, just likea Bayes classiﬁer is optimal for the multi-class classiﬁcation case. Becausedensity estimation it not the topic of this article I have deliberately omittedsome experimental comparisions with other outlier recognition or one-classclassiﬁcation methods.

References

Angiulli, F., Basta, S., Pizzuti, C., 2006. Distance-based detection and pre-diction of outliers. IEEE J KDE 18 (2), 145–160.Barnett, V., Lewis, T., 1994. Outliers in Statistical Data. John Wiley & Sons.Dolia, A., 2002. Statistical descriptor of normality based on hotelling’s t2