Prediction and outlier detection in classification problems
PPrediction and outlier detection in classification problems
Leying Guan ∗ , Rob Tibshirani † Abstract
We consider the multi-class classification problem when the training data and the out-of-sample testdata may have different distributions and propose a method called BCOPS (balanced and conformaloptimized prediction sets). BCOPS constructs a prediction set C ( x ) as a subset of class labels, possiblyempty. It tries to optimize the out-of-sample performance, aiming to include the correct class as oftenas possible, but also detecting outliers x , for which the method returns no prediction (corresponding to C ( x ) equal to the empty set). The proposed method combines supervised-learning algorithms with themethod of conformal prediction to minimize a misclassification loss averaged over the out-of-sample dis-tribution. The constructed prediction sets have a finite-sample coverage guarantee without distributionalassumptions.We also propose a method to estimate the outlier detection rate of a given method. We proveasymptotic consistency and optimality of our proposals under suitable assumptions and illustrate ourmethods on real data examples. We consider the multi-class classification problem where the training data and the test data may be mis-matched. That is, the training data and the test data may have different distributions. We assume theaccess to the labeled training data and unlabeled test data. Let { ( x i , y i ) , i = 1 , . . . , n } be the training dataset, with continuous features x i ∈ R p and response y i ∈ { , . . . , K } for K classes.In classification problems, one usually aims to produce a good classifier using the training data thatpredicts the class k at each x . Instead, here we construct a prediction set C ( x ) at each x by solving anoptimization problem minimizing the out-of-sample loss directly. The prediction set C ( x ) might containmultiple labels or be empty. When K = 2, for example, C ( x ) ∈ {{ } , { } , { , } , ∅} . If C ( x ) containsmultiple labels, it would indicate that x could be any of the listed classes. If C ( x ) = ∅ , it would indicatethat x is likely to be far from the training data and we could not assign it to any class and consider it as anoutlier.There are many powerful supervised learning algorithms that try to estimate P ( y = k | x ), the conditionalprobability of y given x , for k = 1 , . . . , K . When the test data and the training data have the samedistribution, we can often have reasonably good performance and a relatively faithful evaluation of theout-of-sample performance using sample slitting of the training data. However, the posterior probability P ( y = k | x ) may not reveal the fact that the training and test data are mismatched. In particular, whenerroneously applied to mismatched data, the standard approaches may yield predictions for x far fromthe training samples, where it is usually better to not make a prediction at all. Figure 1 shows a twodimensional illustrative example. In this example, we have a training data set with two classes and train alogistic regression model with it. The average misclassification loss based on sample splitting of the trainingdata is extremely low. The test data comes from a very different distribution. We plot the training data inthe upper left plot: the black points represent class 1 and blue points represent class 2, and plot the testdata in the upper right plot using red points. The black and blue dashed curves in these two plots are theboundaries for P ( y = 1 | x ) = 0 .
05 and P ( y = 1 | x ) = 0 .
95 from the logistic regression model. Based on thepredictions from the logistic model, we are quite confident that the majority of the red points are from class1. However, in this case, since the test samples are relatively far from the training data, most likely, weconsider them to be outliers and don’t want to make predictions. ∗ Dept. of Statistics, Stanford Univ, [email protected] † Depts. of Biomedical Data Sciences, and Statistics, Stanford Univ, [email protected] a r X i v : . [ s t a t . M E ] J un igure 1: Illustrative example I. We show the training data in the upper left plot, the black points representclass 1 and blue points represent class 2, the black and blue dashed curves are the boundaries for P ( y =1 | x ) = 0 . and P ( y = 1 | x ) = 0 . based on the logistic model. In the upper right plot, we use the redpoints to represent the test samples. The black and blue dashed lines in the upper half of the figure are thedecision boundaries for posterior probability of class 2 being 0.05 and 0.95 based on the logistic regressionmodel. The lower half of Figure 1, the interior of the dashed curves represent the density-level sets achieving95% coverage for class 1 and class 2 respectively. l l lll lll ll lllll lll lll ll lll lll lll lllllllll ll lll ll ll lll ll ll llllllll lll l lll lll lll lll ll ll l lll lllll ll llll l llll lll l ll ll l llllll ll ll ll lll l lllll ll l ll lllll ll l ll llll ll ll ll lll ll l ll lll ll lll l llll lll ll ll ll ll lll lll ll −4 −2 0 2 4 − − training samples x1 x ll l l ll l ll l ll llll l lll lll ll l ll ll l lll ll l ll ll llll l l ll l lll lll ll ll ll lll ll ll l lll ll l ll ll l l ll llll ll l lllll llll −4 −2 0 2 4 − − test samples x1 x l l lll lll ll lllll lll lll ll lll lll lll lllllllll ll lll ll ll lll ll ll llllllll lll l lll lll lll lll ll ll l lll lllll ll llll l llll lll l ll ll l llllll ll ll ll lll l lllll ll l ll lllll ll l ll llll ll ll ll lll ll l ll lll ll lll l llll lll ll ll ll ll lll lll ll −4 −2 0 2 4 − − training samples x1 x ll l l ll l ll l ll llll l lll lll ll l ll ll l lll ll l ll ll llll l l ll l lll lll ll ll ll lll ll ll l lll ll l ll ll l l ll llll ll l lllll llll −4 −2 0 2 4 − − test samples x1 x Illustrative example II. We have two classes with x ∈ R , and the two classes are well separated inthe first dimension and follows standard normal distribution in other dimensions. The left plot of Figure 2shows the data colored with their actual class. The black points represent class 1 and the blue points representclass 2. The right plot of Figure 2 shows the data with color corresponding to their density-level set: x iscolored green if C ( x ) = { , } , black if C ( x ) = { } , blue if C ( x ) = { } and red if C ( x ) = ∅ . l l lll lll ll lllll lll lll lllll lll lll lllllllll ll lll ll ll lll ll ll llllllll lll l lll lll lll lll ll ll l lll lllll ll llll l llll lll l ll ll l llllll ll ll ll lll l lllll ll l ll lllll ll l ll llll ll ll ll lll ll l ll lllll lll lllll lll ll ll ll ll lll lll ll lll lllll llll lll lllll l l lll lll ll llllll l lllll ll l lll l llll l ll llll ll l l llll l l llll ll ll lll lll lll ll l llll ll ll l lllll l ll l l llll ll lll lll ll ll ll l ll ll l lll l l ll l l llll lll l ll l ll llll ll ll lll ll ll ll l llll lll llll l lll l lll lll ll ll lll l l ll l ll lll llll l lll lll lll ll ll l lll ll l ll ll llll ll ll l llllll ll ll ll lll ll ll l lll ll l ll llll ll llll lll lllll llllll lll lll llll llll l ll l lll l ll llll lll l ll lll lll ll lll l ll ll ll ll ll lll ll ll l lll l ll l ll llll llll ll l ll lll l ll ll ll l lllll llll ll ll ll ll lll ll l lll llll lll llll l llll ll ll ll l lll l ll ll llll l l lll lll l lll ll ll ll ll l llll ll lll l llll lll ll ll l ll lll l lllll llll llll ll lllll l lll ll llll lll llll ll ll ll llll lll l llll l llll ll l ll l lll ll lllll l lll l ll lll lll l lll lllll llll l lll ll llll ll lllll ll ll l ll ll l lll ll l lll l ll ll ll lll l l llll l l ll ll ll ll l l ll lll ll l llll ll ll ll lll ll llll l lll ll ll ll ll l l ll l llll ll lll llll llllll l llll l l ll lll ll llll lll ll lllllll ll ll l l lll ll l ll ll ll l ll lll ll ll ll ll ll lll ll lll lll lllll l lll lll l lllll lll ll lll ll lll lll l llll ll lll l ll ll lllll ll lllll ll llll llll lll lll ll lll lllll llll lll ll ll l ll llll ll ll lll lll ll lllllll llllllllll ll lll ll llllll llllll lll ll ll ll llll l ll ll llll ll l llll ll ll lll ll lll ll lll ll l ll lll lll ll ll llll lll ll ll lll llll lll ll llll lll lll lllll ll ll llllllll lllllll l llll l l ll lllll l llll llll ll l ll lll lll llll l ll lllll l lllll ll lll llll ll ll lll l ll lll l l ll lll l llllll lll ll ll lllllll llll ll llll l lllll ll l ll ll llll lllll lll llll ll lll ll lll ll lll l ll lllll ll lllll llllll llllll llllll lll lll l lll ll lll lllll ll l lllllll ll lll llll lll ll lllllllll ll l l llll lllllllll ll ll ll llll lll lll ll lllll llll lllll llllll ll ll lll lll lll lll ll ll llll l ll llllllll l llll ll llll ll ll ll lll lll ll l lllll l lll l llll l l ll llll ll lll ll lll lllll llll ll lll l l lll ll llllllll lll l llll ll lllll llllll ll lll l ll lll l lllllll llll l lll lll l llll lll ll ll l llll l lll lllllll ll l llll lll ll llll lll ll lllllll ll lllll llll lll lll ll llll l l ll lll ll l lllll lll l ll ll llll lll ll lll llllll ll ll llll ll llll lll lll ll lll l lll lll lll l ll lll llll ll ll ll lllll l lll ll l lll ll lllllll lllll lllll ll ll ll llll l lll ll llll ll lll ll l ll ll llll lllll −4 −2 0 2 4 − Actual class label x1 x ll class1class2 l l lll lll ll lllll lll lll lllll lll lll lllllllll ll lll ll ll lll ll ll llllllll lll l lll lll lll lll ll ll l lll lllll ll llll l llll lll l ll ll l llllll ll ll ll lll l lllll ll l ll lllll ll l ll llll ll ll ll lll ll l ll lllll lll lllll lll ll ll ll ll lll lll ll lll lllll llll lll lllll l l lll lll ll llllll l lllll ll l lll l llll l ll llll ll l l llll l l llll ll ll lll lll lll ll l llll ll ll l lllll l ll l l llll ll lll lll ll ll ll l ll ll l lll l l ll l l llll lll l ll l ll llll ll ll lll ll ll ll l llll lll llll l lll l lll lll ll ll lll l l ll l ll lll llll l lll lll lll ll ll l lll ll l ll ll llll ll ll l llllll ll ll ll lll ll ll l lll ll l ll llll ll llll lll lllll llllll lll lll llll llll l ll l lll l ll llll lll l ll lll lll ll lll l ll ll ll ll ll lll ll ll l lll l ll l ll llll llll ll l ll lll l ll ll ll l lllll llll ll ll ll ll lll ll l lll llll lll llll l llll ll ll ll l lll l ll ll llll l l lll lll l lll ll ll ll ll l llll ll lll l llll lll ll ll l ll lll l lllll llll llll ll lllll l lll ll llll lll llll ll ll ll llll lll l llll l llll ll l ll l lll ll lllll l lll l ll lll lll l lll lllll llll l lll ll llll ll lllll ll ll l ll ll l lll ll l lll l ll ll ll lll l l llll l l ll ll ll ll l l ll lll ll l llll ll ll ll lll ll llll l lll ll ll ll ll l l ll l llll ll lll llll llllll l llll l l ll lll ll llll lll ll lllllll ll ll l l lll ll l ll ll ll l ll lll ll ll ll ll ll lll ll lll lll lllll l lll lll l lllll lll ll lll ll lll lll l llll ll lll l ll ll lllll ll lllll ll llll llll lll lll ll lll lllll llll lll ll ll l ll llll ll ll lll lll ll lllllll llllllllll ll lll ll llllll llllll lll ll ll ll llll l ll ll llll ll l llll ll ll lll ll lll ll lll ll l ll lll lll ll ll llll lll ll ll lll llll lll ll llll lll lll lllll ll ll llllllll lllllll l llll l l ll lllll l llll llll ll l ll lll lll llll l ll lllll l lllll ll lll llll ll ll lll l ll lll l l ll lll l llllll lll ll ll lllllll llll ll llll l lllll ll l ll ll llll lllll lll llll ll lll ll lll ll lll l ll lllll ll lllll llllll llllll llllll lll lll l lll ll lll lllll ll l lllllll ll lll llll lll ll lllllllll ll l l llll lllllllll ll ll ll llll lll lll ll lllll llll lllll llllll ll ll lll lll lll lll ll ll llll l ll llllllll l llll ll llll ll ll ll lll lll ll l lllll l lll l llll l l ll llll ll lll ll lll lllll llll ll lll l l lll ll llllllll lll l llll ll lllll llllll ll lll l ll lll l lllllll llll l lll lll l llll lll ll ll l llll l lll lllllll ll l llll lll ll llll lll ll lllllll ll lllll llll lll lll ll llll l l ll lll ll l lllll lll l ll ll llll lll ll lll llllll ll ll llll ll llll lll lll ll lll l lll lll lll l ll lll llll ll ll ll lllll l lll ll l lll ll lllllll lllll lllll ll ll ll llll l lll ll llll ll lll ll l ll ll llll lllll −4 −2 0 2 4 − Density−level set x1 x llll As an alternative, the density-level set (Lei et al. 2013, Hartigan 1975, Cadre 2006, Rigollet et al. 2009)considers f y ( x ), the density of x given y . For each new sample x , it constructs a prediction set C ( x ) = { k : x ∈ A k } where A k = { x | f k ( x ) ≥ f k,α } and f k,α is the lower α percentile of f k ( x ) under the distribution ofclass k . In Figure 1, the lower half shows the result of the density-level set with α = 0 .
05. Again, the lowerleft plot contains the training samples and the lower right plot contains the test samples. The black andblue dashed ellipses are the boundaries for the decision regions A and A from the the oracle density-levelsets (with given densities). We call the prediction with C ( x ) = ∅ as the abstention. In this example, wecan see that the oracle density-level sets have successfully abstained from predictions while assigning correctlabels for most training samples. The density-level set is also suggested as a way to making prediction withabstention in Hechtlinger et al. (2018).However, the density-level set has its own drawbacks. It does not try to utilize information comparingdifferent classes, which can potentially lead to a large deterioration in performance. Figure 2 shows anotherexample where the oracle density-level set has less than ideal performance. In this example, we have twoclasses with x ∈ R , and the two classes are well separated in the first dimension and follows the standardnormal distribution in other dimensions. In Figure 2, we show only the first two dimensions. In the the leftplot of Figure 2, we have colored the samples based on their actual class. The black points represent class 1and the blue points represent class 2. In the right plot of Figure 2, we have colored the data based on theiroracle density-level set results: x is colored green if C ( x ) = { , } , black if C ( x ) = { } , blue if C ( x ) = { } and red if C ( x ) = ∅ . Even though class 1 and 2 can be well separated in the first dimension, we still have C ( x ) = { , } for a large portion of data, especially for samples from class 2.In this paper, following the previous approach of density-level sets, we propose a method called BCOPS(balanced and conformal optimized prediction set) to construct prediction set C ( x ) ⊆ { , , . . . , K } for each x , which tries to make good predictions for samples that are similar to the training data and refrain frommaking predictions otherwise. BCOPS is usually more powerful than the density-level set because it combinesinformation from different classes when constructing C ( x ). We also describe a new regression-based methodfor the evaluation of outlier detection ability under some assumptions on how the test data may differ fromthe training data. The paper is organized as follows. In section 2, we first describe our model and relatedworks, then we will introduce BCOPS. In section 3, we will describe methods to evaluate the performanceregarding outlier detection. Some asymptotic behaviors of our proposals are given in section 4. Finally, we3rovide real data examples in section 5. It is often assumed that the distribution of the training data and out-of-sample data are the same. Let π k ∈ (0 ,
1) be the proportion samples from class k ∈ { , . . . , K } in the training data, with (cid:80) Kk =1 π k = 1. Let f k ( x ) be the density of x from class k , and f ( x )/ f test ( x ) be the marginal in/out-of-sample densities. Underthis assumption, we know that f ( x ) = K (cid:88) k =1 π k f k ( x ) , f test ( x ) = K (cid:88) k =1 π k f k ( x ) . (1)In particular, f test ( x ) can be written as a mixture of f k ( x ), and the mixture proportions π k remain unchanged.In this paper, we allow for distributional changes and assume that the out-of-sample data may havedifferent mixture proportions ˜ π k ∈ [0 ,
1) for k ∈ { , . . . , K } , and as well as a new class R (outlier class)which is unobserved in the training data. We let f test ( x ) = K (cid:88) k =1 ˜ π k f k ( x ) + (cid:15) · e ( x ) (2)where e ( x ) is the density of x from the outlier class , (cid:15) ∈ [0 ,
1) is its proportion and (cid:80) Kk =1 ˜ π k + (cid:15) = 1.Under this new model assumption, we want to find a prediction set C ( x ) that aims to minimize the lengthof C ( x ) averaged over a properly chosen importance measure µ ( x ), and with guaranteed coverage (1 − α )for each class. Let | C ( x ) | be the size of C ( x ). We consider the optimization problem P below:min (cid:90) | C ( x ) | µ ( x ) dxs.t. P k ( k ∈ C ( x )) ≥ − α, ∀ k = 1 , . . . , K (3)where P k ( A ) is the probability of the event A under the distribution of class k and µ ( x ) is a weightingfunction which we will choose later to tradeoff classification accuracy and outlier detection. The constraint(3) says that we want the have k ∈ C ( x ) for at least (1 − α ) of samples that are actually from an observedclass k (coverage requirement). If C ( x ) = ∅ , x is considered to be an outlier at the given level α and we willrefrain from making a prediction.It is easy to check that problem P can be decomposed into K independent problems for different classes,referred to as problem P k : min (cid:90) x ∈ A k µ ( x ) dx (4) s.t. P k ( x ∈ A k ) ≥ − α (5)Let A k be the solution to problem P k , then the solution to problem P is C ( x ) = { k : x ∈ A k } . The set A k has an explicit form using the density functions.For an event A , let P F ( A ) be the probability of A under distribution F and let Q ( α ; g, F ) be the lower α percentile of a real-valued function g ( x ) under distribution F : Q ( α ; g, F ) = sup { t : P F ( g ( x ) ≤ t ) ≤ α } We use Q ( α ; g ( x ) , . . . , g ( x n )), or Q ( α ; g ( x n )) to denote the lower α percentile of g ( x ) from the empiricaldistribution using samples x , . . . , x n . Let F k be the distribution of x from class k . It is easy to check that A k = { x : v k ( x ) ≥ Q ( α ; v k , F k ) } , v k ( x ) = f k ( x ) µ ( x ) (6)4s the solution to problem P k . We call this A k the oracle set for class k , the oracle prediction set C ( x )for problem P is constructed using the oracle sets A k . Since A k ( x ) depends only on the ordering of v k ( x ),we can also use any order-preserving transformation of v k ( x ) when constructing A k . (An order-preservingtransformation o : R → R satisfies that v < v ⇔ o ( v ) < o ( v ) for ∀ v , v ∈ R .) µ ( x ) How should we choose µ ( x )? For any given µ ( x ), while the coverage requirement is satisfied by definition,the choice of µ ( x ) influences how well we separate different observed classes from each other, and inliersfrom outliers. In practice, except for the coverage, people also want to minimize the misclassification lossaveraged over the out-of-sample data, e.g,Err = E x,y ∼ f test ( x ) (cid:88) k (cid:54) = y k ∈ C ( x ) (7)It is easily shown that the solution minimizing the above misclassification loss under the coverage requirementis the same as minimizing the objective (cid:82) | C ( x ) | µ ( x ) dx in problem P with µ ( x ) = f test ( x ). This makes f test ( x ) a natural choice for µ ( x ).BCOPS constructs (cid:98) C ( x ) to approximate the oracle solution C ( x ) to P with µ ( x ) = f test ( x ). Someprevious work is closely related or equivalent to other choices of µ ( x ). For example, the density-level setdescribed in the introduction can also be written equivalently to the solution of problem P with µ ( x ) ∝ C ( x ) is constructed by combining a properly chosen learning algorithms with theconformal prediction idea to meet the coverage requirement without distributional assumptions (Vovk et al.2005). For the remainder of this section, we first have a brief discussion of some related methods in section2.3, and review the idea of conformal prediction in section 2.4. We give details of BCOPS in section 2.5, andwe show a simulated example in 2.6. The new model assumption described in equation (2) f test ( x ) = K (cid:88) k =1 ˜ π k f k ( x ) + (cid:15) · e ( x )allows for changes in the mixture proportion, and treats as outliers the part of distributional change thatcan not be explained. This assumption is different from the assumption that fixes f ( y | x ) and allows changesin f ( x ) without constraint. We use this model because P ( y = k ) is much easier to estimate than f ( x ), andit explicitly describes what kind of data we would like to reject.Without the extra term (cid:15) · e ( x ), the change in mixture proportions π k is also called label shift/target shift(Zhang et al. 2013, Lipton et al. 2018). Zhang et al. (2013) also allows for a location-scale transformation in x | y . When only label shift happens, a better prediction model can be constructed through sample reweightingusing the labeled training data and unlabeled out-of-sample data.The extra term (cid:15) · e ( x ) corresponds to the proportion and distribution of outliers. We do not want to makea prediction if a sample comes from the outlier distribution. There is an enormous literature on the outlierdetection problem, and readers who are interested can find a thorough review of traditional outlier detectionapproaches in Hodge & Austin (2004), Chandola et al. (2009). Here, we go back to to the density-level set.Both BCOPS and the density-level set are based on the idea of prediction sets/tolerance regions/minimumvolume sets(Wilks 1941, Wald 1943, Chatterjee & Patra 1980, Li et al. 2008), where for each observation x , we assign it a prediction set C ( x ) instead of a single label so as to minimize certain objective, usuallythe length or volume of C ( x ), while having some coverage requirements. As we pointed out before, thedensity-level set is the optimal solution when µ ( x ) ∝ (cid:90) | C ( x ) | dxs.t. P k ( k ∈ C ( x )) ≥ − α, ∀ k = 1 , . . . , K µ ( x ) being the Lebesgue measure, it is notobvious that µ ( x ) ∝ µ ( x ) ∝ µ ( x ) = f ( x ), the in-sample density. When µ ( x ) = f ( x ), we encounter the same problem as in the usualclassification methods that learn P ( y | x ) and could assign confident predictions to test samples which arefar from the training data. As a contrast, BCOPS choses µ ( x ) to utilize as much information as possible tominimize Err = E x,y ∼ f test ( x ) (cid:80) k (cid:54) = y k ∈ C ( x ) , which usually leads to not only good predictions for inliers butalso abstentions for the outliers.In an independent recent work, Barber et al. (2019) also used information from the unlabeled out-of-sample data, under a different model and goal. BCOPS constructs (cid:98) C ( x ) using the method of conformal prediction. We give a brief recap of the conformalprediction here for completeness.Let X , . . . , X n i.i.d ∼ P and X n +1 be a new observation. Conformal prediction considers the questionwhether X n +1 also comes from P and aims to find a decision rule such that if X n +1 is also independentlygenerated from P , than we will accept X n +1 (to be from P ) with probability at least 1 − α . The key stepof the conformal prediction is to construct a real-valued conformal score function σ ( { X , . . . , X n +1 } , x ) of x that may depend on the observations { X , . . . , X n , X n +1 } but is permutation invariant to its first argument.Let σ i = σ ( { X , . . . , X n , X n +1 } , X i ). Then if X n +1 is also independently generated from P , by symmetry,we have that s i = 1 n + 1 n +1 (cid:88) j =1 σ i ≥ σ j is uniformly distributed on { n +1 , n +1 , . . . , nn +1 , } (if there is a tie, breaking it randomly). For any featurevalue x , we decide if the set A contains x by letting X n +1 = x and consider the corresponding s n +1 : A = { x | s n +1 ≥ (cid:98) ( n + 1) α (cid:99) n + 1 } , Then we have P ( X n +1 ∈ A ) ≥ − α (Vovk et al. 2005).The most familiar valid conformal score may be the sample splitting conformal score σ ( { X , . . . , X n } , x ) = σ ( x ) where the conformal score function is independent of the new observation X n +1 and observations { X , . . . , X n } that are used to construct s n +1 given the conformal score function (but can depend on othertraining data). From now on, we will call a procedure based on this independence as the sample-splittingconformal construction.Another simple example where the conformal score function actually relies on the permutation invarianceis given below: σ ( { X , . . . , X n , X n +1 } , x ) = − ( x − (cid:80) n +1 i =1 X i n + 1 ) Since σ ( { X , . . . , X n , X n +1 } , x ) is permutation invariant on its first argument { X , . . . , X n , X n +1 } , we willhave the desired coverage with this score function. We call a procedure of the above type, that relies on thepermutation invariance but not the independence between observations and the conformal score function, asthe data-augmentation conformal construction.In BCOPS, we estimate v k ( x ) that is used in eq.(6) through either a sample-splitting conformal con-struction or a data-augmented conformal construction to have the coverage validity without distributionalassumptions. 6 .5 BCOPS With the observations from the out-of-sample data, we can consider directly problem P k with µ ( x ) = f test ( x ):min (cid:90) x ∈ A k f test ( x ) dxs.t. P k ( x ∈ A k ) ≥ − α This has the solution A k = { x : v k ( x ) ≥ Q ( α ; v k , F k ) } , v k ( x ) = f k ( x ) f k ( x ) + f test ( x )where we have applied an order-preserving transformation to the density ratio f k ( x ) f test ( x ) to get v k ( x ). Thus,for example, a test point x will have an empty prediction set and be deemed an outlier if each class density f k ( x ) relative to the overall density f test ( x ) is low.If we knew f k ( x ), f test ( x ) and hence v k ( x ), we can have the oracle A k and C ( x ). They are of courseunknown: one could use the density estimation to approximate them, but this would suffer in high dimension.Instead, our proposed BCOPS constructs sets (cid:98) A k to approximate the above A k using the idea of the conformalprediction where the conformal score function is learned via a supervised binary classifier L . When the densityratio v k ( x ) has low dimensional structure, the learned density ratio function from the binary classifier is oftenmuch better than that uses the density estimations directly. Since we have used conformal construction whenconstructing (cid:98) A k , the constructed prediction set (cid:98) C ( x ) = { k : x ∈ (cid:98) A k } will also have the finite sample coveragevalidity. Algorithm 1 gives details of its implementation. Algorithm 1
BCOPSfunction BCOPS( D tr , D te , α , L ) Input :
Coverage level α , a binary classifier L , labeled training data D tr , unlabeled test data D te . Output:
For each x ∈ D te , the prediction set (cid:98) C ( x ).1. Randomly split the training and test data into { D tr , D tr } and { D te , D te } . Let D trk, , D trk, containsamples from class k in D tr , D tr respectively2. For each k , apply L to { D trk, , D te } to separate D trk, from D te and learn a prediction function ˆ v k, ( x ) for v k ( x ) = f k ( x ) f k ( x )+ f test ( x ) . Do the same thing with { D trk, , D te } , and denote the learned prediction functionby ˆ v k, ( x ).3. For x ∈ D te , letting t be ∈ { , } such that x ∈ D tet , and t (cid:48) = { , } \ t . We construct s k ( x ) = 1 | D trk,t | + 1 (cid:88) z ∈ D trk,t ∪{ x } ˆ v k,t (cid:48) ( x ) ≥ ˆ v k,t (cid:48) ( z ) and (cid:98) A k = { x : s k ( x ) ≥ (cid:98) ( | D trk,t | +1) α (cid:99)| D trk,t | +1 } , (cid:98) C ( x ) = { k : x ∈ (cid:98) A k } . Remark 1.
Algorithm 1 uses the sample-splitting conformal construction. We can also use the data aug-mentation conformal prediction instead. For a new observation x and class k , we can consider the augmenteddata D k = { x k, , . . . , x k,n k , x } , where x k,i for i = 1 , . . . , n k are samples from class k in the training data set,and build a classifier separating D k from D te \ { x } , the test data excluding x . For each new observation, welet the trained prediction model be ˆ v k ( . | x ) and let s k ( x ) = 1 n k + 1 (cid:88) z ∈ D trk ˆ v k ( x | x ) ≥ ˆ v k ( z | x ) , (cid:98) A k = { x : s k ( x ) ≥ (cid:98) ( n k + 1) α (cid:99) n k + 1 } , (cid:98) C ( x ) = { k | x ∈ (cid:98) A k } y exchangeability, we can also have finite sample coverage guarantee using this approach (data augmentationconformal construction). However, we use the sample-splitting conformal construction in this paper to avoida huge computational cost. By exchangeability, we know that the above procedure has finite sample validity (Vovk et al. 2009, Cadreet al. 2009, Lei et al. 2013, Lei 2014, Lei & Wasserman 2014):
Proposition 1. (cid:98) C ( x ) is finite sample valid: P k ( k ∈ (cid:98) C ( x )) ≥ − α, ∀ k = 1 , . . . , K Algorithm 1 produces prediction set (cid:98) C ( x ) that achieves the same objective as the oracle prediction set C ( x ) (we will refer it as the asymptotic optimality) if ˆ v k, , ˆ v k, are good estimations of v k ( x ), and v k ( x ) iswell-behaved. A more rigorous statement can be found in section 4. In this section, we provide a simple simulated example to illustrate differences between three differentmethods: (1) BCOPS, (2) density-level set where µ ( x ) ∝ P , and (3) in-sample ratio set where µ ( x ) = f ( x ) in P . All three methods have followed the sample-splitting conformal construction with the level α = 0 .
05. For both BCOPS and the in-sample ratio set, we have used the random forest classifier to learn v k ( x ) ( v k ( x ) = f k ( x ) f k ( x )+ f test ( x ) for BCOPS and v k ( x ) = f k ( x ) f ( x ) for the in-sample ratio set).We let x ∈ R and generate 1000 training samples, half from class 1 and the other half from class 2.The feature x is generated as x ∼ (cid:26) N (0 ,
1) if y = 1 N (3 , .
5) if y = 2 , x j ∼ N (0 , , j = 2 , . . . , R ): x ∼ N (3 , , x j ∼ N (0 , , j (cid:54) = 2In this example, we let the learning algorithm L be the random forest classifier. Figure 3 plots the firsttwo dimensions of the test samples and shows the regions with 95% coverage for BCOPS, density-level setand the in-sample ratio set. The upper left plot colors the data based on its correct label, and is coloredblack/blue/red if its label is class 1/class 2/outliers. For the remaining three plots, a sample is coloredblack if C ( x ) = { } , blue if C ( x ) = { } , green if C ( x ) = { , } and red if C ( x ) = ∅ . Table 1 shows resultsof abstention rate in outliers (the higher, the better), prediction accuracy for data from class 1 and 2 (aprediction is called correct if C ( x ) = { y } for a sample ( x, y )), coverages for class 1 and class 2.Table 1: An illustrative example. The second column R is the abstention rate of outliers, the third column isthe prediction accuracy, the fourth and fifth columns are the coverage for samples from class 1 and class 2. R accuracy coverage I coverage IIdensity-level 0.46 0.57 0.96 0.97in-sample ratio 0.20 0.94 0.94 0.95BCOPS 0.84 0.95 0.96 0.97We can see that the BCOPS achieves much higher abstention rate in outliers, and much higher accuracyin the observed classes compared with the density-level set, while the in-sample ratio set has similar accuracyas the BCOPS but the lowest abstention rate in this example.We also observe that small α might lead to making predictions on many outliers which are far from thetraining data (especially for the density-level set and the in-sample ratio set). While the power for outlierdetection varies for different approaches and problems, we want to learn about the outlier abstention rateno matter what kind of method we are using. In section 3, we provide methods for this purpose.8igure 3: A simulated example. The upper left plot shows the class label for each sample in the test data set.The upper right, lower left, lower right plots corresponds to the prediction results using the density-level set,in-sample ratio set and BCOPS respectively. The upper left plot colors the data based on its correct label,and is colored black/blue/red if its label is class 1/class 2/outliers. For the remaining three plots, a sampleis colored black if C ( x ) = { } , blue if C ( x ) = { } , green if C ( x ) = { , } and red if C ( x ) = ∅ . −4 −2 0 2 4 − test class label x1 x ll ll ll lll ll llll ll l llllll l lll ll l l lll ll ll lll lll ll ll llll ll ll ll ll ll lll l lll lll lll lll l ll llll l lll lll ll l ll lllll ll lll l ll l ll llll ll ll ll llll ll ll ll ll ll ll l ll ll lll lll lll l ll ll llll ll ll ll ll lll l l ll ll lll ll ll llll ll l ll l ll lll lll lllll lllll l ll lll l ll l ll lll lll ll lll l ll lll lll l l ll ll ll l l lll l ll ll lll l l ll ll llll ll l lll lll ll ll lll lll llll l llll l lllll l ll ll ll lll llll ll ll ll ll ll ll llll lll ll lll ll l ll ll ll l llll l ll ll ll l ll l ll ll ll ll llll ll llll ll ll ll lll lll ll lll ll l l ll ll lll llll llll l ll ll ll ll ll lll l ll ll ll l l ll lllll l lll ll lll lll ll lll llll l llll ll lll ll l llll l ll lllll ll ll ll ll ll ll llll l ll ll ll l ll llll l lll ll lll l l ll lllll llll l lll ll ll l lll ll l lll l ll ll llll lllll lll lll lll lll l ll ll llll lllll l ll ll l ll ll lllll ll l lllll ll ll llll ll ll lll l lll llllll lll lll l lll lll llll lll lll lll ll llll llllll l lll llll l ll lll lll ll ll llll l ll lllll llll ll ll ll ll lll ll llll ll l ll l lll llll llll l lllll ll l ll lll llll ll ll lll ll ll llll lll ll lll l lll lll llll llll ll l ll lllll ll lll l ll ll llllll ll ll l lllllll lll l llllll l ll ll llll ll lll ll l ll llllll llll ll ll lll l ll l ll lllll ll lll lll ll lll l ll l ll l lll llll l llllll l lllll lll l lll l ll lll llll ll lll llll l ll lll lll lll ll l ll l ll ll lll llllll ll ll ll l l ll lllll ll l lll ll l llllll lll l ll ll ll ll l ll ll ll llll llll llll l ll ll l l llll lll lll ll l lll ll lll lll l lll ll l ll l ll ll l lllll l ll ll l lll ll l ll ll ll lll llll llll ll ll ll l lll lllll lll l ll ll ll lll lll ll ll ll llll l lll ll ll lll ll lll ll l ll lll lll l ll ll ll ll l ll l lll lllll lll llll lllll lll l ll ll ll ll ll llll l l ll ll ll lll ll l ll ll llll ll ll ll ll l llll ll l ll ll l lll lll l lll l lllll ll l llll l ll ll ll lll l ll llll ll ll ll l lll ll lll llll lll ll lll ll lll ll lll ll ll l ll l ll ll lll ll ll llll l ll ll ll ll lll ll llll lllll lll ll l ll ll ll l lll llll ll lll lll l ll l ll l ll ll ll l ll ll l lllll ll lll
12R −4 −2 0 2 4 − density−level set x1 x l ll ll ll l lll llll ll ll ll l l ll lll ll lll ll l lllll lll l lll ll ll ll lll llll ll llll ll ll l l llll lll llll ll ll lll lll lll ll l ll l ll ll l lllll l lll l ll l l ll ll ll lll llll llll ll ll ll lll llll lll l ll ll ll lll ll ll ll ll llll l lll ll ll lll ll lll ll l ll lllll l ll ll lll ll ll llll llllll lll lll l ll l ll llll l ll lll lll ll lll llll ll l ll ll l llll ll l ll ll l lll lll lll llll ll l llll lll lll ll ll lll ll lll llll llll lll ll ll ll ll lll ll ll ll l ll l lll ll llll lll ll ll ll ll llll llll llll l ll ll l lll llll l lll lll ll l ll ll ll ll l lll ll ll l llll l l ll l ll llll lll lll ll l l lll lll ll l lll l ll llll ll llllll ll ll llll l lll ll lll l ll l ll ll lll ll lll ll l ll ll lll ll lll ll l ll ll l ll ll llll l ll lll ll l ll ll ll ll l lllll ll l ll llll lll llllll ll ll lll ll lll lllll ll lll ll l lll ll ll ll l ll ll l ll lll l lll ll ll ll llll ll ll l ll ll l ll ll ll ll llll lllll llll ll l ll ll l ll l llllll lll l l lll ll ll ll l ll ll lll ll lll lll ll ll ll llll llll lll ll ll llll ll ll ll lll ll ll ll ll llll ll l ll lll ll lllllllll ll ll lll llllllll l l lll ll ll l ll llllllll l ll lll lll lll lll lllllll llllll llll ll ll ll ll lllll ll llllll l lll lllllll ll ll llll ll llll ll ll lll l llll l ll ll ll llll ll ll ll l ll lll lll llll ll ll ll l lll lll lllll l lll lll ll lllll lllll ll l ll lllll ll l lllll ll ll ll l lll lllll lll lll lll llllll ll ll ll lllllllll ll lll ll ll ll l ll ll llll lllll lll ll ll ll llll l llll ll llll llll lll llll l lll ll lllll ll lll ll ll llll lllll l llll lll l ll ll llllll ll llll ll lll llll ll llll ll l lll lll lll llll ll ll lll lll l lll lll lllll llll llllll l ll ll ll lll llllll lllll lllll llll llll l lll lll l llllll llll llllll lll llll ll l lll ll l l l ll lll l llll lll ll lll l ll llll llll ll ll l ll llll ll lll ll l lll l lll lll l lll l ll llll ll ll l l ll lll lll llll ll ll ll lll llll lll l ll l ll lll l ll lll llll ll llll ll lll ll llll lll lll l lll l lll ll ll ll lll llll l l lllll ll llll ll lll ll llll l l ll lll llllll lll ll l ll l llll − in−sample ratio set x1 x l ll l ll l ll ll lll lllll ll ll ll l l ll lllll ll l lll ll l llllll lll l lll ll ll ll lll llll llll llll l ll ll l l llll lll llll l lll ll lll lll l lll ll l ll l ll ll l lllll l ll ll l ll ll l ll ll ll lll llll llll ll ll ll l lll lllll lll l ll ll ll lll lll ll ll ll llll l lll ll ll lll ll lll ll l ll lll lll l ll ll ll ll l ll l ll lllll lll llll lllll lll l ll l ll lll lll l l ll lll lll ll l ll ll llll ll l ll ll l llll ll l ll ll l lll lll lll lllll ll l llll l lll l lll l ll llll ll ll ll lll ll lll llll lll ll lll ll llll ll ll ll ll l ll lll ll lllll l ll ll l ll lll ll llll lllll llll l ll ll ll lll llll l lll lll ll l ll ll ll ll l ll ll lllll ll llll ll lll ll lll ll l llllll l ll ll l l lll ll ll l ll lllll l lll ll ll lll ll ll ll llll l lllll ll l ll lllll ll lll ll l ll llll ll lll llll ll ll lll ll ll ll ll llll l ll l lll l ll lll lll ll ll lll lll ll ll l ll lll ll llll lllll l ll ll l ll llll lll ll lll ll ll ll l l ll ll ll l l ll l lll ll l lll llll ll l lll ll ll lll lll llll lll l lllll ll lll ll lll ll lll llllll ll lll l ll lll l lll l ll ll ll l ll ll l ll lll llll ll ll llll ll lll ll l ll llll lll llll lll ll l ll l lll lll l lll llll lll ll lll lll l ll l lll ll llll ll l ll l l lllll ll llll l ll llll l lll ll ll llll l lll ll lll l l ll lllll llll l lll ll ll l lll ll l lll l ll ll llll lllll llll lll lll l ll ll llll lllll l ll ll l ll ll lllll ll l lll lll llll l ll lll l lll llllll lll lll l lll lll llll lll lllll lllll lllll lll llll l ll lll lll ll ll llll l ll lllll llll ll ll ll ll lll ll llll ll l ll l lll llll llll llllll l llll llll ll ll lll ll ll llll lll ll lll l lll lll llll llll ll l ll lllll ll lll l ll ll llllll ll ll lllllll lll l llllll l ll ll lllll lll ll ll llllllllll ll lll l ll l ll lllll ll lll lll ll lll l ll l ll l lll llll l llllll lllll lll l lll l ll lll llll llll lll l ll lll lll lll lll l lllllllllllllllllll lllllllll llll llll llllllllll ll llll lllllll ll ll llll l ll ll ll l ll llllll llll lll l l lll lll l ll lll lll ll lll l ll lll ll l ll llll l ll llll lllll llll l lll l lllllll ll l llll − BCOPS x1 x l ll ll l ll ll lll llllll ll ll ll l l ll lllll ll l lll ll l llllll lll l ll ll ll ll l ll lll llll llll llll l ll ll l l llll lll lll ll l lll ll lll lll l lll ll ll l ll ll l lllll l ll ll ll ll l ll ll ll lll llll llll ll ll ll l lll lllll lll l ll ll ll lll lll ll ll ll llll l lll ll ll lll ll lll ll l ll ll lll l ll ll ll ll l ll l lll lllll ll llll lllll lll l ll l ll ll ll lll l l ll ll ll lll ll l ll ll lll ll ll ll ll l llll ll l ll ll l lll lll l lll l lllll ll l llll l ll ll ll lll ll llll ll ll ll lll l lll llll lll ll lll ll lll ll lll ll ll ll l ll ll lll ll ll llll l ll ll lll lll ll llll lllll llll l ll ll llll llll l lll lll ll l ll l ll ll ll l ll ll l lllll ll llllllll lll lll lll ll l lll llll l l llll l ll llll ll l l lll ll lll l l ll lll l llll l lll ll lll llll lll lll l lll l l l ll lllll ll ll lll ll ll llll l lll ll l ll llll l lll ll lll l l ll lllll llll l lll ll ll l lll ll l lll l ll ll llll lllll ll lll lllll ll ll llll lllll l ll ll l ll ll lllll ll l lllll ll ll llll ll ll lll l lll llllll lll lll l lll lll llll lll lllll lllll lllll l lll llll l ll lll ll ll ll llll l ll lllll llll ll ll ll ll lll ll llll ll l ll l lll llll llll llllll l llll llll ll ll lll ll ll llll lll ll lll l lll lll llll llll ll l ll lllll ll lll l ll ll llllll ll ll lllll lll l llllll l ll ll lllll lll ll ll llllllllll ll lll l ll l ll lllll l lll lll ll lll l ll l ll l lll llll l llllll lllll lll l lll l ll lll llll llll lll l ll lll lll lll lllll llllll lllllll ll lll l ll l ll llll ll ll lll l llll ll l lllll l lll ll l lll lll ll ll ll llll ll ll ll l ll lll l lll llll lll l ll lll ll lll ll lll ll lll l ll l ll llll ll ll llll ll ll l ll ll l ll ll lll lll lll l ll lllll ll ll lll l ll ll lll ll ll llll l l ll l ll ll lllll lll lll l ll l ll ll ll ll l l ll lll ll l l ll ll ll l l ll l ll ll lll l ll ll lll l l lll lll ll ll ll ll lll llll llll l lll ll lll llll ll ll ll ll ll ll llll lll l ll ll l ll ll ll l llll l lll l ll l lll ll ll llll l lll ll ll ll lll lll ll ll ll l l lllll ll llll ll ll ll ll ll l l ll ll l l ll llll ll llll lll ll lll llll lll llll ll l llll l ll llll Outlier abstention rate and false labeling rate
In the previous section, we proposed BCOPS for prediction set construction at a given level α . In this section,we describe a regression-based method to estimate the test set mixture proportions ˜ π k for k = 1 , . . . , K , andusing this, we estimate the outlier abstention rate and FLR (false labeling rate). The outlier abstention rateis the expected proportion of outliers with an empty prediction set. The FLR is the expected ratio betweenthe number of outlier given a label and total number of samples given a label. For a fixed prediction setfunction C ( x ), its outlier abstention rate γ and FLR are defined as γ := P R ( C ( x ) = ∅ )FLR := E (cid:18) |{ x ∈ D te : y ( x ) = R , C ( x ) (cid:54) = ∅}||{ x ∈ D te : C ( x ) (cid:54) = ∅}| ∨ (cid:19) The expectation is taken over the distribution of x . The outlier abstention rate is power for C ( x ) in termsof the outlier detection while FLR controls the percent of outliers among samples with predictions.Information about the outlier abstention rate and FLR can be valuable when picking α . For example,while we want to have both small α and large γ , γ is negatively related to α . As a result, we might want tochoose α based on the tradeoff curve of α and γ . There are different ways that we may want to utilize γ orFLR:1. Set α ≥ α ∗ to control the FLR, for example, let α ∗ = inf { α : FLR( α ) ≤ } where FLR( α ) is theFLR at the given α .2. Set α = α ∗ to control the abstention rate where α ∗ is the smallest α such that γ is above a giventhreshold.3. Set α without considering FLR or γ , however, in this case, we can still assign a score for each point tomeasure how likely it may be an outlier. For each point x , let α ( x ) be the largest α such that C ( x ) = ∅ .We let its outlier score be γ ( x ) := γ α ( x ) , where γ α ( x ) is the abstention rate at the required coverageis (1 − α ( x )). The interpretation for the outlier score is simple: if we want to refrain from makingprediction for γ proportion of outliers, then we do not make a prediction for samples with γ ( x ) ≥ γ even if C ( x ) itself is non-empty. γ and FLR When the proportion of outliers (cid:15) is greater than zero in the test samples, γ can also be expressed as γ = E [Number of outliers with C ( x ) = ∅ ]Total number of outliers = E [ N ∅ ] − (cid:80) Nk =1 N ˜ π k γ k N (1 − (cid:80) Kk =1 ˜ π k )where N is the total number of test samples, N ∅ is the total number of samples with abstention ( C ( x ) = ∅ )and γ k := P k ( C ( x ) = ∅ ) is the abstention rate for class k . The FLR can be expressed asFLR = E [ All non-empty (cid:122) (cid:125)(cid:124) (cid:123) ( N − N ∅ ) − (cid:80) Kk =1 Class k non-empty (cid:122) (cid:125)(cid:124) (cid:123) N ˜ π k (1 − γ k )( N − N ∅ ) ∨ γ k and ˆ π k , the estimates of γ k and ˜ π k , are available, we can construct empirical estimates ˆ γ and (cid:91) FLR of γ and FLR:ˆ γ = ( N ∅ − (cid:80) Kk =1 N ˆ π k ˆ γ k ) ∨ N (1 − (cid:80) Kk =1 ˆ π k ) ∨ , (cid:91) FLR = ( N − N ∅ − (cid:80) k N ˆ π k (1 − ˆ γ k )) ∨ N − N ∅ ) ∨ .1.1 Estimation of γ k We estimate γ k using the empirical distribution for class k from the training data. More specifically, forBCOPS, we let ˆ γ k = |{ x ∈ D trk : (cid:98) C ( x ) = ∅}|| D trk | where (cid:98) C ( x ) follows the same construction as in the BCOPS Algorithm: For x ∈ D trk,t , we construct ˆ C ( x ) for x using training and test samples from fold t (cid:48) ∈ { , } \ t as described in the BCOPS Algorithm. ˜ π k Let S l be regions such that P R ( S l ) = 0 for l = 1 , . . . , K . Then, by our model assumption, we know P test ( S l ) = K (cid:88) k =1 ˜ π k P k ( S l )Let P l = P test ( S l ) be the response vector, P be a K × K design matrix with P l,k = P k ( S l ), and Σ = P T P .As long as Σ is invertible, ˜ π is the solution to the regression problem that regresses P on P . Next, we givea simple proposal trying to construct such S l .For a fixed function η : R p → R K , let g l,k ( . ) be the density of η l in class k . If the outliers happen withprobability 0 at regions of η l where class l has relatively high density, we can let S l be the region withrelatively high g l,l ( . ):1. Let ◦ be the composition operator and S l = { z : g l,l ( z ) ≥ Q ( ζ ; g l,l ◦ η l , F l ) } , P l = P test ( η l ( x ) ∈ S l ), P l,k = P k ( η l ( x ) ∈ S l ) for a user-specific constant ζ ∈ (0 ,
1) specifying the separation between inliersand outliers.2. We would like to solve J ( η ) := min π (cid:107) P − P π (cid:107) , the the oracle problem based on the function η .We recommend taking η l ( x ) = log f l ( x ) f test ( x ) , the log-odd ratio separating class l from the test data, since itautomatically tries to separate class l from other classes, including the outliers.Neither η nor P , P given η will be observed. In practice, we use sample-splitting to estimate η in one foldof the data and estimate P, P empirically in the other fold conditional on the estimated η . See Algorithm2, MixEstimate (mixture proportion estimation), for details. Algorithm 2
MixEstimate function MixEstimate( D tr , D te , ζ , L ) Input :
Left-out proportion ζ , a binary classifier L , labeled training data D tr , unlabeled test data D te . Bydefault, ζ = 0 . Output:
Estimated mixture proportion { ˆ π k , k = 1 , . . . , K }
1. Randomly split the training and test data into { D tr , D tr } and { D te , D te } . For t = 1 ,
2, let D trk,t containsamples from class k in D trt , and apply L to { D trk,t , D tet } to separate samples from class k and the testset, we get ˆ η tl ( x ) as the estimate to η l ( x ) = log f l ( x ) f test ( x ) .2. For fold t = 1 ,
2: let t (cid:48) = { , } \ t , and • let ˆ F t (cid:48) l and ˆ g t (cid:48) l,l ( . ) be empirical distribution of class l and the gaussian kernel density estimationof the density of ˆ η tl ( x ) using D trl,t (cid:48) . • the empirical problem ˆ J (ˆ η t ) is constructed with empirical probabilities of each class in fold t (cid:48) falling into regions (cid:98) S l = { t : ˆ g t (cid:48) l,l ( t ) ≥ Q ( ζ ; ˆ g t (cid:48) l,l ◦ ˆ η t , ˆ F t (cid:48) l ) } . Let ˆ π tk for k = 1 , . . . , K be the solutionsto ˆ J (ˆ η t ).3. Output the average mixture proportion estimate and let ˆ π k = ˆ π k, +ˆ π k, , ∀ k = 1 , . . . , K .11 emark 2. • Our proposal can be viewed as an extension to the BBSE method in Lipton et al. (2018)under the presence with outliers. • In practice, we can add a constraint on the optimization variables π k and require that (cid:80) Kk =1 π k ≤ and π k ≥ . This constraint guarantees that both ˜ π k and the outlier proportion (cid:15) are non-negative. We show in section 4 that the estimates from MixEstimate will converge to ˜ π k under proper assumptions.As a continuation of the example shown in section 2.6. Figure 4 shows curves of estimated FLR, estimatedoutlier abstention rate ˆ γ , as well as the FLP (false labeling proportion), which is the sample version of FLRusing current test data, and the sample version of the γ against different α .Figure 4: An illustrative example, continuation of section 2.6. The red solid/dashed curves show the FLPand estimated FLR against different α . The blue solid/dashed curves show the actual outlier abstention rate γ and estimated γ against different α . . . . . . . Estimated/true curves of FLR/ g a FLPFLR(estimated) gg (estimated) Let n k be the sample size of class k in the training data and n be the size of the training data. Let N bethe size of the test data. In this section, we consider the asymptotic regime where n → ∞ , and assume thatlim n →∞ Nn ≥ c for a constant c > n k n → c k for a constant c k ∈ (0 , k = 1 , . . . , K . In this asymptoticregime, we show that1. The prediction set (cid:98) C ( x ) constructed using BCOPS achieves the same loss as the oracle prediction set C ( x ) asymptotically if the estimation of v k ( x ) = f k ( x ) f k ( x )+ f test ( x ) is close to it, and under some conditionson the densities of x and distribution of v k ( x ) for k = 1 , . . . , K .2. The mixture proportion estimations converge to the the true out-of-sample mixture proportions ˜ π k ifthe outliers are rare when the observed classes have high densities, and under some conditions on thedensities of η l ( x ), the functions used to construct S l . Let ˆ v k ( x ) be the estimate of v k ( x ), representing either ˆ v k, ( x ) or ˆ v k, ( x ) in the BCOPS Algorithm. Assumption 1.
Densities f ( x ) , . . . , f k ( x ) , e ( x ) are upper bounded by a constant. There exist constants < c ≤ c and δ , γ > , such that for k = 1 , . . . , K , we have c | δ | γ ≤ | P k ( { x | v k ( x ) ≤ Q ( α ; v k , F k ) + δ } ) − α | ≤ c | δ | γ , ∀ − δ ≤ δ ≤ δ emark 3. We require that the underlying function v k ( x ) is neither too steep nor too flat around theboundary of the optimal decision region A k . This makes sure that this boundary is not too sensitive to smallerrors in estimating v k ( x ) and the final loss is not too sensitive to small changes in the decision region. Assumption 2.
The estimated function ˆ v k ( x ) converges to the true model v k ( x ) : there exists constants B , β , β > and a set A n of x depending on n , such that, as n → ∞ , we have P (sup x ∈ A n | ˆ v k ( x ) − v k ( x ) |
For such an assumption to hold in high dimensional setting, the classifier L in BCOPS usuallyneeds to be parametric and we will also need some parametric model assumptions depending on L . For exam-ple, when we let L be the logistic regression with lasso penalty in BCOPS, we could require the approximatecorrectness of the logistic model, nice behavior of features and sparsity in signals (Van de Geer et al. 2008). Theorem 1.
Under Assumptions 1-2, for any fixed level α > , let C ( x ) be the oracle BCOPS predictionset, for a large enough constant B , we have P ( (cid:90) ( | (cid:98) C ( x ) | − | C ( x ) | ) f test ( x ) dx ≥ B ( log nn ) min( γβ ,β , ) ) → In this section, let ˆ η represent ˆ η t for t = 1 , ζ is a user-specific positive constant, η : R p → R K is a fixed function, and g l,k ( t ) is thedensity of η l in class k , S l = { t : g l,l ( t ) ≥ Q ( ζ ; g l,l ◦ η l , F l ) } and P, P are the oracle response and designmatrix based on η , Σ = P T P . We also let ˆ J ( η ) be the problem with empirical P and P from fold t = 1 or2, and h n be the bandwidth of the gaussian kernel density estimation in Algorithm 2. The bandwidth h n satisfies h n → log nnh n → Assumption 3.
For k = 1 , . . . , K, R and l = 1 , . . . , K , the density g l,k ( t ) is bounded, and g l,l ( t ) is H ¨ older continuous (e.g. there exist constants < γ ≤ and B , such that | g l,l ( z ) − g l,l ( z (cid:48) ) | ≤ B | z − z (cid:48) | γ for ∀ z, z (cid:48) ∈ R ). Assumption 4.
There exist constants γ, c , c > and δ > , such that for ∀ l = 1 , . . . , K : c | δ | γ ≤ | P l ( g l,l ( t ) ≤ Q ( ζ ; g l,l ◦ η l , F l ) + δ ) − ζ | ≤ c | δ | γ , ∀ − δ ≤ δ ≤ δ Remark 5.
Assumption 4 is similar to Assumption 1, it asks that g l,l ( t ) is neither too steep nor too flataround the boundary of S l . Assumption 5. P R ( η l ( x ) ∈ S l ) = 0 , ∀ l = 1 , . . . , K , and Σ is invertible with the smallest eigenvalue σ min ≥ c for a constant c > . Theorem 2.
Under Assumption 3-5, let { ˆ π k , k = 1 , . . . , K } be solutions to ˆ J ( η ) . Then ˆ π k p → ˜ π k as n → ∞ . By the independence of the two folds, once we have learned ˆ η in Algorithm 2, we can condition on it andtreat it as fixed. Hence, we have Corollary 1 as a direct application of Theorem. Corollary 1.
Let A be the event that Assumptions 3-5 are satisfied for η = ˆ η . If P ( A ) → as n → ∞ .Then, let { ˆ π k , k = 1 , . . . , K } be solutions to ˆ J (ˆ η ) , we have ˆ π k p → ˜ π k as n → ∞ . Proof of Theorem 2 is given in Appendix A. 13igure 5:
Type I error control: Actual type I error vs nominal type I error α for different methods. Allmethods can control the targeted type I error. . . . . . . type I error vs. a a t y pe I DLSBCOPS(rf)BCOPS(glm)IRS(rf)IRS(glm)
We look at the MNIST handwritten digit data set(LeCun & Cortes 2010). We let the training data containdigits 0-5, while the test data contain digit 0-9. We subsample 10000 training data and 10000 test data andcompare the type I and type II errors using different methods. The type I error is defined as (1 − coverage)and no type I error is defined for new digits unobserved in the training data. Rather than considering Errdefined in eq.(7), we define the type II error as Err = E F | C ( x ) \{ y }| > for samples with distribution F ( y isthe true label of x ), so that we will have Err in [0 , L being random forest (rf) or logistic+lassoregression (glm).2. DLS: the density-level set (with µ ( x ) = 1 in problem P ) with the sample-splitting conformal construc-tion.3. IRS: the in-sample ratio set (with µ ( x ) = f ( x ) in problem P ) with the sample-splitting conformalconstruction and a supervised K -class classifier L to learn f k ( x ) f ( x ) . Here, we also let L be either randomforest (rf) or multinomial+lasso regression (glm).Figure 5 plots the nominal type I error for digits showing up in the training data (average), we can seethat all methods can control its claimed type I error.Figure 6 shows plot of the type II error agains type I error, separately for digits in and not in the trainingset, as α ranges from 0 .
01 to 0 .
99. We observe that • For the unobserved digits in the training data, we see thatDLS < IRS(glm) < IRS(rf) < BCOPS(glm) < BCOPS(rf) (ordered from worse to better)14igure 6:
Comparisons of the Type II ∼ Type I error curves using different conformal prediction sets. Resultsfor observed digits (digits ≤ ) and unobserved digits (digits ≥ ) have been presented separately. BCOPSperforms the best for the unobserved digits, and IRS is slightly better than BCOPS for the observed digitsusing the same learning algorithm. Both BCOPS and IRS are much better than DLS in this example. . . . . . . digits <= 5 typeI t y pe II DLSBCOPS(rf)BCOPS(glm)IRS(rf)IRS(glm) . . . . . . digits >= 6 typeI t y pe II DLSBCOPS(rf)BCOPS(glm)IRS(rf)IRS(glm)
BCOPS achieves the best performance borrowing information from the unlabeled data. In this example,IRS also has better results compared with DLS for the unobserved digits. IRS depends only on thepredictions from the given classifier(s) and does not prevent us from making prediction at a locationwith sparse observations if the classifiers themselves do not take this into consideration. Although wecan easily come up with situations where such methods fail entirely in terms of outlier detection, e.g,the simulated example in section 2.6, in this example, the dimension learned by IRS for in-sampleclassification is also informative for outlier detection. • For the observed digits in the training data, we see thatDLS < BCOPS(glm) < IRS(glm) < BCOPS(rf) < IRS(rf)DLS performs much worse than both BCOPS and IRS, and BCOPS performs slightly worse than IRSfor a given learning algorithm L .Overall, in this example, BCOPS trades off some in-sample classification accuracy for higher power in outlierdetection.In practice, we won’t have access to the curves in Figure 6. While we can estimate the behavior ofthe observed digits using the training data, we won’t have such luck for the outliers. We can use methodsproposed in the section 3 to estimate the FLR and the outlier abstention rate γ . Figure 7 compares theestimated FLR and γ with the actual sample-versions of FLR and γ . We can see that the estimated FLRand γ matches reasonably well with the actual FLP and γ (sample-version) for both learning algorithm L . The intrusion detector learning task is to build a predictive model capable of distinguishing between “bad”connections, called intrusions or attacks, and “good” normal connections (Stolfo et al. (2000)). We use 10%of the data used in the 1999 KDD intrusion detection contest. We have four classes: normal, smurf, neptuneand other intrusions. The normal, smurf, neptune samples are randomly assigned into the training and testsamples while other intrusions appear only in the test data. We have 116 features in total and approximately180,000 training samples and 180,000 test samples, and about 3.5% of the test data are other intrusions.15igure 7:
FLR and outlier abstention rate γ estimation. Red curves are actual FLP and estimated FLR, andthe blue curves are the actual abstention rate γ realized on the current data set and the estimated γ . . . . . . . Estimated/true curves of FLR/ g a FLPFLR(estimated) gg (estimated) 0.0 0.2 0.4 0.6 0.8 1.0 . . . . . . Estimated/true curves of FLR/ g a FLPFLR(estimated) gg (estimated) We let L be random forest, and compare the BCOPS with the RF prediction. Figure 8 shows the estimatedabstention rate and estimated in-sample accuracy defined as (1-estimated type II errors for observed classes).In this example, BCOPS takes α = 0 . α achieving 95% of the abstention rate for outliers.Figure 8: Estimated outlier abstention rate γ and in-sample accuracy against α . The vertical red line showsthe suggested value for α if we let the estimated abstention rate be . In this example, the in-sampleaccuracy remains almost 1 for extremely small α and is not very instructive for picking α . . . . . . . a r a t e g (estimated)in−sample accuracy Table 2 shows prediction accuracy using BCOPS and RF. We pool smurf and neptune together and callthem the observed intrusions. We say that a prediction is correct from RF if it correctly assign normallabel to normal data or assign intrusion label to intrusions. From Table 2, we can see that the original RFclassifier assigns correct label to 99.999% of samples from the the observed classes, but claims more 50%of other intrusions to be the normal. The significant deterioration on unobserved intrusion types is also16able 2:
Network intrusion results. The column names are the prediction sets from the BCOPS, and therow names are the true class labels. In each cell, the upper half is the number of samples falling into thecategory and the lower half is the prediction accuracy from RF for sample in this category. For example, thecell in the column “normal+intrusion” and row ”normal” describes the number of normal connections withBCOPS prediction set contains both normal and at least one intrusion label (upper half ) and the predictionaccuracy based on RF for these samples (lower half ). (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)
Class
BCOPSLabel normal intrusion normal+intrusion abstentionnormal (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80) (cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)
NA 0 (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)
NA 0 (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80) (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)
NA 0 (cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88) (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)
NA 0 (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80) (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80) (cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88) (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)
NA 0 (cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)(cid:80)
In this paper, we propose a conformal inference method with a balanced objective for outlier detection andclass label prediction. The conformal inference provides finite sample, distribution-free validity, and with abalanced objective, the proposed method has achieved good performance at both outlier detection and theclass label prediction. Moreover, we propose a method for evaluating the outlier detection rate. Simple as itis, it has achieved good performance in both the simulations and the real data examples in this paper. Here,we also discuss some potential future work:1. One extension is to consider the case where just a single new observation is available. In this case,although we have little information about f test ( x ), we can still try to design objective functions thatlead to good predictions for samples close to the training data set while accepting our ignorance atlocations where the training samples are rare. For example, we can use a truncated version of f ( x )and let µ ( x ) = f ( x ) f ( x ) ≥ c + c f ( x ) Proofs of Theorems 1-2 Before we prove Theorems 1-2, we first state some Lemmas that will be useful in the proofs. Let ˆ F k be theempirical distribution of F k for given samples based on the context. In this paper, when it comes to theempirical estimates, they are are estimated with samples of size cn for a constant c > Lemma 1. Let G and ˆ G be the CDF and empirical CDF of a univariate variable in R . Let V n = sup t | G ( t ) − ˆ G ( t ) | , then, for a large enough constant B , we have P ( V n ≤ B (cid:113) log nn ) → .Proof. Lemma 1 is a result of the classical empirical process theory (Wellner et al. (2013)), see also LemmaC.1 in Lei et al. (2013). Lemma 2. Let g ( t ) be the density of the univariate variable t ∈ R . Let ˆ g ( t ) be its Gaussian kernel densityestimation with bandwidth h n > log nn . Suppose g ( t ) is bounded and H ¨ o lder continuous with exponent ≥ α > , then, there exists a large enough constant B , such that with probability at least − n , we have (cid:107) g ( t ) − ˆ g ( t ) (cid:107) ∞ < B ( h α n + (cid:114) log nnh n ) Proof. Obviously, the Gaussian kernel K ( z ) for z ∈ R satisfies Assumption 2-3 in Jiang (2017) (the sphericallysymmetric and non-increasing Assumption, and the exponential decay Assumption), and g ( t ) is bounded(Assumption 1 in Jiang (2017)). Then, Lemma 2 a special case of Theorem 2 in Jiang (2017). A.1 Proof of Theorem 1 The proof follows the same procedure as in Lei et al. (2013). Let A k be the accepted region for class k underthe oracle BCOPS. Let R n, = P out ( x / ∈ A n ), R n, = sup x ∈ A n | ˆ v k ( x ) − v k ( x ) | and R n, = | Q ( α ; v k , F k ) − Q ( α ; v k , ˆ F k ) | . Then we have( (cid:98) A k \ A k ) ∩ A n = { x | x ∈ A n , v k ( x ) < Q ( α ; v k , F k ) , ˆ v k ( x ) ≥ Q ( α ; ˆ v k , ˆ F k ) }⊆{ x | Q ( α ; v k , F k ) − R n, − R n, ≤ v k ( x ) < Q ( α ; v k , F k ) } (8)and (cid:90) (cid:98) A k \ A k f test ( x ) dx ≤ (cid:90) ( (cid:98) A k \ A k ) ∩ A n f test ( x ) dx + R n, ≤ (cid:90) ( (cid:98) A k \ A k ) ∩ A n v k ( x )[ Q ( α ; v k , F k ) − R n, − R n, ] + f test ( x ) d ( x ) + R n, ≤ P F k (( (cid:98) A k \ A k ) ∩ A n )[ Q ( α ; v k , F k ) − R n, − R n, ] + + R n, By Assumption 2, for a large enough constant B , we have P ( R n, ≤ B ( log nn ) β ) → , P ( R n, ≤ B ( log nn ) β ) → G and ˆ G be the CDF and empirical CDF of v k ( x ). By Lemma 1, on the one hand, with probabilityapproaching 1, for any constant δ and a constant B large enough, we have | ˆ G ( Q ( α − δ ; v k , F k )) − ( α − δ ) | ≤ B (cid:114) log nn On the other hand, by Assumption 1, we have δ ≥ c | Q ( α ; v k , F k ) − Q ( α ± δ ; v k , F k ) | γ 18n other words, with probability approaching 1, the following is true Q ( α − B (cid:114) log nn ; v k , F k ) ≤ Q ( α ; v k , ˆ F k ) ≤ Q ( α + B (cid:114) log nn ; v k , F k )and | Q ( α ± B (cid:114) log nn ; v k , F k ) − Q ( α ; v k , F k ) | ≤ ( B c (cid:114) log nn ) γ (10)Hence, R n, ≤ ( B c (cid:113) log nn ) γ .For the numerator of equation (8), we use equations (9)-(10) and apply Assumption 1 again, for a largeenough constant B , we have, P k (( (cid:98) A k \ A k ) ∩ A n ) ≤ P k ( v k,α − R n, − R n, ≤ v ( x ) ≤ v k,α ) ≤ ( c (2 R n, + R n, )) γ ≤ B ( (cid:114) log nn + ( log nn ) β γ )For the denominator of equation (8), when α is a positive constant, since v k ( x ) = f k ( x ) f test ( x )+ f k ( x ) is non-zeroas long as f k ( x ) is non-zero, we must have that Q ( α ; v k , F k ) is also a positive constant, and we can alwaystake n large enough, such that Q ( α ; v k , F k ) − R n, − R n, ≥ Q ( α ; v k , F k ) > B large enough, with probability approaching 1: (cid:90) (cid:98) A k \ A k f test ( x ) dx ≤ BK (( 1 n ) min( γβ ,β , ) (12)Combining eq.(11)-(12), with probability approaching 1, we have (cid:90) ( | (cid:98) C ( x ) | − | C ( x ) | ) f test ( x ) dx ≤ (cid:88) k (cid:90) (cid:98) A k \ A k f test ( x ) dx ≤ B (( 1 n ) min( γβ ,β , )with probability approaching 1 for a large enough constant B . A.2 Proof of Theorem 2 Since Σ is invertible with smallest eigenvalue σ min ≥ c > c and ( P T P ) − Σ − P T P = ˜ π byAssumption 5. To show that ˆ π = ( ˆ P T ˆ P ) − ˆ P T ˆP, where ˆ P and ˆP are the empirical versions of P and P, it issufficient to show ˆ P l,k p → P l,k , ∀ k = 1 , . . . , K, R , l = 1 , . . . , K In section 4.2, we have only defined P l,k for k = 1 , . . . , K , here we include the class R as well following thesame definition: P l, R = P R ( η l ( x ) ∈ S l ). Recall the definition of ˆ P l,k :ˆ P l,k := P ˆ F k ( η l ( x ) ∈ ˆ S l ) = ( ˆ P l,k − ˜ P l,k ) + ˜ P l,k where • ˜ P l,k := P F k ( η l ( x ) ∈ ˆ S l ). • ˆ S l := { t : ˆ g l,l ( t ) ≥ ˆ g l,ζ } , ˆ g l,l ( t ) is the kernel estimation of g l,l ( t ) and ˆ g l,ζ := Q ( ζ ; ˆ g l,l ◦ η l , ˆ F l ).19e first show that ˜ P l,k p → P l,k , ∀ k = 1 , . . . , K, R , l = 1 , . . . , K Let ∆ = P l,k − ˜ P l,k = ∆ − ∆ where ∆ = (cid:82) t g l,k ( t ) ˆ g l,ζ ≤ ˆ g l,l ( t ) ≤ g l,ζ and ∆ = (cid:82) t g l,k ( t ) g l,ζ ≤ ˆ g l,l ( t ) ≤ ˆ g l,ζ . Wenow prove ∆ p → 0. Let R n, = (cid:107) ˆ g l,l ( t ) − g l,l ( t ) (cid:107) ∞ and R n, = | ˆ g l,ζ − g l,ζ | . Then, for a large enough constant B , we have∆ = (cid:90) t g l,k ( t ) ˆ g l,ζ +ˆ g l,l ( t ) − g l,l ( t ) ≤ g l,l ( t ) ≤ g l,ζ +ˆ g l,l ( t ) − g l,l ( t ) ≤ (cid:90) t g l,k ( t ) g l,ζ +ˆ g l,l ( t ) − g l,l ( t ) − R n, ≤ g l,l ( t ) ≤ g l,ζ +ˆ g l,l ( t ) − g l,l ( t ) ≤ (cid:18) max t g l,k ( t )[ g l,ζ − R n, − R n, ] + (cid:19) P l ( g l,ζ + ˆ g l,l ( t ) − g l,l ( t ) − R n, ≤ g l,l ( t ) ≤ g l,ζ + ˆ g l,l ( t ) − g l,l ( t )) ≤ B (cid:18) max t g l,k ( t )[ g l,ζ − R n, − R n, ] + (cid:19) R γn, (13)The last step is a result of Assumption 4. Observe that • Under Assumption 3, let the constant α > o lder exponent for g l,l ( t ), we apply Lemma 2 andhave that for a large enough constant B : P ( R n, ≥ B ( (cid:114) log nnh n + h α n )) → ⇒ R n, p → • Notice that we also have that ∀ δ ∈ ( − δ , δ ): | Q ( ζ − δ ; g l,l , ˆ F l )) − ˆ g l,ζ − δ | ≤ R n, p → R n, = | ˆ g l,ζ − Q ( ζ ; g l,l , ˆ F l ) + Q ( ζ ; g l,l , ˆ F l ) − g l,ζ | ≤ R n, + | Q ( ζ ; g l,l , ˆ F l )) − g l,ζ | Let G and ˆ G be the CDF and empirical CDF of g l,l ( η l ( x )) in class l . Apply Lemma 1, we know thatthere exists a constant B , such that ∀ δ ∈ ( − δ , δ ): | G ( Q ( ζ − δ ; g l,l , ˆ F l )) − ( ζ − δ ) | ≤ (cid:107) G − ˆ G (cid:107) ∞ ≤ B (cid:114) log nn Under Assumption 4, we have δ ≥ c | Q ( ζ − δ ; g l,l , F l ) − g l,ζ | γ In other words, with probability approaching 1, for a large enough constant B , the following is true Q ( ζ − B (cid:114) log nn ; g l,l , F l ) ≤ Q ( ζ ; g l,l , ˆ F l ) ≤ Q ( ζ + B (cid:114) log nn ; g l,l , F l ) ⇒ | g l,ζ − Q ( ζ ; g l,l , ˆ F l ) | ≤ B ( log nn ) γ Hence, R n, ≤ R n, + B ( log nn ) γ for a large enough constant B .Hence, as n → ∞ , we have g l,ζ − R n, − R n, → g l,ζ is a positive constant. Combine the above analysis witheq.(13) and that the density g l,k is bounded, for a large enough constant B , we have∆ ≤ B (cid:18) R n, + ( log nn ) γ (cid:19) γ → p → p → 0. Next, we show that | ˆ P l,k − ˜ P l,k | p → G and ˆ G be the CDF and empirical CDF of g l,l ( η l ( x )) in class k , this is true with the argument below: | ˆ P l,k − ˜ P l,k | = | P ˆ F k (ˆ g l,l ( η l ( x )) ≤ ˆ g l,ζ ) − P F k (ˆ g l,l ( η l ( x )) ≤ ˆ g l,ζ ) |≤ max (cid:16) | ˆ G (ˆ g l,ζ + R n, ) − G (ˆ g l,ζ − R n, ) | , | ˆ G (ˆ g l,ζ − R n, ) − G (ˆ g l,ζ + R n, ) | (cid:17) ≤ (cid:107) ˆ G − G (cid:107) ∞ + | G (ˆ g l,ζ + R n, ) − G (ˆ g l,ζ − R n, ) | By Lemma 1, we have that (cid:107) ˆ G − G (cid:107) ∞ < B (cid:113) log nn for a large enough constant B with probability approaching1. Following the same argument as eq.(13), we know that for a large enough constant B : | G (ˆ g l,ζ + R n, ) − G (ˆ g l,ζ − R n, ) | ≤ B max g l,k ( t )[ g l,ζ − R n, − R n, ] + (2 R n, ) γ p → | ˆ P l,k − ˜ P l,k | p → References Barber, R. F., Candes, E. J., Ramdas, A. & Tibshirani, R. J. (2019), ‘Conformal prediction under covariateshift’, arXiv preprint arXiv:1904.06019 .Bartlett, P. L. & Wegkamp, M. H. (2008), ‘Classification with a reject option using a hinge loss’, Journal ofMachine Learning Research (Aug), 1823–1840.Cadre, B. (2006), ‘Kernel estimation of density level sets’, Journal of multivariate analysis (4), 999–1023.Cadre, B., Pelletier, B. & Pudlo, P. (2009), ‘Clustering by estimation of density level sets at a fixed proba-bility’.Chandola, V., Banerjee, A. & Kumar, V. (2009), ‘Anomaly detection: A survey’, ACM computing surveys(CSUR) (3), 15.Chatterjee, S. K. & Patra, N. K. (1980), ‘Asymptotically minimal multivariate tolerance sets’, CalcuttaStatistical Association Bulletin (1-2), 73–94.Hartigan, J. A. (1975), ‘Clustering algorithms’.Hechtlinger, Y., P´oczos, B. & Wasserman, L. (2018), ‘Cautious deep learning’, arXiv preprintarXiv:1805.09460 .Herbei, R. & Wegkamp, M. H. (2006), ‘Classification with reject option’, Canadian Journal of Statistics (4), 709–721.Hodge, V. & Austin, J. (2004), ‘A survey of outlier detection methodologies’, Artificial intelligence review (2), 85–126.Jiang, H. (2017), Uniform convergence rates for kernel density estimation, in ‘Proceedings of the 34thInternational Conference on Machine Learning-Volume 70’, JMLR. org, pp. 1694–1703.LeCun, Y. & Cortes, C. (2010), ‘MNIST handwritten digit database’. URL: http://yann.lecun.com/exdb/mnist/ Lei, J. (2014), ‘Classification with confidence’, Biometrika (4), 755–769.Lei, J., Robins, J. & Wasserman, L. (2013), ‘Distribution-free prediction sets’, Journal of the AmericanStatistical Association (501), 278–287.Lei, J. & Wasserman, L. (2014), ‘Distribution-free prediction bands for non-parametric regression’, Journalof the Royal Statistical Society: Series B (Statistical Methodology) (1), 71–96.21i, J., Liu, R. Y. et al. (2008), ‘Multivariate spacings based on data depth: I. construction of nonparametricmultivariate tolerance regions’, The Annals of Statistics (3), 1299–1323.Lipton, Z. C., Wang, Y.-X. & Smola, A. (2018), ‘Detecting and correcting for label shift with black boxpredictors’, arXiv preprint arXiv:1802.03916 .Rigollet, P., Vert, R. et al. (2009), ‘Optimal rates for plug-in estimators of density level sets’, Bernoulli (4), 1154–1178.Stolfo, S. J., Fan, W., Lee, W., Prodromidis, A. & Chan, P. K. (2000), Cost-based modeling for fraudand intrusion detection: Results from the jam project, in ‘Proceedings DARPA Information SurvivabilityConference and Exposition. DISCEX’00’, Vol. 2, IEEE, pp. 130–144.Van de Geer, S. A. et al. (2008), ‘High-dimensional generalized linear models and the lasso’, The Annals ofStatistics (2), 614–645.Vovk, V., Gammerman, A. & Shafer, G. (2005), Algorithmic learning in a random world , Springer Science& Business Media.Vovk, V., Nouretdinov, I., Gammerman, A. et al. (2009), ‘On-line predictive linear regression’, The Annalsof Statistics (3), 1566–1590.Wald, A. (1943), ‘An extension of wilks’ method for setting tolerance limits’, The Annals of MathematicalStatistics (1), 45–55.Wellner, J. et al. (2013), Weak convergence and empirical processes: with applications to statistics , SpringerScience & Business Media.Wilks, S. S. (1941), ‘Determination of sample sizes for setting tolerance limits’, The Annals of MathematicalStatistics (1), 91–96.Zhang, K., Sch¨olkopf, B., Muandet, K. & Wang, Z. (2013), Domain adaptation under target and conditionalshift, inin