[PDF] A Flexible Procedure for Mixture Proportion Estimation in Positive-Unlabeled Learning

Abstract

Positive--unlabeled (PU) learning considers two samples, a positive set P with observations from only one class and an unlabeled set U with observations from two classes. The goal is to classify observations in U. Class mixture proportion estimation (MPE) in U is a key step in PU learning. Blanchard et al. [2010] showed that MPE in PU learning is a generalization of the problem of estimating the proportion of true null hypotheses in multiple testing problems. Motivated by this idea, we propose reducing the problem to one dimension via construction of a probabilistic classifier trained on the P and U data sets followed by application of a one--dimensional mixture proportion method from the multiple testing literature to the observation class probabilities. The flexibility of this framework lies in the freedom to choose the classifier and the one--dimensional MPE method. We prove consistency of two mixture proportion estimators using bounds from empirical process theory, develop tuning parameter free implementations, and demonstrate that they have competitive performance on simulated waveform data and a protein signaling problem.

Full PDF

AA Flexible Procedure for Mixture ProportionEstimation in Positive–Unlabeled Learning

Zhenfeng LinDepartment of Statistics, Texas A&M University3143 TAMU, College Station, TX 77843-3143zﬂ[email protected] P. LongDepartment of Statistics, Texas A&M University3143 TAMU, College Station, TX [email protected]

Abstract

Positive–unlabeled (PU) learning considers two samples, a positive set P with observations from onlyone class and an unlabeled set U with observations from two classes. The goal is to classify observationsin U . Class mixture proportion estimation (MPE) in U is a key step in PU learning. Blanchard et al.[2010] showed that MPE in PU learning is a generalization of the problem of estimating the proportion oftrue null hypotheses in multiple testing problems. Motivated by this idea, we propose reducing the problemto one dimension via construction of a probabilistic classiﬁer trained on the P and U data sets followedby application of a one–dimensional mixture proportion method from the multiple testing literature to theobservation class probabilities. The ﬂexibility of this framework lies in the freedom to choose the classiﬁerand the one–dimensional MPE method. We prove consistency of two mixture proportion estimators usingbounds from empirical process theory, develop tuning parameter free implementations, and demonstrate thatthey have competitive performance on simulated waveform data and a protein signaling problem. Keywords : mixture proportion estimation; PU learning; classiﬁcation; empirical processes; local falsediscovery rate; multiple testing

Short title : Mixture Proportion Estimation for PU Learning a r X i v : . [ s t a t . M E ] J u l Introduction

Let X , . . . , X n ∼ F = αF + (1 − α ) F , (1) X L, , . . . , X L,m ∼ F , all independent, where F and F are distributions on R p with densities f and f with respect to measure µ . The goal is to estimate α and the classiﬁer C ( x ) = (1 − α ) f ( x ) αf ( x ) + (1 − α ) f ( x ) , (2)which can be used to separate the unlabeled data { X i } ni =1 into the classes 0 and 1. The above problemhas been termed Learning from Positive and Unlabeled Examples , Presence Only Data , Partially SupervisedClassiﬁcation , and the

Noisy Label Problem in the machine learning literature [Elkan and Noto, 2008, Liuet al., 2002, Ramaswamy et al., 2016, Scott, 2015, Scott et al., 2013, Ward et al., 2009]. In this work, weuse the term PU learning to refer to Model (1). Here we denote the positive set P := { X L,i } mi =1 and theunlabeled set U := { X i } ni =1 . This setting is more challenging than the traditional classiﬁcation frameworkwhere one possesses labeled training data belonging to both classes. In particular α and C are not generallyidentiﬁable from the data { X i } ni =1 and { X L,i } mi =1 . PU learning has been applied to text analysis [Liu et al.,2002], time series [Nguyen et al., 2011], bioinformatics [Yang et al., 2012], ecology [Ward et al., 2009], andsocial networks [Chang et al., 2016].Several strategies have been proposed for solving the PU problem. Ward et al. [2009] assumes α isknown and uses logistic regression to classify U . The SPY method of Liu et al. [2002] classiﬁes U directly byidentifying a “reliable negative set.” The SPY method has practical challenges including choosing the reliablenegative set. Other strategies estimate α directly. Ramaswamy et al. [2016] estimate α via kernel embeddingof distributions. Scott [2015] and Blanchard et al. [2010] estimate α using the ROC curve produced by aclassiﬁer trained on P and U .Blanchard et al. [2010] showed that MPE in the PU model is a generalization of estimating the propor-tion of true nulls in multiple testing problems. Speciﬁcally, suppose that F and F are one–dimensionaldistributions and F is known. Then the unlabeled set X , . . . , X n may be interpreted as test statistics with1he hypotheses: H : X i ∼ F ,H a : X i ∼ F . In this context, 1 − α is the proportion of true null hypotheses and the classiﬁer C is the local FDR [Efronet al., 2001]. There are many works on addressing identiﬁability and estimation of α and C in this simplersetting [Efron, 2012, Genovese and Wasserman, 2004, Meinshausen and Rice, 2006, Patra and Sen, 2015,Robin et al., 2007].FDR α estimation methods have been developed for one–dimensional MPE problems and are not directlyapplicable on the multidimensional PU learning problem in which X i ∈ R p . In this work, we show that thePU MPE problem can be reduced to dimension one by constructing a classiﬁer on the P versus U datasets followed by transforming observations to class probabilities. One dimensional MPE methods from theFDR literature can then be applied to the class probabilities. Computer implementation of this approachis straightforward because one can use existing classiﬁer and one–dimensional MPE algorithms. We proveconsistency for adaptations of two one–dimensional MPE methods: Storey [2002] based on empirical processesand Patra and Sen [2015] based on isotonic regression. These proofs use results from empirical process theory.We show that the ROC method used in Blanchard et al. [2010] and Scott [2015] is a variant of the methodproposed by Storey [2002].The rest of the paper is organized as follows. In Section 2 we give a sketch of the proposed procedure,which includes two proposed estimators C-patra/sen and C-roc. This section consists of three parts. First, amotivation of the procedure from hypothesis testing community is explained. Second, identiﬁability of α isaddressed. Third, a workﬂow is provided to explain how to implement the proposed procedure. In Section 3we show that Model (1) can be reduced to one-dimension with a classiﬁer. In Section 4 we show consistencyof two α estimators. In Section 5 we numerically show that the estimators perform well in various settings.A conclusion is made in Section 6. Appendix A.1 gives proofs of theorems in the paper. Supporting lemmascan be found in Appendix A.2. 2 Background and Proposed Procedure

Suppose one conducts n tests of null hypothesis H : X i ∼ F versus alternative hypothesis H a : X i ∼ F , i = 1 , . . . , n . The X i are typically test statistics or p–values and the null distribution F is assumed known(usually U nif [0 ,

1] in the case of X i being p–values). The distribution of the X i are F = αF + (1 − α ) F ,where 1 − α is the proportion of true null hypotheses. The false discovery rate (FDR) is the expectedproportion of false rejections. If R is the number of rejections and V is the number of false rejections then F DR ≡ E [ VR R> ]. Benjamini and Hochberg [1995] developed a linear step–up procedure which bounds theFDR at a user speciﬁed level β . In fact, this procedure is conservative and results in an FDR ≤ β (1 − α ) ≤ β .This conservative nature causes the procedure to have less power than other methods which control FDRat β . Adaptive

FDR control procedures ﬁrst estimate 1 − α and then use this estimate to select a β whichensures control at some speciﬁed level while maximizing power. Many estimators of α have been proposed[Benjamini and Hochberg, 2000, Benjamini et al., 2006, Blanchard and Roquain, 2009, Langaas et al., 2005,Patra and Sen, 2015, Storey, 2002].There are two reasons why these procedures cannot be directly applied to the PU learning problem.First, many of the methods have no clear generalization to dimension greater than one because they requirean ordering of the test statistics or p–values. Second, the distribution F is assumed known where as inthe PU learning problem we only have a sample from this distribution. The classiﬁer dimension reductionprocedure we outline in Section 2.3 addresses the ﬁrst point by transforming the PU learning problem to1–dimension. The theory we develop in Sections 3 and 4 addresses the second issue. α and C Many works in both the PU learning and multiple testing literature have discussed the non–identiﬁabilityof the parameters α and F . For any given ( α, F ) pair with α <

1, one can ﬁnd a γ > α (cid:48) ≡ α + γ ≤

1. Deﬁne F (cid:48) ≡ αF + γF α + γ . Then F = α (cid:48) F (cid:48) + (1 − α (cid:48) ) F , which implies ( α (cid:48) , F (cid:48) ) and ( α, F ) result in the same distributions for P and U .To address this issue, we follow the approach taken by Blanchard et al. [2010] and Patra and Sen [2015]3nd estimate a lower bound on α deﬁned as α := inf (cid:26) γ ∈ (0 ,

1] : F − (1 − γ ) F γ is a c.d.f. (cid:27) . (3)The parameter α is identiﬁable. Recall the objective is to estimate C ( x ) = (1 − α ) f ( x ) αf ( x ) + (1 − α ) f ( x ) , the probability an observation in U is from class 1. We can use α to upper bound C in the following way.Note that the classiﬁer C ( x ) = πf ( x ) πf ( x ) + (1 − π ) f ( x )outputs the probability an observation is from the labeled data set at a given x . We can approximate C bytraining a model on the P versus U data sets. The classiﬁers C and C are related through α . To see this,note that after some algebra f ( x ) f ( x ) = C ( x )1 − C ( x ) 1 − ππ . Thus C ( x ) = (1 − α ) f ( x ) f ( x ) = 1 − ππ C ( x )1 − C ( x ) (1 − α ) . Since α is not generally identiﬁable, neither is C . However the plug-in estimator using C n (a classiﬁertrained on P versus U ) and (cid:98) α (some estimator of α ) serves as an upper bound for C . Speciﬁcally, (cid:98) C ( x ) = 1 − ππ C n ( x )1 − C n ( x ) (1 − (cid:98) α ) . We can classify an unlabeled observation X i as being from F if (cid:98) C ( X i ) > . The problem has now beenreduced to estimation of α . The classiﬁer C n plays an important role in estimation of α as well, as shownin the following section. α Estimation

The proposed procedure to estimate α in Model (1) is summarized in Figure 1. The key idea of thisprocedure is to reduce the dimension of PU learning problem via the classiﬁer C n trained on P versus U andthen apply a one-dimensional MPE method on the transformed data to estimate α . The procedure consistsof three steps: • Step 1 . Label the P samples with pseudo label ( Y = 1) and label the U samples with pseudo label( Y = 0). Hence we have ˜ P := { ( X L,i , Y i = 1) , i = 1 , . . . , m } and ˜ U := { ( X i , Y i = 0) , i = 1 , . . . , n } .4 Step 2 . Train a probabilistic classiﬁer C n ( · ) = (cid:98) P ( Y = 1 | X = · ) on ˜ P versus ˜ U . Compute probabilisticpredictions: p := { p i , i = 1 , . . . , m } and p := { p i , i = 1 , . . . , n } , where p i := C n ( X L,i ) and p i := C n ( X i ). • Step 3 . Apply a one-dimensional MPE method to p and p to estimate α . C n ( · ) one-dimensionalprocedure 12or C-patra/sen method:adapted from Patra and Sen [2015]C-roc method:adapted from Storey [2002] and Scott [2015]?...?????+...+++++ X n ... X X X X X X L,m ... X L, X L, X L, X L, X L, p n ... p p p p p p m ... p p p p p features y probability output Step 1 :Augment data with y .Stack the positive and theunlabeled as a large matrix Step 2 :Train a classiﬁer, computeclass probabilities

Step 3 :Apply one-dim procedure onclass probabilitiesFigure 1: Workﬂow of proposed procedure. In

Step 1 , “+” denotes the positive samples, and “?” denotes theunlabeled samples whose label are unknown (can be “+” or “-”). We stack the set P and the set U togetheras a large matrix, and add a new column y to manually impose pseudo labels on observations: “1” for X L,i and “0” for X i . In Step 2 , a classiﬁer C n ( · ) is trained on the stacked matrix and the probability predictions( y = 1 as reference) are obtained. In Step 3 , a one-dimensional procedure is applied to the probabilityoutput from Step 2. In this paper, two methods C-patra/sen and C-roc are introduced as examples. Theupper density curve is used to demonstrate that the p := { p i } mi =1 are from one population, while thebottom density curve shows that p := { p i } ni =1 are from mixture of two populations.We augment the original data with pseudo labels in Step 1, in order to use a supervised learning clas-siﬁcation algorithm. In Step 2 we use Random Forest [Breiman, 2001]. However in principle any classiﬁercan be used. Note that the p i and p i are scalars. Hence in Step 3 we can utilize any one-dimensionalmethod to estimate α . In this work we adapt two methods – one from Storey [2002] and Scott [2015],5nother from Patra and Sen [2015]. Note that the original theory developed for these methods assumed thatthe null distribution is known, but in the PU problem we need to estimate it from p . Since this setting ismore complex and more challenging, new theory is needed. In Section 4, we prove the consistency of twoestimators in the PU setting, using Theorems 1 and 2. Using the P and U samples we can make probabilistic predictions, i.e. compute the probability that theobservation is from distribution F versus from distribution F . The true classiﬁer is C ( x ) = f ( x ) πf ( x ) π + f ( x )(1 − π ) , (4)where π = mm + n is the proportion of labeled sample within the entire data. We treat π as a known constant.Denote the distribution of probabilistic predictions for P and U , respectively, as G ( t ) = P ( C ( X ) ≤ t | X ∼ F ) , (5) G L ( t ) = P ( C ( X ) ≤ t | X ∼ F ) . (6)One can consider the two-component mixture model G = α G G s + (1 − α G ) G L , (7)for α G and G S , which are again potentially non-identiﬁable. Deﬁne α G := inf (cid:26) γ ∈ (0 ,

1] : G − (1 − γ ) G L γ is a c.d.f. (cid:27) . (8) Theorem 1. α G = α . See Section A.1.1 for a proof. Theorem 1 shows one can solve the p–dimensional MPE problem (3) bysolving the 1–dimensional MPE problem (8). In what follows we use α instead of α G to simplify notation.In practice, the classiﬁer C ( X ) is approximated by a trained model C n ( X ) on a given sample. Forconvenience, we assume the classiﬁer C n ( X ) is trained using another independent sample D (cid:48) n . The D (cid:48) n is omitted in the following to lighten notation. We require the approximated classiﬁer to be a consistentestimator of the true classiﬁer. Assumptions A.

We assume E | C n ( X ) − C ( X ) | = O (cid:0) n − τ (cid:1) , (9)6 or some τ > . Such convergence results have been proven for a variety of probabilistic classiﬁers, including variants ofRandom Forest [Biau, 2012]. Deﬁne G L,n ( t ) := 1 m m (cid:88) i =1 C n ( X L,i ) ≤ t ,G n ( t ) := 1 n n (cid:88) i =1 C n ( X i ) ≤ t . Intuitively we can think of G L,n and G n as approximate empirical distribution functions of G L and G respectively. The approximation is due to the fact that C is estimated with C n . Thus we would expectGlivenko-Cantelli and Donsker properties for G n ( t ) and G L,n ( t ). However problems can arise when C ( X )is not continuous. Essentially convergence in probability for C ( X ), implied by Assumptions A, only impliesconvergence of distribution functions at points of continuity. By assuming G L and G possess densities, wecan obtain uniform convergence of distribution functions. Assumptions B.

We assume that G and G L are absolutely continuous and have bounded density functions g and g L . Theorem 2.

Under Assumption A and B, for β = τ / n β ( G L,n ( t ) − G L ( t )) is O P (1) ,n β ( G n ( t ) − G ( t )) is O P (1) , where both O P (1) are uniform in t . See Section A.1.2 for a proof. The result from theorem 2 is the key step in showing consistency of our α estimators in the following sections. α .1 C-patra/sen Patra and Sen [2015] remove as much of the G L,n distribution from G n as possible, while ensuring thatthe diﬀerence is close to a valid cumulative distribution function. We brieﬂy review the idea and providetheoretical results to support use of this procedure in the PU learning problem. See Patra and Sen [2015]for a fuller description of the method in the one–dimensional case.For any γ ∈ (0 ,

1] deﬁne (cid:98) G γs,n = G n − (1 − γ ) G L,n γ . If γ ≥ α , (cid:98) G γs,n will be a valid c.d.f. (up to sampling uncertainty) while the converse is true if γ < α . Findthe closest valid c.d.f. to (cid:98) G γs,n , termed ˇ G γs,n , and measure the distance between (cid:98) G γs,n and ˇ G γs,n . Deﬁneˇ G γs,n = argmin all c.d.f. W ( t ) (cid:90) (cid:16) (cid:98) G γs,n − W ( t ) (cid:17) dG n ( t ) , (10) d n ( g, h ) = (cid:115)(cid:90) ( g ( t ) − h ( t )) dG n ( t ) . Isotonic regression is used to solve Equation 10. If d n ( (cid:98) G γs,n , ˇ G γs,n ) ≈

0, then α ≤ γ where the level ofapproximation is a function of the estimation uncertainty and thus the sample size. Given a sequence c n deﬁne (cid:98) α c n = inf (cid:110) γ ∈ (0 ,

1] : γd n ( (cid:98) G γs,n , ˇ G γs,n ) ≤ c n n β − η (cid:111) where η ∈ (0 , β ) is a constant and the rate β is from Theorem 2. Theorem 3.

Under Assumptions A and B, if c n = o ( n β − η ) and c n → ∞ , then (cid:98) α c n p −→ α . The proof, contained in Section A.1.4, is a generalization of results in Patra and Sen [2015] which accountsfor the fact that both G n and G L,n are estimators. While Theorem 3 provides consistency, there are awide range of choices of c n . Patra and Sen [2015] showed that γd n ( (cid:98) G γs,n , ˇ G γs,n ) is convex, non-increasingand proposed letting (cid:98) α be the γ that maximizes the second derivative of γd n ( (cid:98) G γs,n , ˇ G γs,n ). We use thisimplementation in our numerical work in Section 5. Recalling the deﬁnitions of G , G s , and G L from Section 3, note G ( t ) = αG s ( t ) + (1 − α ) G L ( t ) ≤ α + (1 − α ) G L ( t )8or all t . Thus for any t such that G L ( t ) (cid:54) = 1 we have k ( t ) ≡ G ( t ) − G L ( t )1 − G L ( t ) ≤ α. In the FDR literature, G L is the distribution of the test statistic or p–value under the null hypothesisand is generally assumed known. Thus only G must be estimated, usually with the empirical cumulativedistribution function. Storey [2002] proposed an estimator for k ( t ) at ﬁxed t (Equation 6) and determineda bootstrap method to ﬁnd the t which produces the best estimates of the FDR.The PU problem is more complicated in that one must estimate G and G L . However the structure of G and G L enables one to estimate the identiﬁable parameter α . Speciﬁcally with t ∗ = inf { t : G L ( t ) ≥ } wehave lim t ↑ t ∗ k ( t ) = α . (11)See Lemma 1 for a proof. This result suggests estimating α by substituting the empirical estimators of G n and G L,n into Equation 11 along with a sequence (cid:98) t which is converging to the (unknown) t ∗ . Such asequence (cid:98) t must be chosen so that the estimated denominator 1 − (cid:98) G L,n ( (cid:98) t ) is not converging to 0 too fast(and hence too variable). For (cid:98) t we use a quantile of the empirical c.d.f. which is converging to 1, but at arate slower than the convergence of the empirical c.d.f.. For some q ∈ (0 , β ), deﬁne (cid:98) t = inf { t : G L,n ( t ) ≥ − n − q } − n − . The n − term in (cid:98) t avoids technical complications. Theorem 4.

Under Assumptions A and B k n ( (cid:98) t ) ≡ G n ( (cid:98) t ) − G L,n ( (cid:98) t )1 − G L,n ( (cid:98) t ) P −→ α . See Section A.1.3 for a proof.

The ROC method of Scott [2015] (Proposition 2) and Blanchard et al. [2010] is a variant of the Storey [2002]method with a particular cutoﬀ value t . Deﬁne the true ROC curve by the parametric equation { ( G L ( t ) , G ( t )) : t ∈ [0 , } . α is the supremum of one minus the slope between (1,1) and any point on theROC curve. This is equivalent to the Storey method because α = sup t − − G ( t )1 − G L ( t )= sup t G ( t ) − G L ( t )1 − G L ( t )= sup t k ( t ) . The true ROC curve is not known, so α cannot be computed directly from this expression. Blanchard et al.[2010] found a consistent estimator and Scott [2015] determined rates of convergence using VC theory. Forapplication to data, Scott [2015] splits the labeled and unlabeled data sets in half, constructs a kernel logisticregression classiﬁer on half the data, and estimates the slope between (1,1) and a discrete set of points onthe ROC curve. The α estimate is the supremum of 1 minus each of these slopes. Thus we see that theROC method and earlier methods developed in the FDR literature are in the same family of α estimationstrategies. Choosing a t in the Storey approach is equivalent to choosing a point on the ROC curve. We consider two implementations of these ideas. The method of Scott [2015], using a kernel logistic regressionclassiﬁer and a PU training–test set split to estimate tuning parameters, is referred to as “ROC.” To facilitatecomparison with C-patra/sen, we consider another version with a Random Forest classiﬁer using out–of–bagprobabilities to construct the ROC curve. We call this method C-roc.

To illustrate the proposed methods we carry out numerical experiments on simulated waveform data anda real protein signaling data set

TCDB-SwissProt . We compare the performance of the three methods (C-patra/sen, C-roc and ROC) discussed in Section 4 and the SPY method. With the SPY method, once theclassiﬁcations (“positive” or “negative”) in set U are made, we use the proportion of “negative” cases as anapproximation of α . For the C-roc and C-patra/sen methods [Breiman, 2001], we use Random Forest toconstruct C n ( · ). Scott [2015] estimated κ = 1 − α . We have modiﬁed the ROC method notation to reﬂect the α notation used in this work. .1 Waveform Data We simulate observations from the waveform data set using the R-package mlbench [Leisch and Dimitriadou,2010]. The waveform data is a binary classiﬁcation problem with 21 features. We ﬁx π = 0 . lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . a a l C−patra/senC−rocROCSPY lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . a a cc u r a cy l C−patra/senC−rocROCSPY lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll . . . . . . a F sc o r e l C−patra/senC−rocROCSPY

Figure 2: Comparison of methods with diﬀerent α values. On the x-axis, α varies from 0.01 to 0.99 by stepsize 0.01. The left plot displays the estimates of the lower bound α . The middle plot displays the accuracyof classifying observations in U . The right plot displays the F1 score of the classiﬁcations. α We vary α from 0.01 to 0.99 in Model (1) in increments of 0 .

01. For each α the sample sizes are ﬁxed at m = n = 3000. At each α we run the methods described to estimate α and classify observations in U .Results are shown in Figure 2. For α estimation shown in the left panel, the ROC method performs wellwhen α is large, but overestimates α when it is near zero. If α is small, the ROC method is sensitive to therandom seed used to divide samples into training and testing sets. The SPY method depends on a goodchoice of noise level, so with misspeciﬁed noise level it usually overestimates or underestimates α . C-roc andC-patra/sen methods are more stable with small α . We empirically examine consistency and convergence rates of the methods by estimating α at increasingsample sizes, keeping the number of labeled and unlabeled observations equal, i.e. n = m . In Figure 3, everymethod is repeated 20 times for each ( n, α ) pair. The 20 α estimates are displayed as a boxplot, whichshow estimator bias and variance. 11 ll l l lll ll

100 400 1600 6400 . . . C−patra/sen l ll ll lll lll ll

100 400 1600 6400 . . . C−roc ll ll llll lll

100 400 1600 6400 . . . ROC ll ll lll

100 400 1600 6400 . . . SPY ll

100 400 1600 6400 . . . ll ll ll lll l l

100 400 1600 6400 . . . a ll l ll lll

100 400 1600 6400 . . . l l l

100 400 1600 6400 . . . l

100 400 1600 6400 . . . n l ll lll l ll ll

100 400 1600 6400 . . . n l l l l

100 400 1600 6400 . . . n

100 400 1600 6400 . . . n Figure 3: Comparison of methods with diﬀerent sample sizes. The red solid line represents the true α (0.1,0.5,0.9). The range for all y-axes is [0, 1] from bottom to top. The unlabeled sample size n varies with100 × j ( j = 0 , . . . , (cid:98) α for each ( n, α ) pair.We see that 1) all methods, except SPY, appear consistent under diﬀerent settings ( α = 0 . , . , . α , the estimators have smaller variance 4)C-patra/sen and C-roc are the best methods on average. α Estimation

One approach to solving the multidimensional PU learning problem is to estimate α separately using eachfeature. If X i ∈ R p , this results in p estimates (cid:98) α , . . . , (cid:98) α p of the parameter α . Each of these is an estimatedlower bound on α . Thus a naive estimate of α is max( (cid:98) α , . . . , (cid:98) α p ). This approach ignores the correlationstructure among features.Using the waveform data, we compare this strategy to the multi–dimensional classiﬁer approach. Tomake the problem challenging we select the 14 weakest features, deﬁned as having the lowest Random12orest importance scores. We apply the Patra–Sen one–dimensional method to obtain individual feature α estimates. The results are summarized in Figure 4. Feature importance matches well with the performanceof the α estimates. On the right panels of Figure 4, we see that feature 5 is not useful because there is littlediﬀerence between the unlabeled and labeled samples, leading to a feature based α estimate of approximately0 . . . . . . . . feature a i m po r t an c e true alphaimportancepatra/sen true alpha = 0.6C−patra/sen = 0.594C−roc = 0.567ROC = 0.541SPY = 0.589max(patra/sen) = 0.542 KDEs of 5th feature unlabeledlabeled

KDEs of 8th feature unlabeledlabeled

Figure 4: Estimation of α using individual features. In the left panel the horizontal blue dash line is thetrue α (= 0 . α estimate using the Patra/Sen procedure on a single feature (left y–axis). The right panelsare kernel density estimates of “unlabeled” and “labeled” data for features 5 and 8. The transporter classiﬁcation database (TCDB) [Saier et al., 2006], here the P set, consists of 2453 proteinsinvolved in signaling across cellular membranes. It is desirable to add proteins to this database from unlabeleddatabases which contain a mixture of membrane transport and non–transport proteins. Elkan and Noto[2008] and Das et al. [2007] manually identiﬁed 348 of the 4906 proteins as being related to transport in theSwissProt [Boeckmann et al., 2003] database. We treat the SwissProt data as the unlabeled set U for which13e have ground truth α = (4906 − / ≈ . α and classify unlabeled observations.In the ideal column, α = 0 .

93 is the true proportion of negative samples within the unlabeled set. Theaccuracy (0 .

99) and F1 score (0 .

92) in the ideal column are calculated using 10–fold cross-validation with allthe of positive examples (in TCDB and SwissProt) against only the negative examples in SwissProt. Thisrepresents an upper bound on the performance one could expect for the PU learning methods. The secondthrough ﬁfth columns (C-patra/sen through SPY) contain results of the four methods discussed in this work.C-roc has the best performance among the methods.ideal C-patra/sen C-roc ROC SPYalpha 0.93 0.96 0.94 0.96 0.89accuracy 0.99 0.95 0.96 0.89 0.94F1 score 0.92 0.54 0.68 0.04 0.66Table 1: Comparison of methods for protein signaling data.

In this paper we proposed a framework for estimating the mixture proportion and classiﬁer in the PU learningproblem. We implemented this framework using two estimators from the FDR literature, C-patra/senand C-roc. The framework has the power to incorporate other one-dimensional MPE procedures, such asMeinshausen and Rice [2006], Genovese and Wasserman [2004], Langaas et al. [2005], Efron [2007], Jin [2008],Cai and Jin [2010] or Nguyen and Matias [2014]. More generally we have strengthened connections betweenthe classiﬁcation–machine learning literature and the multiple testing literature by constructing estimatorsusing ideas from both communities.

Supplementary Materials R –code and data needed for reproducing results in this work are available online at github.com/zflin/PU_learning . The R packages used in this project are mlbench [Leisch and Dimitriadou, 2010], devtools [Wickham and Chang, 2017], Iso [Turner, 2015], randomForest [Liaw and Wiener, 2002],

MASS [Venablesand Ripley, 2002], pdist [Wong, 2013],

LiblineaR [Helleputte, 2017], foreach [Analytics and Weston, 2015b],14 oParallel [Analytics and Weston, 2015a], and xtable [Dahl, 2016].

Acknowledgments

Part of this work was completed while the authors were research fellows in the Astrostatistics program atthe Statistics and Applied Mathematical Sciences Institute (SAMSI) in Fall 2016. The authors gratefullyacknowledge SAMSI’s support.

References

R. Analytics and S. Weston. doParallel: Foreach Parallel Adaptor for the ’parallel’ Package , 2015a. URL https://CRAN.R-project.org/package=doParallel . R package version 1.0.10.R. Analytics and S. Weston. foreach: Provides Foreach Looping Construct for R , 2015b. URL https://CRAN.R-project.org/package=foreach . R package version 1.4.3.Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach tomultiple testing.

Journal of the Royal Statistical Society. Series B (Methodological) , 57(1):289–300, 1995.ISSN 00359246.Y. Benjamini and Y. Hochberg. On the adaptive control of the false discovery rate in multiple testing withindependent statistics.

Journal of educational and Behavioral Statistics , 25(1):60–83, 2000.Y. Benjamini, A. M. Krieger, and D. Yekutieli. Adaptive linear step-up procedures that control the falsediscovery rate.

Biometrika , 93(3):491–507, 2006.G. Biau. Analysis of a random forests model.

J. Mach. Learn. Res. , 13(1):1063–1095, Apr. 2012. ISSN1532-4435.G. Blanchard and ´E. Roquain. Adaptive false discovery rate control under independence and dependence.

Journal of Machine Learning Research , 10(Dec):2837–2871, 2009.G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection.

J. Mach. Learn. Res. , 11:2973–3009,Dec. 2010. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1756006.1953028 .B. Boeckmann, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Mi-choud, C. O’Donovan, I. Phan, S. Pilbout, and M. Schneider. The swiss-prot protein knowledgebase andits supplement trembl in 2003.

Nucleic Acids Research , 31(1):365–370, 2003.L. Breiman. Random forests.

Machine Learning , 45(1):5–32, 2001.T. T. Cai and J. Jin. Optimal rates of convergence for estimating the null density and proportion of nonnulleﬀects in large-scale multiple testing.

Ann. Statist. , 38(1):100–145, 02 2010.S. Chang, Y. Zhang, J. Tang, D. Yin, Y. Chang, M. A. Hasegawa-Johnson, and T. S. Huang. Positive-unlabeled learning in streaming networks. In

Proceedings of the 22Nd ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining , KDD ’16, pages 755–764, New York, NY, USA, 2016.ACM. ISBN 978-1-4503-4232-2.D. B. Dahl. xtable: Export Tables to LaTeX or HTML , 2016. URL https://CRAN.R-project.org/package=xtable . R package version 1.8-2. 15. Das, M. H. Saier, and C. Elkan.

Finding Transport Proteins in a General Protein Database , pages 54–66.Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.B. Efron. Size, power and false discovery rates.

Ann. Statist. , 35(4):1351–1377, 08 2007.B. Efron.

Large-scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction , volume 1.Cambridge University Press, 2012.B. Efron, R. Tibshirani, J. D. Storey, and V. Tusher. Empirical bayes analysis of a microarray experiment.

Journal of the American statistical association , 96(456):1151–1160, 2001.C. Elkan and K. Noto. Learning classiﬁers from only positive and unlabeled data. In

Proceedings of the14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages 213–220.ACM, 2008.C. Genovese and L. Wasserman. A stochastic process approach to false discovery control.

Annals of Statistics ,pages 1035–1061, 2004.T. Helleputte.

LiblineaR: Linear Predictive Models Based on the LIBLINEAR C/C++ Library , 2017. Rpackage version 2.10-8.J. Jin. Proportion of non-zero normal means: Universal oracle equivalences and uniformly consistent esti-mators.

Journal of the Royal Statistical Society. Series B (Statistical Methodology) , 70(3):461–493, 2008.ISSN 13697412, 14679868.M. Langaas, B. H. Lindqvist, and E. Ferkingstad. Estimating the proportion of true null hypotheses,with application to dna microarray data.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 67(4):555–572, 2005. ISSN 1467-9868.F. Leisch and E. Dimitriadou. mlbench: Machine Learning Benchmark Problems , 2010. R package version2.1-1.A. Liaw and M. Wiener. Classiﬁcation and regression by randomforest.

R News , 2(3):18–22, 2002. URL http://CRAN.R-project.org/doc/Rnews/ .B. Liu, W. S. Lee, P. S. Yu, and X. Li. Partially supervised classiﬁcation of text documents. In

ICML ,volume 2, pages 387–394. Citeseer, 2002.N. Meinshausen and J. Rice. Estimating the proportion of false null hypotheses among a large number ofindependently tested hypotheses.

The Annals of Statistics , 34(1):373–393, 2006.M. N. Nguyen, X.-L. Li, and S.-K. Ng. Positive unlabeled learning for time series classiﬁcation. In

Proceedingsof the Twenty-Second International Joint Conference on Artiﬁcial Intelligence - Volume Volume Two ,IJCAI’11, pages 1421–1426. AAAI Press, 2011. ISBN 978-1-57735-514-4.V. H. Nguyen and C. Matias. On eﬃcient estimators of the proportion of true null hypotheses in a multipletesting setup.

Scandinavian Journal of Statistics , 41(4):1167–1194, 2014. ISSN 1467-9469.R. K. Patra and B. Sen. Estimation of a two-component mixture model with applications to multiple testing.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 2015.H. Ramaswamy, C. Scott, and A. Tewari. Mixture proportion estimation via kernel embedding of distribu-tions.

PMLR , pages 2052–2060, 2016. 16. Robin, A. Bar-Hen, J.-J. Daudin, and L. Pierre. A semi-parametric approach for mixture models: Applica-tion to local false discovery rate estimation.

Computational Statistics & Data Analysis , 51(12):5483–5493,2007.M. H. Saier, Jr, C. V. Tran, and R. D. Barabote. Tcdb: the transporter classiﬁcation database for membranetransport protein analyses and information.

Nucleic Acids Research , 34(suppl 1):D181–D186, 2006.C. Scott. A rate of convergence for mixture proportion estimation, with application to learning from noisylabels. In

AISTATS , 2015.C. Scott, G. Blanchard, and G. Handy. Classiﬁcation with asymmetric label noise: Consistency and maximaldenoising. In

COLT , pages 489–511, 2013.J. D. Storey. A direct approach to false discovery rates.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 64(3):479–498, 2002. ISSN 1467-9868.R. Turner.

Iso: Functions to Perform Isotonic Regression , 2015. URL https://CRAN.R-project.org/package=Iso . R package version 0.0-17.W. N. Venables and B. D. Ripley.

Modern Applied Statistics with S . Springer, New York, fourth edition,2002. URL . ISBN 0-387-95457-0.G. Ward, T. Hastie, S. Barry, J. Elith, and J. R. Leathwick. Presence-only data and the em algorithm.

Biometrics , 65(2):554–563, 2009.H. Wickham and W. Chang. devtools: Tools to Make Developing R Packages Easier , 2017. URL https://CRAN.R-project.org/package=devtools . R package version 1.13.2.J. Wong. pdist: Partitioned Distance Function , 2013. URL https://CRAN.R-project.org/package=pdist .R package version 1.2.P. Yang, X.-L. Li, J.-P. Mei, C.-K. Kwoh, and S.-K. Ng. Positive-unlabeled learning for disease geneidentiﬁcation.

Bioinformatics , 28(20):2640–2647, 2012.17 ppendix A.1

Appendix to

A Flexible Procedure for Mixture Proportion Estimation inPositive–Unlabeled Learning

A.1 Technical Notes

A.1.1 Proof of Theorem 1

Equivalently, we are trying to prove G − (1 − γ ) G L γ is a CDF ⇔ F − (1 − γ ) F γ is a CDF . (A.1)Suﬃcient to show G − (1 − γ ) G L non-decreasing ⇔ f − (1 − γ ) f ≥ . (A.2)First we show ⇐ . Consider any t > t . Then( G ( t ) − (1 − γ ) G L ( t )) − ( G ( t ) − (1 − γ ) G L ( t )) = (cid:90) { x : C ( x ) ∈ ( t ,t ] } f ( x ) − (1 − γ ) f ( x ) (cid:124) (cid:123)(cid:122) (cid:125) ≥ dµ ( x ) ≥ . Now we show ⇒ by proving the contrapositive. By assumption there exists A = { x : f ( x ) − (1 − γ ) f ( x ) < } such that P ( A ) >

0. Further we have A = (cid:26) x : (1 − γ ) (1 − π ) π > f ( x ) f ( x ) (1 − π ) π (cid:27) =  x : 11 + (1 − γ ) (1 − π ) π (cid:124) (cid:123)(cid:122) (cid:125) ≡ t ∗ < C ( x )  . So ( G (1) − (1 − γ ) G L (1)) − ( G ( t ∗ ) − (1 − γ ) G L ( t ∗ )) = (cid:90) A = { x : C ( x ) >t ∗ } f ( x ) − (1 − γ ) f ( x ) dµ ( x ) < . A.1.2 Proof of Theorem 2 n β ( G n ( t ) − G ( t )) = n β n / n / (cid:0) G n ( t ) − E [ C n ( X ) ≤ t | C n ] (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) ≡ R n ( t ) + n β (cid:0) E [ C n ( X ) ≤ t | C n ] − G ( t ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) ≡ Q n ( t ) ppendix A.2 We now show that R n ( t ) and Q n ( t ) are O P (1) uniformly in t . Together these facts show the expression is O P (1) uniformly in t . R n ( t ) : Note R n ( t ) = √ n (cid:32) n n (cid:88) i =1 C n ( X i ) ≤ t − E [ C n ( X ) ≤ t | C n ] (cid:33) . By the DKW inequality P ( || R n || ∞ > x (cid:12)(cid:12) C n ) ≤ e − x . Thus || R n || ∞ is O P (1). Q n ( t ) : We have Q n ( t ) = E [ ≡ T n (cid:122) (cid:125)(cid:124) (cid:123) ( C n ( X ) ≤ t − C ( X ) ≤ t ) (cid:12)(cid:12) C n ] ≤ | E [ T n | C ( X ) − t |≤ (cid:15) n (cid:12)(cid:12) C n ] | (cid:124) (cid:123)(cid:122) (cid:125) B + | E [ T n | C ( X ) − t | >(cid:15) n | C ( X ) − C n ( X ) | <(cid:15) n (cid:12)(cid:12) C n ] | (cid:124) (cid:123)(cid:122) (cid:125) B + | E [ T n | C ( X ) − t | >(cid:15) n | C ( X ) − C n ( X ) | >(cid:15) n (cid:12)(cid:12) C n ] | (cid:124) (cid:123)(cid:122) (cid:125) B Noting that | T n | ≤ C n is independent of C ( X ), we have B ≤ P ( | C ( X ) − t | ≤ (cid:15) n ) ≤ (cid:15) n sup t g ( t )where g is the density of C ( X ), which exists and is bounded by Assumptions B. B is 0 because T n = 0whenever the indicator functions in B are both 1. Finally noting B ≤ | C ( X ) − C n ( X ) | >(cid:15) n and using Markov’sinequality twice, we have P ( B > r n ) ≤ P ( E [ | C ( X ) − C n ( X ) | >(cid:15) n | C n ] > r n ) ≤ P ( | C ( X ) − C n ( X ) | > (cid:15) n ) r n ≤ E [ | C n ( X ) − C ( X ) | ] (cid:15) n r n . If we choose (cid:15) n = n − τ/ and r n ∼ n − τ/ , then we can set β = τ / n β ( G L,n ( t ) − G L ( t )) is O P (1) uniform in t . A.1.3 Proof of Theorem 4

Since (cid:98) t = inf { t : G L,n ( t ) ≥ − n − q } − n − and 0 < q < β , we have( n β (1 − G L,n ( (cid:98) t ))) − = n q n β = o (1) . Recall by Theorem 2 we have n β ( G L,n ( t ) − G L ( t )) ≡ d L ( t ) = O P (1) n β ( G n ( t ) − G ( t )) ≡ d ( t ) = O P (1) ppendix A.3 where this and subsequent O P and o P are uniform in t . We have G n ( (cid:98) t ) − G L,n ( (cid:98) t )1 − G L,n ( (cid:98) t ) = G ( (cid:98) t ) − G L ( (cid:98) t )1 − G L,n ( (cid:98) t ) + n − β ( d L ( (cid:98) t ) − d ( (cid:98) t ))1 − G L,n ( (cid:98) t )= (cid:32) − G L ( (cid:98) t )1 − G L,n ( (cid:98) t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) ≡ A (cid:32) G ( (cid:98) t ) − G L ( (cid:98) t )1 − G L ( (cid:98) t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) ≡ k ( (cid:98) t ) + d L ( (cid:98) t ) − d ( (cid:98) t ) n β (1 − G L,n ( (cid:98) t )) (cid:124) (cid:123)(cid:122) (cid:125) o P (1) . Note that A = 1 + d L ( (cid:98) t ) n β (1 − G L,n ( (cid:98) t )) = 1 + o P (1) . Thus it is suﬃcient to show that k ( (cid:98) t ) → α . By Lemma 1, k ( t ) ↑ α as t ↑ t ∗ . We show that for any (cid:15) > P ( (cid:98) t ∈ ( t ∗ − (cid:15), t ∗ )) → . Thus by the continuous mapping theorem, the estimator is consistent.

Part 1:

We show P ( t ∗ − (cid:98) t > (cid:15) ) →

0. By the deﬁnition of t ∗ , there exists γ > G L ( t ∗ − (cid:15)/

2) = 1 − γ .We have P ( t ∗ − (cid:98) t > (cid:15) ) = P ( G L,n ( t ∗ − (cid:15) + n − ) > G L,n ( (cid:98) t + n − )) ≤ P ( G L,n ( t ∗ − (cid:15) + n − ) > − n − q ) ≤ P ( G L ( t ∗ − (cid:15) + n − ) > − n − q − γ/ (cid:124) (cid:123)(cid:122) (cid:125) ≡ A + P ( | G L,n ( t ∗ − (cid:15) + n − ) − G L ( t ∗ − (cid:15) + n − ) | > γ/ (cid:124) (cid:123)(cid:122) (cid:125) → .A → n , G L ( t ∗ − (cid:15) + n − ) ≤ G L ( t ∗ − (cid:15)/

2) = 1 − γ < − n − q − γ/ Part 2:

We show P ( (cid:98) t ≥ t ∗ ) →

0. We have P ( (cid:98) t ≥ t ∗ ) = P ( G n,L ( (cid:98) t + n − ) ≥ G n,L ( t ∗ + n − ))= P (1 − n − q ≥ G n,L ( t ∗ + n − ))= P (1 − G n,L ( t ∗ + n − ) ≥ n − q )= P ( n β ( G L ( t ∗ + n − ) − G n,L ( t ∗ + n − )) (cid:124) (cid:123)(cid:122) (cid:125) O P (1) by Theorem 2 ≥ n β − q ) . Since β > q we have the result.

A.1.4 Proof of Theorem 3

Proof. ∀ (cid:15) >

0, we need to show P ( | (cid:98) α c n − α | > (cid:15) ) →

0. Note P ( | (cid:98) α c n − α | > (cid:15) ) = P ( (cid:98) α c n < α − (cid:15) ) + P ( (cid:98) α c n > α + (cid:15) ) . First we show that P ( (cid:98) α c n < α − (cid:15) ) →

0. If α ≤ (cid:15) , then P ( (cid:98) α c n < α − (cid:15) ) ≤ P ( (cid:98) α c n <

0) = 0 . ppendix A.4 If α > (cid:15) , suppose we have (cid:98) α c n < α − (cid:15) , then by Lemma 6, d n ( (cid:98) G α − (cid:15)s,n , ˇ G α − (cid:15)s,n ) ≤ c n n β − η ( α − (cid:15) ) . The LHS of above converges to positive constant by Lemma 5, while the RHS converges to zero by the choiceof c n , hence P ( (cid:98) α c n < α − (cid:15) ) → P ( (cid:98) α c n > α + (cid:15) ) →

0. Suppose we have (cid:98) α c n > α + (cid:15) , then by Lemma 6, n β − η d n ( (cid:98) G α + (cid:15)s,n , ˇ G α + (cid:15)s,n ) > c n ( α − (cid:15) ) . The LHS of above converges to zero by Lemmas 5 and 4, while the RHS converges to inﬁnity by the choiceof c n , hence P ( (cid:98) α c n > α + (cid:15) ) → A.2 Lemmas

Lemma 1. lim t ↑ t ∗ k ( t ) = α .Proof. Deﬁne α (cid:48) = lim t ↑ t ∗ k ( t ). Show α (cid:48) ≤ α : By the deﬁnition of α there exists c.d.f G α such that G ( t ) = α G α ( t ) + (1 − α ) G L ( t ) ≤ α + (1 − α ) G L ( t ) . Thus k ( t ) = G ( t ) − G L ( t )1 − G L ( t ) ≤ α for all t . Thus α (cid:48) = lim t ↑ t ∗ k ( t ) ≤ α . Show α (cid:48) ≥ α : Consider any γ < α . We show γ < α (cid:48) . Since γ < α , G − (1 − γ ) G L γ is not a c.d.f. Thus there exists t < t such that G ( t ) − (1 − γ ) G L ( t ) γ > G ( t ) − (1 − γ ) G L ( t ) γ . (A.3)Since the left hand side is bounded above by 1 and G ( t ) = 1 ∀ t ≥ t ∗ , t < t ∗ . From Equation (A.3) we have G ( t ) − G ( t ) > (1 − γ )( G L ( t ) − G L ( t ))which implies (since G L ( t ) − G L ( t ) <

0) that G ( t ) − G ( t ) G L ( t ) − G L ( t ) < (1 − γ ) . (A.4)From Lemma 3 we have 1 − G L ( t )1 − G ( t ) = G L (1) − G L ( t ) G (1) − G ( t ) ≥ G L ( t ) − G L ( t ) G ( t ) − G ( t ) ppendix A.5 Combining this result with Equation (A.4) we obtain1 − G ( t )1 − G L ( t ) ≤ − γ which implies γ ≤ G ( t ) − G L ( t )1 − G ( t ) = k ( t )Since k ( t ) ↑ as t ↑ t ∗ (see Lemma 2), we have the result. Lemma 2. k ( t ) is increasing on t ∈ [0 , t ∗ ) .Proof. Recall Q ( p ) = inf { t ∈ (0 ,

1] : G L ( t ) ≥ p } and t ∗ = Q (1). Note that with a, b, c, d > a/b < c/d , a + cb + d > ab . Next note that by Lemma 3, for t ∗ > t > t , G ( t ) − G ( t ) G L ( t ) − G L ( t ) > − G ( t )1 − G L ( t ) . Thus we have 1 − k ( t ) = 1 − G ( t )1 − G L ( t )= 1 − G ( t ) + G ( t ) − G ( t )1 − G L ( t ) + G L ( t ) − G L ( t ) ≥ − G ( t )1 − G L ( t )= 1 − k ( t ) . Lemma 3 (Ratio) . For all ≤ t < t ≤ where G ( t ) − G ( t ) > we have − ππ t − t < G L ( t ) − G L ( t ) G ( t ) − G ( t ) ≤ − ππ t − t where / ≡ ∞ .Proof. The classiﬁer is C ( x ) = πf L ( x ) πf L ( x ) + (1 − π ) f ( x ) = 11 + − ππ f ( x ) f L ( x ) Deﬁne A t = { x : C ( x ) ≤ t } = { x : − tt π − π f L ( x ) ≤ f ( x ) } . Therefore on the set A t ∩ A Ct we have1 − t t π − π f L ( x ) ≤ f ( x ) < − t t π − π f L ( x )So G L ( t ) − G L ( t ) G ( t ) − G ( t ) = (cid:82) A t ∩ A Ct f L ( x ) (cid:82) A t ∩ A Ct f ( x ) > (cid:82) A t ∩ A Ct f L ( x ) − t t π − π (cid:82) A t ∩ A Ct f L ( x ) = t − t − ππ . We can obtain the upper bound in an identical manner. ppendix A.6Lemma 4. n β − η d n ( G, G n ) = o P (1) ,n β − η d n ( G L , G L,n ) = o P (1) . Proof. n β − η d n ( G, G n ) = (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116)(cid:90)  n − η (cid:124)(cid:123)(cid:122)(cid:125) = o P (1) n β ( G n ( t ) − G ( t )) (cid:124) (cid:123)(cid:122) (cid:125) = O P (1)  dG n ( t ) , where n β ( G n ( t ) − G ( t )) = O P (1) uniformly, and then n − η n β ( G n ( t ) − G ( t )) = o P (1) uniformly. Therefore n β − η d n ( G, G n ) ≤ sup t | n − η n β ( G n ( t ) − G ( t )) | = o P (1) . The G L , G L,n case can be proved in an identical manner.

Lemma 5.

For ≥ γ ≥ α , γd n ( (cid:98) G γs,n , ˇ G γs,n ) ≤ d n ( G, G n ) + (1 − γ ) d n ( G L , G L,n ) . Thus, γd n ( (cid:98) G γs,n , ˇ G γs,n ) → (cid:40) if γ ≥ α ,> if γ < α . Proof.

Let G γs = G − (1 − γ ) G L γ . If γ ≥ α , then γd n ( (cid:98) G γs,n , ˇ G γs,n ) ≤ γd n ( (cid:98) G γs,n , G γs ) ≤ d n ( G, G n ) + (1 − γ ) d n ( G L , G L,n ) . The ﬁrst inequality holds by the deﬁnition of ˇ G γs,n due to the fact that G γs is a valid CDF when 1 ≥ γ ≥ α ,and the second inequality is due to triangle inequality.Now we prove the limit property of γd n ( (cid:98) G γs,n , ˇ G γs,n ). If γ ≥ α , then γd n ( (cid:98) G γs,n , ˇ G γs,n ) → d n ( G, G n ) → d n ( G L , G L,n ) → γ < α , by the deﬁnition of α G , G γs is not avalid c.d.f.. Pointwisely, (cid:98) G γs,n → G γs . So for large n , (cid:98) G γs,n is not valid c.d.f., while ˇ G γs,n is always a c.d.f.. So γd n ( (cid:98) G γs,n , ˇ G γs,n ) would converge to some positive constant. Lemma 6. B n := { γ ∈ [0 ,

1] : n β − η γd n ( (cid:98) G γs,n , ˇ G γs,n ) ≤ c n } is convex. Thus, B n = ( (cid:98) α c n , or B n = [ (cid:98) α c n , .Proof. Obviously, 1 ∈ B n . Assume γ ≤ γ from B n , let γ = ξγ + (1 − ξ ) γ , where ξ ∈ [0 , (cid:98) G γs,n , ξγ (cid:98) G γ s,n + (1 − ξ ) γ (cid:98) G γ s,n = γ (cid:98) G γ s,n . ppendix A.7 Note that γ (cid:0) ξγ ˇ G γ s,n + (1 − ξ ) γ ˇ G γ s,n (cid:1) is a valid c.d.f. We have γ ∈ B n because d n ( (cid:98) G γ s,n , ˇ G γ s,n ) ≤ d n (cid:18) (cid:98) G γ s,n , γ (cid:0) ξγ ˇ G γ s,n + (1 − ξ ) γ ˇ G γ s,n (cid:1)(cid:19) = d n (cid:18) γ (cid:16) ξγ (cid:98) G γ s,n + (1 − ξ ) γ (cid:98) G γ s,n (cid:17) , γ (cid:0) ξγ ˇ G γ s,n + (1 − ξ ) γ ˇ G γ s,n (cid:1)(cid:19) ≤ ξγ γ d n ( (cid:98) G γ s,n , ˇ G γ s,n ) + (1 − ξ ) γ γ d n ( (cid:98) G γ s,n , ˇ G γ s,n ) ≤ ξγ γ c n n β − η γ + (1 − ξ ) γ γ3