Improved Estimation of Class Prior Probabilities through Unlabeled Data
IImproved Estimation of Class Prior Probabilitiesthrough Unlabeled Data
Norman MatloffDepartment of Computer ScienceUniversity of California, DavisDavis, CA 95616October 7, 2015
Abstract
Work in the classification literature has shown that in computing a classifica-tion function, one need not know the class membership of all observations inthe training set; the unlabeled observations still provide information on themarginal distribution of the feature set, and can thus contribute to increasedclassification accuracy for future observations. The present paper will showthat this scheme can also be used for the estimation of class prior probabili-ties, which would be very useful in applications in which it is difficult or ex-pensive to determine class membership. Both parametric and nonparametricestimators are developed. Asymptotic distributions of the estimators are de-rived, and it is proven that the use of the unlabeled observations does reduceasymptotic variance. This methodology is also extended to the estimation ofsubclass probabilities.
There has been much work on the issue of unlabeled data in classification problems.Some papers, such as [8], have taken the point of view that the data is missing, dueto some deficiency in the data collection process, while others, such as [13], areaimed at situations in which some observations are deliberately left unlabeled.The latter approach is motivated by the fact that in many applications it is very dif-ficult or expensive to determine class membership. Thus [13] proposed that in part1 a r X i v : . [ s t a t . M L ] O c t f the training set, class membership be left undetermined. The unlabeled observa-tions would still provide information on the marginal distribution of the features,and it was shown that this information can contribute to increased classificationaccuracy for future observations.In the present work, it is again assumed that we have both labeled and unlabeleddata, but the focus is on estimation of the class prior probabilities rather than esti-mation of the classification function. We wish to estimate those probabilities via amixture of labeled and unlabeled data, in order to economize on the time and effortneeded to acquire labeling information.For example, consider the geographic application in [2]. Our class variable here isforest cover type, representing one of seven classes. Suppose we wish to estimatethe population proportions of the various cover types. The authors note that aproblem arises in that cover data is generally “directly recorded by field personnelor estimated from remotely sensed data...[both of which] may be prohibitively timeconsuming and/or costly in some situations.” However, various feature variablesare easily recorded, such as elevation, horizontal distance to the nearest roadwayand so on. Previous work in the estimation of classification functions suggests thatwe may use this other data, without labeling, as a means of reducing the time andeffort needed to determine class membership.As another example, consider the patent analysis reported in [12]. The overallgoal was to estimate the proportion of patents whose research was publicly funded.So, we would have just two classes, indicating public funding or lack of it. Thedetermination of which patents were publicly funded involved inspection of notonly the patents themselves, but also the papers cited in the patents, a very time-consuming, human-intensive task. However, using the approach discussed here, wecould define our features to consist of some key words which have some predictivepower for public funding status, such as appearance of the word university in theAssignee section of the patent document, and then use the feature data to helpachieve our goal of estimating the class probabilities.Specifically, one might develop a classification rule from the labeled observations,and then use the rule to classify the remaining observations in the training set. Inother words, we would find predicted forest cover types for each of the unlabeledobservations. Finally, one would obtain estimates for the proportions of forestcover types in the full training set, using the predicted cover types for those onwhich labels had not been collected. Actually, one can do even better by usingestimated conditional class probabilities of the unlabeled and labeled observations,as will be proven here. 2et us set some notation. Suppose there are c classes, and let Y = ( Y (1) , ..., Y ( c ) ) T be the class identification vector, so that Y ( j ) is 1 or 0, according to whether theitem belongs to class j. Let X = ( X (1) , ..., X ( f ) ) T be the feature vector. Forsome observations in our training set, we will have data on both X and Y , butothers will be unlabeled, i.e. we will have data only on X . In our context here, ourgoal is to estimate q j = EY ( j ) , the prior probability of class j.The key point is that the unlabeled observations provide additional information onthe marginal distribution of X . Since in the Law of Total Expectation, EY = E [ E ( Y | X )] , the outer expectation is with respect to X, the better our estimate ofthe marginal distribution of X , the better our estimate of EY . Thus using the twodata sets in concert can give us more accurate estimates of the target quantity, EY ,than can the labeled set alone. In this paper such estimates will be developed andanalyzed.It is important to note that that not only do we want to be able to estimate EY ,but we also may need a measure of the accuracy of our estimates. Most journals inthe sciences, for example, want statistical inference to accompany findings, in theform of confidence intervals and hypothesis tests. This issue is also addressed inthis paper.In addition, we may be interested in estimating subclass probabilities, again usinglabeled and unlabeled data. For instance, we may wish to compare the rates ofpublic funding of patents in the computer field on one hand, and in the optics fieldon the other.Section 2 first notes that the statistical literature, such as [9] and [10], containssome semiparametric methodology which is related to our goals. A typical casehere might be the popular logistic model,[1], especially in settings with continuousfeatures such as in the forest example. Asymptotic distributions are derived for ourestimators of class prior probabilities in Section 3, and that section also proves thatthe use of unlabeled data does reduce asymptotic variance in estimating the priors.The question then arises as to how much reduction in asymptotic variance can beattained. An analyst faced with a real data set must decide whether the methodproposed here will work well in his/her particular setting. Is it worth collectingsome unlabeled data? In Section 4 techniques are developed for assessing this.These techniques may also be useful for the original application of unlabeled data,which was to enhance the accuracy of classification function estimators. Section 5 This notation, traditional in the statistics literature, differs from the machine learning customof having Y ( j ) take on the values ± . However, the statistics notation will be convenient here invarious ways, e.g. that P ( Y ( j ) = 1) is simply EY ( j ) . EY works on somereal data sets.The remainder of the paper covers variations. Section 6 addresses the subclassproblem, and Section 7 treats the case of purely discrete features. Asymptoticinference methodology is again developed, and again it is proven that asymptoticvariance is reduced. Finally, Section 8 will discuss possible extensions of theseapproaches. Denote ithe i th observation in our data by ( Y i , X i ) . Keep in mind that Y i and X i arevectors, of length c and f. We will assume that the first r observations are labeledbut the remainder are not. In other words, Y i is known for i = 1,...,r but unknownfor i = r+1,...,n. Let Y ( j ) i denote the value of Y ( j ) in observation i, i.e. the value ofcomponent j in Y i .As statistical inference is involved, we will be determining asymptotic distributionsof various estimators, as n → ∞ . It is assumed that the ratio r/n goes to the limit γ as n → ∞ .The main analysis will be semiparametric. This means that we assume that it hasbeen determined that the conditional class membership probability functions be-long to some parametric family g ( t, θ ) , i.e. that P ( Y ( j ) = 1 | X = t ) = g ( x, θ j ) (1)for some vector-valued parameter θ j , but we assume no parametric model for thedistribution of X.In the material that follows, the reader may wish to keep in mind the familiarlogistic model, P ( Y ( j ) = 1 | X = t ) = 11 + e − θ Tj t (2)where t includes a constant component and T denotes matrix transpose.The estimator ˆ θ j for the population value θ j is obtained from ( Y , X ) , ..., ( Y r , X r ) via the usual iteratively reweighted least squares method, with weight function4 ( t, θ j ) = 1 V ar ( Y ( j ) | X = t ) = 1 g ( t, θ j )[1 − g ( t, θ j )] (3)That is, the estimator is the root of r (cid:88) i =1 w ( X i , ˆ θ j ) { Y ( j ) i − g ( X i , ˆ θ j ) } g (cid:48) ( X i , ˆ θ j ) (4)where g (cid:48) is the vector of partial derivatives of g with respect to its second argument.Routines to perform the computation are commonly available, such as the glm function in the R statistical package, with the argument family=binomial .Since the Law of Total Expectation implies that q j = EY ( j ) = E (cid:104) E ( Y ( j ) | X ) (cid:105) = E [ g ( X, θ j )] (5)we take as our new estimator of q j ˆ q j = 1 n n (cid:88) i =1 g ( X i , ˆ θ j ) (6)This will be compared to the classical estimator based on the labeled data, ¯ Y ( j ) = 1 r r (cid:88) i =1 Y ( j ) i (7) We will assume that g satisfies the necessary smoothness conditions, such as theones given in [7], and that the means exist and so on.
Lemma 1
Define A j = E { w ( X, θ j ) g (cid:48) ( X, θ j ) g (cid:48) ( X, θ j ) T } (8)5 nd B j = E [ g (cid:48) ( X, θ j )] (9) Then n (ˆ q j − q j ) is asymptotically equivalent to n − [ n (cid:88) i =1 { g ( X i , θ j ) − q j } + γ − B Tj A − j r (cid:88) i =1 w ( X i , θ j ) { Y ( j ) i − g ( X i , θ j ) } g (cid:48) ( X i , θ j )] (10) Proof 1
Write n (ˆ q j − q j ) = n − n (cid:88) i =1 (cid:104) g ( X i , ˆ θ j ) − q j (cid:105) (11) Form the Taylor expansion of (cid:80) ni =1 g ( X i , ˆ θ j ) around the point θ j . Then there is a ˜ θ j between θ j and ˆ θ j such that n (ˆ q j − q j ) = n − n (cid:88) i =1 [ g ( X i , θ j ) − q j ] (12) + (cid:34) n − n (cid:88) i =1 g (cid:48) ( X i , ˜ θ j ) (cid:35) T n (ˆ θ j − θ j ) (13) = C n + D n (14) The C n term converges in distribution, by the Central Limit Theorem. What about D n ? It is well established, for example in [7], that n (ˆ θ j − θ j ) is asymptoticallyequivalent to n − γ − A − j r (cid:88) i =1 w ( X i , θ j ) { Y ( j ) i − g ( X i , θ j ) } g (cid:48) ( X i , θ j ) (15) Meanwhile, im n →∞ n n (cid:88) i =1 g (cid:48) ( X i , θ j ) = B j w.p. 1 (16) Thus D n is asymptotically equivalent to F n = n − γ − B Tj A − j r (cid:88) i =1 w ( X i , θ j ) { Y ( j ) i − g ( X i , θ j ) } g (cid:48) ( X i , θ j )] (17) by Slutsky’s Theorem [[16]].So D n − F n is o p (1) , and thus the same is true for ( C n + D n ) − ( C n + F n ) . Theresult follows. We can now determine the asymptotic variance of ˆ q j : Theorem 1
The asymptotic variance of (cid:98) q j is AV ar (ˆ q j ) = 1 n V ar [ g ( X, θ j )] + 1 r B Tj A − j B j (18) Proof 2
Write (10) as G n + H r , where G n = n − n (cid:88) i =1 { g ( X i , θ j ) − q j } (19) and H r = r − B Tj A − j r (cid:88) i =1 w ( X i , θ j ) { Y ( j ) i − g ( X i , θ j ) } g (cid:48) ( X i , θ j ) (20) Let us first establish that G n and H r are uncorrelated. To show this, let us find theexpected value of the product of two minor terms. Again applying the Law of TotalExpectation, write (cid:104) g ( X, θ j ) · w ( X, θ j ) { Y ( j ) − g ( X, θ j ) } g (cid:48) ( X, θ j ) (cid:105) = E (cid:2) E (cid:0) g ( X, θ j ) w ( X, θ j ) { Y ( j ) − g ( X, θ j ) } g (cid:48) ( X, θ j ) | X (cid:1)(cid:3) = E (cid:104) g ( X, θ j ) w ( X, θ j ) E (cid:16) Y ( j ) − g ( X, θ j ) | X (cid:17) g (cid:48) ( X, θ j ) (cid:105) = 0 (21) from (5).For a random vector W, let Cov(W) denote the covariance matrix of W. The Law ofTotal Expectation can be used to derive the relation Cov ( R ) = E [ Cov ( R | S )] + Cov [ E ( R | S )] (22) for random vectors R and S . Apply this with S = X and R = r − B Tj A − j w ( X, θ j ) { Y ( j ) − g ( X, θ j ) } g (cid:48) ( X, θ j ) (23) Then E ( R | S ) = 0) . Also Cov ( R | S ) = r − B Tj A − j w ( X, θ j ) [ V ar ( Y | X )] g (cid:48) ( X, θ j ) g (cid:48) ( X, θ j ) T A − j B j (24) = r − B Tj A − j w ( X, θ j ) g (cid:48) ( X, θ j ) g (cid:48) ( X, θ j ) T A − j B j (25)(26) from (3). This yields E [ Cov ( R | S )] = r − B Tj A − j B j (27) from (8).Therefore the asympotic variance of ˆ q j is as given in (18). Statistical inference is then enabled, with ˆ q being approximately normally dis-tributed. The standard error is obtained as the square root of the estimated versionof the quantities in (18), taking for instance8 V ar [ g ( X, θ j )] = 1 n n (cid:88) i =1 [ g ( X i , ˆ θ j ) − ˆ q j ] (28)An approximate 100(1- α )% confidence interval for q j is then obtained by addingand subtracting Φ − ( α/ times the standard error, where Φ is the standard normaldistribution function. Corollary 1
The asympotic variance of the estimator based on both the labeledand unlabeled data, (cid:98) q j , is less than or equal to that of the estimator based only onlabeled data, ¯ Y ( j ) . The inequality is strict as long as the random variable g ( X, θ j ) is not constant. Proof 3
If we were to use only the labeled data, the asympotic variance of ourestimator of q j would be the result of setting n = r in (18), i.e. AV ar (ˆ q j ) = 1 r (cid:104) V ar { g ( X, θ j ) } + B Tj A − j B j (cid:105) (29) Comparing (29) and (18), we see that use of the unlabeled data does indeed yielda reduction in asympotic variance, by the amount of (cid:18) r − n (cid:19) V ar { g ( X, θ j ) } (30) It was proven in [9] that even with no unlabeled data, the asympotic variance of (6)is less than or equal to that of ¯ Y ( j ) . (This result was later sharpened and extendedby [10].) The claimed relation then follows. Clearly, the stronger the relationship between X and Y, the greater the potentialimprovement that can accrue from use of unlabeled data. This raises the questionof how to assess that potential improvement.One approach would be to use a measure of the strength of the relationship between X and Y itself. A number of such measures have been proposed, such as thosedescribed in [11]. Many aim to act as an analog of the R value traditionally used9n linear regression analysis. Here we suggest what appears to be a new measure,whose form more directly evokes the spirit of R .Consider first random variables U and V, with V being continuous. If we are pre-dicting V from U, the population version of R is ρ = E (cid:2) ( V − EV ) (cid:3) − E (cid:2) { V − E ( V | U ) } (cid:3) E [( V − EV ) ] (31)This is the proportional reduction in mean squared prediction error attained byusing U, versus not using it. The analog of this quantity in the case of binary V could be defined to be the pro-portional reduction in classification error. To determine this, note that without U,we always predict V to be 1 if EV ≥ . ; otherwise our prediction is always 0.The probability of misclassification is then min( EV, − EV ) . Similarly, the condi-tional probability of misclassification, given U, is min[ E ( V | U ) , − E ( V | U )] , andthe unconditional probability, still using U, is the expected value of that quantity.Accordingly, we take our binary-V analog of R to be η = min( EV, − EV ) − E [min[ E ( V | U ) , − E ( V | U )]min( EV, − EV ) (32)Applying Jensen’s Inequality to the convex function φ ( t ) = − min( t, − t ) , andusing EV = E [ E ( V | U )] , we have that E [min[ E ( V | U ) , − E ( V | U )] ≤ min( EV, − EV ) (33)Thus ≤ η ≤ , as with R . One could use η as a means of assessing whether tomake use of unlabeled data.Another approach, motivated by (30), would be to use σ j = V ar [ g ( X, θ j ) (34)Note that in the extreme case σ j = 0 , g ( · ) is a constant and EV = E ( V | U ) , sounlabeled data would be useless. Note too that since ≤ g ≤ , this measure σ j isin some sense scale-free. Accordingly, we might base our decision as to whether The quantity ρ can be shown to be the correlation between V and E ( V | U ) .
10o use unlabeled data on (28), making use of the unlabeled data if this quantity issufficiently far from 0.
Rather than merely presenting some examples of our methodology’s use on realdata, we aim here to assess the methodology’s performance on real data. Thebootstrap will be used for this purpose. (The reader should note that this is intendedjust a means of assessing the general value of the methodology, rather than part ofthe methodology itself.)Suppose we wish to study the efficiency of estimating some population value ν using (cid:98) ν s , an asymptotically unbiased function of a random sample of size s fromthe population. We are interested in studying the efficiency of (cid:98) ν s for various samplesizes s, as measured by mean squared error M SE ( (cid:98) ν s ) = E (cid:2) ( (cid:98) ν s − ν ) (cid:3) (35)Now suppose we have an actual sample of size h from the population. We take thisas the “population,” and then draw m subsamples of size s. In the i th of these sub-samples, we calculate (cid:98) ν s , denoting its value by (cid:98) ν si . The point is that the empiricalc.d.f. of the (cid:98) ν si , i = 1,...,m, is an approximation to the true c.d.f. of (cid:98) ν s . Moreover, (cid:100) M SE ( (cid:98) ν s ) = 1 m m (cid:88) i =1 ( (cid:98) ν si − (cid:98) ν h ) (36)is an estimate of the true mean squared estimation error, (35).This bootstrap framework was used to study the the efficiency of our classificationprobability estimator (cid:98) q j . On each of three real data sets, the ratio was computed ofthe estimated MSE for (cid:98) q j to that resulting from merely using ¯ Y ( j ) and the labeleddata.Figure 1 below shows the estimated MSE ratios for the Pima Indians diabetesdata in the UCI Machine Learning Repository, [14]. (The figures here have beensmoothed using the R lowess() function.) Here we are predicting whether a womanwill develop diabetes, using glucose and BMI as features. The methodology devel-oped here brings as much as a 22 percent improvement.11ata set ˆ η j ˆ σ j Pima 0.2403 0.0617abalone 0.2621 0.0740PUMS 0.2177 0.0063Table 1: Comparison of Measures Proposed in Section 4As expected, the improvement is better for larger values of n-r. In other words,for each fixed number r of labeled observations, the larger the number n-r of unla-beled observations, the better we do. On the other hand, for each fixed number ofunlabeled observations, the smaller the number of labeled observations, the better(since n-r is a larger proportion of r).Figure 2 shows similar behavior for another famous UCI data set, on abalones. Weare predicting whether the animal has less than or equal to nine rings, based onlength and diameter.As pointed out in Section 4, our method here may produce little or no improvementfor some data sets. Here is an example, from the 2000 census data, [15]. Thedata consist of records on engineers and programmers in Silicon Valley, and weare using age and income to predict whether the worker has a graduate degree.The results, seen in Figure 3, show that our methodology produces essentially noimprovement in this case.Now let us examine the two possible measures proposed in Section 4, in the contextof the above three data sets. Table 1 shows the results, which are rather striking.Even though the misclassification rate, η , was approximately equal across the threedata sets, the value of σ was an order of magnitude smaller for PUMS than for theother two data sets. The latter pattern exactly reflects the pattern we saw in theMSE values in the figures.In other words, classification power by itself does not suffice to indicate whetherour method yields an improvement. It is also necessary that the feature vector X have a substantial amount of variation. The measure σ j provides a direct assess-ment of the potential for improvement. Here the focus will be on quantities of the form12 j,W = P ( Y ( j ) = 1 | X(cid:15)W ) (37)for various sets W of interest. It was mentioned earlier, for instance, that one maywish to compare the proportions of public funding for subcategories of patents, saycomputers and optics. Here we outline how the methodology developed earlier inthis paper can be extended to this situation, focusing on the semiparametric case.Write (37) as q j,W = E (cid:2) Y ( j ) I W ( X ) (cid:3) P ( X(cid:15)W ) (38)where I W is the indicator function for membership in W. By analogy with (6), then,our estimator for q j,W is ˆ q j,W = (cid:80) ni =1 g ( X i , ˆ θ j ) I W ( X i ) (cid:80) ni =1 I W ( X i ) (39)The “ordinary” estimator, based only on labeled Y values, is ¯ Y ( j,W ) = r (cid:80) ri =1 I W ( X i ) Y ( j ) j n (cid:80) ni =1 I W ( X i ) (40) Theorem 2
The asymptotic variance of ˆ q j,W is AV ar (ˆ v j,W ) = 1 nP ( X(cid:15)W ) V ar [ g ( X, θ j ) I W ( X )] + 1 r C Tj A − j C j (41) Proof 4
Since lim n →∞ n n (cid:88) i =1 I W ( X i ) = P ( X(cid:15)W ) w.p. 1 (42) we can again apply Slutsky’s Theorem and derive the asymptotic distribution of ˆ q j,W . Toward that end, define j,W = E (cid:104) Y ( j ) I W ( X ) (cid:105) (43) and concentrate for the moment on estimating v j,W , using ˆ v j,W = 1 n n (cid:88) i =1 g ( X i , ˆ θ j ) I W ( X i ) (44) The analog of (12) is then n (ˆ v j,W − v j,W ) = n − n (cid:88) i =1 [ g ( X i , θ j ) I W ( X i ) − v j,W ] (45) + (cid:34) n − n (cid:88) i =1 g (cid:48) ( X i , ˜ θ j ) I W ( X i ) (cid:35) T n (ˆ θ j − θ j ) for some ˜ θ j between θ j and ˆ θ j .Proceeding as in (10) we have that n (ˆ v j,W − v j,W ) is asympotically equivalentto n − [ n (cid:88) i =1 { g ( X i , θ j ) I j,W ( X i ) − v j,W } + γ − C Tj A − j r (cid:88) i =1 w ( X i , θ j ) { Y ( j ) i − g ( X i , θ j ) } g (cid:48) ( X i , θ j )] (46) where C j = E (cid:2) g (cid:48) ( X, θ j ) I W ( X i ) (cid:3) (47) Thus n (ˆ q j,W − q j,W ) is asympotically equivalent to n − P ( X(cid:15)W ) (cid:34) n (cid:88) i =1 { g ( X i , θ j ) I j,W ( X i ) − v j,W } + γ − C Tj A − j r (cid:88) i =1 w ( X i , θ j ) { Y ( j ) i − g ( X i , θ j ) } g (cid:48) ( X i , θ j ) (cid:35) (48)14 ontinuing as before, the desired result follows: AV ar (ˆ v j,W ) = 1 P ( X(cid:15)W ) (cid:20) n V ar [ g ( X, θ j ) I W ( X )] + 1 r C Tj A − j C j (cid:21) (49) Corollary 2
The asympotic variance of the estimator based on both the labeledand unlabeled data, (cid:98) q j,W , is less than or equal to that of the estimator based onlyon labeled data, ¯ Y ( j,W ) . The inequality as strict as long as the random variable I W ( X ) g ( X, θ j ) is not constant. Proof 5
Define h ( t, θ ) = g ( t, θ ) I W ( t ) . Theorem 1, applied to h and W ( X ) X rather than g and X , shows that n n (cid:88) i =1 g ( X i , ˆ θ j ) I W ( X i ) (50) has asympotic variance less than that of r r (cid:88) i =1 I W ( X i ) Y ( j ) j (51) The result then follows by applying Slutsky’s Theorem to the denominators in (39)and (40), and noting (42).
Suppose X is discrete, taking on b vector values v , ..., v b . We assume here thatour labeled data is extensive enough that for each k, there is some i with X i = v k .Then we can find direct estimates of the q j without resorting to using a parametricmodel or smoothing methods such as nearest-neighbor.We can write the prior probabilities as q j = b (cid:88) k =1 p k d jk (52)15here p k = P ( X = v k ) and d jk = P ( Y ( j ) = 1 | X = v k ) .Let ik be the indicator variable for X i = v k , for i = 1,...,n and k = 1,...,b. Thendenote the counts of labeled and unlabeled observations taking the value k by M k = r (cid:88) i =1 ik (53)and N k = n (cid:88) i = r +1 ik (54)Also define the count of class-j observations among the labeled data having featurevalue k: T jk = r (cid:88) i =1 ik Y ( j ) i (55)We can estimate the quantities d jk and p k by ˆ d jk = T jk M k (56)and ˆ p k = M k + N k n (57)Our estimates of the class prior probabilities are then, in analogy to (52), ˆ q j = b (cid:88) k =1 ˆ p k ˆ d jk (58)Note that if we do not use unlabeled data, this estimate reduces to the standardestimate of q j = EY ( j ) , the class sample proportion ¯ Y ( j ) , in (7).The quantities M h , N h and T jh have an asympotically multivariate normal distri-bution. Thus from the multivariate version of Slutsky’s Theorem is asympoticallyequivalent to 16 (cid:88) k =1 d jk ˆ p k (59)This is a key point in addressing the question as to whether the unlabeled data pro-vide our estimators with smaller asympotic variance. Since the only randomnessin (59) is in the ˆ p k , the larger the sample used to calculate those quantities, thesmaller the variance of (59) will be. Using the unlabeled data provides us with thatlarger sample, and thus (59) will be superior to (7).Stating this more precisely, the variance of (59) is n (cid:32) b (cid:88) k =1 d jk p k (1 − p k ) − b (cid:88) k =1 b (cid:88) s = k +1 d jk d js p k p s (cid:33) (60)Without the unlabeled data, this expression would be the same but with a factor of1/r instead of 1/n. So, again, it is demonstrated that use of the unlabeled data isadvantageous.To form confidence intervals and hypothesis tests from (58), we obtain a standarderror by substituting ˆ d jh and ˆ p h in (60), and taking the square root. This paper has developed methodology for estimating class prior probabilities insituations in which class membership is difficult/expensive to determine. Asymp-totic distributions of the estimators were obtained, enabling users to form confi-dence intervals and perform hypothesis tests. It was proven that use of unlabeleddata does bring an improvement in asymptotic variance, compared to using onlythe labeled data.In the parametric cases, the usual considerations for model fitting hold. One shouldfirst assess the goodness of fit of the model Though many formal test proceduresexist, with the large samples often encountered in classification problems it may bepreferable to use informal assessment. One can, for instance, estimate the function E ( Y ( j ) | X = t ) nonparametrically, using say kernel or nearest-neighbor methods,and then compare the estimates to those resulting from a parametric fit.17he paper first built from the semiparametric framework of [9], and then also de-rived a fully nonparametric estimator for the case of purely discrete features. Thelatter methodology requires that the training set be large enough that there is atleast one observation for every possible value of X. If there are many such values,or if the training set is not large enough, one might turn to smoothing methods,such as kernel-based approaches. Another possibility in the purely discrete case isto use the log-linear model [[1]], which is fully parametric.One might pursue possible further improvement by using the unlabeled data notjust to attain a better estimate for the marginal distribution of X , but also to en-hance the accuracy of the estimate of the conditional distribution of Y given X . Asnoted in Section 1, much work has been done on this problem in the context of esti-mating classification functions. It would be of interest to investigate the asymptoticbehavior in our present context of estimation of class prior probabilities. It wouldalso be useful to investigate whether the methods of Section 4 can be applied to theclassification function problem. References [1] A. Agresti.
Categorical Data Analysis , Wiley, 2002.[2] J. Blackard and D. Dean. Comparative Accuracies of Artificial Neural Net-works and Discriminant Analysis in Predicting Forest Cover Types from Car-tographic Variables,
Computers and Electronics in Agriculture , 1999, 24, 131-151.[3] Bunke, O. Bunke (eds.).
Statistical Inference in Linear Models. StatisticalMethods of Model Building , Vol. I, 1986, Wiley, Chichester.[4] J. Chen, J. Fan, K. Li and H. Zhou. Local Quasi-Likelihood Estimation withData Missing at Random.
Statistica Sinica , 2006, 16, 1071-1100.[5] A. Davison and D. Hinkley, Bootstrap Methods and Their Application, Cam-bridge University Press, 1997.[6] D. Hunter.
Asymptotic Tools , 2002, .[7] R.I. Jennrich. Asymptotic Properties of Least Squares Estimators, 1969,
Ann.Math. Statist.
40, 633-43. 188] W.Z. Liu, A.P. White, S.G. Thompson, M.A. Bramer. Techniques for Deal-ing with Missing Values in Classification, Lecture Notes in Computer Science,Springer, 1997.[9] N. Matloff. Use of Regression Functions for Improved Estimation of Means.
Biometrika , 1981, 68(3): 685-689.[10] U. Muller, A. Schick and W. Wefelmeyer. Imputing Responses That Are NotMissing.
Probability, Statistics and Modelling in Public Health (M. Nikulin, D.Commenges and C. Huber, eds.), 2006, 350-363, Springer.[11] Nagelkerke, N.J.D. A Note on a General Definition of the Coefficient of De-termination.
Biometrika , 1991, 78, 3, 691-692.[12] F. Narin, K. Hamilton, and D. Olivastro. The Increasing Linkage Between USTechnology and Public Science.
Research Policy , 1997, 26(3):317-330.[13] Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. TextClassification from Labeled and Unlabeled Documents Using EM.
MachineLearning , 39(2/3). pp. 103-134. 2000.[14] University of California, Irvine. UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/datasets.html .[15] U.S. Census Bureau. Public-Use Microdata Samples (PUMS), .[16] L. Wasserman.
All of Statistics: a Concise Course in Statistical Inference ,2004, Springer. 19 . . . . . . . n−r M SE r a t i o r = 50 r = 100r = 150 Figure 1: Pima MSEs20
00 200 300 400 500 . . . . . . . n−r M SE r a t i o r = 100r = 200r = 300r = 400r = 500 Figure 2: Abalone MSEs21
00 700 800 900 1000 1100 1200 . . . . . . n−r M SE r a t i o r = 500 r = 1000r = 1500r = 500 r = 1000r = 1500