[PDF] Surprise sampling: improving and extending the local case-control sampling

Abstract

Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear "surprising" in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in Fithian and Hastie (2014) and Ai, et al. (2018), the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and Ai et al. (2018)'s sampling as special cases. Under same model specifications, the proposed estimator also performs no worse than those in the literature. The estimation procedure is valid even if the model is misspecified and/or the pilot estimator is inconsistent or dependent on full data. We present theoretical justifications of the claimed advantages and optimality of the estimation and the sampling design. Different from Ai, et al. (2018), our large sample theory are population-wise rather than data-wise. Moreover, the proposed approach can be applied to unsupervised learning studies, since it essentially only requires a specific loss function and no response-covariate structure of data is needed. Numerical studies are carried out and the evidence in support of the theory is shown.

Full PDF

SSurprise sampling: improving and extending the localcase-control sampling

Xinwei Shen Kani ChenDepartment of MathematicsHong Kong University of Science and TechnologyClear Water Bay, Kowloon, Hong KongWen Yu ∗ Department of StatisticsSchool of ManagementFudan UniversityShanghai 200433, P.R.China ∗ Corresponding author: [email protected] a r X i v : . [ s t a t . M E ] J u l bstract Fithian and Hastie (2014) proposed a new sampling scheme called local case-control(LCC) sampling that achieves stability and eﬃciency by utilizing a clever adjustmentpertained to the logistic model. It is particularly useful for classiﬁcation with largeand imbalanced data. This paper proposes a more general sampling scheme basedon a working principle that data points deserve higher sampling probability if theycontain more information or appear “surprising” in the sense of, for example, a largeerror of pilot prediction or a large absolute score. Compared with the relevant existingsampling schemes, as reported in Fithian and Hastie (2014) and Ai, et al. (2018),the proposed one has several advantages. It adaptively gives out the optimal forms toa variety of objectives, including the LCC and Ai et al. (2018)’s sampling as specialcases. Under same model speciﬁcations, the proposed estimator also performs no worsethan those in the literature. The estimation procedure is valid even if the modelis misspeciﬁed and/or the pilot estimator is inconsistent or dependent on full data.We present theoretical justiﬁcations of the claimed advantages and optimality of theestimation and the sampling design. Diﬀerent from Ai, et al. (2018), our large sampletheory are population-wise rather than data-wise. Moreover, the proposed approachcan be applied to unsupervised learning studies, since it essentially only requires aspeciﬁc loss function and no response-covariate structure of data is needed. Numericalstudies are carried out and the evidence in support of the theory is shown.

MSC:

Some key words : Generalized linear models, Horvitz-Thompson estimator, Local case-controlsampling, Model mis-speciﬁcation, Subsampling.2

Introduction

Nowadays, with the rapid development of data capturing and storage techniques, peoplemeet huge amounts of data in various ﬁelds. With the growth of the data size, computa-tional capacity becomes more and more crucial to implement eﬃcient data analysis. Despitesigniﬁcant progress on computer hardware, computational ability may still become a majorconstraint for data analysis when the data size is suﬃciently large. For instance, it might bevery time and resource consuming if we want to try a variety of competing models instead ofonly ﬁtting one or two predetermined models, or if we need to reﬁt the model from time totime with new observations arriving continuously, or if we need to apply the data partitionand reusing techniques such as cross-validation, bootstrapping, bagging, and so on. All thesecomputationally intensive procedures require tremendous computational costs, especially forthose large data sets.A simple approach to reduce the computational cost is to draw a subsample from thefull data set and then analyze the subsample. The easiest subsampling design is to drawthe data uniformly. However, uniform subsampling might be very ineﬃcient for some datastructures. For instance, in classiﬁcation problems with two classes (one class for positiveexamples called “cases” and the other for negative examples called “controls”), when theclasses are imbalanced, that is, one of the classes (usually the class of cases) is rare andthe other is dominant, uniform subsampling is unfavorable because it ignores the unequalimportance of the data points. For imbalanced data, case-control sampling is a well-knownsubsampling design (Mantel and Haenszel, 1959; Miettinen, 1976; Breslow, Day et al., 1980).It draws uniform subsamples from each of the two classes with diﬀerent sampling percentages.Often comparable number of cases and controls are sampled, yielding a subsample with noobvious imbalance, to increase the eﬃciency of estimation. Anderson (1972) and Prenticeand Pyke (1979) showed that by ﬁtting a logistic model on the case-control subsample andthen making a simple adjustment, one can still get a consistent estimator for the regressionparameters if the logistic model is correctly speciﬁed. Thus, case-control design can helpreduce the computational burden and retain satisfactory estimation eﬃciency under theimbalanced structure. Besides classiﬁcation, imbalanced data structure also appears forother data types. A survey on predictive modeling under various imbalanced distributionsis provided by Branco et al. (2015).Subsampling designs have also received attention in epidemiological cohort studies. Thecohorts to follow up usually involve a great number of subjects and the occurrence of a certaindisease is interested. In most cases, the occurrence rate of the disease is low during the entirefollow-up time, so the cohort is essentially imbalanced. Meanwhile, the collection of covariateinformation on all involved subjects can be very expensive and time consuming because ofthe large cohort size. Hence, subsampling designs were developed to save sampling cost andtime. The widely studied ones include case-cohort design (Prentice, 1986) and nested case-control design (Thomas, 1977). Case-control design can also be used in cohort studies. Moregeneralized case-cohort designs aiming to improve the estimation eﬃciency were developed3ater by Chen and Lo (1999), Chen (2001), and more recently, Yao et al. (2017). Thesampling probability of all these subsampling designs depends on the observed follow-uptimes and the censoring indicators, but not the covariates. These designs are usually usedfor covariates ascertainment, that is, the covariates are observed only when the subject isselected into the subsample.Statistical models are usually imposed for data analysis, but in real applications themodels are easily misspeciﬁed. For some machine learning approaches, models are just usedto derive meaningful loss functions, without caring about if they are correctly speciﬁedor not. Under these circumstances, the target parameter becomes a certain populationrisk minimizer corresponding to the loss function used (Huber, 2011). When there existspossibly model mis-speciﬁcation, robust analysis procedures are often preferred. However,as Fithian and Hastie (2014) pointed out, in classiﬁcation problems, when the logistic modelis misspeciﬁed, the standard case-control sampling can not provide consistent estimator ofthe target parameter. To overcome this drawback, Fithian and Hastie (2014) proposed alocal case-control (LCC) sampling design, which depends not only on the class label, butalso on the predictors and a pilot estimate (a pilot estimate is a “good” guess for the targetparameter). Their design provides a clever way to remedy imbalance locally throughout thefeature space and possesses potential robustness against model mis-speciﬁcation. Moreover,an elegant LCC estimate mimicking the full sample maximum likelihood estimate (MLE)is proposed. They showed that when the logistic model is correctly speciﬁed and the pilotis consistent and independent of the full data, the asymptotic variance of their proposedestimate is twice the variance of the full data MLE; when the logistic model is misspeciﬁed,the proposed estimate is still consistent and asymptotically normal as long as the pilot isconsistent and independent of the data.The LCC design aims to simultaneously speed up computation and provide a simpleprocedure to obtain a good estimate even under model mis-speciﬁcation, but Fithian andHastie (2014) did not discuss the optimality of the design. More recently, Wang et al. (2018)modiﬁed the LCC design smartly to get an optimal subsampling method that minimizes theconditional asymptotic mean squared error (MSE) of the subsample estimator given thefull data. Later on, Ai et al. (2018) extended the method to generalized linear modelswith canonical links. However, they only considered the optimization criterion regardingthe conditional MSE. Some other non-uniform subsampling designs for the linear modelinclude Ma et al. (2015) and Wang et al. (2018). We propose an improved subsamplingdesign which accommodates various types of statistical learning objectives and includes theLCC sampling and Ai et al. (2018)’s sampling as special cases. For estimating the targetparameter based on the subsample, we apply the Horvitz-Thompson (HT) type estimation(Horvitz and Thompson, 1952). The new sampling design is derived by optimizing certainwell-deﬁned criteria, such as prediction accuracy, estimation accuracy, or MSE. For diﬀerentcriteria, the proposed design has its corresponding form adaptively. Basically, it draws adata point with a large error of pilot prediction or a large score into the subsample withhigher probability. Provided a pilot estimate, such a data point is in certain sense unusual,4r “surprising”, for that given pilot, so we call the proposed design surprise sampling design.The advantages of the proposed surprise sampling are summarized as follows.(i) The surprise sampling design is derived based on well-deﬁned objectives. For aspeciﬁc objective, the corresponding design is optimal. Meanwhile, the proposed designﬂexibly adapts varying objectives. In contrast, the objective of the LCC design isnot clearly deﬁned and the optimality is not discussed in Fithian and Hastie (2014).In Wang et al. (2018) and Ai et al. (2018), only the objective of minimizing theconditional MSE is considered.(ii) The proposed estimators are always consistent and asymptotically normal, re-gardless of the correctness of the model speciﬁcation and the consistency of the pilotestimate. The consistency of Fithian and Hastie (2014)’s estimator does not requirethe logistic model speciﬁcation to be correct, but needs the consistency of the pilot.The pilots used by Wang et al. (2018) and Ai et al. (2018) are also consistent.(iii) If the pilot estimate is consistent and the logistic model is correctly speciﬁed, theproposed estimator is no worse than Fithian and Hastie (2014)’s estimator in the sensethat they have the same asymptotic eﬃciency under the LCC sampling.(iv) The validity of the proposed estimation procedure does not require the pilot esti-mate to be independent of the full data, while Fithian and Hastie (2014)’s estimatorrequires the independence. This relaxation is useful in applications especially whenthere is no other data source and one needs to get the pilot from the full data.(v) The large sample properties derived for the proposed estimators are population-wise (unconditional), while the parallel large sample properties in Wang et al. (2018)and Ai et al. (2018) are developed data-wise (conditional on the full data).(vi) The proposed approach can be generally applied to not only supervised learningsuch as classiﬁcation and regression but also unsupervised learning tasks, because itessentially only requires a well-deﬁned loss function with a ﬁnite-dimensional param-eter. Hence the application of the approach is more than the scope of Logistic modelor the other generalized linear models.We proceed as follows. Section 2 introduces the notation and describes the problemsetting. In Section 3, we present the main idea of the proposed surprise sampling and thespeciﬁc forms of the sampling design are respectively derived to reach various objectives,such as best prediction accuracy and estimation accuracy. Section 4 gives out the largesample properties of the HT type estimator under the surprise sampling design. In Section5, extensive simulation studies are carried out to show the eﬀectiveness of the proposedapproach. The results of real data analysis are provided in Section 6. Section 7 concludes.All the technique details are summarized in the Appendix.5

Notation and problem setting

Suppose the full data consists n subjects and let d i be the observed data point for the i -th subject. We assume that d i , i = , . . . , n , are independent and identically distributed(i.i.d.) copies of a random element D whose distribution stands for the population. Forsupervised learning problems, D can be decomposed into a response, denoted by Y , and a q -dimensional predictor or covariate vector, denoted by X . Correspondingly, the observeddata points d i = ( y i , x i ) , i = , . . . , n , are i.i.d. copies of D = ( Y, X ) . We aim to predict Y through X , or learn the regression function f ( x ) = E ( Y ∣ X = x ) , based on the observed data.Usually, a model is imposed on f ( x ) characterized by a p -dimensional parameter, denoted by θ throughout the paper, and a corresponding loss function is minimized to obtain an estimateof θ . Write the loss function as l ( d ; θ ) , where d = ( y, x ) . The loss function can take the formof squared loss, negative log-likelihood, hinge loss, Huber loss, etc. Some classical choicesinclude Logistic regression with l ( d ; θ ) = − y ( α + β ⊺ x ) + log ( + exp ( α + β ⊺ x )) for a binaryresponse, where θ = ( α, β ⊺ ) , Poisson log-linear model with l ( d ; θ ) = − y ( α + β ⊺ x )+ exp ( α + β ⊺ x ) for a counting response, linear model with squared loss l ( d ; θ ) = ( y − α − β ⊺ x ) / σ for acontinuous response, where θ = ( α, β ⊺ , σ ) , and also nonlinear models such as neural networkswith squared loss or cross-entropy loss. For unsupervised learning tasks, there is no response-covariate structure of data. Typically, we derive a loss function l ( d ; θ ) based on a certainpurpose and aim at minimizing the loss with respect to θ . For instance, in the geometricview of principal component analysis (PCA), the goal is to ﬁnd a k -dimensional aﬃne spacein R q that best approximate the n examples in terms of Euclidean distance. Parameterizingthe aﬃne space by α + U β where U consists of k -columns of an orthogonal basis of the space,we end up with the following loss function l ( d ; θ ) = ∥ d − ( α + U β )∥ , where θ = { α, U, β } and ∥ ⋅ ∥ stands for the Euclidean norm.The target parameter that we aim to estimate is the so-called population risk minimizerwhich minimizes the population risk R ( θ ) = E [ l ( D ; θ )] , that is, θ ∗ = arg min θ R ( θ ) . In supervised learning, when the regression model is correctly speciﬁed, that is, there existssome θ satisfying f ( x ) = f θ ( x ) , then it is easy to see that θ = θ ∗ . However, as we mentionin Section 1, in many real applications, f ( x ) does not satisfy the speciﬁed model and this isthe so-called model mis-speciﬁcation. Meanwhile, in some unsupervised learning tasks suchas PCA, there is no imposed model. It is well-known that no matter the model speciﬁcationholds or not, θ ∗ can be well deﬁned under general conditions. The full sample version ofthe risk minimizer, denoted by ˆ θ ∗ , is given by ˆ θ ∗ = arg min θ ∑ ni = l ( d i ; θ ) . Under suitableregularity conditions, one can show that ˆ θ ∗ is consistent for θ ∗ , that is, ˆ θ ∗ converges inprobability to θ ∗ (Huber, 2011).When n is suﬃciently large, subsampling designs are preferred to save computationalcost. For each i = , . . . , n , let ∆ i be a 0-1 valued binary indicator, indicating whether the i th data point is sampled into the subsample (∆ i =

1) or not (∆ i = θ be a pilot6stimate (i.e., a guess of θ ∗ ). Let π i be the conditional sampling probability of the i th datapoint, i.e., the conditional probability of ∆ i =

1, given all the observed data and the pilotestimate. Thus, π i may depend on the observed data and the pilot. Also, ∆ i ’s are generatedindependently with each other given the observed data and the pilot.For binary response, the LCC design proposed by Fithian and Hastie (2014) sets π i =∣ y i − p ( ˜ α + ˜ β ⊺ x i )∣ , where p ( t ) = exp ( t )/[ + exp ( t )] and ( ˜ α, ˜ β ) is a pilot estimate of ( α, β ) .After obtaining the subsample, they ﬁt a Logistic regression to the subsample and then doa simple adjustment to get the estimator of the target parameter. The main advantage oftheir estimation procedure is that it basically maintains the original form of the maximumlikelihood estimation with full data. However, as we mention in Section 1, the validity oftheir approach heavily relies on the consistency of the pilot estimate, yet it has not beenextended to other regression models or more general learning tasks. We apply the HT typeestimation. Speciﬁcally, the HT type estimator is given byˆ θ = arg min θ n ∑ i = ∆ i π i l ( d i ; θ ) . (1)The HT type estimation is general enough for various kinds of loss functions. The com-putational complexity of (1) is similar to that of full data. For instance, if the objectivefunction based on the full data is convex, so is the objective function in (1). Meanwhile, byinverting the sampling probability, it is quite intuitive to expect the consistency of the HTtype estimator regardless of the consistency of the pilot estimate. More importantly, basedon the HT type estimation, we can derive an optimal form of the subsampling design for aspeciﬁc objective. The details of the derivation is discussed in the following section.Before proceeding to the next section, we introduce some more necessary notation. Let g ( d ; θ ) = ∂l ( d ; θ )/ ∂θ . When l is the negative log-likelihood function, g becomes the scorevector. Let G ( d ; θ ) = ∂ l ( d ; θ )/ ∂θ∂θ ⊺ and A = E [ G ( D ; θ ∗ )] . For any column vector a , a ⊗ stands for aa ⊺ . We use ∥ ⋅ ∥ to stand for the Euclidean norm. To give out the speciﬁc form of the proposed subsampling design, we ﬁrst heuristically presentthe asymptotic properties of the HT type estimator ˆ θ . By the deﬁnition of ˆ θ , it is easy tosee that ∑ ni = ∆ i g ( d i ; ˆ θ )/ π i =

0. Under suitable conditions, we can show that ˆ θ is consistentfor θ ∗ and that √ n ( ˆ θ − θ ∗ ) = − A − √ n n ∑ i = ∆ i π i g ( d i ; θ ∗ ) + o p ( ) . (2)Furthermore, we can show that √ n ( ˆ θ − θ ∗ ) converges in distribution to a Gaussian vec-tor, denoted by Z , with mean zero and variance-covariance matrix A − V π A − , where V π = E [ g ( D ; θ ∗ ) ⊗ / π ] and π is a probability that may depend on D . We give out suﬃcient condi-tions that guarantee the above results in Appendix A.2.7 .1 Overall prediction accuracy If the main purpose of the data analysis is prediction, the minimized population predictionerror R ( θ ∗ ) = E [ l ( D ; θ ∗ )] is a natural criterion to measure the prediction accuracy. Basedon a subsampling percentage π and the corresponding HT type estimator ˆ θ , the predictionerror can be measured by R ( ˆ θ ) , which is R ( θ ) evaluated at θ = ˆ θ . By Taylor expansion, wehave that nR ( ˆ θ ) − nR ( θ ∗ ) = n E [ g ( D ; θ ∗ )]( ˆ θ − θ ∗ )+ n ( ˆ θ − θ ∗ ) ⊺ E [ G ( D ; θ ∗ )]( ˆ θ − θ ∗ ) + o p ( )= n ( ˆ θ − θ ∗ ) ⊺ A ( ˆ θ − θ ∗ ) + o p ( ) . (3)The leading term in (3) converges in distribution to Z ⊺ AZ / θ . By some calculation, we have that E ( Z ⊺ AZ ) = tr [ E ( Z ⊺ AZ )] = tr [ A E ( ZZ ⊺ )] = tr ( V π A − )= E [ tr ( A − / g ( D ; θ ∗ ) ⊗ A − / π )] = E ( ∥ A − / g ( D ; θ ∗ )∥ π ) . (4)Then it is natural to select π that minimizes (4), which can be treated as the averagediﬀerence between the prediction error based on the subsample and the minimized populationprediction error. The following proposition, proved in Appendix A.1, gives out the optimalform of π . Proposition 1.

Let r ∈ ( , ) be a constant. The optimal π to minimize (4), subject to E ( π ) ⩽ r , is given by π = ( c ∥ A − / g ( D ; θ ∗ )∥) ∧ , where c is the largest constant such that E ( π ) ⩽ r and for constants a and b , a ∧ b = min { a, b } . The constant r is used to control the subsampling rate. Note that the optimal π dependson some unknown quantities, such as A and θ ∗ . Given a pilot estimate ˜ θ , we propose thefollowing subsampling design π i = ( c ∥ ˜ A − / g ( d i ; ˜ θ )∥) ∧ , (5) i = , . . . , n , where ˜ A = n − ∑ ni = G ( d i ; ˜ θ ) and c is a constant selected to reach the predeter-mined subsampling rate. For the given subsampling rate r , we design an algorithm based onthe bisection method to get the largest c such that n − ∑ ni = π i ⩽ r .The proposed subsampling probability is proportional to the score evaluated at the pilot.When the value is relatively large for a data point, it implies that this data point has a largeprediction error for the given pilot estimate, that is, it is somewhat more “surprising” thanthe ones with smaller score values. Under such design, the data point with a larger scorevalue is more likely to be drawn into the subsample. Thus, we call the proposed subsamplingdesign “surprise” sampling, or score sampling.8 .2 Estimation accuracy If the main concern is the estimation accuracy of v ⊺ ˆ θ , as an estimator of v ⊺ θ ∗ , where v isa given p -dimensional constant vector, then the objective becomes to give out an eﬃcientsubsampling scheme to increase the estimation accuracy. Without loss of generality, assumethat ∥ v ∥ =

1. From (2), we have that √ nv ⊺ ( ˆ θ − θ ∗ ) converges in distribution to a Gaus-sian vector with mean zero and variance v ⊺ A − V π A − v . Then we select π to minimize thisasymptotic variance. Proposition 2.

Let r ∈ ( , ) be a constant. The optimal π to minimize v ⊺ A − V π A − v ,subject to E ( π ) ⩽ r , is given by π = ( c ∣ v ⊺ A − g ( D ; θ ∗ )∣) ∧ , where c is the largest constant such that E ( π ) ⩽ r . Given a pilot ˜ θ , the proposed subsampling design is given by π i = ( c ∣ v ⊺ ˜ A − g ( d i ; ˜ θ )∣) ∧ , (6) i = , . . . , n , where c is a constant selected to reach the predetermined subsampling rate.Again, the proposed design is proportional to the score evaluated at ˜ θ .To illustrate the relationship between the proposed surprise sampling and the LCC sam-pling, we discuss the case of generalized linear models with D = ( Y, X ) and d i = ( y i , x i ) .Here, let Z = ( , X ⊺ ) ⊺ and θ be the collection of all the regression parameters (i.e., thecoeﬃcients corresponding to X plus the intercept). For a generalized linear model, theconditional probability or density function of Y given X = x , denoted by p ( y ∣ x ; θ ) , can bewritten as the form of ψ ( y, θ ⊺ z ) , where ψ is a known function up to a ﬁnite-dimensionalparameter and z = ( , x ⊺ ) ⊺ . A natural choice of the loss function is the negative log likeli-hood l ( y, x ; θ ) = − log ψ ( y, θ ⊺ z ) . Deﬁne S ( y, t ) = ∂∂t ψ ( y, t )/ ψ ( y, t ) . Then it is easy to see that g ( y, x ; θ ) = − S ( y, θ ⊺ z ) z . When the model is correctly speciﬁed, there exists θ = θ ∗ , knownas the true parameter value. Let σ Z = Var [ S ( Y, θ ⊺ Z )∣ Z ] . Proposition 3.

Set v = E ( σ Z Z ) . Then the optimal π to minimize the asymptotic varianceof v ⊺ ˆ θ is given by π = c ∣ S ( Y, θ ⊺ Z )∣ ∧ , where c is a constant controlling the subsampling rate. Based on Proposition 3, given a pilot estimate ˜ θ , we would propose the subsampling design π i = c ∣ S ( y i , ˜ θ ⊺ z i )∣ ∧

1, where z i = ( , x ⊺ i ) ⊺ , i = , . . . , n . For the Logistic model with a binaryresponse, we have ψ ( y, t ) = p ( t ) y ( − p ( t )) − y and S ( y, t ) = ( y − p ( t )) . Then the proposedsubsampling design becomes π i = c ∣ y i − p ( ˜ θ ⊺ z i )∣ ∧

1, which is exactly the LCC samplingproposed by Fithian and Hastie (2014). Thus, the LCC sampling can be viewed as a specialcase of the proposed surprise sampling design, with a somewhat narrow objective to achieveoptimality in estimating a certain direction of the regression parameter vector. Surprise9ampling, however, is better justiﬁed, since if the data analysis objective is altered, thesampling can be accordingly modiﬁed. We show in the next section that if Logistic modelis correctly speciﬁed and the pilot estimate is consistent, the proposed HT type estimator ˆ θ has the same asymptotic eﬃciency as the LCC estimator of Fithian and Hastie (2014).Besides optimizing the prediction accuracy and estimation accuracy, other objectives canalso be considered. The optimal selection probability π adaptive to various objectives canbe derived similarly to those in Proposition 1 and 2. For instance, Wang et al. (2018) andAi et al. (2018) mainly considered minimizing the conditional MSE of ˆ θ given the full data.In our proposal, for minimizing the (unconditional) MSE, the optimal design is given by π = ( c ∥ A − g ( D ; θ ∗ )∥) ∧

1, which has the similar kernel part to the optimal form derived byWang et al. (2018) under the Logistic model and Ai et al. (2018) under the generalizedlinear models.

Let ˜ π i be the kernel part of the proposed surprise sampling design, e.g. ˜ π i = ∥ ˜ A − / g ( d i ; ˜ θ )∥ for prediction and ˜ π i = ∣ v ⊺ ˜ A − g ( d i ; ˜ θ )∣ for estimation. The implementation procedure of thesurprise sampling is summarized in Algorithm 1. Algorithm 1

Surprise Sampling

Input: data d i , i = , . . . , n , a loss function l ( d ; θ ) , and a subsampling rate r . Output: the HT type estimate ˆ θ . Get a pilot estimate ˜ θ . For i = , . . . , n , do:- evaluate ˜ π i and π i = c ˜ π i ∧

1, where c is obtained by Algorithm 2;- generate independent ∆ i ∼ Bernoulli ( π i ) . Obtain a subsample { d i ∶ ∆ i = } . Compute ˆ θ by minimizing ∑ ni = ∆ i l ( d i ; θ )/ π i .The constant c is obtained by Algorithm 2. In this section we discuss the asymptotic properties of the HT type estimator under theproposed surprise sampling designs. We ﬁrst give out a general theorem on the large sampletheory of ˆ θ deﬁned in (1). The theorem shows that the consistency and the asymptoticnormality of ˆ θ hold when some conditions are assumed for the sampling design π i and theunderlying distribution of the data. 10 lgorithm 2 Find c to reach the predetermined subsampling rate r Input: { ˜ π i , i = , . . . , n } and the subsampling rate r . Output: the constant c to control the subsample size.Rearrange ˜ π i ’s in ascending order ˜ π ( ) ⩽ ⋅ ⋅ ⋅ ⩽ ˜ π ( n ) , and compute c = nr / ∑ ni = ˜ π i . if c ˜ π ( n ) ⩽ then Set c = c . else Find m such that f ( / ˜ π ( m + ) ) ⩽ nr ⩽ f ( / ˜ π ( m ) ) by the bisection method, where f ( x ) =∑ ni = ( x ˜ π i ∧ ) .Set c = [ nr − ( n − m )]/ ∑ mi = k + ˜ π ( i ) . end ifTheorem 1. If conditions A1-A3 and C1-C4 listed in Appendix A.2 hold, then 1) ˆ θ p → θ ∗ , where p → means converging in probability (i.e., ˆ θ is consistent) and 2) √ n ( ˆ θ − θ ∗ ) d → N ( , A − V π A − ) , where d → means converging in distribution and N ( µ, Σ ) stands for a normalvector with mean µ and variance-covariance matrix Σ . The key step to get the asymptotic distribution is to establish the expansion (2). We giveout the proof in Appendix A.2. It is worthwhile to mention that the consistency and theasymptotic normality of ˆ θ does not require the pilot ˜ θ to be consistent nor to be independentof the full data, which, however, are both required in Fithian and Hastie (2014). Therelaxations can be useful in practice, as we mentioned in Section 1. This is one of the mainreasons that we propose to use the HT type estimator.Based on Theorem 1, the asymptotic properties of the HT type estimator under theproposed surprise sampling designs can be derived immediately. For the surprise sampling,the speciﬁc forms of the design π i are given in (5) or (6), depending on the analysis purpose.We ﬁrst consider the situation where the pilot ˜ θ is consistent. Here, the probability π =( c ∥ A − / g ( D ; θ ∗ )∥) ∧ π = ( c ∣ v ⊺ A − g ( D ; θ ∗ )∣ ∧ θ with the consistent pilot estimate. Corollary 1.

If conditions C1-C5 listed in Appendix A.2 hold and ˜ θ p → θ ∗ (i.e., ˜ θ is consis-tent), then 1) ˆ θ p → θ ∗ (i.e., ˆ θ is consistent) and 2) √ n ( ˆ θ − θ ∗ ) d → N ( , A − V π A − ) . In Corollary 1, the probability π in V π just takes the optimal form given in Proposition 1and 2. It means that when the pilot estimate is consistent, the proposed sampling designsgiven in (5) or (6) are optimal in the sense that the corresponding HT type estimator ˆ θ asymptotically reaches the best prediction accuracy or estimation accuracy.When the assumed model is correctly speciﬁed, the target parameter θ ∗ becomes thetrue parameter θ . For the binary response problem discussed in Fithian and Hastie (2014),we have shown that the LCC sampling is a special case of the proposed surprise sampling.11onsider the sampling design π i = ∣ y i − p ( ˜ θ ⊺ z i )∣ , that is, the constant c =

1. We have thefollowing result for the corresponding ˆ θ . Corollary 2.

If the Logistic regression model p ( θ ⊺ z ) = exp ( θ ⊺ z )/[ + exp ( θ ⊺ z )] holds with θ being the true parameter value, X does not concentrate on a hyperplane of dimension smallerthan q , E (∥ X ∥ / π ) < ∞ , and ˜ θ p → θ , then √ n ( ˆ θ − θ ) d → N ( , full ) , where Σ full stands forthe asymptotic variance of the MLE for the full sample. Corollary 2 means that under the correctly speciﬁed Logistic model, the proposed HT typeestimator ˆ θ is asymptotically as eﬃcient as the LCC estimator of Fithian and Hastie (2014),as long as the pilot is consistent. Again, our result does not require the independence betweenthe pilot and the full data. Both Corollary 1 and 2 are proved in Appendix A.2.Then we consider the situation with an inconsistent pilot estimate. When the pilotestimate is not consistent, the LCC estimator is no longer consistent. However, based onTheorem 1, the HT type estimator can be still consistent and asymptotically normal aslong as the pilot has a certain limit in probability. Speciﬁcally, suppose that ˜ θ converges inprobability to a certain limit, denoted by ¯ θ . Let ¯ A = E [ G ( D ; ¯ θ )] and V ¯ π = E [ g ( D ; θ ∗ ) ⊗ / ¯ π ] ,where ¯ π = ( c ∥ ¯ A − / g ( D ; ¯ θ )∥) ∧ π = ( c ∣ v ⊺ ¯ A − g ( D ; ¯ θ )∣) ∧ Corollary 3.

If conditions C1-C4 and C6 listed in Appendix A.2 hold and ˜ θ p → ¯ θ , then 1) ˆ θ p → θ ∗ (i.e., ˆ θ is consistent) and 2) √ n ( ˆ θ − θ ∗ ) d → N ( , A − V ¯ π A − ) . Again we give out the proof in Appendix A.2. When the pilot estimate is not consistentfor θ ∗ , the surprise sampling design is no longer optimal in the sense of minimizing theoverall prediction error and the asymptotic variance. However, ˆ θ is still consistent for θ ∗ and asymptotically normally distributed. Meanwhile, when ¯ θ is not far away from θ ∗ , whichis a common case since the pilot is deﬁned to be a “good” guess of the target parameter,the surprise sampling design is still approximately optimal. Thus, the proposed estimationprocedure is robust for the model mis-speciﬁcation as well as the inconsistent pilot.Finally, a plugged-in approach can be applied to estimate the asymptotic variance-covariance matrix of ˆ θ . Speciﬁcally, we deﬁne ˆ A = n − ∑ ni = ∆ i G ( d i , ˆ θ )/ π i and ˆ V π = n − ∑ ni = ∆ i g ( d i ; ˆ θ ) ⊗ / π i .A consistent estimator for the asymptotic variance-covariance matrix is given by ˆ A − ˆ V π ˆ A − . Here we conduct extensive simulation studies to examine the eﬀectiveness of the proposedapproach and make some comparison with the LCC sampling. The numerical studies mainlyfocus on various regression type problems, but as we have already mentioned, the wholeprocedure is general enough to be extended to unsupervised learning tasks. We ﬁrst consider12inary response where the proposed HT type estimator can be compared with the LCCestimator of Fithian and Hastie (2014).

Simulation 1: Correctly speciﬁed Logistic model

In simulation 1 we consider the scenario where the Logistic model is correctly speciﬁed.We set q =

50 and all the predictors X are generated independently from the standard normaldistribution. Given X , Y is generated from a Bernoulli distribution with success probability p ( θ ⊺ Z ) , where θ = ( α, β ⊺ ) ⊺ with the ﬁrst 25 components of β being 1, the rest 25 being 0,and α being chosen to yield P ( Y = ) ≈ n is set to be 10 .For the pilot estimate, we use data drawn from the full sample with size 10 . We considertwo types of consistent pilot estimates. The ﬁrst one is to draw a random sample withuniform probabilities and get the Logistic MLE to be the pilot; the second one is to drawa 50-50 split case-control sample and apply the weighted case-control approach to get thepilot. For the subsampling design, we apply the LCC sampling π i = ∣ y i − p ( ˜ θ ⊺ z i )∣ , which is aspecial case of the proposed surprise sampling. Both the LCC estimator and the proposedHT type estimator based on the negative log-likelihood loss are obtained for comparison.The procedure is repeated for 1000 times. We record the squared bias and variance of theestimator for β , denoted by ˆ β , over the 1000 realizations for each of the two methods underthe two pilots, respectively. The results are summarized in Table 1.Table 1. Comparison of LCC and HT estimate under the correctly speciﬁed Logistic model.Pilot Uniform MLE WCCEstimation ̂ Bias (× ) ̂ Var (× ) ̂ Bias (× ) ̂ Var (× ) LCC 3.888 3.682 3.347 3.613HT 4.052 3.767 3.477 3.676

LCC: local case-control estimate; HT: Horvitz-Thompson type estimate; Uniform MLE: MLE with uni-form sampling as pilot; WCC: weighted case-control estimate as pilot; ̂ Bias : ̂ Bias = ∥ E ( ˆ β ) − β ∥ ; ̂ Var: ̂ Var = ∑ qj = Var ( ˆ β j ) . From Corollary 2, we know that when the Logistic model is correctly speciﬁed, theproposed HT estimator is asymptotically as eﬃcient as the LCC estimator under the LCCsampling. From Table 1, we see that the behaviors of the two estimates are very close toeach other, which coincides with the ﬁnding of the corollary. Also, the pilot method haslittle impact on the behavior of the proposed estimator, as long as it is consistent.It is worthwhile to mention that under the LCC sampling, the average size of the sub-sample is around 6.6% of the entire sample size. Adding the sample size used for the pilot, itmeans that we use about 7.6% of the full data size to reach half of the estimation eﬃciency.We also record the computation time. On the computer we conduct all the numerical stud-ies (a laptop running macOS 10.14 with an Intel I7 processor and 16GB memory), it takes13round 352 seconds to calculate a full sample MLE. The whole procedure of the subsampleestimation, including obtaining the pilot, evaluating the sampling probabilities, drawing thesubsample, and calculating the HT type estimate, however, takes around 19 seconds, only5.4% of the time for the full sample MLE.

Simulation 2: Incorrectly speciﬁed Logistic model with a consistent pilot

In simulation 2 we turn to the scenario where the Logistic model is incorrectly speciﬁed.Here q is set to be 5 and again all the predictors are generated independently from thestandard normal distribution. Besides the linear combination of the 5 predictors, we add thequadratic term of the ﬁrst predictor into the success probability of the Bernoulli distribution,but still ﬁt the data by the usual Logistic model with linear terms only. Thus, the ﬁttedmodel is incorrectly speciﬁed. The parameters are set to yield P ( Y = ) ≈ n = , and the pilot samplesize is 10 . We still use the Logistic MLE with uniform subsampling and the weighted case-control estimate as the pilot estimate, respectively. It is easy to see that the two pilots areconsistent. Similar to simulation 1, the LCC sampling is applied and the LCC estimator andthe HT estimator are obtained. The procedure is repeated for 1000 times and the resultsparallel to Table 1 are summarized in Table 2.Table 2. Comparison of LCC and HT estimate under the incorrectly speciﬁed Logisticmodel with consistent pilot estimate.Pilot Uniform MLE WCCEstimation ̂ Bias (× ) ̂ Var (× ) ̂ Bias (× ) ̂ Var (× ) LCC 3.170 1.643 4.222 1.226HT 4.629 1.045 3.481 1.040

LCC: local case-control estimate; HT: Horvitz-Thompson type estimate; Uniform MLE: MLE with uni-form sampling as pilot; WCC: weighted case-control estimate as pilot; ̂ Bias : ̂ Bias = ∥ E ( ˆ β ) − β ∗ ∥ ; ̂ Var: ̂ Var = ∑ qj = Var ( ˆ β j ) . From the results we see that the LCC estimate and the HT estimate are essentiallyunbiased for the target parameter β ∗ . Meanwhile, in this case, the empirical variance of theproposed HT estimate is smaller than that of the LCC estimate. Under the LCC sampling,the average size of the subsample is around 1.9% of the entire sample size. Simulation 3: Incorrectly speciﬁed Logistic model with an inconsistent pilot

In simulation 3 we compare the two estimators under incorrectly speciﬁed Logistic withan inconsistent pilot estimate. The data generation scheme is the same as that in simulation2. For the pilot estimate, we still draw a subsample of size 10 with uniform probabilities.However, the diﬀerence here is that we use the probit MLE to be the pilot estimate. Then14he resulting pilot estimate is no longer consistent for the target parameter. Using thisinconsistent pilot, the LCC sampling is applied and the LCC estimator and the HT estimatorare obtained. The procedure is repeated for 1000 times. Here, besides the results of the twoestimates, we also report the squared bias and variance of the pilot estimate. The resultsare summarized in Table 3.Table 3. Comparison of LCC and HT estimate under the incorrectly speciﬁed Logisticmodel with inconsistent pilot estimate.Estimation ̂ Bias (× ) ̂ Var (× ) Pilot p Pilot p : pilot estimate with the probit ﬁtting; LCC: local case-control estimate; HT: Horvitz-Thompsontype estimate; ̂ Bias : ̂ Bias = ∥ E ( ˆ β ) − β ∗ ∥ ; ̂ Var: ̂ Var = ∑ qj = Var ( ˆ β j ) . The consistency of the LCC estimator relies on the consistency of the pilot, but the thatof the proposed HT estimator does not, as Corollary 3 illustrates. From Table 3, we see thatthe pilot estimate is clearly biased. Compared with the HT estimator, the bias of the LCCestimator is quite obvious. Although the squared bias of the LCC estimator is not large inthe absolute value, it is of the same order with its variance. The proposed HT estimator,however, has much smaller squared bias relative to its variance, showing positive evidencefor Corollary 3. This is an advantage of the HT estimator over the LCC one.

Simulation 4: Comparison of LCC sampling and surprise sampling

Proposition 3 shows that the LCC sampling is a special case of the proposed surprisesampling design, and is optimal when the target is to estimate v ⊺ θ ∗ with v = E ( σ Z Z ) .However, when the target direction changes, the LCC sampling is not adaptively optimal,but our surprise sampling design is. In this simulation, we again generate data by thesame scheme as that in simulation 2, but suppose that the main concern is to estimate theregression coeﬃcient of the ﬁrst predictor, i.e., β ∗ . Equivalently speaking, the target direction v here is a 6-dimensional column vector with the second element being 1 and all the othersbeing 0. For this target, the LCC sampling is not optimal. Following Proposition 2, theoptimal design is given by π i = ( c ∣ v ⊺ ˜ A − z i ( y i − p ( ˜ θ ⊺ z i ))∣) ∧

1, where ˜ A = n − ∑ ni = p ( ˜ θ ⊺ z i )( − p ( ˜ θ ⊺ z i )) z i z ⊺ i . The two subsampling designs are compared, with c decided to yield comparableaverage subsample sizes between the two designs. We still use the same two pilot estimates assimulation 2 and consider three estimation approaches: the LCC estimator, the HT estimatorunder the LCC sampling, and the HT estimator under the optimal surprise sampling. Theprocedure is repeated for 1000 times. For the HT estimator, besides the squared bias andvariance, we also report the average of the variance estimates and the empirical coveragepercentage of the 95% Wald conﬁdence interval. The results are summarized in Table 4.15able 4. Estimation of β ∗ by using local case-control sampling and surprise sampling underthe Logistic model.Pilot Uniform MLE WCCBias Var Var Est. CP Bias Var Var Est. CPEstimation (× ) (× ) (× ) (%) (× ) (× ) (× ) (%) LCC 0.473 7.799 - - 1.295 3.417 - -HT-LCC 1.669 1.684 1.444 94.1 0.139 1.540 1.394 94.4HT-optimal 0.577 1.134 0.978 92.0 0.065 1.129 0.943 92.5

LCC: local case-control estimate; HT-LCC: Horvitz-Thompson type estimate under the local case-controlsampling; HT-optimal: Horvitz-Thompson type estimate under the optimal surprise sampling; Bias :average of the squared bias; Var: empirical variance; Var Est.: average of the variance estimate; CP:coverage probability of the 95% Wald conﬁdence interval; Uniform MLE: MLE with uniform samplingas pilot; WCC: weighted case-control estimate as pilot. The variance of the HT estimator under the optimal surprise sampling is much smallerthan that of the LCC estimator. By using comparable subsample sizes (around 1.9% of theentire sample size), the relative eﬃciency of the HT estimator under the optimal sampling tothe LCC estimator is at least 3 in our situation. Also, the HT estimator under the optimalsampling is more eﬃcient than that under the LCC sampling, which is expected accordingto Proposition 2. The plugged-in variance estimate and the Wald conﬁdence interval giveout reasonable performance.In the remaining studies we turn to more general regression models with counting andcontinuous responses, where the LCC approach is no longer available.

Simulation 5: Counting response under log-linear model

In simulation 5 we consider the counting response. We set q = X = ( X , X ) are generated independently from the standard normal distribution. Twoschemes for generating Y given X are considered. In the ﬁrst one, Y is generated froma Poisson distribution with mean exp ( α + β X + β X ) , while in the second one, Y isgenerated from a Poisson distribution with mean exp ( α + β X + β X + β X ) . Thus, thePoisson log-linear model only with the linear terms is correctly speciﬁed for the ﬁrst scheme,but misspeciﬁed for the second. In both schemes, the parameters are set to yield P ( Y ⩽ ) ≈ n = . We draw a pilot sample of size 1000 fromthe entire sample with uniform probabilities and ﬁt the Poisson log-linear MLE to be thepilot, denoted by ˜ θ = ( ˜ α, ˜ β , ˜ β ) ⊺ . According to Proposition 3, we use the surprise samplingdesign π i = ( c ∣ y i − exp ( ˜ θ ⊺ z i )∣) ∧

1, where z i = ( , x i , x i ) ⊺ and c is set to make the subsamplesize equal to 1000. The proposed HT type estimator based on the negative log-likelihoodloss is calculated. For comparison, we draw a uniform subsample of size 2000 to calculate16he Poisson log-linear MLE. The size is 2000 since the proposed surprise sampling needsto pay for its pilot sample. The procedure is repeated for 1000 times. The squared biasand variance of the two estimators are recorded. For the HT estimator, we also record theaverage of the variance estimates and the empirical coverage percentage of the 95% Waldconﬁdence interval. The results are summarized in Table 5.Table 5. Estimation results under the Log-linear model.Estimation Sub-MLE HTModel Bias Var Bias Var Var Est. CPspeciﬁcation Parameter (× ) (× ) (× ) (× ) (× ) (%) α β β α β β Full MLE: full sample MLE; Sub-MLE: MLE with uniform subsample of size 2000; HT: Horvitz-Thompsontype estimate under the surprise sampling; Bias : average of the squared bias; Var: empirical variance;Var Est.: average of the variance estimate; CP: coverage probability of the 95% Wald conﬁdence interval. From the results, we ﬁnd out that no matter whether the model is correctly speciﬁed ornot, the HT estimator based on the proposed surprise sampling is essentially unbiased forthe target parameter, and is much more eﬃcient than the uniform subsample MLE with acomparable sample size. The plugged-in variance estimate and the Wald conﬁdence intervalgive out satisfactory performances.

Simulation 6: Continuous response under linear model

In the last simulation we consider the continuous response. Still q is set to be 2, andthe two predictors X = ( X , X ) are generated independently from a normal distributionwith zero mean and variance 0.01. Two schemes for generating Y given X are considered.In the ﬁrst one, Y is generated from a normal distribution with mean α + β X + β X andvariance 0.01, while in the second one, Y is generated from Poisson distribution with mean α + β X + β X + β X and variance 0.01. Thus, the linear regression model only with thelinear terms is correctly speciﬁed for the ﬁrst scheme, but misspeciﬁed for the second. Inboth schemes, the parameters are set to yield P (− . < Y < . ) ≈ . n = . We draw a uniform pilotsample of size 1000 from the entire sample and ﬁt the normal linear regression MLE to bethe pilot ˜ θ . For the surprise sampling, we use π i = ( c ∣ y i − ˜ θ ⊺ z i ∣) ∧

1, where c is set to make thesubsample size equal to 1000. The proposed HT type estimator based on the least squared17oss is calculated. Similar to the last simulation, the MLE based on a uniform subsample ofsize 2000 is also obtained for comparison. The parallel results are summarized in Table 6.Table 6. Estimation results under the linear model.Estimation Sub-MLE HTModel Bias Var Bias Var Var Est. CPspeciﬁcation Parameter (× ) (× ) (× ) (× ) (× ) (%) α β β α β β Sub-MLE: MLE with uniform subsample of size 2000; HT: Horvitz-Thompson type estimate under thesurprise sampling; Bias : average of the squared bias; Var: empirical variance; Var Est.: average of thevariance estimate; CP: coverage probability of the 95% Wald conﬁdence interval. The observations of the results are basically similar to those of Table 5, so we omitrepeating the details. The last two simulations show that the proposed surprise samplingand the corresponding HT estimator are quite eﬀective for counting and continuous data.

Web Spam data

We ﬁrst apply the proposed approach to the Web Spam data available on the LIBSVMwebsite and originally from Webb, Caverlee, and Pu (2006). The same data was also analyzedby Fithian and Hastie (2014) using the LCC approach for spam ﬁltering. The data consistsof 350000 web pages with about 60% are labeled as “web spam” that designed to manipulatesearch engines rather than display legitimate content. Following Fithian and Hastie (2014),we use frequency of the 99 unigrams that appeared in at least 200 documents as features,log-transformed with an oﬀset to reduce skew. This data set is marginally balanced, but hasconsiderable conditional imbalance in the sense that for some feature values, the web labelis easy to predict.We keep the same subsampling procedure as that in Fithian and Hastie (2014). TheLCC sampling design is used and the corresponding selection percentage is about 10% forthis data. To assess the sampling distribution of the estimators, we repeatedly take uniformsubsample of size n = Simulation 1 with size 10000 to get the pilot18stimate. By the LCC design, the subsample size for parameter estimation is also around10000. We ﬁt the full sample MLE (of size 100000), the LCC estimate, and the proposedHT estimate. Following Corollary 2, under the Logisitic model, the asymptotic variance ofthe proposed HT estimate is twice the variance of the full sample MLE, and is the sameas that of the LCC estimate. Since in this real data, it is very likely the Logistic model isincorrectly speciﬁed, as Fithian and Hastie (2014) mentioned, the asymptotic variance of theHT estimate and the LCC estimate should be a little bit more than two times the varianceof the MLE. On the other hand, the subsample size is around 20000 (including the samplefor pilot estimate), i.e., 20% of the full sample size. Then a uniform subsample of size 20000should yield variance roughly 5 times that of the full sample. j V a r( β ^ j ) / V a r( β ^ j , M L E ) Method

LCCHT

Figure 1. Relative variance of the subsampling estimate ( ˆ β j ) to the full sample MLE( ˆ β j, MLE ). The triangle is for the HT type estimate. The round is for the LCC estimate.Figure 1 shows the relative variances of the subsampling estimates to the full sampleMLE. Speciﬁcally, the horizontal axis indexes the 100 regression coeﬃcients to ﬁt, and thevertical axis stands for the variance of each estimated coeﬃcient relative to that of the full-sample MLE of the same coeﬃcient. The triangle is for the HT estimate and the circle is forthe LCC estimate. We ﬁnd that most relative variances of the two subsampling estimatesare slightly larger than 2, as expected, but are substantially smaller than 5, implying theLCC sampling is more eﬃcient than the uniform subsampling with a comparable samplesize. Moreover, for most estimated coeﬃcients, the relative variance of the HT estimate issmaller than that of the LCC estimate, implying that the proposed HT estimator is slightlymore eﬃcient than the LCC estimator. This is consistent with the ﬁnding of

Simulation 2 .The average computation time of 100 replications for the full sample MLE is 66.64 seconds.The average computation time for the LCC estimate (including the subsampling designimplementation) is 19.32 seconds and for the HT estimate is 20.04 seconds.

Micro-blog data

19n the second example we use the proposed method to analyze a data set of micro-blog, a Chinese version of twitter. We collect 625895 tweets posted on Sina micro-blogduring January to March, 2018. For each tweet, the time of posting (in hour), the numberof comments, retweets, and likes are recorded. Moreover, we also record the gender, thenumber of followers, fans, and the number of tweets posted before of the blogger who postedthe tweet. We set the number of likes to be the response and the other variables to be thepredictors. Since the number of likes is a counting variable, the log-linear model is used todo analysis. The response, ranging from 0 to tens of thousands, is very right skewed, i.e.,imbalanced. The mean of response is 1038 while the median is 72. Around 55% of the tweetshave the number of likes less than 100, while 2.11% of the tweets have the number of likeslarger than 10000. In Figure 2, we show the histogram of the response.

Like p r opo r t i on Figure 2. Histogram of the number of likes in Micro-blog data.In this analysis we focus on predicting the response by the predictors. Usually, the fulldata should be randomly separated into the training part for training model and the testingpart for measuring prediction accuracy. To increase the stability of the results, here we usea ten-fold cross-validation style approach, that is, the full data is randomly split into tenparts and in each time, nine parts of the data serve as the training set and the rest one partserves as the testing set. Each time we train the log-linear model based on the training setand then calculate the root mean square error (RMSE) of the predicted values on the testingset, i.e., RMSE = √ n − t ∑ i ∈ I t ( y i − ˆ y i ) , where I t is the indicator set of the testing data, n t isthe size of I t , y i is the response value, and ˆ y i is the predicted value. Finally, the average ofthe ten RMSEs, called ARMSE, is treated as the measure of the prediction accuracy. In thetraining set, we use the same sampling design as that in Simulation 5 to get the subsample,i.e., π i = ( c ∣ y i − exp ( ˜ θ ⊺ z i )∣) ∧

1, where y i is the response, x i is the standardized predictorvector, and ˜ θ is the pilot estimate which is the MLE ﬁtted with a uniform subsample of size10000. By taking a proper value of c , the size of the surprise subsample is set to be 10000. Tomake comparison, we also draw a uniform subsample of size 20000, which has a comparablesample size to the surprise subsample, to train the model by MLE. The prediction result ofthe full data MLE serves as the benchmark. 20he ARMSE based on the full data is 6 . × . The ARMSE based on surprisesampling data is 7 . × , which is quite close to that of the full data. The ARMSE basedon the uniform subsample data is, however, 2 . × , which is much greater than that ofthe surprise sampling. One of the main reasons of such result is the right-skewness of theresponse data. With the subsampling percentage used here, the uniform subsampling haslittle chance to pick out the tweets with the extremely large responses, say, larger than 10000.Because of the signiﬁcant inﬂuence of the large responses on the model ﬁtting, the trainingmodels based on the uniform subsample data in some folds have quite diﬀerent coeﬃcientsfor some predictors from the full data models. Consequently, the prediction based on theuniform subsample models are very inaccurate for some responses in the testing set, resultingin extremely large RMSEs. By contrast, the surprise sampling is more likely to select thoselarge responses, or in other words, the “surprising” responses, into the subsample due toits design motivation (the selection probability can be 1 for those extremely large responsescompared with the pilot prediction). Consequently, the training models based on surprisesampling data take the large responses into consideration and eventually they have similarcoeﬃcients estimation and prediction performance with the full data models. The averagecomputation time of 10 folds for the full data is 97.45 seconds. For the surprise sampledata, the average computation time is 4.22 seconds and for the uniform sample data, is 3.54seconds. We develop an improved version of the LCC sampling, called surprise sampling design, toreduce computational burden and retain good prediction or estimation performances. Theproposed sampling design is ﬂexibly adaptive to diﬀerent objectives. For a speciﬁc objective,the design has the corresponding optimality. The LCC sampling design, in this sense, can beviewed as a special case of the proposed surprise sampling design. For parameter estimation,we propose the HT type estimation approach. The resulting estimator is consistent andasymptotically normally distributed, without requiring the pilot estimator to be consistent orto be independent of the full data. For the binary response, if the Logistic model is correctlyspeciﬁed and the pilot is consistent, the HT estimator is asymptotically as eﬃcient as theestimator of Fithian and Hastie (2014) under the LCC sampling. The surprise samplingdesign and the HT estimation can be extended to more general responses such as countingand continuous response.The proposed subsampling approach is more of a working principle for eﬃcient dataanalysis which can be applied to various statistical learning problems. In the numericalstudies we mainly focus on regression. However, as we mention many times, the optimalsampling design, the algorithm described in Section 3, and the theories discussed in Section 4can be readily extended to unsupervised learning tasks. The main reason is that our approachsetup essentially only requires a speciﬁc loss function and a ﬁnite-dimensional parameter.21ne example of unsupervised learning is brieﬂy given in Section 2. Some more detaileddiscussion on the approach and more meaningful applications of unsupervised learning areof great interest.There are several other directions worth further research. Firstly, in our proposed ap-proach, the pilot estimate is a guess of the target parameter, so they are of the same dimen-sion. In some real application, people may just want to use a relatively simpler model andsmaller sample size to get the pilot quickly and then apply a more complicated model toestimate the target parameter or do prediction. Then the pilot model used is diﬀerent fromthe ﬁtted model and the pilot is no longer a formal guess of the target parameter. Thus, thesurprise sampling design needs adjustment and the optimality of the design requires moreinvestigation. Secondly, the HT type estimator deﬁned in (1) is not semiparametrically eﬃ-cient, but an immediate variant of the HT estimator is. The surprise sampling can be deﬁnedaccordingly with the variant, in which the sampling probability π i is estimated by the kernelsmoothed approach. It is possible to further improve the estimation eﬃciency. Thirdly, whenthe dimension of the target parameter is high, one may introduce regularization penalties,such as lasso and ridge, into the objective function in (1). Based on the proposed framework,the regularization can be easily incorporated. The large sample properties of the resultingestimator needs more exploration. Appendix

In Appendix, we prove the propositions and theorems in Section 3 and 4.

A.1 Proof of the propositions

Proof of Proposition 1:

Set ˜ π = ∥ A − / g ( D ; θ ∗ )∥ . By the Cauchy-Schwarz inequality, E ( ˜ π ) = E (√ π ⋅ ˜ π √ π ) ⩽ √ E ( π ) ⋅ E ( ˜ π π ) , where the equality holds if and only if √ π ∝ ˜ π /√ π which is equivalent to π = c ⋅ ˜ π with c being a constant. From this inequality we get an achievable lower bound of (4), that is,min π ( ˜ π ) E ( ˜ π π ) = [( E ( ˜ π )) ⋅ E ( π ) ]∣ π = c ˜ π = E ( ˜ π ) c . By the Karush-Kuhn-Tucker (KKT) conditions, with the constraint 0 ⩽ π ⩽

1, the optimal π must be given by c ˜ π ∧

1. Then with the constraint E ( π ) ⩽ r , c is determined as the largestconstant such that E ( c ˜ π ∧ ) ⩽ r . 22 roof of Proposition 2: Set ˜ π = ∣ v ⊺ A − / g ( D ; θ ∗ )∣ . Note that the asymptotic variance of √ nv ⊺ ( ˆ θ − θ ∗ ) which we want to minimize is v ⊺ A − V π A − v = E ( v ⊺ A − g ( D ; θ ∗ ) ⊗ A − v ⊺ / π ) = E (∣ v ⊺ A − g ( D ; θ ∗ )∣ / π ) . Then the proof is exactly the same with that of Proposition 1. Proof of Proposition 3:

Let e be a p -dimensional unit vector whose ﬁrst component is oneand all the others are zero. We ﬁrst prove that v ⊺ A − = e ⊺ . (7)When θ = θ we have that E ( G ( Y, X ; θ )∣ X ) = − E ( ∂ ∂θ∂θ ⊺ log ψ ( Y, θ ⊺ Z )∣ Z )= Cov ( ∂∂θ log ψ ( Y, θ ⊺ Z )∣ Z ) = Cov ( S ( Y, θ ⊺ Z ) Z ∣ Z )= E [( S ( Y, θ ⊺ Z ) Z − E ( S ( Y, θ ⊺ Z ) Z ∣ Z )) ⊗ ∣ Z ]= E [( S ( Y, θ ⊺ Z ) − E ( S ( Y, θ ⊺ Z ) ∣ Z )) ∣ Z ] ZZ ⊺ = Var [ S ( Y, θ ⊺ Z )∣ Z ] ZZ ⊺ , where Cov stands for the variance-covariance matrix of a random vector. Hence, A = E [ G ( Y, X ; θ )] = E [ Var ( S ( Y, θ ⊺ Z ) ∣ Z ) ZZ ⊺ ] = E ( σ Z ZZ ⊺ ) . Since the ﬁrst component of Z isconstant 1, we have that e ⊺ Z =

1. Also, note that σ Z is a scalar. Thus, e ⊺ A = E ( σ Z e ⊺ ZZ ⊺ ) = E ( σ Z Z ⊺ ) = v ⊺ , which leads to (7).Next, we have that v ⊺ A − g ( y, x ; θ ) = e ⊺ (− S ( y, θ ⊺ z ) z ) = − e ⊺ S ( y, θ ⊺ z )( , x ⊺ ) ⊺ = − e ⊺ ( S ( y, θ ⊺ z ) , S ( y, θ ⊺ z ) x ⊺ ) ⊺ = − S ( y, θ ⊺ z ) . On one hand, by Proposition 2 we know that the asymptotic variance of v ⊺ ˆ θ based on theoptimal π is E (( c ∣ v ⊺ A − g ( Y, X ; θ ∗ )∣) ∨ ( v ⊺ A − g ( Y, X ; θ ∗ )) ) , (8)which equals to E ((∣ S ( Y, θ ⊺ Z )∣/ c ) ∨ ( S ( Y, θ ⊺ Z ))) because of (7). On the other hand, theasymptotic variance of v ⊺ ˆ θ based on the π deﬁned in the proposition is E ( π ∣ v ⊺ A − g ( Y, X ; θ ∗ )∣ )∣ π = c ∣ S ( Y,θ ⊺ Z )∣∧ = E (( c ∣ S ( Y, θ ⊺ Z )∣) ∨ ( S ( Y, θ ⊺ Z ))) which is equal to (8). Therefore, π ( Y, X ; θ ) = c ∣ S ( Y, θ ⊺ Z )∣ ∧ π to minimizethe asymptotic variance of v ⊺ ˆ θ . 23 .2 Proof of the theorem and corollaries In order to make the presentation more concise, we use some abbreviation of the notation.Speciﬁcally, for i = , . . . , n , let l i ( θ ) = l ( d i ; θ ) , g i ( θ ) = g ( d i ; θ ) , and G i ( θ ) = G ( d i ; θ ) . Let R n ( θ ) = n − ∑ ni = ∆ i l i ( θ )/ π i and denote the σ -algebra generated by the observed data andthe pilot estimate ˜ θ by F n .Some conditions are needed. The ﬁrst sets of conditions are mainly about the subsamplingprobabilities π i ’s. A1 . n − ∑ ni = ( sup θ ∈ Θ l i ( θ )) / π i = O p ( ) , where Θ ∈ R p is the parameter space. A2 . n − ∑ ni = ∥ G i ( θ ∗ )∥ / π i = O p ( ) . A3 . There exists a probability π that may depend on D such that E (∥ g ( D ; θ ∗ )∥ / π ) < ∞ and n − ∑ ni = ( / π i − ) g i ( θ ∗ ) ⊗ p → E [( / π − ) g ( D ; θ ∗ ) ⊗ ] .The second sets are regularity conditions for the underlying distribution of the data. C1 . Θ is compact and contains θ ∗ as an interior point. C2 . R ( θ ) is continuous in θ and θ ∗ is the unique global minimizer of R ( θ ) over Θ. C3 . There exists a function h ( D ) such that sup θ ∈ Θ ∥ l ( D ; θ )∥ ⩽ h ( D ) , sup θ ∈ Θ ∥ g ( D ; θ )∥ ⩽ h ( D ) , sup θ ∈ Θ ∥ G ( D ; θ )∥ ⩽ h ( D ) , and E [ h ( D )] < ∞ . G ( D ; θ ) is continuous in θ with proba-bility one. C4 . The matrix A is positive deﬁnite.We now begin to prove Theorem 1. The following lemmas are needed. Lemma A.1.

Under conditions A1, C1, and C3, sup θ ∈ Θ ∣ R n ( θ ) − R ( θ )∣ p → .Proof: By the triangle inequality, we have thatsup θ ∈ Θ ∣ R n ( θ ) − R ( θ )∣ (9) ⩽ sup θ ∈ Θ ∣ R n ( θ ) − R Fn ( θ )∣ + sup θ ∈ Θ ∣ R Fn ( θ ) − R ( θ )∣ , where R Fn ( θ ) = n − ∑ ni = l i ( θ ) . By C3 and uniform law of large number, it can be shown thatsup θ ∈ Θ ∥ R Fn ( θ )− R ( θ )∥ p →

0. Next we show the convergence of the ﬁrst term in the right-hand-side of (9). For i = , . . . , n , deﬁne d δ ( ∆ i , θ ) = sup θ ∈ B ( θ ,δ ) ∆ i l i ( θ )/ π i − inf θ ∈ B ( θ ,δ ) ∆ i l i ( θ )/ π i ,where B ( θ , δ ) is a ball in R p centered at θ with radius δ . By the continuity of l i ( θ ) , themeasurability of l i ( θ )/ π i to F n , and dominated convergence theorem, E [ d δ ( ∆ i , θ )∣F n ] → δ →

0. Thus, for all θ ∈ Θ and ε >

0, there exists δ ε ( θ ) > E [ d δ ε ( θ ) ( ∆ i , θ )∣F n ] < ε .Since Θ is compact, we can ﬁnd a ﬁnite sequence of θ , . . . , θ K such that Θ is covered by24 Kk = B ( θ k , δ ε ( θ k )) . Thus, we have thatsup θ ∈ Θ ( R n ( θ ) − R Fn ( θ ))= sup θ ∈ Θ n n ∑ i = [ ∆ i π i l i ( θ ) − E ( ∆ i π i l i ( θ )∣F n )]⩽ max ⩽ k ⩽ K sup θ ∈ B ( θ k ,δ ε ( θ k )) n n ∑ i = [ ∆ i π i l i ( θ ) − E ( ∆ i π i l i ( θ )∣F n )]⩽ max ⩽ k ⩽ K n n ∑ i = [ sup θ ∈ B ( θ k ,δ ε ( θ k )) ∆ i π i l i ( θ ) − E ( inf θ ∈ B ( θ k ,δ ε ( θ k )) ∆ i π i l i ( θ )∣F n )] . From A1 , we have that n − ∑ ni = ( sup θ ∈ Θ l i ( θ )) / π i p →

0. Thus, by Chebyshev inequality, C3 ,and the weak law of large number, we can show thatmax ⩽ k ⩽ K n n ∑ i = ( sup θ ∈ B ( θ k ,δ ε ( θ k )) ∆ i π i l i ( θ ) − E ( inf θ ∈ B ( θ k ,δ ε ( θ k )) ∆ i π i l i ( θ )∣F n )]= max ⩽ k ⩽ K n n ∑ i = [ E ( sup θ ∈ B ( θ k ,δ ε ( θ k )) ∆ i π i l i ( θ )∣F n ) − E ( inf θ ∈ B ( θ k ,δ ε ( θ k )) ∆ i π i l i ( θ )∣F n )] + a n = max ⩽ k ⩽ K n n ∑ i = E [ d δ ε ( θ ) ( ∆ i , θ k )∣F n ] + a n ⩽ ε + a n , where a n satisﬁes that for all ε > P (∣ a n ∣ > ε ∣F n ) p →

0. Similarly, we can show thatinf θ ∈ Θ ( R n ( θ ) − R Fn ( θ )) ⩾ ε + b n , where for all ε > P (∣ b n ∣ > ε ∣F n ) p →

0. Thus, we canshow that for arbitrary ε > P ( sup θ ∈ Θ ∣ R n ( θ ) − R Fn ( θ )∣ > ε ∣F n ) p →

0. By Helly-Bray theorem,sup θ ∈ Θ ∣ R n ( θ ) − R Fn ( θ )∣ p →

0. The desired conclusion follows from (9).

Lemma A.2.

Under condition A2 and C3, if ˜ θ p → θ ∗ , then n − ∑ ni = ∆ i G i ( θ ∗ )/ π i p → A .Proof: Conditioning on F n , ∆ i G i ( θ ∗ )/ π i , i, = , . . . , n , are independent with E [ ∆ i G i ( θ ∗ )/ π i ∣F n ] = G i ( θ ∗ ) and E [∥ ∆ i G i ( θ ∗ )/ π i − G i ( θ ∗ )∥ ∣F n ] = ( / π i − )∥ G i ( θ ∗ )∥ . By A2 , Chebyshev in-equality, C3 , and the weak law of large number, for any ε > P (∥ n − ∑ ni = ∆ i G i ( θ ∗ )/ π i − n − ∑ ni = G i ( θ ∗ )∥ > ε ∣F n ) p →

0. Then, by Helly-Bray theorem, we have that for any ε > P (∥ n − ∑ ni = ∆ i G i ( θ ∗ )/ π i − n − ∑ ni = G i ( θ ∗ )∥ > ε ) →

0, that is ∥ n n ∑ i = ∆ i π i G i ( θ ∗ ) − n n ∑ i = G i ( θ ∗ )∥ p → . (10)By law of large number, ∥ n n ∑ i = G i ( θ ∗ ) − A ∥ p → . (11)Thus, by the triangle inequality, (10), and (11), we have n − ∑ ni = ∆ i G i ( θ ∗ )/ π i p → A .25 emma A.3. Under condition C3, if ˜ θ p → θ ∗ , then ˜ A p → A .Proof: Deﬁne A ( θ ) = E [ G ( D ; θ )] . By the triangle inequality, we have that ∥ ˜ A − A ∥ ⩽ ∥ ˜ A − A ( ˜ θ )∥ + ∥ A ( ˜ θ ) − A ∥ . (12)Since ˜ θ p → θ ∗ , ∥ ˜ A − A ( ˜ θ )∥ ⩽ sup θ ∈N ( θ ∗ ) ∥ n − ∑ ni = G i ( θ ) − A ( θ )∥ as n is suﬃciently large, where N ( θ ∗ ) is a neighborhood in θ about θ ∗ . By C3 and the uniform law of large number,sup θ ∈N ( θ ) ∥ n − ∑ ni = G i ( θ ) − A ( θ )∥ p →

0, so does ∥ ˜ A − A ( ˜ θ )∥ . By the continuity of A ( θ ) withrespect to θ , ∥ A ( ˜ θ ) − A ∥ p →

0. Thus, by (12), ˜ A p → A . Proof of Theorem 1:

1) Consistency: By C2 and the uniform convergence of R n ( θ ) to R ( θ ) on Θ from Lemma A.1, we can show that ˆ θ p → θ ∗ by arguments similar to those in the proofof Theorem 5 in Fithian and Hastie (2014). The details are omitted here.2) Asymptotic normality: Since ∑ ni = ∆ i g i ( ˆ θ )/ π i =

0, by Taylor expansion, we have that0 = √ n n ∑ i = ∆ i π i g i ( θ ∗ ) + n n ∑ i = ∆ i π i G i ( θ ∗ ) ⋅ √ n ( ˆ θ − θ ∗ ) (13) + o p ( n n ∑ i = ∆ i π i G i ( θ ∗ ) ⋅ √ n ( ˆ θ − θ ∗ )) . Write the ﬁrst term in the right-hand-side of (13) as η n + η n , where η n = √ n n ∑ i = ∆ i − π i π i g i ( θ ∗ ) and η n = n − / ∑ ni = g i ( θ ∗ ) . It is easy to see that conditioning on F n , η n has zero meanand variance-covariance matrix equals to n − ∑ ni = ( / π i − ) g i ( θ ∗ ) ⊗ . Form A3 , the variance-covariance matrix converges in probability to E [( / π − ) g ( D ; θ ∗ ) ⊗ ] which is independentof F n . Therefore, η n is asymptotically normal with mean zero and variance-covariancematrix E [( / π − ) g ( D ; θ ∗ ) ⊗ ] . On the other hand, it is not diﬃcult to see that E ( η n + η n ∣F n ) = η n which is asymptotically normal with mean zero and variance-covariance matrix E [ g ( D ; θ ∗ ) ⊗ ] . Consequently, η n + η n is asymptotically normal with mean zero and variance-covariance matrix E [( / π − ) g ( D ; θ ∗ ) ⊗ ] + E [ g ( D ; θ ∗ ) ⊗ ] = V π .Combining the asymptotic normality of n − / ∑ ni = ∆ i g i ( θ ∗ )/ π i , Lemma A.2 and (13), wecan derive that √ n ( ˆ θ − θ ∗ ) = O p ( ) and √ n ( ˆ θ − θ ∗ ) = − A − √ n n ∑ i = ∆ i π i g i ( θ ∗ ) + o p ( ) , which is exactly (2). Using Slutsky theorem, it follows quickly that √ n ( ˆ θ − θ ∗ ) is asymptot-ically normal with mean zero and variance-covariance matrix A − V π A − .26ext we turn to prove the corollaries. For Corollary 1, the following condition is needed. C5 . E ( sup θ ∈ Θ ∥ l ( D ; θ )∥ / π ) < ∞ and E (∥ G ( D ; θ ∗ )∥ / π ) < ∞ , where π is deﬁned in Proposi-tion 1 or 2. Proof of Corollary 1:

The result can be proved from Theorem 1 if one shows that C1 - C5 imply A1 - A3 . We take π i = ( c ∥ ˜ A − / g i ( ˜ θ )∥) ∧ π i is of the same spirit. Note that π i can be regarded as the function of ˜ A , ˜ θ ,and the i th data point. We just write π i as π ( d i ; ˜ A, ˜ θ ) , where π ( d i ; ˜ A, ˜ θ ) = ( c ∥ ˜ A − / g i ( ˜ θ )∥)∧ π = ( c ∥ A − / g ( D ; θ ∗ )∥) ∧ π ( D ; A, θ ∗ ) . By C3 , it is easy to seethat E (∥ g ( D ; θ ∗ )∥ / π ) < ∞ . Let V n (A , θ ) = n − ∑ ni = ( / π ( d i ; A , θ )− ) g i ( θ ∗ ) ⊗ , and V (A , θ ) = E [( / π ( D ; A , θ ) − ) g ( D ; θ ∗ ) ⊗ ] , where A is a p × p matrix. By the triangle inequality, wehave that ∥ V n ( ˜ A, ˜ θ ) − V ( A, θ ∗ )∥ (14) ⩽ ∥ V n ( ˜ A, ˜ θ ) − V ( ˜ A, ˜ θ )∥ + ∥ V ( ˜ A, ˜ θ ) − V ( A, θ ∗ )∥ . Since ˜ θ p → θ ∗ and ˜ A p → A by Lemma A.3, the ﬁrst term in the right-hand-side of (14) is nolarger than sup (A ,θ )∈N ( A,θ ∗ ) ∥ V n (A , θ ) − V (A , θ )∥ as n is suﬃciently large, where N (

A, θ ∗ ) is a neighborhood in (A , θ ) about ( A, θ ∗ ) . By C3 and the uniform law of large number,sup (A ,θ )∈N ( A,θ ∗ ) ∥ V n (A , θ ) − V (A , θ )∥ p →

0. Moreover, ∥ V ( ˜ A, ˜ θ ) − V ( A, θ ∗ )∥ p → V (A , θ ) with respect to (A , θ ) . Thus, we have V n ( ˜ A, ˜ θ ) p → V ( A, θ ∗ ) , whichmeans A3 holds. By using similar arguments, we can show that n − ∑ ni = ( sup θ ∈ Θ l i ( θ )) / π i p → E ( sup θ ∈ Θ ∥ l ( D ; θ )∥ / π ) and n − ∑ ni = ∥ G i ( θ ∗ )∥ / π i p → E (∥ G ( D ; θ ∗ )∥ / π ) . Then C5 implies that A1 and A2 hold. Proof of Corollary 2:

When the covariates vector X does not concentrate on a hyperplaneof dimension smaller than q and E (∥ X ∥ / π ) < ∞ , conditions C3 - C5 hold. When the Logisticmodel is correctly speciﬁed, it can be showed that Σ full = { E [ p ( θ ⊺ Z )( − p ( θ ⊺ Z )) ZZ ⊺ ]} − .From Theorem 1, when π i = ∣ y i − p ( ˜ θ ⊺ z i )∣ and ˜ θ p → θ , √ n ( ˆ θ − θ ) d → N ( , A − V π A − ) , where A = Σ full and in V π , g ( Y, X ; θ ) = ( Y − p ( θ ⊺ Z )) Z and π = ∣ Y − p ( θ ⊺ Z )∣ . Thus, to obtain theconclusion, it is eﬃcient to show that V π = − full . It is easy to see that V π = E [ g ( Y, X ; θ ) ⊗ π ] = E [ ( Y − p ( θ ⊺ Z )) ∣ Y − p ( θ ⊺ Z )∣ ZZ ⊺ ] = E [∣ Y − p ( θ ⊺ Z )∣ ZZ ⊺ ] . For a 0-1 binary random variable Y with E ( Y ) = p , it is easy to see that E (∣ Y − p ∣) = Var ( Y ) = p ( − p ) . Thus, we have that E [∣ Y − p ( θ ⊺ Z )∣ ZZ ⊺ ] = E { E [∣ Y − p ( θ ⊺ Z )∣∣ Z ] ZZ ⊺ }= E [ p ( θ ⊺ Z )( − p ( θ ⊺ Z )) ZZ ⊺ ] = − full . The conclusion follows immediately. 27o prove Corollary 3, we need another condition. C6 . ¯ θ is an interior point of Θ, the matrix ¯ A is positive deﬁnite, E ( sup θ ∈ Θ ∥ l ( D ; θ )∥ / ¯ π ) < ∞ , E (∥ g ( D ; θ ∗ )∥ / ¯ π ) < ∞ and E (∥ G ( D ; θ ∗ )∥ / ¯ π ) < ∞ . Proof of Corollary 3:

We still take π i = ( c ∥ ˜ A − / g ( d i ; ˜ θ )∥) ∧ θ follows exactly the same step as that inTheorem 1, so we skip the details.2) Asymptotic normality: The proof of the asymptotic normality of ˆ θ also follows the similarsteps to those in Theorem 1. The main diﬀerence lies in that if ˜ θ p → ¯ θ , by C6 , we can showthat ˜ A p → ¯ A using arguments similar to those in the proof of Lemma A.3. Meanwhile, itcan also be shown similarly that V n ( ˜ A, ˜ θ ) p → V ( ¯ A, ¯ θ ) = V ¯ π . Thus, n − / ∑ ni = ∆ i g i ( θ ∗ )/ π i isasymptotically normal with mean zero and variance-covariance matrix V ¯ π . Lemma A.2 stillholds here. It follows then that √ n ( ˆ θ − θ ∗ ) is asymptotically normal with mean zero andvariance-covariance matrix A − V ¯ π A − . Acknowledgment

The authors thank Professor Cheng Zhang and Pengfei Ma for providing the micro-blogdata. The research of Wen Yu was supported by the National Natural Science Foundationof China Grants (11671097).

References [1] Ai, M., Yu, J., Zhang, H., Wang, H. (2018) Optimal subsampling algorithms for big datageneralized linear models. arXiv:1806.06761v1 .[2] Anderson, J. A. (1972). Separate sample logistic discrimination.

Biometrika , 19-35.[3] Breslow, N. E., Day, N. E. et al. (1980). Statistical Methods in Cancer Research. TheAnalysis of Case-Control Studies 1.

Distributed for IARC by WHO, Geneva, Switzerland.[4] Branco, P., Torgo, L., and Ribeiro, P. R. (2015). A survey of predictive modelling underimbalanced distributions. arXiv:1505.01658v2 .[5] Chen, K. (2001). Generalized case-cohort sampling.

Journal of the Royal StatisticalSociety B , 791-809.[6] Chen, K. and Lo, S-H. (1999). Case-cohort and case-control analysis with Cox’s model. Biometrika , 755-764.[7] Fithian, W. and Hastie, T. (2014). Local case-control sampling: eﬃcient subsampling inimbalanced data sets. The Annals of Statistics , 1693-1724.288] Horvitz, D. G. and Thompson, D. J. (1952). A generalized of sampling without re-placement from a ﬁnite universe. Journal of the American Statistical Association ,663-685.[9] Huber, P. J. (2011). Robust Statistics . Springer, Berlin.[10] Ma, P., Mahoney, M., and Yu, B. (2015). A statistical perspective on algorithmic lever-aging.

Journal of Machine Learning Research , 861-911.[11] Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data fromretrospective studies of disease. Journal of the National Cancer Institute , 719-748.[12] Miettinen, O. S. (1976). Estimability and estimation in case-referrent studies. AmericanJournal of Epidemiology , 226-235.[13] Prentice, R. L. (1986). A case-cohort design for epidemiologic cohort studies and diseaseprevention trials.

Biometrika , 1-11.[14] Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case-controlstudies. Biometrika , 403-411.[15] Thomas, D. C. (1977). Appendum to “Methods of cohort analysis: appraisal by appli-cation to asbestos mining,” by Liddell, F. D. K., McDonald, J. C. and Thomas, D. C. Journal of the Royal Statistical Society A , 469-490.[16] Wang, H. Yang, M., and Stufken, J. (2018). Information-based optimal subdata selec-tion for big data linear regression. Journal of the American Statistical Association doi:10.1080/01621459.2017.1408468.[17] Wang, H., Zhu, R., and Ma P. (2018). Optimal subsampling for large sample Logistic re-gression.

Journal of the American Statistical Association doi: 10.1080/01621459.2017.1292914.[18] Webb, S., Caverlee, J., and Pu, C. (2006). Introducing the webb spam corpus: Usingemail spam to identify web spam automatically. In

Proceedings of the Third Conferenceon Email and Anti-Spam (CEAS) . CEAS, Mountain View, CA.[19] Yao, Y., Yu, W., and Chen, K. (2017). End-point sampling.

Statistica Sinica ,27