Enhanced Cube Implementation For Highly Stratified Population
EEnhanced Cube Implementation
For Highly Stratified Population
Raphaël Jauslin a , Esther Eustache a and Yves Tillé a Abstract
A balanced sampling design should always be the adopted strategies ifauxiliary information is available. Besides, integrating a stratified struc-ture of the population in the sampling process can considerably reduce thevariance of the estimators. We propose here a new method to handle theselection of a balanced sample in a highly stratified population. The methodimproves substantially the commonly used sampling design and reduce thetime-consuming problem that could arise if inclusion probabilities withinstrata do not sum to an integer.
Key words : balanced sampling, clustered sampling, auxiliary information,unequal probability sampling a Institute of statistics, University of Neuchâtel, Bellevaux 51, 2000 Neuchâtel, Switzer-land (E-mail: [email protected]) a r X i v : . [ s t a t . M E ] J a n Introduction
In survey statistics, balanced sampling is a particularly efficient method whenvalues of auxiliary variables are available for all units in the population. The ideais to select the sample so that the totals of the Horvitz-Thompson estimators ofsome auxiliary variables equal the population totals. There are different methodsfor selecting a balanced sample. Deville and Tillé (2004) have proposed the cubemethod which successively transforms the vector of inclusion probabilities into asample. The method has been improved by Chauvet and Tillé (2006) by reducingthe computation time.In many areas, it is very useful to use stratified sampling designs. As alreadyindicated by Neyman (1934), the variance of the Horvitz-Thompson estimatorcan be reduced by constructing strata such that the variables are homogeneouswithin the strata. Besides, Chauvet (2009) proposed a specific algorithm to obtainbalanced samples in the strata of a population. However, this method becomescumbersome when the number of strata is large.A highly stratified population is very common in survey sampling. For exam-ple, it may be necessary to select individuals from a population while requiringthat at most only one individual from each household in a population is taken.Each household is then a stratum. In spatial statistics, one can also constructsmall strata of neighboring units to obtain well-spread samples. Highly stratifiedsampling is also necessary for some donor imputation methods: the objective isto select a respondent for each non-respondent to impute its values. Each non-respondent then defines a stratum from which a respondent is to be selected (Haslerand Tillé, 2016).The balanced and stratified sampling method of Chauvet (2009) has been im-proved by Hasler and Tillé (2014) to partially resolve the disadvantage of the timerequired to process a highly stratified population. When the sum of the inclusionprobabilities in the strata is not an integer, the computation time can becomeproblematic. This problem arises, for example, when the objective is to select lessthan one individual per household. Neither of the two methods already proposedsolves the computational time problem in these situations.In this paper, we propose a new method to obtain a stratified balanced sample.This new method is particularly interesting when the population is highly stratifiedand the inclusion probabilities do not sum to an integer within the strata. We referreaders to Tillé (2020) and Hankin et al. (2019) to have more information on thegeneral settings on stratified balanced sampling design.The document is organized as follows. The section 2 gives the basic notationsand settings. Section 3 present the problem of selecting a balanced sample. Inthe section 4, we review the cube method and how it is used to select a balancedsample. In section 5, we discuss the issue of the highly stratified population and2eview the methods used to select a sample in this case. In the section 6, wepresent the new method and the section 7 is devoted to variance estimation. Inthe section 8, we give the simulation results of the different algorithms on anartificial dataset while the section 9 gives a conclusion on the new method.
Consider a finite population U of size N whose units can be defined by labels k ∈ { , , . . . , N } . Let define a variable of interest y . Suppose that we are tryingto estimate the following unknown total: Y = (cid:88) k ∈ U y k . (1)A sampling design is defined by the probability p ( s ) of selecting each possiblesubset s ⊂ U such that (cid:80) s ⊂ U p ( s ) = 1 . Consider a vector a = ( a , . . . , a N ) (cid:62) thatmaps elements of a subset s to an N vector of 0s and 1s such that: a k = (cid:26) if k ∈ s, otherwise , for k ∈ U . For each unit of the population, the inclusion probability π k , with ≤ π k ≤ , is defined as the probability of selecting k into a sample s : π k = P ( k ∈ s ) = E ( a k ) = (cid:88) s ⊂ U | k ∈ s p ( s ) . Let π = ( π , . . . , π N ) (cid:62) be the vector of all the inclusion probabilities. Letalso π k(cid:96) be the probability of selecting units k and (cid:96) together in the sample, with π kk = π k . Assuming that π k > for all k ∈ U , the total (1) can be estimatedusing the classical unbiased Horvitz-Thompson estimator defined by (cid:98) Y = (cid:88) k ∈ U y k a k π k . (2) Usually, some auxiliary information are available for each unit k ∈ U in a vector x k = ( x k , . . . , x kq ) (cid:62) ∈ R q , with q ∈ N ∗ . A sampling design is said to be bal-anced on the q auxiliary variables if and only if it satisfies the following balancingequation: (cid:98) X = (cid:88) k ∈ s x k π k = (cid:88) k ∈ U x k a k π k = (cid:88) k ∈ U x k = X . ψ ∈ R q such that ψ (cid:62) x k = π k , forall k ∈ U . Indeed, this gives (cid:88) k ∈ s ψ (cid:62) x k π k = (cid:88) k ∈ s π k π k = n. The size of the sample will be fixed only if n is an integer. If it is not the case, thesample size will be equal to the higher or lower integer to n .More generally, the problem of selecting a balanced sample is written as thefollowing linear system : (cid:26) A (cid:62) a = A (cid:62) π , a ∈ { , } N , (3)where A = ( x /π , . . . , x N /π N ) (cid:62) . The aim consists then of obtaining a sample a that satisfies (or approximately satisfies) the constraints.Suppose that the population U is divided into H strata U , . . . , U H , with re-spective sizes of N , . . . , N H . The strata form a partition and respect the followingproperties: U = H (cid:91) h =1 U h , N h > , U h ∩ U (cid:96) = ∅ , for all h, (cid:96) ∈ { , . . . , H } Then, this implies that (cid:80) Hh =1 N h = N . The inclusion probabilities sum to a value n h in each stratum h , i.e. n h = (cid:80) k ∈ U h π k . Let h = ( h , . . . , h N ) (cid:62) be a categoricalvector that specifies the stratum to which each unit belongs. For example, h k = (cid:96) means that unit k belongs to strata U (cid:96) , with k ∈ U and (cid:96) ∈ { , . . . , H } . Anotherway for expressing the stratum of each unit is to use the disjunctive form. Let H be the disjunctive matrix of the corresponding vector h of size N × H , such that: H = (cid:0) ( U ) , . . . , ( U H ) (cid:1) , where ( U h ) ∈ R N is an column vector such that its k th element is equal to 1 ifthe unit k belongs to the stratum U h and 0 otherwise.Obtaining a balanced sample in a stratified population is equivalent to addingstratification constraints to the previous linear system (3). These constraints arecontained in the matrix H , so the modification of the linear problem gives: (cid:26) ( H A ) (cid:62) a = ( H A ) (cid:62) π , a ∈ { , } N . (4)4he number of constraints in the linear problem is then ( q + H ) . In the nextsection, a method to select a balanced sample is presented. Deville and Tillé (2004) developed the cube method that selects a balanced samplerespecting the inclusion probabilities. The method can deal with equal or unequalinclusion probabilities. The algorithm is separated into two phases.• The first phase is called the flight phase. It modifies recursively and randomlythe vector of inclusion probabilities π into a sample by respecting exactly thebalancing constraints of the problem. The subspace induced by the linearsystem (3) could be rewritten using the following notation: A = (cid:8) a ∈ R N | A (cid:62) a = A (cid:62) π (cid:9) = π + Null ( A (cid:62) ) , where Null ( A (cid:62) ) = (cid:8) u ∈ R N | A (cid:62) u = 0 (cid:9) . The idea is then to use a vector u of the null space of A (cid:62) in order to update randomly the vector π . Thewhole procedure of the update can be found in Deville and Tillé (2004). Ateach step, at least one component is set to 0 or 1. Matrix A is updated withthe new inclusion probabilities. This step is repeated until the null space of A (cid:62) is empty. At the end of the flight phase, the final updated vector of π contains at most q elements that are still not equal to 0 or 1.• The second phase is called the landing phase. This phase allows to obtainthe sample a that respects as much as possible the balancing constraints.There are two different ways to achieve it, by relaxing the q constraints oneby one, or by linear programming.In the flight phase, the major computational cost comes from the research of avector in the null space of A (cid:62) . Chauvet and Tillé (2006) have improved this time-consuming inconvenience using a sub-matrix of A rather than the entire matrix.The idea is to consider a submatrix that has one more row than the number ofcolumns to ensure to have at least one vector in its null space. This submatrix,denoted by B , has then a size of ( q +1) × q , with respect to q < N and Rank ( B ) ≤ q .The interest of using this submatrix comes from the following result: a vector u of Null ( B (cid:62) ) completed by ( N − ( q + 1)) zeros is a vector of Null ( A (cid:62) ) . Withthis idea, all the computations can be done using only a submatrix B . Usually, N is much greater than q , the size of B is then much smaller than A . This impliesobviously an important gain of computational time. The method proposed inthis paper uses the same idea. In the next section, the particular case of highlystratified sampling is considered. 5 Highly stratified population
It is always preferable to consider a stratified population in order to estimatethe total (1). Indeed, the variance of the Horvitz-Thompson estimator (2) canbe considerably reduced compared to the non-stratified estimator (1). However,when the population is highly stratified (i.e. H is very large), the selection ofa balanced sample with classical methods becomes difficult due to the too largenumber of constraints in H . In order to decrease the time-consuming problem,different approaches have already been proposed.Chauvet (2009) has developed an algorithm to select a balanced sample in ahighly stratified population. Firstly, a flight phase is applied inside each stratum.This allows modifying the inclusion probabilities such that these are as balanced aspossible in each stratum. Next, a flight phase is applied on the whole population.Finally, a landing phase is carried out on units that are not still selected or rejected.This procedure has the advantage to be simple to implement. Its major deficiencyis when the number of strata H becomes too large, the procedure remains veryslow and often cannot even be used.Hasler and Tillé (2014) have proposed another method to deal with highlystratified population. As the previous method, it begins by applying the flightphase of the cube method to each stratum of the population. Next, it carries outa flight phase on an union of strata by adding another stratum at each step. Bydoing this, strata are managed one after the other and the inclusion probabilitiesof certain strata are set to 0 or 1 during this step. The idea behind this procedureis to reduce the matrix H considered because some strata are removed from thematrix when all its units are selected or rejected. At the end, a landing phase isapplied. However, if n h is not equal to an integer for a stratum U h , this method alsoremains very time-consuming. Indeed, some strata are never completely removedduring the procedure and then the submatrix of H considered becomes too large.The properties of the cube method imply that the inclusion probabilities aresatisfied and that the sample is balanced on the auxiliary variables in these twomethods. However, they still have difficulty to deal with all the situations of highlystratified sampling. In the next following section, a new method is presented inorder to completely resolve these drawbacks. In the fast implementation of the cube method (Chauvet and Tillé, 2006), themain modification was to use a matrix smaller than A to update π . This allows toconsiderably decreasing the computational cost. The idea of our method is similarbut adapted to a stratified population: consider a matrix of constraints B smaller6han ( H A ) during the use of the cube method.The submatrix matrix B must be found at each step of the flight phase ofthe cube method. As explained in Section 3, the number of balancing constraintsdepends on the number of strata H when the population is stratified. By con-sidering a matrix B with fewer rows, or units, the corresponding vector of strata h will be reduced. This subvector of h will contain fewer categories and thenthe corresponding matrix H will have fewer columns. The number of constraintswill therefore depend on the rows of B . This is why obtaining the matrix B withexactly one row more than its number of columns is not as easy as with an unstrat-ified population. Algorithm 1 shows how to find the number of rows to considerto obtain the smaller matrix B such that B has exactly one row more than itsnumber of columns. Algorithm 1
Find the sub-matrix B of ( H A ) Let q be the number of auxiliary variables of A . Initialize q by q . For t =1 , , , . . . repeat the following steps:1. Extract the first q t rows of the vector h and denote it h t .2. Denote H t the number of different strata in h t .3. Update q t +1 = q + H t + 1 .while q t +1 > q t .Finally, B is defined as the q t first rows of the concatenated matrix ( H t A t ) , where A t and H t are the submatrix containing only its q t first rows. Example 6.1
Suppose that q = 2 and that the categorical vector is equal to h = (1 , , , , , , , , (cid:62) . We obtain t = 1 : q = 2 , h = (1 , (cid:62) , H = 1 → q = 2 + 1 + 1 = 4 ,t = 2 : q = 4 , h = (1 , , , (cid:62) , H = 2 → q = 2 + 2 + 1 = 5 ,t = 3 : q = 5 , h = (1 , , , , (cid:62) , H = 3 → q = 2 + 3 + 1 = 6 ,t = 4 : q = 6 , h = (1 , , , , , (cid:62) , H = 3 → q = 2 + 3 + 1 = 6 ,t = 5 : q = q B contains then q = 6 rows and 2 + 3 = 5 columns. So it is a matrix with onlyone more rows than the number of columns as desired. The matrix B is found after having computed its number of rows q t usingAlgorithm 1. The first q t elements of h composed the strata membership vector h t . The disjunctive matrix H t can then be found using h t . The matrix B is equal7o ( H t A t ) , with A t the submatrix of A containing only its q t first rows. Thesame procedure proposed by Chauvet and Tillé (2006) can then be applied. If thepopulation is highly stratified and the number of auxiliary variables is acceptable,our procedure can be very efficient. Moreover, it handles inclusion probabilitiesthat do not sum to an integer inside strata. Algorithm 2 presents the wholemethod. The variance can be approximated using the method proposed by Deville and Tillé(2005). Let the vector z k = ( H A ) k , where ( H A ) k denote the k th row of the matrix ( H A ) . The variance of theHorvitz-Thompson estimator of the total (cid:98) Y can be approximated byvar app ( (cid:98) Y ) = (cid:88) k ∈ U c k (cid:18) y k π k − α (cid:62) z k (cid:19) , (5)where c k = π k (1 − π k ) NN − ( H + q ) and α = (cid:32)(cid:88) (cid:96) ∈ U c (cid:96) z (cid:96) z (cid:62) (cid:96) (cid:33) − (cid:88) (cid:96) ∈ U c (cid:96) z (cid:96) y (cid:62) (cid:96) π (cid:96) . (6)There exists many different ways to express the quantity c k and then thisleads to various approximations of the variance. Value c k can in particular beapproximated by (cid:101) c k = (1 − π k ) nn − ( H + q ) . Equation (5) can be estimated on a sample s using (cid:101) c k instead of c k and by replacingthe sums on U by sums on s in Expressions (5) and (6). In this section, the performance of the method is evaluated on real data producedby the Swiss Federal Statistical Office (2020). The dataset contains information onSwiss establishments. We restrict the study to the Switzerland region called EspaceMittelland (a region of the second degree of the Nomenclature of Territorial Unitsfor Statistics (NUTS) of Switzerland). This region contains 5 cantons (a region ofthe third degree of the NUTS) and 675 municipalities. For confidentiality reasons,8 lgorithm 2
Consider π the N vector of inclusion probabilities such that < π k < , for k ∈ { , . . . , N } .I. Perform a flight phase on each stratum according to the inclusion probabil-ities π and the balancing constraints in A (cid:62) . The vector π is updated by π such that some of its elements are set to 0 or 1. Compute the set ofindices i ⊂ { , . . . , N } containing the unit indices that have an inclusionprobability still not equal to 0 or 1.II. Initialize t by 1. Repeat step 1. to 6. until it is no more possible to find thematrix B or until the vector u is null.1. In A , h and π , consider only units with indices in i t .2. Apply the Algorithm 1 to find the submatrix B of ( H A ) .3. Compute u , a vector of the null space of B completed by 0s to obtaina vector with the same size as i t .4. Compute λ > and λ > , the two greater values such that (cid:54) π tk + λ u k (cid:54) (cid:54) π tk − λ u k (cid:54) , for all k ∈ i t .
5. Update π t by: π t +1 = (cid:26) π t + λ u with probability λ / ( λ + λ ) , π t − λ u with probability λ / ( λ + λ ) .
6. Update t by t + 1 and update i t the set of indices containing the unitthat have an inclusion probability still not equal to 0 or 1.III. It could remain some units that are still not rejected or selected. Perform alanding phase by suppression of variables on the balancing variables ( H A ) on the remaining indices i t . 9he units considered are the hectares of land in which at least one establishment islocated. In order to be able to estimate the variance, only 3 hectares of land permunicipalities are included in the study. This implies that the dataset containsinformation from 2025 hectares including at least one establishment.We stratify the units in two different ways: by cantons and by municipalities.The number of strata is then respectively equal to H c = 5 and H m = 675 . Figure1 shows the dataset with the two proposed stratification. The idea behind thisprocedure is to compare the execution time for a stratified population with alow number of strata versus a high one. To compare the method, we will usebalancing variables x j containing the number of women employed in a sector j ,with j = 1 , , . Each sector represents a type of activity: sector j = 1 involves thenatural environment and agriculture, sector j = 2 is manufacture and sector j = 3 is related to services. Table 1 shows the mean time execution of three methods forhighly stratified sampling: the methods of Hasler and Tillé (2014) and Chauvet(2009), detailed in Section 5, and the new one presented in this article.Inside each stratum, the inclusion probabilities are equal. For each stratifica-tion, we carry out two different sampling: one with inclusion probabilities thatsum to an integer number within each stratum and one with a non-integer sum.Then, we consider n h = 80 and n h = 80 . , h ∈ { , . . . , H c } , for the first stratifi-cation. For the second one, we take n h = 2 and n h = 1 . , h ∈ { , . . . , H m } . Wechoose to deal also with non-integers n h with the aim to compare the impact ofthis situation on the mean sampling time.Chauvet’s method cannot be compared because its execution time is too longand should be avoided for highly stratified population. However, it remains veryefficient if the number of strata is acceptable. If inclusion probabilities in stratumsum to an integer, the Hasler’s method performs very well. However, the executiontime increases strongly when n h is not integer. The proposed method has a verywell behaved and the time is considerably reduced for a highly stratified population.In order to compare the variance of the method with the others, we estimatethe variance using some variables of interest y j that contains the total number ofemployee of the sector j , j = 1 , , . In Table 1, we compare the approximatedvariance (5), the estimated variance and the simulated variance computing usingthe equation: v sim = 1 m (cid:88) s { (cid:98) Y ( s ) − Y } , (7)where m is the number of simulations.For each method, we vary the number of selected units within each stratum bytaking n h equal to 2 for the stratification with H c strata and 80 for the stratificationwith H m strata. This implies sample of size respectively equal to 400 and 1350. Thevariance estimator seems to be unbiased for the approximated variance. However,10e see that the approximated variance and estimator are slightly biased to the v i .This coming from the landing phase of each method. We can conclude that theproposed method is comparable in terms of variance to other methods.Table 1: Results of 1000 simulations on the Swiss establishments dataset. Thepopulation of size is equal to 2025. We compute the mean time execution insecondes of each sampling procedure. We vary the number of strata H and thenumber of unit selected within each strata n h . AlgorithmProposed method Hasler’s method Chauvet’s methodCantons ( H = 5 ) n h = 80 n h = 80 . H = 675 ) n h = 2 n h = 1 . Table 2: Results of 1000 simulations on a population of size is equal to 2025. Thenumber of strata is equal to 5 for Cantons and 675 for the Municipalities. For eachvariable of interest y j , j = 1 , , and for each sampling methods, we compute thevariance approximated by the simulations (7) , approximated variance (5) as wellas the variance estimator (6) AlgorithmProposed method Hasle’s method Chauvet’s method v sim (cid:99) var ( (cid:98) Y ) var app ( (cid:98) Y ) v sim (cid:99) var ( (cid:98) Y ) var app ( (cid:98) Y ) v sim (cid:99) var ( (cid:98) Y ) var app ( (cid:98) Y ) Cantons ( H = 5) y y y ( H = 675) y y y Conclusion
The stratified sampling procedure is a well-known and appropriate procedure toreduce the variance of the Horvitz-Thompson estimator. In this paper, we proposea new method and implementation that provides an excellent executive time andflexibility that the existing methods did not allow. In many surveys where thepopulation is stratified, the sum of inclusion probabilities within each stratum isnot an integer. Other methods are not directly applicable in this case. We haveshown by mean of simulations that the variance of the estimator is not impactedby our method. All of these results indicate that our proposed algorithm is veryefficient to select a sample in a stratified and highly stratified population.
References
Chauvet, G. (2009). Stratified balanced sampling.
Survey Methodology , 35:115–119.Chauvet, G. and Tillé, Y. (2006). A fast algorithm of balanced sampling.
Journalof Computational Statistics , 21:9–31.Deville, J.-C. and Tillé, Y. (2004). Efficient balanced sampling: The cube method.
Biometrika , 91:893–912.Deville, J.-C. and Tillé, Y. (2005). Variance approximation under balanced sam-pling.
Journal of Statistical Planning and Inference , 128:569–591.Hankin, D., Mohr, M., and Newman, K. (2019).
Sampling Theory: For the Eco-logical and Natural Resource Sciences . Oxford University Press, New York.Hasler, C. and Tillé, Y. (2014). Fast balanced sampling for highly stratified pop-ulation.
Computational Statistics and Data Analysis , 74:81–94.Hasler, C. and Tillé, Y. (2016). Balanced k -nearest neighbor imputation. Statistics ,105:11–23.Neyman, J. (1934). On the two different aspects of the representative method: Themethod of stratified sampling and the method of purposive selection.
Journalof the Royal Statistical Society , 97:558–606.Swiss Federal Statistical Office (2020).
Statistique structurelle des entreprises(STATENT) Description des données GEOSTAT . Neuchâtel, Switzerland.Tillé, Y. (2020).
Sampling and estimation from finite populations . Wiley, NewYork. 12 . ◦ N . ◦ N . ◦ N . ◦ N . ◦ N . ◦ N . ◦ E . ◦ E . ◦ E . ◦ E . ◦ E Longitude L a t i t ud e N h
93 159 327 408 1038
Cantons . ◦ N . ◦ N . ◦ N . ◦ N . ◦ N . ◦ N . ◦ E . ◦ E . ◦ E . ◦ E . ◦ E Longitude L a t i t ud e N h Municipalities
Figure 1: Data extract from the Swiss establishement data base of the SwissFederal Statistical Office Swiss Federal Statistical Office (2020). The data arerestricted to the NUTS region 2. The left plot is showing the separation by Canton H c = 5 , the right one the separation by Municipality H m = 675= 675