[PDF] Enhanced Cube Implementation For Highly Stratified Population

Abstract

A balanced sampling design should always be the adopted strategies if auxiliary information is available. Besides, integrating a stratified structure of the population in the sampling process can considerably reduce the variance of the estimators. We propose here a new method to handle the selection of a balanced sample in a highly stratified population. The method improves substantially the commonly used sampling design and reduces the time-consuming problem that could arise if inclusion probabilities within strata do not sum to an integer.

Full PDF

EEnhanced Cube Implementation

For Highly Stratiﬁed Population

Raphaël Jauslin a , Esther Eustache a and Yves Tillé a Abstract

A balanced sampling design should always be the adopted strategies ifauxiliary information is available. Besides, integrating a stratiﬁed struc-ture of the population in the sampling process can considerably reduce thevariance of the estimators. We propose here a new method to handle theselection of a balanced sample in a highly stratiﬁed population. The methodimproves substantially the commonly used sampling design and reduce thetime-consuming problem that could arise if inclusion probabilities withinstrata do not sum to an integer.

Key words : balanced sampling, clustered sampling, auxiliary information,unequal probability sampling a Institute of statistics, University of Neuchâtel, Bellevaux 51, 2000 Neuchâtel, Switzer-land (E-mail: [email protected]) a r X i v : . [ s t a t . M E ] J a n Introduction

In survey statistics, balanced sampling is a particularly eﬃcient method whenvalues of auxiliary variables are available for all units in the population. The ideais to select the sample so that the totals of the Horvitz-Thompson estimators ofsome auxiliary variables equal the population totals. There are diﬀerent methodsfor selecting a balanced sample. Deville and Tillé (2004) have proposed the cubemethod which successively transforms the vector of inclusion probabilities into asample. The method has been improved by Chauvet and Tillé (2006) by reducingthe computation time.In many areas, it is very useful to use stratiﬁed sampling designs. As alreadyindicated by Neyman (1934), the variance of the Horvitz-Thompson estimatorcan be reduced by constructing strata such that the variables are homogeneouswithin the strata. Besides, Chauvet (2009) proposed a speciﬁc algorithm to obtainbalanced samples in the strata of a population. However, this method becomescumbersome when the number of strata is large.A highly stratiﬁed population is very common in survey sampling. For exam-ple, it may be necessary to select individuals from a population while requiringthat at most only one individual from each household in a population is taken.Each household is then a stratum. In spatial statistics, one can also constructsmall strata of neighboring units to obtain well-spread samples. Highly stratiﬁedsampling is also necessary for some donor imputation methods: the objective isto select a respondent for each non-respondent to impute its values. Each non-respondent then deﬁnes a stratum from which a respondent is to be selected (Haslerand Tillé, 2016).The balanced and stratiﬁed sampling method of Chauvet (2009) has been im-proved by Hasler and Tillé (2014) to partially resolve the disadvantage of the timerequired to process a highly stratiﬁed population. When the sum of the inclusionprobabilities in the strata is not an integer, the computation time can becomeproblematic. This problem arises, for example, when the objective is to select lessthan one individual per household. Neither of the two methods already proposedsolves the computational time problem in these situations.In this paper, we propose a new method to obtain a stratiﬁed balanced sample.This new method is particularly interesting when the population is highly stratiﬁedand the inclusion probabilities do not sum to an integer within the strata. We referreaders to Tillé (2020) and Hankin et al. (2019) to have more information on thegeneral settings on stratiﬁed balanced sampling design.The document is organized as follows. The section 2 gives the basic notationsand settings. Section 3 present the problem of selecting a balanced sample. Inthe section 4, we review the cube method and how it is used to select a balancedsample. In section 5, we discuss the issue of the highly stratiﬁed population and2eview the methods used to select a sample in this case. In the section 6, wepresent the new method and the section 7 is devoted to variance estimation. Inthe section 8, we give the simulation results of the diﬀerent algorithms on anartiﬁcial dataset while the section 9 gives a conclusion on the new method.

Consider a ﬁnite population U of size N whose units can be deﬁned by labels k ∈ { , , . . . , N } . Let deﬁne a variable of interest y . Suppose that we are tryingto estimate the following unknown total: Y = (cid:88) k ∈ U y k . (1)A sampling design is deﬁned by the probability p ( s ) of selecting each possiblesubset s ⊂ U such that (cid:80) s ⊂ U p ( s ) = 1 . Consider a vector a = ( a , . . . , a N ) (cid:62) thatmaps elements of a subset s to an N vector of 0s and 1s such that: a k = (cid:26) if k ∈ s, otherwise , for k ∈ U . For each unit of the population, the inclusion probability π k , with ≤ π k ≤ , is deﬁned as the probability of selecting k into a sample s : π k = P ( k ∈ s ) = E ( a k ) = (cid:88) s ⊂ U | k ∈ s p ( s ) . Let π = ( π , . . . , π N ) (cid:62) be the vector of all the inclusion probabilities. Letalso π k(cid:96) be the probability of selecting units k and (cid:96) together in the sample, with π kk = π k . Assuming that π k > for all k ∈ U , the total (1) can be estimatedusing the classical unbiased Horvitz-Thompson estimator deﬁned by (cid:98) Y = (cid:88) k ∈ U y k a k π k . (2) Usually, some auxiliary information are available for each unit k ∈ U in a vector x k = ( x k , . . . , x kq ) (cid:62) ∈ R q , with q ∈ N ∗ . A sampling design is said to be bal-anced on the q auxiliary variables if and only if it satisﬁes the following balancingequation: (cid:98) X = (cid:88) k ∈ s x k π k = (cid:88) k ∈ U x k a k π k = (cid:88) k ∈ U x k = X . ψ ∈ R q such that ψ (cid:62) x k = π k , forall k ∈ U . Indeed, this gives (cid:88) k ∈ s ψ (cid:62) x k π k = (cid:88) k ∈ s π k π k = n. The size of the sample will be ﬁxed only if n is an integer. If it is not the case, thesample size will be equal to the higher or lower integer to n .More generally, the problem of selecting a balanced sample is written as thefollowing linear system : (cid:26) A (cid:62) a = A (cid:62) π , a ∈ { , } N , (3)where A = ( x /π , . . . , x N /π N ) (cid:62) . The aim consists then of obtaining a sample a that satisﬁes (or approximately satisﬁes) the constraints.Suppose that the population U is divided into H strata U , . . . , U H , with re-spective sizes of N , . . . , N H . The strata form a partition and respect the followingproperties: U = H (cid:91) h =1 U h , N h > , U h ∩ U (cid:96) = ∅ , for all h, (cid:96) ∈ { , . . . , H } Then, this implies that (cid:80) Hh =1 N h = N . The inclusion probabilities sum to a value n h in each stratum h , i.e. n h = (cid:80) k ∈ U h π k . Let h = ( h , . . . , h N ) (cid:62) be a categoricalvector that speciﬁes the stratum to which each unit belongs. For example, h k = (cid:96) means that unit k belongs to strata U (cid:96) , with k ∈ U and (cid:96) ∈ { , . . . , H } . Anotherway for expressing the stratum of each unit is to use the disjunctive form. Let H be the disjunctive matrix of the corresponding vector h of size N × H , such that: H = (cid:0) ( U ) , . . . , ( U H ) (cid:1) , where ( U h ) ∈ R N is an column vector such that its k th element is equal to 1 ifthe unit k belongs to the stratum U h and 0 otherwise.Obtaining a balanced sample in a stratiﬁed population is equivalent to addingstratiﬁcation constraints to the previous linear system (3). These constraints arecontained in the matrix H , so the modiﬁcation of the linear problem gives: (cid:26) ( H A ) (cid:62) a = ( H A ) (cid:62) π , a ∈ { , } N . (4)4he number of constraints in the linear problem is then ( q + H ) . In the nextsection, a method to select a balanced sample is presented. Deville and Tillé (2004) developed the cube method that selects a balanced samplerespecting the inclusion probabilities. The method can deal with equal or unequalinclusion probabilities. The algorithm is separated into two phases.• The ﬁrst phase is called the ﬂight phase. It modiﬁes recursively and randomlythe vector of inclusion probabilities π into a sample by respecting exactly thebalancing constraints of the problem. The subspace induced by the linearsystem (3) could be rewritten using the following notation: A = (cid:8) a ∈ R N | A (cid:62) a = A (cid:62) π (cid:9) = π + Null ( A (cid:62) ) , where Null ( A (cid:62) ) = (cid:8) u ∈ R N | A (cid:62) u = 0 (cid:9) . The idea is then to use a vector u of the null space of A (cid:62) in order to update randomly the vector π . Thewhole procedure of the update can be found in Deville and Tillé (2004). Ateach step, at least one component is set to 0 or 1. Matrix A is updated withthe new inclusion probabilities. This step is repeated until the null space of A (cid:62) is empty. At the end of the ﬂight phase, the ﬁnal updated vector of π contains at most q elements that are still not equal to 0 or 1.• The second phase is called the landing phase. This phase allows to obtainthe sample a that respects as much as possible the balancing constraints.There are two diﬀerent ways to achieve it, by relaxing the q constraints oneby one, or by linear programming.In the ﬂight phase, the major computational cost comes from the research of avector in the null space of A (cid:62) . Chauvet and Tillé (2006) have improved this time-consuming inconvenience using a sub-matrix of A rather than the entire matrix.The idea is to consider a submatrix that has one more row than the number ofcolumns to ensure to have at least one vector in its null space. This submatrix,denoted by B , has then a size of ( q +1) × q , with respect to q < N and Rank ( B ) ≤ q .The interest of using this submatrix comes from the following result: a vector u of Null ( B (cid:62) ) completed by ( N − ( q + 1)) zeros is a vector of Null ( A (cid:62) ) . Withthis idea, all the computations can be done using only a submatrix B . Usually, N is much greater than q , the size of B is then much smaller than A . This impliesobviously an important gain of computational time. The method proposed inthis paper uses the same idea. In the next section, the particular case of highlystratiﬁed sampling is considered. 5 Highly stratiﬁed population

It is always preferable to consider a stratiﬁed population in order to estimatethe total (1). Indeed, the variance of the Horvitz-Thompson estimator (2) canbe considerably reduced compared to the non-stratiﬁed estimator (1). However,when the population is highly stratiﬁed (i.e. H is very large), the selection ofa balanced sample with classical methods becomes diﬃcult due to the too largenumber of constraints in H . In order to decrease the time-consuming problem,diﬀerent approaches have already been proposed.Chauvet (2009) has developed an algorithm to select a balanced sample in ahighly stratiﬁed population. Firstly, a ﬂight phase is applied inside each stratum.This allows modifying the inclusion probabilities such that these are as balanced aspossible in each stratum. Next, a ﬂight phase is applied on the whole population.Finally, a landing phase is carried out on units that are not still selected or rejected.This procedure has the advantage to be simple to implement. Its major deﬁciencyis when the number of strata H becomes too large, the procedure remains veryslow and often cannot even be used.Hasler and Tillé (2014) have proposed another method to deal with highlystratiﬁed population. As the previous method, it begins by applying the ﬂightphase of the cube method to each stratum of the population. Next, it carries outa ﬂight phase on an union of strata by adding another stratum at each step. Bydoing this, strata are managed one after the other and the inclusion probabilitiesof certain strata are set to 0 or 1 during this step. The idea behind this procedureis to reduce the matrix H considered because some strata are removed from thematrix when all its units are selected or rejected. At the end, a landing phase isapplied. However, if n h is not equal to an integer for a stratum U h , this method alsoremains very time-consuming. Indeed, some strata are never completely removedduring the procedure and then the submatrix of H considered becomes too large.The properties of the cube method imply that the inclusion probabilities aresatisﬁed and that the sample is balanced on the auxiliary variables in these twomethods. However, they still have diﬃculty to deal with all the situations of highlystratiﬁed sampling. In the next following section, a new method is presented inorder to completely resolve these drawbacks. In the fast implementation of the cube method (Chauvet and Tillé, 2006), themain modiﬁcation was to use a matrix smaller than A to update π . This allows toconsiderably decreasing the computational cost. The idea of our method is similarbut adapted to a stratiﬁed population: consider a matrix of constraints B smaller6han ( H A ) during the use of the cube method.The submatrix matrix B must be found at each step of the ﬂight phase ofthe cube method. As explained in Section 3, the number of balancing constraintsdepends on the number of strata H when the population is stratiﬁed. By con-sidering a matrix B with fewer rows, or units, the corresponding vector of strata h will be reduced. This subvector of h will contain fewer categories and thenthe corresponding matrix H will have fewer columns. The number of constraintswill therefore depend on the rows of B . This is why obtaining the matrix B withexactly one row more than its number of columns is not as easy as with an unstrat-iﬁed population. Algorithm 1 shows how to ﬁnd the number of rows to considerto obtain the smaller matrix B such that B has exactly one row more than itsnumber of columns. Algorithm 1

Find the sub-matrix B of ( H A ) Let q be the number of auxiliary variables of A . Initialize q by q . For t =1 , , , . . . repeat the following steps:1. Extract the ﬁrst q t rows of the vector h and denote it h t .2. Denote H t the number of diﬀerent strata in h t .3. Update q t +1 = q + H t + 1 .while q t +1 > q t .Finally, B is deﬁned as the q t ﬁrst rows of the concatenated matrix ( H t A t ) , where A t and H t are the submatrix containing only its q t ﬁrst rows. Example 6.1

Suppose that q = 2 and that the categorical vector is equal to h = (1 , , , , , , , , (cid:62) . We obtain t = 1 : q = 2 , h = (1 , (cid:62) , H = 1 → q = 2 + 1 + 1 = 4 ,t = 2 : q = 4 , h = (1 , , , (cid:62) , H = 2 → q = 2 + 2 + 1 = 5 ,t = 3 : q = 5 , h = (1 , , , , (cid:62) , H = 3 → q = 2 + 3 + 1 = 6 ,t = 4 : q = 6 , h = (1 , , , , , (cid:62) , H = 3 → q = 2 + 3 + 1 = 6 ,t = 5 : q = q B contains then q = 6 rows and 2 + 3 = 5 columns. So it is a matrix with onlyone more rows than the number of columns as desired. The matrix B is found after having computed its number of rows q t usingAlgorithm 1. The ﬁrst q t elements of h composed the strata membership vector h t . The disjunctive matrix H t can then be found using h t . The matrix B is equal7o ( H t A t ) , with A t the submatrix of A containing only its q t ﬁrst rows. Thesame procedure proposed by Chauvet and Tillé (2006) can then be applied. If thepopulation is highly stratiﬁed and the number of auxiliary variables is acceptable,our procedure can be very eﬃcient. Moreover, it handles inclusion probabilitiesthat do not sum to an integer inside strata. Algorithm 2 presents the wholemethod. The variance can be approximated using the method proposed by Deville and Tillé(2005). Let the vector z k = ( H A ) k , where ( H A ) k denote the k th row of the matrix ( H A ) . The variance of theHorvitz-Thompson estimator of the total (cid:98) Y can be approximated byvar app ( (cid:98) Y ) = (cid:88) k ∈ U c k (cid:18) y k π k − α (cid:62) z k (cid:19) , (5)where c k = π k (1 − π k ) NN − ( H + q ) and α = (cid:32)(cid:88) (cid:96) ∈ U c (cid:96) z (cid:96) z (cid:62) (cid:96) (cid:33) − (cid:88) (cid:96) ∈ U c (cid:96) z (cid:96) y (cid:62) (cid:96) π (cid:96) . (6)There exists many diﬀerent ways to express the quantity c k and then thisleads to various approximations of the variance. Value c k can in particular beapproximated by (cid:101) c k = (1 − π k ) nn − ( H + q ) . Equation (5) can be estimated on a sample s using (cid:101) c k instead of c k and by replacingthe sums on U by sums on s in Expressions (5) and (6). In this section, the performance of the method is evaluated on real data producedby the Swiss Federal Statistical Oﬃce (2020). The dataset contains information onSwiss establishments. We restrict the study to the Switzerland region called EspaceMittelland (a region of the second degree of the Nomenclature of Territorial Unitsfor Statistics (NUTS) of Switzerland). This region contains 5 cantons (a region ofthe third degree of the NUTS) and 675 municipalities. For conﬁdentiality reasons,8 lgorithm 2

Consider π the N vector of inclusion probabilities such that < π k < , for k ∈ { , . . . , N } .I. Perform a ﬂight phase on each stratum according to the inclusion probabil-ities π and the balancing constraints in A (cid:62) . The vector π is updated by π such that some of its elements are set to 0 or 1. Compute the set ofindices i ⊂ { , . . . , N } containing the unit indices that have an inclusionprobability still not equal to 0 or 1.II. Initialize t by 1. Repeat step 1. to 6. until it is no more possible to ﬁnd thematrix B or until the vector u is null.1. In A , h and π , consider only units with indices in i t .2. Apply the Algorithm 1 to ﬁnd the submatrix B of ( H A ) .3. Compute u , a vector of the null space of B completed by 0s to obtaina vector with the same size as i t .4. Compute λ > and λ > , the two greater values such that (cid:54) π tk + λ u k (cid:54) (cid:54) π tk − λ u k (cid:54) , for all k ∈ i t .

5. Update π t by: π t +1 = (cid:26) π t + λ u with probability λ / ( λ + λ ) , π t − λ u with probability λ / ( λ + λ ) .

6. Update t by t + 1 and update i t the set of indices containing the unitthat have an inclusion probability still not equal to 0 or 1.III. It could remain some units that are still not rejected or selected. Perform alanding phase by suppression of variables on the balancing variables ( H A ) on the remaining indices i t . 9he units considered are the hectares of land in which at least one establishment islocated. In order to be able to estimate the variance, only 3 hectares of land permunicipalities are included in the study. This implies that the dataset containsinformation from 2025 hectares including at least one establishment.We stratify the units in two diﬀerent ways: by cantons and by municipalities.The number of strata is then respectively equal to H c = 5 and H m = 675 . Figure1 shows the dataset with the two proposed stratiﬁcation. The idea behind thisprocedure is to compare the execution time for a stratiﬁed population with alow number of strata versus a high one. To compare the method, we will usebalancing variables x j containing the number of women employed in a sector j ,with j = 1 , , . Each sector represents a type of activity: sector j = 1 involves thenatural environment and agriculture, sector j = 2 is manufacture and sector j = 3 is related to services. Table 1 shows the mean time execution of three methods forhighly stratiﬁed sampling: the methods of Hasler and Tillé (2014) and Chauvet(2009), detailed in Section 5, and the new one presented in this article.Inside each stratum, the inclusion probabilities are equal. For each stratiﬁca-tion, we carry out two diﬀerent sampling: one with inclusion probabilities thatsum to an integer number within each stratum and one with a non-integer sum.Then, we consider n h = 80 and n h = 80 . , h ∈ { , . . . , H c } , for the ﬁrst stratiﬁ-cation. For the second one, we take n h = 2 and n h = 1 . , h ∈ { , . . . , H m } . Wechoose to deal also with non-integers n h with the aim to compare the impact ofthis situation on the mean sampling time.Chauvet’s method cannot be compared because its execution time is too longand should be avoided for highly stratiﬁed population. However, it remains veryeﬃcient if the number of strata is acceptable. If inclusion probabilities in stratumsum to an integer, the Hasler’s method performs very well. However, the executiontime increases strongly when n h is not integer. The proposed method has a verywell behaved and the time is considerably reduced for a highly stratiﬁed population.In order to compare the variance of the method with the others, we estimatethe variance using some variables of interest y j that contains the total number ofemployee of the sector j , j = 1 , , . In Table 1, we compare the approximatedvariance (5), the estimated variance and the simulated variance computing usingthe equation: v sim = 1 m (cid:88) s { (cid:98) Y ( s ) − Y } , (7)where m is the number of simulations.For each method, we vary the number of selected units within each stratum bytaking n h equal to 2 for the stratiﬁcation with H c strata and 80 for the stratiﬁcationwith H m strata. This implies sample of size respectively equal to 400 and 1350. Thevariance estimator seems to be unbiased for the approximated variance. However,10e see that the approximated variance and estimator are slightly biased to the v i .This coming from the landing phase of each method. We can conclude that theproposed method is comparable in terms of variance to other methods.Table 1: Results of 1000 simulations on the Swiss establishments dataset. Thepopulation of size is equal to 2025. We compute the mean time execution insecondes of each sampling procedure. We vary the number of strata H and thenumber of unit selected within each strata n h . AlgorithmProposed method Hasler’s method Chauvet’s methodCantons ( H = 5 ) n h = 80 n h = 80 . H = 675 ) n h = 2 n h = 1 . Table 2: Results of 1000 simulations on a population of size is equal to 2025. Thenumber of strata is equal to 5 for Cantons and 675 for the Municipalities. For eachvariable of interest y j , j = 1 , , and for each sampling methods, we compute thevariance approximated by the simulations (7) , approximated variance (5) as wellas the variance estimator (6) AlgorithmProposed method Hasle’s method Chauvet’s method v sim (cid:99) var ( (cid:98) Y ) var app ( (cid:98) Y ) v sim (cid:99) var ( (cid:98) Y ) var app ( (cid:98) Y ) v sim (cid:99) var ( (cid:98) Y ) var app ( (cid:98) Y ) Cantons ( H = 5) y y y ( H = 675) y y y Conclusion

The stratiﬁed sampling procedure is a well-known and appropriate procedure toreduce the variance of the Horvitz-Thompson estimator. In this paper, we proposea new method and implementation that provides an excellent executive time andﬂexibility that the existing methods did not allow. In many surveys where thepopulation is stratiﬁed, the sum of inclusion probabilities within each stratum isnot an integer. Other methods are not directly applicable in this case. We haveshown by mean of simulations that the variance of the estimator is not impactedby our method. All of these results indicate that our proposed algorithm is veryeﬃcient to select a sample in a stratiﬁed and highly stratiﬁed population.

References

Chauvet, G. (2009). Stratiﬁed balanced sampling.

Survey Methodology , 35:115–119.Chauvet, G. and Tillé, Y. (2006). A fast algorithm of balanced sampling.

Journalof Computational Statistics , 21:9–31.Deville, J.-C. and Tillé, Y. (2004). Eﬃcient balanced sampling: The cube method.

Biometrika , 91:893–912.Deville, J.-C. and Tillé, Y. (2005). Variance approximation under balanced sam-pling.

Journal of Statistical Planning and Inference , 128:569–591.Hankin, D., Mohr, M., and Newman, K. (2019).

Sampling Theory: For the Eco-logical and Natural Resource Sciences . Oxford University Press, New York.Hasler, C. and Tillé, Y. (2014). Fast balanced sampling for highly stratiﬁed pop-ulation.

Computational Statistics and Data Analysis , 74:81–94.Hasler, C. and Tillé, Y. (2016). Balanced k -nearest neighbor imputation. Statistics ,105:11–23.Neyman, J. (1934). On the two diﬀerent aspects of the representative method: Themethod of stratiﬁed sampling and the method of purposive selection.

Journalof the Royal Statistical Society , 97:558–606.Swiss Federal Statistical Oﬃce (2020).

Statistique structurelle des entreprises(STATENT) Description des données GEOSTAT . Neuchâtel, Switzerland.Tillé, Y. (2020).

Sampling and estimation from ﬁnite populations . Wiley, NewYork. 12 . ◦ N . ◦ N . ◦ N . ◦ N . ◦ N . ◦ N . ◦ E . ◦ E . ◦ E . ◦ E . ◦ E Longitude L a t i t ud e N h

93 159 327 408 1038

Cantons . ◦ N . ◦ N . ◦ N . ◦ N . ◦ N . ◦ N . ◦ E . ◦ E . ◦ E . ◦ E . ◦ E Longitude L a t i t ud e N h Municipalities

Figure 1: Data extract from the Swiss establishement data base of the SwissFederal Statistical Oﬃce Swiss Federal Statistical Oﬃce (2020). The data arerestricted to the NUTS region 2. The left plot is showing the separation by Canton H c = 5 , the right one the separation by Municipality H m = 675= 675