[PDF] New randomized response technique for estimating the population total of a quantitative variable

Abstract

In this paper, a new randomized response technique aimed at protecting respondents' privacy is proposed. It is designed for estimating the population total, or the population mean, of a quantitative characteristic. It provides a~high degree of protection to the interviewed individuals, hence it may be favorably perceived by them and increase their willingness to cooperate. Instead of revealing the true value of the characteristic under investigation, a respondent only states whether the value is greater (or smaller) than a~number which is selected by him/her at random, and is unknown to the interviewer. For each respondent this number, a sort of individual threshold, is generated as a pseudorandom number from the uniform distribution. Further, two modifications of the proposed technique are presented. The first modification assumes that the interviewer also knows the generated random number. The second modification deals with the issue that, for certain variables, such as income, it may be embarrassing for the respondents to report either high or low values. Thus, depending on the value of the pseudorandom lower bound, the respondent is asked different questions to avoid being embarrassed. The suggested approach is applied in detail to the simple random sampling without replacement, but it can also be applied to many currently used sampling schemes, including cluster sampling, two-stage sampling, etc. Results of simulations illustrate the behavior of the proposed procedure.

Full PDF

aa r X i v : . [ s t a t . M E ] J a n NEW RANDOMIZED RESPONSE TECHNIQUE FOR ESTIMATINGTHE POPULATION TOTAL OF A QUANTITATIVE VARIABLE

JAROM´IR ANTOCH , , FRANCESCO MOLA , AND OND ˇREJ VOZ ´AR , Abstract.

In this paper, a new randomized response technique aimed at protectingrespondents’ privacy is proposed. It is designed for estimating the population total,or the population mean, of a quantitative characteristic. It provides a high degree ofprotection to the interviewed individuals, hence it may be favorably perceived by themand increase their willingness to cooperate. Instead of revealing the true value of thecharacteristic under investigation, a respondent only states whether the value is greater(or smaller) than a number which is selected by him/her at random, and is unknownto the interviewer. For each respondent this number, a sort of individual threshold,is generated as a pseudorandom number from the uniform distribution. Further, twomodiﬁcations of the proposed technique are presented. The ﬁrst modiﬁcation assumesthat the interviewer also knows the generated random number. The second modiﬁcationdeals with the issue that, for certain variables, such as income, it may be embarrass-ing for the respondents to report either high or low values. Thus, depending on thevalue of the pseudorandom lower bound, the respondent is asked diﬀerent questions toavoid being embarrassed. The suggested approach is applied in detail to the simplerandom sampling without replacement, but it can also be applied to many currentlyused sampling schemes, including cluster sampling, two-stage sampling, etc. Results ofsimulations illustrate the behavior of the proposed procedure. Introduction

A steady decline in response rates has been reported for many surveys in most countriesaround the world; see, e.g., Stoop (2005), Steeh et al. (2001) or Synodinos and Yamada(2000). This decline is observed regardless of the mode of the survey, e.g., face-to-face survey, paper/electronic questionnaire, Internet survey or telephone interviewing.Furthermore, this trend has continued despite additional procedures aimed at reducingrefusal and increasing contact rates (cid:0)

Brick (2013) (cid:1) . For some time we have observedthat people are getting more and more suspicious with respect to any kind of samplingsurveys, a priori assuming that the other side cheats (or can cheat). It is especially dueto the overall spread of the Internet, where we communicate with anonymous computerrobots, leaving us no chance to check their trustworthiness.The growing concern about “invasion of privacy” thus also represents an importantchallenge for statisticians. Quite naturally, a respondent may be hesitant or even evasivein providing any information which may indicate a deviation from a social or legal norm,and/or which he/she feels might be used against him/her some time later. Therefore,if we ask sensitive or pertinent questions in a survey, conscious reporting of false valueswould often occur (cid:0)

S¨arndal et al. (1992), pp 547 (cid:1) . Unfortunately, standard techniquessuch as reweighting or model-based imputation cannot usually be applied; for a thoroughdiscussion, see S¨arndal et al. (1992) or S¨arndal and Lundstr¨om (2005). On the other

Key words and phrases.

Survey sampling, population total, Horvitz-Thompson’s estimator, random-ized response techniques, simple random sampling, stream data. hand, this issue can, at least partially, be resolved using randomized response techniques(RRT).For all of the reasons mentioned above, diﬀerent RRTs have been developed with thegoal to obtain unbiased estimates and to reduce the non-response rate. These techniquesstarted with a seminal paper by Warner (1965), who aimed at estimating the proportionof people in a given population with sensitive characteristics, such as substance abuse,unacceptable behavior, criminal past, controversial opinions, etc. Eriksson (1973) andChaudhuri (1987) modiﬁed Warner’s method to estimate the population total of a quan-titative variable. However, in our opinion based on personal practical experience, thesestandard RRTs aimed at estimating the population total are rather complicated and de-manding on both respondents and survey statisticians for various real life applications,see also the discussion in Chaudhuri (2017). They require “non-trivial arithmetic opera-tions” from respondents within Chaudhuri’s approach, while the survey statistician mustexpend a lot of eﬀort connected with the design of a suitable deck of cards, or other ran-domization mechanisms to be used for masking the sensitive variables (such as income,personal wealth) and, at the same time, providing accurate enough estimates.In this paper, we propose a method which is simpler in comparison with those proposedpreviously and is practically applicable. The respondent is only asked whether the valueof a sensitive variable attains at least a certain random lower bound. This technique, andits modiﬁcations, are developed in detail, applied to the simple random sampling withoutreplacement, and illustrated using simulations.The main advantages of the suggested method include the ease of implementation,simpler use by the respondent, and practically acceptable preciseness. Moreover, respon-dents’ privacy is well protected because they never report the true value of the sensitivevariable. Unlike in Chaudhuri’s or Eriksson’s approach, there is no issue with the cardsdesign. From a certain point of view, a small disadvantage may be a lower degree of con-ﬁdence in anonymity, due to the extrinsic device/technique used for generating randomnumbers.This paper is organized as follows. In sec. 2, selected randomized response techniquesfor estimation of the population total, or population mean, are concisely described. Insec. 3, a new randomized response technique and its two modiﬁcations are proposed, andtheir properties studied. Sec. 4 illustrates the suggested ideas with the aid of a simulationstudy. Finally, sec. 5 provides the main conclusions of the paper.2.

Selected randomized response techniques for estimating thepopulation total and their properties

Let us consider a ﬁnite population U = { , . . . , N } of N identiﬁable units, where eachunit can unambiguously be identiﬁed by its label. Let Y be a sensitive quantitativevariable; the goal of the survey is to estimate the population total t Y = P i ∈ U Y i or,alternatively, the population mean t Y = t Y /N , of the surveyed variable. To that end, weuse a random sample s selected with probability p ( s ), described by a sampling plan witha ﬁxed sample size n . Let us denote by π i the probability of inclusion of the i th elementin the sample, i.e., π i = P s ∋ i p ( s ), and by ξ i the indicator of inclusion of the i th elementin the sample s , i.e., ξ i = 1 if s ∋ i and ξ i = 0 otherwise. To keep the length of thepaper acceptable, we do not introduce all notions from scratch and refer the reader toTill´e (2006) if needed.As argued above, in practice it is often impossible to obtain the values of the surveyedvariable Y in suﬃcient quality because of its sensitivity. Therefore, statisticians try to ntoch, Mola, Voz´ar : arXiv : January 25, 2021, 1:53 3 obtain from each respondent at least a randomized response R that is correlated to Y .Randomization of the responses is carried out independently for each population unit inthe sample.Note that, in such a case, the survey has two phases. First, a sample s is selectedfrom U and then, given s , responses R i are realized using the selected RRT. We denotethe corresponding probability distributions by p ( s ) and q (cid:0) r | s (cid:1) . In this setting, the notionsof the expected values, unbiasedness and variances are tied to a twofold averaging process: • Over all possible samples s that can be drawn using the selected sampling plan p ( s ). • Over all possible response sets r that can be realized given s under the responsedistribution q (cid:0) r | s (cid:1) .Below we follow the literature and, where appropriate, denote the expectation operatorswith respect to these two distributions by E p and E q , respectively.In a direct survey, the population total t Y is usually estimated from the observedvalues Y i using a linear estimator t ( s, Y ) = P i ∈ s b si Y i , where the weights b si follow theunbiasedness constraint P s ∋ i p ( s ) b si = 1 , i = 1 , . . . , N . If π i > ∀ i ∈ U , then Horvitz-Thompson’s estimator(1) t HT ( s, Y ) = X i ∈ s Y i π i is a linear unbiased estimator with the weights b si = 1 /π i , and E p (cid:0) t HT ( s, Y ) (cid:1) = t Y , seeHorvitz and Thompson (1952), or sec. 2.8 in Till´e (2006) for details.If the survey is conducted by means of RRT, the true values of Y i for the sample s are unknown and, instead of them, values of random variables R i correlated to Y i arecollected. The population total is then usually estimated using a Horvitz-Thompson’stype estimator(2) t HT ( s, R ) = X i ∈ s R i π i . Suppose now that we have an estimator (a formula, or a computational procedure) forestimating the population total t Y or population mean t Y ; we denote it by b Y R and b Y R ,respectively. The subscript R emphasizes that the estimator is based on the values of R i in the sample, i.e., on randomized responses. Moreover, we assume that the randomizedresponses R i follow a model for which it holds E (cid:0) R i (cid:1) = Y i , var (cid:0) R i (cid:1) = φ i ∀ i ∈ U , andcov (cid:0) R i , R j (cid:1) = 0 ∀ i = j, i, j ∈ U . Note that φ i is a function of Y i .Recall that the estimator b Y R of the population total t Y is conditionally unbiased ifthe conditional expectation of b Y R given the sample s is equal to the current estimator b Y s that would be obtained if no randomization took place, i.e., if E q (cid:0) b Y R | s (cid:1) = b Y s . Thesubscript s indicates that the “usual” estimator based on the non-randomized sample,e.g., the Horvitz-Thompson’s one, is used, and E q (cid:0) b Y R (cid:12)(cid:12) s (cid:1) stands for the conditionalexpectation of b Y R given the sample s with respect to the distribution induced by therandomization of responses. For the estimator b Y R of the population mean, we proceedanalogously.If b Y R is conditionally unbiased and b Y s is unbiased, then b Y R is unbiased as well, since itholds E (cid:0) b Y R (cid:1) = E p (cid:0) E q ( b Y R (cid:12)(cid:12) s ) (cid:1) = E p (cid:0) b Y s (cid:1) = t Y . Analogously it holds E (cid:0) b Y R (cid:1) = t Y . RRT for estimating population total

By a standard formula of probability theory, we get the variance of b Y R in the formvar (cid:0) b Y R (cid:1) = E p (cid:16) var q (cid:0) b Y R (cid:12)(cid:12) s (cid:1)(cid:17) + var p (cid:16) E q (cid:0) b Y R (cid:12)(cid:12) s (cid:1)(cid:17) = E p (cid:16) var q (cid:0) b Y R (cid:12)(cid:12) s (cid:1)(cid:17) + var p (cid:0) b Y s (cid:1) . (3)The second term on the right-hand side of (3) is, obviously, the variance of the estimatorthat would apply if no randomization of responses was deemed necessary, while the ﬁrstterm represents the increase of the variance produced by the randomization. In otherwords, the two terms on the right-hand side of (3) represent, respectively, “ contributionby randomized response technique used ” and “ contribution by sampling variation ” to thetotal variance of b Y R . When treating b Y R , we proceed analogously.Because the variances of b Y s are well known for many currently used sampling proce-dures, it remains to ﬁnd the contribution by randomization and to suggest methods forits estimation.For the estimator t HT ( s, R ) given by (2), we havevar (cid:16) t HT ( s, R ) (cid:17) = E p (cid:16) var q (cid:0) t HT ( s, R ) (cid:12)(cid:12) s (cid:1)(cid:17) + var p (cid:16) E q (cid:0) t HT ( s, R ) (cid:12)(cid:12) s (cid:1)(cid:17) = E p (cid:16) X i ∈ U φ i ξ i π i (cid:17) + var p (cid:0) t HT ( s, Y ) (cid:1) = X i ∈ U φ i π i + var (cid:0) t HT (cid:0) s, Y (cid:1)(cid:1) . (4)When any RRT is used instead of direct surveying, the variance of the population totalestimator is always higher. This increase in variability of t HT ( s, R ) is described by theﬁrst term in (4), which represents additional variability caused by using a randomizedresponse R instead of the directly surveyed variable Y. Let us take a look at two RRT proposals that are recommended in the literature andused in practice. Note that the subscript E , C (respectively) emphasizes Eriksson’s,Chaudhuri’s (respectively) approach; each of them is concisely revisited below.Eriksson (1973) proposed a technique in which the respondent randomly draws a cardfrom a deck. The deck contains 100 C %, 0 < C <

1, cards with the text “

True value ”,while the remaining cards have values x , . . . , x T with relative frequencies q , . . . , q T , P Tt =1 q t = 1 − C . The values of cards x , . . . , x T are chosen to mask the true values of thesurveyed variable Y . Each respondent randomly draws one card from a deck. If a cardwith the text “ True value ” is selected, then the true value of Y is reported, otherwisethe value x t shown on the card is given. The respondent then returns the selected cardto the deck, and the interviewer does not know which card it was. The answer from the i th respondent is thus a random variable Z i,E = ( Y i , with probability C,x t , with probability q t , t = 1 , . . . , T. The answer Z i,E from the i th respondent is then transformed to R i,E = Z i,E − P Tt =1 q t x t C . Itfollows from the deﬁnition of Z i,E that the transformed randomized responses R i,E havethe expectation and variance values ntoch, Mola, Voz´ar : arXiv : January 25, 2021, 1:53 5 E (cid:0) R i,E (cid:1) = Y i , var (cid:0) R i,E (cid:1) = C (1 − C ) Y i + P Tt =1 q t x t − (cid:0) P Tt =1 q t x t (cid:1) − CY i P Tt =1 x t q t C , so that the corresponding Horvitz-Thompson’s type estimator is unbiased. Unfortunately,if the value reported by a respondent diﬀers from any of x t , the interviewer can deducethe true value of the sensitive variable; this fact may decrease the credibility and thewillingness of some respondents to cooperate.Later on, Chaudhuri (1987) suggested that two decks of cards should be used. Theﬁrst deck contains cards with values a , . . . , a K , and the second deck values b , . . . , b L .Both decks of cards should mask the behavior of the studied variable Y . Moreover, thefollowing relationships must hold: µ a = 1 K K X k =1 a k = 0 , σ a = 1 K K X k =1 ( a k − µ a ) > ,µ b = 1 L L X l =1 b l = 0 , σ b = 1 L L X l =1 ( b l − µ b ) > . The respondent randomly draws one card from each deck, say a k and b l , whereas theinterviewer does not know the values on the drawn cards. Then the respondent returnsboth cards, and instead of the true value Y i the value of Z i,C = a k Y i + b l is reported. Thisresponse is then transformed to the randomized response R i,C = Z i,C − µ b µ a . It follows fromthe deﬁnition of Z i,C that the randomized response R i,C has the expectation and varianceE (cid:0) R i,C (cid:1) = Y i and var (cid:0) R i,C (cid:1) = Y i σ a µ a + σ b µ a , so that corresponding Horvitz-Thompson’s type estimator is also unbiased.Both Eriksson’s and Chaudhuri’s techniques have been further developed and improvedby other researchers, see, e.g., an interesting papers by Arnab (1995, 1998), Gjestvangaand Singh (2009) or Bose and Dihidar (2018). The ideas and a representative reviewof further research are presented in a monograph by Chaudhuri (2017). Other typesof randomization techniques were suggested in a series of papers by Dalenius and hiscolleagues, e.g., Bourke and Dalenius (1976) or Dalenius and Vitale (1979). From amongthe recent papers about dealing with sensitive questions in population surveys, we wouldlike to mention, for example, papers by Trappmann et al. (2014) and Kirchner (2015). Inboth of them, long lists of relevant references can be found. Finally, recall that probablythe most comprehensive account of recent developments in sample survey theory andpractice can be found in Handbook of Statistics 29 A, B, edited by Pfeﬀermann and Rao(2009). 3. New randomized response technique

In this section, we suggest a completely diﬀerent approach. Assume that the studiedsensitive variable Y is non-negative and bounded from above, i.e., 0 ≤ Y ≤ M . Param-eter M should be chosen taking into account both bias and privacy. For that purpose,knowledge of the empirical quantiles of the studied population, or at least reasonably RRT for estimating population total guessing them, is vital. Each respondent carries out, independently of the others, a ran-dom experiment generating a pseudorandom number Υ from the uniform distribution oninterval (0 , M ), whereas the interviewer does not know this value. The respondent cangenerate the pseudorandom number Υ using, for example, a laptop online/oﬄine appli-cation; for some other possibilities see sec. 3.4. The respondent then answers a simplequestion: “

Is the value of Y at least Υ ? ” (cid:0) e.g., “ Is your monthly income at least Υ ? ” (cid:1) .Note that the subscript AV used below indicates that the estimator, as well as randomvariables used for its construction, are based on the new idea of randomization suggestedin this Section.Answer of the i th respondent follows the alternative distribution with the parameter Y i /M , i.e.,(5) Z i,AV, (0 ,M ) = ( Y i M , if Υ i ≤ Y i , − Y i M , otherwise . Evidently, E (cid:0) Z i,AV, (0 ,M ) (cid:1) = P (cid:0) Υ i ≤ Y i (cid:1) = Y i /M and var (cid:0) Z i,AV, (0 ,M ) (cid:1) = (cid:0) Y i /M (cid:1)(cid:0) − Y i /M (cid:1) . If we transform the answers Z i,AV, (0 ,M ) to R i,AV, (0 ,M ) = M Z i,AV, (0 ,M ) , then it holds(6) E (cid:0) R i,AV, (0 ,M ) (cid:1) = Y i and var (cid:0) R i,AV, (0 ,M ) (cid:1) = Y i (cid:0) M − Y i (cid:1) . For certain sensitive variables, such as the total amount of alcohol consumed withina certain period, it is better to use a question: “

Is the value of Y lower than Υ ? ” Insuch a case we recode the answer Z i,AV, (0 ,M ) to Z ⋆i,AV, (0 ,M ) = 1 − Z i,AV, (0 ,M ) , and apply thesuggested RRT to Z ⋆i,AV, (0 ,M ) .3.1. Application to the simple random sampling.

Consider now the situation inwhich the sampling plan p ( s ) is a simple random sampling without replacement witha ﬁxed sample size n . Denote by Y = N P i ∈ U Y i the population mean, by S Y = N − P i ∈ U (cid:0) Y i − Y (cid:1) the population variance, and by f = n/N the corresponding sam-pling fraction. In this case, the inclusion probabilities are constant, i.e., π i = P ( ξ i = 1) = n/N ∀ i ∈ U .Let the population total t Y be estimated using the Horvitz-Thompson’s type estimator(7) t RHT,AV, (0 ,M ) ≡ t RAV, (0 ,M ) = Nn X i ∈ s R i,AV, (0 ,M ) . This estimator is evidently unbiased, and we calculate its variance. First, taking intoaccount the independence of outcomes of the randomization experiments performed bythe respondents, we havevar q (cid:16) t RHT,AV, (0 ,M ) (cid:12)(cid:12) s (cid:17) = N n X i ∈ s Y i (cid:0) M − Y i (cid:1) . Using the well-known identity P i ∈ U (cid:0) Y i − Y (cid:1) = P i ∈ U Y i − N Y , we can calculate thecontribution of the suggested RRT to the variance as ntoch, Mola, Voz´ar : arXiv : January 25, 2021, 1:53 7 E p (cid:16) var q (cid:0) t RHT,AV, (0 ,M ) (cid:12)(cid:12) s (cid:1)(cid:17) = E p (cid:16) N n X i ∈ s Y i (cid:0) M − Y i (cid:1)(cid:17) = E p (cid:16) N n X i ∈ U Y i (cid:0) M − Y i (cid:1) ξ i (cid:17) = Nn X i ∈ U Y i (cid:0) M − Y i (cid:1) = N n (cid:16) Y ( M − Y ) − N − N S Y (cid:17) . Finally, taking into account the variance of the simple random sampling without re-placement, see sec. 4.4 in Till´e (2006) for details, we get(8) var (cid:0) t RHT,AV, (0 ,M ) (cid:1) = N n (cid:16) Y ( M − Y ) − n − N S Y (cid:17) . To characterize the variance of the suggested estimators more profoundly, and to geta more transparent insight into the variance of the suggested RRT, we introduce twoauxiliary “measures of concentration”. More precisely, let us denote(9) Γ

Y,M = 1 N X i ∈ U Y i M (cid:16) − Y i M (cid:17) = 1 M N X i ∈ U Y i | {z } M Y − M N X i ∈ U Y i | {z } M Y = YM − Y M and(10) Γ Y ,M = YM ( M − Y ) M = YM − Y M . We call Γ

Y,M the mean relative concentration measure , and Γ

Y ,M the proximity measureof the population mean Y to M .If Y i are i.i.d. random variables with a ﬁnite variance σ and an expectation µ , then, bythe law of large numbers, both Γ Y,M and Γ

Y ,M converge, as N → ∞ , with probability 1to(11) Γ Y,M,as = µM (cid:16) − µM (cid:17) − σ M and Γ Y ,M,as = µM (cid:16) − µM (cid:17) . We call Γ

Y,M,as the asymptotic mean relative concentration measure , and Γ

Y ,M,as the as-ymptotic proximity measure of the population mean Y to M . Note that both Γ Y,M,as andΓ

Y ,M,as exist if 0 ≤ Y i ≤ M ∀ i ∈ U .Let us focus on these measures in more detail. First, note that in our setting boththese measures are population characteristics, not random variables. Second, both Γ Y,M and Γ

Y ,M take on their values in the interval [0 , ], and are equal to zero only in thepathological cases when either Y i = 0 ∀ i ∈ U or Y i = M ∀ i ∈ U . The higher these mea-sures, the higher the variance of t RHT,AV, (0 ,M ) . The mean relative concentration measureΓ Y,M attains its maximum 1 / , M ),i.e., if Y i = M/ ∀ i ∈ U . The measure Γ Y ,M of the population mean’s proximity to thecenter of the interval (0 , M ) attains its maximum 1 / Y = M/

2. This case occurs, e.g., when random variable Y issymmetric around the center of interval M/

2; this feature is certainly true for the uniformdistribution on (0 , M ). RRT for estimating population total

For a ﬁxed value of the upper bound M , population size N and sample size n , thecontribution of the suggested RRT to the variance of t RHT,AV, (0 ,M ) depends, up to a mul-tiplicative constant, on Γ Y,M , because it holds(12) E p (cid:16) var q (cid:0) t RHT,AV, (0 ,M ) (cid:12)(cid:12) s (cid:1)(cid:17) = M N n N X i ∈ U Y i M (cid:16) M − Y i M | {z } Γ Y,M (cid:17) = M N n Γ Y,M . Analogously, this contribution can also be expressed, up to multiplicative constants, byΓ

Y ,M and S Y , because it holds(13) E p (cid:16) var (cid:0) t RHT,AV, (0 ,M ) (cid:12)(cid:12) s (cid:1)(cid:17) = M N n Γ Y ,M − N ( N − n S Y . Both Γ

Y,M and Γ

Y ,M thus help us explain how the suggested RRT increases the varianceof the estimator of the population total t Y for distributions symmetrical around M/ , M ), symmetrical around M/ Remark 1.

Notice that, if the values of Y are bounded both from below and above, i.e.,0 < m ≤ Y ≤ M , then variance of t RHT,AV, (0 ,M ) can be signiﬁcantly reduced by generatingpseudorandom numbers Υ i from the uniform distribution on the interval ( m, M ) insteadon (0 , M ). Indeed; if this is the case, we replace Z i,AV, (0 ,M ) , described by (5), with Z i,AV, ( m,M ) = ( Y i − mM − m , m ≤ Υ i ≤ Y i , − Y i − mM − m , otherwise , transform these variables to R i,AV, ( m,M ) = m + ( M − m ) Z i,AV, ( m,M ) , and estimate popu-lation total t Y analogously to (7), i.e., using the Horvitz-Thompson’s type estimator(14) t RHT,AV, ( m,M ) ≡ t RAV, ( m,M ) = Nn X i ∈ s R i,AV, ( m,M ) . It is easy to show that the variance of t RHT,AV, ( m,M ) is smaller than that of t RHT,AV, (0 ,M ) ,namely, by the value N mn (cid:0) M − Y (cid:1) .When choosing parameters m and M , both bias and privacy should be taken intoaccount. While the lower bound m aﬀects mostly bias and is not crucial for respondents’privacy, the choice of M aﬀects both bias and privacy. Thus, the knowledge of empiricalquantiles for the studied characteristic, or at least a reasonable guess about them, is vitalfor setting the values of m and M properly.An immediate question arises of what happens if the interval [ m, M ] has not been setcorrectly. Evidently, if some values of Y i lie outside of the interval [ m, M ], then withprobability 1 it holds Z i,AV, ( m,M ) = 0 if Y i < m and Z i,AV, ( m,M ) = 1 if Y i > M . The biasof the suggested estimator then equals(15) X i ∈ U | Y i M (cid:0) Y i − M (cid:1) ntoch, Mola, Voz´ar : arXiv : January 25, 2021, 1:53 9 Let us discuss some advantages of the suggested approach in comparison with othercurrently used RRTs, including Eriksson’s and Chaudhuri’s: • It is simple; this fact increases respondents’ conﬁdence and cooperation, and thusreduces the estimation error. • Respondents’ privacy is well protected, because they never report the true valueof the sensitive variable. • One can avoid a demanding task of designing the deck of cards to mask the studiedvariable. • It enables us to estimate the population total at an acceptable level of accuracy,see sec. 4 for details.Due to the device/technique used for generating random numbers, some respondents mayfeel a lower degree of conﬁdence in preserving their anonymity.A natural question arises whether we could improve the accuracy of the suggestedmethod. We discuss two modiﬁcations of the RRTs suggested above and their propertiesin the subsections below. The heuristics behind this approach are based on the followingobservations. All the techniques presented up to now have assumed that the interviewerdoes not know the outcome of the random mechanism leading to the randomized response,such as the card drawn, the value of the pseudorandom number, etc. It is plausible toask what would happen if we also knew the outcome of that random experiment on theone hand, while protecting respondents’ privacy on the other hand. More precisely: canstatisticians improve the accuracy of the proposed estimator, i.e., to decrease its variance,if they also know the values of the generated pseudorandom numbers? We surmise it isfeasible, and suggest one possible way of reaching this goal. Let us point out, however,that the success of the suggested approach, to a considerable extent, depends on thestatistician’s insight into the problem. It may be embarrassing to report either high orlow values of the variables in question, say, the personal income. Depending on the valueof the pseudorandom number Υ, a diﬀerent question is then asked with the aim to reducethe respondent’s potential embarrassment.3.2.

Estimators using knowledge of Υ . Assume again that the studied sensitive vari-able Y is non-negative and bounded from above, i.e., 0 ≤ Y ≤ M . Each respondent car-ries out, independently of the others, a random experiment generating a pseudorandomnumber Υ from the uniform distribution on interval (0 , M ), and informs the interviewerof both its value and whether Υ ≤ Y or not . For example, the response is that the sim-ulated number has been xxx and the respondent earns more/less. Assume further thatthe corresponding random response is now described not by (5), but using a dichotomousrandom variable(16) Z i,AV,α = ( − α + 2 α Υ i M , if Υ i ≤ Y i , − α + 2 α Υ i M , otherwise , ≤ α < . For random responses Z i,AV,α it holdsE (cid:0) Z i,AV,α (cid:1) = P (cid:0) Υ i ≤ Y i (cid:1) = 1 M Z Y i (cid:16) − α + 2 α uM (cid:17) du + 1 M Z MY i (cid:16) − α + 2 α uM (cid:17) du = Y i M , var (cid:0) Z i,AV,α (cid:1) = 1 − αM Y i (cid:0) M − Y i (cid:1) + α . The random responses Z i,AV,α are transformed to R i,AV,α = M Z i,AV,α , and the desiredestimator of the population total t Y can be constructed analogously to (7) and (14).More precisely, we suggest using again the Horvitz-Thompson’s type of estimator in theform(17) t RHT,AV,α ≡ t RAV,α = Nn X i ∈ s R i,AV,α . Because E (cid:0) R i,AV,α (cid:1) = Y i , estimator (17) is unbiased, and the contribution of therandomization to its variance is(18) E p (cid:16) var q (cid:0) t RHT,AV,α (cid:12)(cid:12) s (cid:1)(cid:17) = M N n X i ∈ U (cid:16) N (cid:0) − α (cid:1) Y i M (cid:16) − Y i M (cid:17) + α N (cid:17) . An easy calculation shows that (18) takes on its global minimum at α opt = 3Γ Y,M ∈ [0 , / α opt back to (18), we getE p (cid:16) var q (cid:0) t RHT,AV,α opt (cid:12)(cid:12) s (cid:1)(cid:17) = M N n X i ∈ U (cid:16)(cid:0) − Y,M (cid:1) N Y i M (cid:16) − Y i M (cid:17) + 3Γ Y,M N (cid:17) = M N n Γ Y,M (cid:0) − Y,M (cid:1) . (19)We would like to point out that the knowledge of pseudorandom numbers Υ i andthe use of α opt can considerably decrease variability depnding on the suggested RRT –compare (19) with (12). Note also that our simulations summarized in sec. 4 conﬁrmthese ﬁndings.The value of the parameter α , which is a priori set by the interviewer, is ﬁxed andunknown to the respondent. For α = 0 we have the original method described in sec. 3.1.The response to Z i,AV,α is transformed not by the respondent, but by the intervieweroﬀ-line.Parameter α should be set to its optimal value α opt = 3Γ Y,M , where the mean relativeconcentration measure Γ

Y,M is introduced in sec. 3, formula (9). If the interviewer hassome prior information about the mean µ and variance σ values for the theoretical distri-bution of the surveyed variable Y , he/she should rather apply asymptotic concentrationmeasure (11), which can be estimated using a plug-in moment estimator. More pre-cisely, the population mean Y should be replaced with µ , and the population variance S Y with σ . Since the population second moment Y can be expressed as N − N S Y + Y , it issuﬃcient to substitute µ and σ into this expression. Recall that the prior information isoften available for regular surveys in oﬃcial statistics, such as EU-SILC, because in sucha case we can either use results from previous years updated by inﬂation, or we can relyon the expert opinion. If no prior information is available, we recommend choosing smallvalues of α , such as 0 .

5, to decrease the negative values of R i,AV,α . Note that in sucha case the resulting estimator may attain unacceptably low or even negative values forestimates of non-negative variables. However this issue can, to some extent, be resolvedby properly tuning parameter α and increasing the sample size.Notice that if a non-negative surveyed random variable Y is bounded not only fromabove, but also from below, i.e., 0 < m ≤ Y ≤ M , we generate Υ i from the uniformdistribution on the interval ( m, M ), modifying Z i,AV,α given by (16) to ntoch, Mola, Voz´ar : arXiv : January 25, 2021, 1:53 11 Z i,AV,α, ( m,M ) = ( − α + 2 α Υ i M − m , if Υ i ≤ Y i , − α + 2 α Υ i M − m , otherwise, 0 ≤ α < , transforming Z i,AV,α, ( m,M ) to R i,AV,α, ( m,M ) = ( M − m ) Z i,AV,α, ( m,M ) + m (1 − α ), and formingan estimator of the population total t Y of the Horvitz-Thompson’s type analogouslyto (17), i.e.,(20) t RHT,AV,α, ( m,M ) ≡ t RAV,α, ( m,M ) = Nn X i ∈ s R i,AV,α, ( m,M ) . Because E (cid:0) R i,AV,α , ( m, M ) (cid:1) = Y i , the estimate (20) is again unbiased.We must ﬁrmly emphasize here that the information about neither the value of pseudo-random number Υ nor of the value α enables us to guess the exact value of the sensitivevariable Y , except for the case Y = M . In other words, knowing them does not intrudeon the respondent’s privacy.The heuristics behind the proposed modiﬁcation are the following: • If the answer is

YES , then a high value of the pseudorandom number Υ impliesa high value of the studied variable Y , because Y ≥ Υ, and these observations“considerably” increase the value of the estimator. • On the other hand, if the answer is NO , then a low value of the pseudorandomnumber Υ implies a low value of Y , because Y <

Υ, and these observations“considerably” decrease the value of the estimator.Unfortunately, in both of these situations, i.e., when the value of the response is either(too) low or (too) high, the respondent may be more prone to fabricate his/her answer.3.3.

Estimators using switching questions.

Let us emphasize that for some charac-teristics, such as monthly income of a household, it may be sensitive for respondents toreport either high or low values. This led us to modifying the suggested RRT approachin the following way.First, we set a proper ﬁxed threshold T , 0 < T < M , unknown to the respondent.Depending on whether the pseudorandom number Υ, which is distributed according tothe uniform distribution on (0 , M ), does or does not exceed the ﬁxed threshold T , we askone of the following questions:(i) If Υ ≤ T : “ Is the value of Y at least Υ ? ”,(ii) If Υ > T : “ Is the value of Y smaller than Υ ? ”.Second, we form random variables(21) Z i,AV,T =  , if Υ i ≤ T, Υ i ≤ Y i , , if Υ i ≤ T, Υ i > Y i , or if Υ i > T, Υ i ≤ Y i , − , if Υ i > T, Υ i > Y i . If we know only the answer concerning the value of Y but not the question asked, i.e.,whether Υ i ≤ T or not, Z i,AV,T has the expectation E (cid:0) Z i,AV,T (cid:1) = 1 − | T /M − Y i /M | .Unfortunately, in such a case it is impossible to construct either an estimator of thepopulation total t Y or of the population mean t Y .On the other hand, if we know both the answer concerning the value of Y and thequestion asked, i.e., whether Υ i ≤ T or not, then E (cid:16) Z i,AV,T (cid:17) = Y i /M + T /M −

1. In thiscase, the transformation of Z i,AV,T to R i,AV,T = M Z i,AV,T + M − T enables us to construct an unbiased estimator of the population total t Y of the Horvitz-Thompson’s type, whichhas the form(22) t RHT,AV,T ≡ t RAV,T = Nn X i ∈ s R i,AV,T . An unbiased estimator of the population mean t Y can be constructed analogously.As regards the variance of R i,AV,T , we must distinguish between Y i > T and the com-plementary inequality. It holds(23) var (cid:0) R i,AV,T (cid:1) = ( Y i (cid:0) M − Y i (cid:1) + ( M − T ) (cid:0) Y i + T (cid:1) , Y i ≤ T,Y i (cid:0) M − Y i (cid:1) + T (cid:0) M − Y i − T (cid:1) , Y i > T. If we compare (23) with (6), we can see that the variance of R i,AV,T is always higher thanthat of R i,AV, (0 ,M ) . Because negative values of Z i,AV,T may occur, this may occasionallylead to negative values of R i,AV,T . More precisely, note that R i,AV,T attains only threevalues, i.e., positive (equal to 2 M − T ), zero, and negative (equal to − T ), being the sourceof its poor performance. Recall that t RHT,AV,T is intended to estimate non-negative vari-able Y . Unfortunately, looking at the results of our simulations we observe that t RHT,AV,T quite often returns inadmissibly low, or even negative values; this is a big drawback.A simple, but somewhat tedious, analysis of (23) shows that we cannot ﬁnd the optimalvalue of the threshold from an open interval 0 < T < M minimizing var (cid:0) R i,AV,T (cid:1) .Moreover, numerical experiments show that the variance of var (cid:0) R i,AV,T (cid:1) is acceptableonly for very low, or very high, values of the threshold T , like T = 0 . T = 0 . M . Forexample, the variance contribution of the modiﬁed RRT for T = 0 . M isE p (cid:16) var q (cid:0) t RHT,AV,T =0 . M (cid:12)(cid:12) s (cid:1)(cid:17) = Nn X i ∈ U Y i (cid:0) M − Y i (cid:1) + X i ∈ U | Y i ≤ T (cid:0) . Y i + 0 . M (cid:1) + X i ∈ U | Y i >T (cid:0) − . Y i + 1 . M (cid:1)! . Notice that if a non-negative surveyed random variable Y is bounded not only fromabove, but also from below, i.e., 0 < m ≤ Y ≤ M , we generate Υ from the uniformdistribution on ( m, M ), and, analogously to (21), form random variables Z i,AV,T, ( m,M ) =  , if Υ i ≤ T, Υ i ≤ Y i , , if Υ i ≤ T, Υ i > Y i , or if Υ i > T, Υ i ≤ Y i , − , if Υ i > T, Υ i > Y i . It is easy to show that E (cid:0) Z i,AV,T, ( m,M ) (cid:1) = ( T + Y i − m − M ) / ( M − m ), so that ifwe transform Z i,AV,T, ( m,M ) to R i,AV,T, ( m,M ) = ( M − m ) Z i,AV,T, ( m,M ) + m + M − T , thenE (cid:0) R i,AV,T, ( m,M ) (cid:1) = Y i . Now we can form unbiased estimator of the population total t Y of the Horvitz-Thompson’s type, analogously to (17), of the form(24) t RHT,AV,T, ( m,M ) ≡ t RAV,T, ( m,M ) = Nn X i ∈ s R i,AV,T, ( m,M ) . ntoch, Mola, Voz´ar : arXiv : January 25, 2021, 1:53 13 Let us point out that modiﬁcations described in this Section are interesting especiallyfrom the theoretical point of view. Despite them oﬀering a seemingly nice idea, theycannot be recommended for practical use. We can also compare the results of the simu-lations.3.4.

Random number generation.

In all RRTs we are aware of, the preparation ofthe random mechanism is probably the trickiest point. For example, it is not clear howto design an acceptably large deck of cards that would suﬃciently mask the true values(interviewer cannot guess very close to the true values using respondents’ answers andthe knowledge of cards from this deck) and provide suﬃcient accuracy. Assume nowdirect face-to-face interviewing and describe several possibilities for generating randomnumbers.(1) We allow the respondent to select the random number according to the EuropeanISO 28640:2010(en) Standard, which provides not only the methods suitable forgeneration, but also tables of random numbers and random digits. Recall thatequivalents of this Standard, as well as of the tables of random numbers, existall over the world. We are convinced that existence of an international standardcan increase credibility of the survey and willingness of respondents to respondtruthfully. The selected random number is then used according to the RRT used.(2) To those who feel they are “experts in the ﬁeld of randomness”, the reviewercan oﬀer that they select a random number from the uniform distribution usinghis/her own method. The remaining procedure is the same as described above.(3) Another possibility is, e.g., using a huge deck of cards, for example cards with100-CZK value steps in our case, but it would require additional calculations toﬁnd the bias of such an approach.On the other hand, we would like to point out that the question of credibility is not onlya matter for statisticians, but more and more a task for psychologists. While statisticiansmust suggest procedures which are “suﬃciently random” in their eyes, psychologists mustﬁnd and oﬀer ways to convince the respondents that they are not cheated. Unfortunately,a detailed discussion of this topic would go beyond the scope of this paper.4.

Simulation Study

In many countries, income is recognized as a private and (highly) sensitive item ofinformation. The respondents often refuse to respond at all or provide strongly biasedanswers. This in particular happens if their income is (very) high or (very) low. Thatleads us to assessing the performance of the proposed RRT by a simulation study usingCzech wage data from the Average Earnings Information System (IPSV) of the Ministryof Labor and Social Aﬀairs of the Czech Republic.Based on the extensive analysis of monthly wage data provided by IPSV from theyears 2004 – 2014, Vrabec and Marek (2016) recommended a model of wages in the CzechRepublic as a three-parameter log-logistic distribution with the density(25) f ( y ; τ, σ, δ ) =  τσ (cid:0) y − δσ (cid:1) τ − (cid:16) (cid:0) y − δσ (cid:1) τ (cid:17) − , y ≥ δ > , τ > , σ > , , otherwise,where τ > σ > δ is a locationparameter. We estimate parameters of (25) using the data from 2 nd quarter 2014, and receive(26) b τ = 4 . , b σ = 21 ,

687 and b δ = 250 . The corresponding estimated average monthly income is 24 ,

290 CZK (approximately 950EUR). Note that the estimates (26) are based on roughly 2 . × observations, coveringpractically half of the overall relevant population.Histograms of the data with the bin width 500 (CZK), and density of the log-logisticdistribution (25) with the unknown parameters replaced by their estimates (26), arepresented in ﬁg. 1. Moreover, the corresponding sample quantile function of the observedwages is presented in ﬁg. 2. It is interesting to take a look at both lower and upper samplequantiles of the data used. While 8 ,

000 CZK corresponds to the 0 .

01 sample quantile,40 ,

000 CZK corresponds to the 0 .

91 sample quantile, 60 ,

000 CZK to the 0 .

97 samplequantile and, ﬁnally, 80 ,

000 CZK to the 0 .

98 sample quantile, compare visually ﬁg. 2.Figure 1: Probability histogram of monthly wages in the Czech Republic in the 2ndquarter of 2014, and the density (in red) of approximating model (25) with the parametersestimated by (26).Figure 2: The sample quantile function of monthly wages in the Czech Republic in the2nd quarter of 2014. ntoch, Mola, Voz´ar : arXiv : January 25, 2021, 1:53 15

It is evident from ﬁg. 1 that the original data is highly skewed. Therefore, it is notsurprising that the mean relative concentration measure Γ

Y,M = 0 .

198 is close to itsattainable maximum. In such a case, as follows from sec. 3, we can expect higher varianceof the estimators using the suggested RRT than for the Horvitz-Thompson’s estimatorbased on non-randomized data. Moreover, the estimator t RHT,AV,α based on the knowledgeof Υ i ’s and “almost-optimal” choice of the parameter α ≈ Y,M , should have smallervariance than t RHT,AV, ( m,M ) (corresponding to α = 0). That conjecture is conﬁrmed by oursimulations.From the “theoretical” wage distribution corresponding to model (25), in which un-known parameters have been replaced with their estimates (26), 1,000 replications ofpopulations sized N = 200, or N = 400 are simulated. The simulations are carried outwith the aid of statistical freeware R, version 3.5.1; for details, see R Core Team (2018).Data from the log-logistic distribution is generated using the package ﬂexsurv .From each replication of the population, we draw, without replacement, 1,000 randomsamples of the size n = 20, or n = 50. Such population and sample sizes are standardfor separate strata in the business sampling surveys, and also resemble the usual socialstatistical surveys, such as the EU Statistics of Income Living Condition. In such a surveyfor a medium sized country like the Czech Republic with the population of 10,000,000inhabitants and approximately 4 , ,

000 households, the samples approximately include9 ,

500 households surveyed in a two-dimensional stratiﬁcation (region and size of munici-pality), giving 78 × N = 1 , ,

000 inhabitants per one income group.For a more detailed description of the stratiﬁcation, strata, sample sizes and samplingdesign see

EU-SILC 2016 .For each sample, both t Y and t Y are estimated using the techniques described in sec. 3.Estimates of the total mean values, instead of the population totals, are presented toenable easier comparison between the results obtained for populations with diﬀerentsizes N and diﬀerent sample sizes n .In the simulations, we are especially interested in the impact of “tuning parameters” m, M, T, α and α opt on the estimates. Taking into account the type and nature of thedata we are simulating, we set the parameters as described in tab. 1. The values of α opt were set using the formulae for the optimal variance described in sec. 3. Other parameterswere chosen with regard to our experience, in particular, which monthly salary can beperceived to be high. Because practically all the available data is larger than 7,000CZK, we set the lower bound of the interval for generating pseudorandom numbers Υ i to m = 7 , m M T α α opt ,

000 40 ,

000 30 ,

000 0 .

75 0 . ,

000 60 ,

000 45 ,

000 0 .

75 0 . ,

000 80 ,

000 45 ,

000 0 .

75 0 . Note that the simulation results virtually do not change after 100 replications of the population;diﬀerences begin at the third signiﬁcant digit.

The results are summarized in tab. 2 – 4 and in ﬁg. 3 – 5. They show that for largepopulations the accuracy of the suggested estimators is acceptable even for the methodof the switching questions. The reason for the lower standard deviation of t RAV,α , andespecially t RAV,α opt , in comparison with t RAV,T and t RAV, ( m,M ) is that this estimator eﬃcientlyuses the information on the generated numbers of Υ. Note that we have used the momentplug-in estimate for the optimal value of α .As expected, the variance values of our new estimator and its modiﬁcations are higherthan those of Horvitz-Thompson’s estimator based on the non-randomized data. Theprecision of our basic proposal is practically acceptable, because, according to the sim-ulations, the corresponding sample standard deviation of the estimates has gone up bya mere 60 % in comparison with the Horvitz-Thompson estimate for M = 60 000; thisresults is quite reasonable, taking into account that Y is a very sensitive variable. Notice,however, that the modiﬁcation using the knowledge of the values of Υ i leads to a sub-stantial reduction in variance. Thus, while mildly relaxing respondents’ privacy on theone hand but still keeping secret the true response because the true value of the sensitivevariable is never reported, this modiﬁcation provides estimates whose precision is com-parable with directly surveying under zero non-response. On the other hand, the highvariability of the estimates, even the presence of negative estimates for the mean wages,shows that the modiﬁcation using the switching questions described in sec. 3.3 is onlya theoretical exercise and cannot be recommended for practical use. Its improvementremains an open question.Comparing contents of all tables, we can see that the mean has practically not changed;however, the expected decrease occurs in the variability of the estimates, of about 9%,which shows that it pays “to tune up” the procedure and its parameters according to thegiven problem and potential data.Both results of sec. 3.1 and simulations show that variance of estimators can be greatlyreduced by choice of bounds m and M . We see that for low value of the upper bound M =40 ,

000 the proposed estimators are competitive even with Horvitz-Thompson estimator.It follows from the bias formula 15 that approximately unbiased estimators with lowvariance can be constructed if we use prior information on population quantiles for choiceof bounds m and M . Optimal choice of bounds with respect to the minimization of themean square error is ﬁeld of further research. In tab. 2 – 4 both the sample averages (mean) and sample standard deviations (sd) of the simulatedvalues are presented. For simplicity, we omit “HT” in the descriptions of the analyzed estimators in allﬁgures and tables because all the estimators we compare here are of the Horvitz – Thompson’s type. ntoch, Mola, Voz´ar : arXiv : January 25, 2021, 1:53 17

Table 2: Numerical results of simulations. The mean estimated salaries (in 10 CZK)and the corresponding sample standard deviations (in 10 CZK) for diﬀerent populationsizes N and sample sizes n . Random numbers Υ i are generated from the uniform dis-tribution on the interval [ m, M ] = [7 , , T = 30 , α = 0 . α opt = 0 . ,

000 simulated populations, 1 ,

000 replications of each. N = 200 N = 400Estimator n = 20 n = 50 n = 20 n = 50 t HT mean 24.270 24.272 24.287 24.288sd 2.782 1.757 2.773 1.758 t RAV, ( m,M ) mean 23.189 23.192 23.203 23.205sd 3.687 2.333 3.690 2.336 t RAV,α mean 23.192 23.194 23.206 23.207sd 3.000 1.897 3.001 1.902 t RAV,α opt mean 23.192 23.194 23.206 23.207sd 2.965 1.875 2.966 1.880 t RAV,T mean 23.185 23.189 23.199 23.202sd 6.066 3.836 6.068 3.837Table 3: Numerical results of simulations. The mean estimated salaries (in 10 CZK) andthe corresponding standard deviations (in 10 CZK) for diﬀerent population sizes N andsample sizes n . Random numbers Υ i are generated from the uniform distribution on theinterval [ m, M ] = [7 , , T = 45 , α = 0 . α opt = 0 .

59, 1 ,

000 simulatedpopulations, 1 ,

000 replications of each. N = 200 N = 400Estimator n = 20 n = 50 n = 20 n = 50 t HT mean 24.297 24.301 24.288 24.290sd 2.773 1.758 2.813 1.779 t RAV, ( m,M ) mean 23.983 23.984 23.965 23.974sd 5.530 3.501 5.529 3.495 t RAV,α mean 23.974 23.976 23.956 23.965sd 4.401 2.786 4.398 2.780 t RAV,α opt mean 23.976 23.977 23.958 23.967sd 4.164 2.637 4.161 2.631 t RAV,T mean 23.991 23.992 23.973 23.982sd 9.066 5.729 9.067 5.726

Table 4: Numerical results of simulations. The mean estimated salaries (in 10 CZK) andthe corresponding standard deviations (in 10 CZK) for diﬀerent population sizes N andsample sizes n . Random numbers Υ i are generated from the uniform distribution on theinterval [ m, M ] = [7 , , T = 45 , α = 0 . α opt = 0 .

53, 1 ,

000 simulatedpopulations, 1 ,

000 replications of each. N = 200 N = 400Estimator n = 20 n = 50 n = 20 n = 50 t HT mean 24.275 24.273 24.299 24.299sd 2.765 1.739 2.753 1.737 t RAV, ( m,M ) mean 24.138 24.140 24.158 24.168sd 6.911 4.372 6.921 4.378 t RAV,α mean 24.145 24.146 24.165 24.174sd 5.962 3.770 5.950 3.767 t RAV,α opt mean 24.143 24.145 24.163 24.173sd 5.404 3.417 5.398 3.417 t RAV,T mean 24.136 24.137 24.156 24.165sd 13.018 8.236 13.036 8.244 ntoch, Mola, Voz´ar : arXiv : January 25, 2021, 1:53 19

Figure 3: Behavior of considered estimators applied to diﬀerent sizes of the population N and sample sizes n ; ( m, M ) = (7 , , T = 30 , α = 0 .

75 and α opt = 0 . Figure 4: Behavior of considered estimators applied to diﬀerent sizes of the population N and sample sizes n ; ( m, M ) = (7 , , T = 45 , α = 0 .

75 and α opt = 0 . ntoch, Mola, Voz´ar : arXiv : January 25, 2021, 1:53 21 Figure 5: Behavior of considered estimators applied to diﬀerent sizes of the population N and sample sizes n ; ( m, M ) = (7 , , T = 45 , α = 0 .

75 and α opt = 0 . Conclusions

The purpose of this paper is to present a new randomized response technique possessingtwo attractive properties, namely: • It is simple to use. • It provides a high level of anonymity to the respondent.Though a quantitative estimate is the ﬁnal end, the respondent is only asked for a qual-itative response. Two modiﬁcations are discussed as well. The suggested estimators arebased on the values of pseudorandom numbers generated by the respondents, which areused for masking sensitive information.A small disadvantage of the suggested method may, for some respondents, be a feel-ing of infringement on their privacy due to an extrinsic device/technique being used forgenerating the random numbers. This problem is of mainly a psychological nature and can, at least partially, be resolved by a proper explanation of the approach by the in-terviewer. Unfortunately, all currently used RRT procedures suﬀer, to a certain extent,from the same problem – see, for example, the thorough discussion in Chaudhuri (2017)and Chaudhuri and Christoﬁdes (2013).The ﬁrst modiﬁcation assumes that not only the respondent, but also the interviewerknows the generated random number that masks the true value of the response. Thesecond modiﬁcation makes use of switching questions with the aim to make the surveyless embarrassing for respondents in certain speciﬁc situations. For all suggested RRTprocedures, we show their unbiasedness, and derive the corresponding variance for theHorvitz-Thompson’s type estimator under the simple random sampling without replace-ment. The optimal values of the tuning parameters enabling us to minimize the varianceof the suggested procedures are also discussed. The ﬁrst modiﬁcation seems to be espe-cially promising because we have shown that knowing the random number and properlysetting the tuning parameters can suﬃciently increase the precision of the estimator.For the second modiﬁcation, it is good to know that it would not work in practice. Onthe other hand, we admit that, for some readers, the suggested modiﬁcations may be ofinterest, even if only from the theoretical point of view.As a technical tool, two auxiliary measures are proposed, called the mean relativeconcentration measure of the values of Y around the center of interval (0 , M ), and theproximity measure of the population mean to the center of interval (0 , M ). With the aidof these measures we can explain why, and especially how, the suggested RRTs increasethe variance of the estimators of t Y and t Y for symmetrical distributions; distributionsclosely concentrated around their centers; or uniform distributions.We would like to summarize the merits of the method proposed in this paper. In ouropinion, we are bringing progress in this ﬁeld. The ﬁrst advantage is that our method iseasy to implement because there exist many more or less easily available online/oﬄinegenerators of random numbers from the uniform distribution. If the main goal of a surveyis to estimate a continuous random variable with a large span, like income or personalwealth then, for the “classical RRT methods” described in sec. 2, we need to design a verylarge deck of cards to mask the true values of the surveyed variable. For example, if weassume an income range from 7,000 CZK to 60,000 CZK, as is reasonable in our example,the number of cards needed for Eriksson’s RRTs, provided the income values are roundedto 1,000 CZK, is 54. If the rounding step is 500 CZK, then 107 cards are needed. Finally, ifthe rounding step is 100 CZK, then 531 cards are needed. Manipulations with such a largedeck of cards can be cumbersome for both the respondent and the interviewer. Even if thespan of the surveyed variable is not very large, it is not easy to ﬁnd precise instructions,or algorithms, concerning how to design the corresponding deck of cards. Diﬃculties withthis design may pose a problem for the ﬁeld survey statisticians, discouraging them fromthe use of such RRTs.Our technique also shares the ease of use with Eriksson’s technique. Unlike withinChaudhuri’s approach, which requires quite demanding arithmetic operations from therespondent, each respondent only states whether his/her true income is higher than acertain number. Let us point out that the respondent never reports the true value ofthe variable. In our original proposal, described in sec. 3, the interviewer moreover doesnot know the value of Υ. Thus, the privacy of respondent is protected better than inEriksson’s approach, which intrudes on the privacy of the respondents to a certain extent.Indeed, if the value reported by a respondent diﬀers from any of x t , the interviewer learnsabout the true value of the sensitive variable. ntoch, Mola, Voz´ar : arXiv : January 25, 2021, 1:53 23 Finally, note that we ﬁnd rather problematic any comparison of our approach withthe methods employed by Eriksson or Chaudhuri, because their performance stronglydepends on the choice of the cards used. In our opinion, it is tricky to design a deckof cards for a continuous variable with a high range, such as the income in the CzechRepublic, and a reliable estimator of this type with an acceptably small variance valuewould need an excessively large deck of cards.

Acknowledgements:

The work was supported by grant GA ˇCR P403/19/02773S.

References [1] Arnab, R. 1995. Optimal estimation of a ﬁnite population total under randomized response surveys.

Statistics

27: 175 – 180. DOI: https://doi.org/10.1080/02331889508802520[2] Arnab, R. 1998. Randomized response surveys. Optimum estimation of a ﬁnite population total.

Statistical Papers

39: 405 – 408.DOI: https://doi.org/10.1007/BF02927102[3] Bose, M. and Dihidar, K. 2018 Privacy protection measures for randomized response surveys onstigmatizing continuous variables.

J. of Applied Statistics

45, 2760 – 2772.DOI: https://doi.org/10.1080/02664763.2018.1440540[4] Bourke, P.D. and T. Dalenius. 1976. Some new ideas in the realm of randomized inquiries.

Interna-tional Statistical Review

44: 219 – 221.DOI: https://doi.org/10.2307/1403280[5] Brick, M.J. 2013. Unit nonresponse and weighting adjustments: A critical review.

J. of OﬃcialStatistics

29: 329 – 353.DOI: https://doi.org/10.2478/jos-2013-0026[6] Chaudhuri, A. 1987. Randomized response surveys of a ﬁnite population. A uniﬁed approach withquantitative data.

J. Statistical Planning and Inference

15: 157 – 165.DOI: https://doi.org/10.1016/0378-3758(86)90094-7[7] Chaudhuri, A. 2017.

Randomized Response and Indirect Questioning Techniques in Surveys.

Chap-man and Hall/CRC, New York. ISBN 978-11-3811542-2[8] Chaudhuri, A. and T.C. Christoﬁdes. 2013. Indirect Questioning in Sample Surveys. Springer, Hei-delberg. ISBN 978-3642362750[9] Dalenius, T. and R.A. Vitale. (1979) A new randomized response design for estimating the mean ofa distribution. In:

Contributions to Statistics.

Edited by Jureˇckov´a, J. Academia, Praha, 54 – 59.[10] Eriksson, S.A. 1973. A new model for randomized response.

International Statistical Review

Journal of Applied Statistics

36, 1361 – 1367.DOI: https://doi.org/10.1080/02664760802684151[13] Horvitz, D.G, and D.J. Thompson. 1952. A generalization of sampling without replacement from aﬁnite universe.

Journal of the American Statistical Association

47: 663 – 685.DOI: https://doi.org/10.1080/01621459.1952.10483446[14] ISO 28640:2010 (en).

Random variate generation methods

J. ofOﬃcial Statistics

31: 31 – 59.DOI: https://doi.org/10.1515/jos-2015-0002[16] Kuha, J. and J. Jackson. 2014. The item count method for sensitive survey questions: Modellingcriminal behaviour.

Applied Statistics

63: 321 – 341.DOI: https://doi.org/10.1111/rssc.12018[17] R Core Team. 2018.

R: A Language and Environment for Statistical Computing. [18] Pfeﬀermann D. and C.R. Rao, Eds. 2009.

Handbook of Statistics 29A: Sample Surveys: Design,Methods and Application.

Elsevier BV, Amsterdam. ISBN: 978-0-444-53124-7.[19] Pfeﬀermann, D. and C.R. Rao, Eds. 2009.

Handbook of Statistics 29B: Sample Surveys: Inferenceand Analysis.

Elsevier BV, New York. ISBN: 978-0-444-53438-5.[20] S¨arndal, C-E., B. Swensson, and J. Wretman. 1992.

Model Assisted Survey Sampling . Springer,Heidelberg. ISBN: 978-0-387-40620-6.[21] S¨arndal, C-E. and S. Lundstr¨om. 2005.

Estimation in Surveys with Nonresponse.

J. Wiley and Sons,Chichester. ISBN 978-0-470-01133-1.[22] Steeh, C., et al. 2001. Are they really as bad as they seem? Nonresponse rates at the end of thetwentieth century.

J. of Oﬃcial Statistics

17: 227 – 247.[23] Stoop, I.A.L. 2005.

The Hunt for the Last Respondent: Nonresponse in Sample Surveys.

Social andCultural Planning Oﬃce of the Netherlands, The Hague. ISBN 90-377-0215-5.[24] Synodinos, N.E. and S. Yamada. 2000. Response rate trends in Japanese surveys.

International J. ofPublic Opinion Research

12: 48 – 72.DOI: https://doi.org/10.1093/ijpor/12.1.48[25] Till´e, Y. 2006.

Sampling Algorithms . Springer, New York. ISBN 978-0387-30814-2.[26] Trappmann, M., I. Krumpal, A. Kirchner and B. Jann. 2014. A new technique for asking quantitativesensitive questions.

J. of Survey Statistics and Methodology

2: 58 – 77.DOI: https://doi.org/10.1093/jssam/smt019[27] Vrabec, M. and L. Marek. 2016. Model of distribution of wages. In: . Bansk´a ˇStiavnica, 378 – 396. ISBN 978-80-89438-04-4,ISSN 2453-9902.https://amsesite.wordpress.com[28] Warner, S.L. 1965. Randomized response: A survey technique for eliminating evasive answer bias.

J. American Statistical Association

60: 63–69.DOI: https://doi.org/10.2307/2283137 Charles University, Faculty of Mathematics and Physics, Sokolovsk´a 83, CZ – 186 75Praha 8 – Karl´ın, Czech Republic; [email protected] Universit`a di Cagliari, Facolt´a di Economia, viale S. Ignazio da Laconi 17, I – 09123Cagliari, Italy; [email protected] Prague University of Economics and Business, Faculty of Informatics and Statistics,W. Churchill Sq. 4, CZ – 130 67 Praha 34