[PDF] Big Data meets Causal Survey Research: Understanding Nonresponse in the Recruitment of a Mixed-mode Online Panel

Abstract

Survey scientists increasingly face the problem of high-dimensionality in their research as digitization makes it much easier to construct high-dimensional (or "big") data sets through tools such as online surveys and mobile applications. Machine learning methods are able to handle such data, and they have been successfully applied to solve \emph{predictive} problems. However, in many situations, survey statisticians want to learn about \emph{causal} relationships to draw conclusions and be able to transfer the findings of one survey to another. Standard machine learning methods provide biased estimates of such relationships. We introduce into survey statistics the double machine learning approach, which gives approximately unbiased estimators of causal parameters, and show how it can be used to analyze survey nonresponse in a high-dimensional panel setting.

Full PDF

BBig Data meets Causal Survey Research:Understanding Nonresponse in the Recruitment ofa Mixed-mode Online Panel

Barbara Felderer ∗ Jannis Kueck † Martin Spindler † Abstract

Survey scientists increasingly face the problem of high-dimensionalityin their research as digitization makes it much easier to construct high-dimensional (or “big”) data sets through tools such as online surveys andmobile applications. Machine learning methods are able to handle suchdata, and they have been successfully applied to solve predictive problems.However, in many situations, survey statisticians want to learn about causal relationships to draw conclusions and be able to transfer the ﬁndings of onesurvey to another. Standard machine learning methods provide biased esti-mates of such relationships. We introduce into survey statistics the double ∗ GESIS Leibniz Institute for the Social SciencesCorrespondence: Barbara Felderer, GESIS Leibniz Institute for the Social Sciences. B2,1 68159Mannheim. Email: [email protected] † University of Hamburg a r X i v : . [ s t a t . M E ] F e b achine learning approach, which gives approximately unbiased estimatorsof causal parameters, and show how it can be used to analyze survey non-response in a high-dimensional panel setting. Key words: machine learning, causal inference, survey nonresponse, paneldropout

A key attribute of “big data” is the large volume of data that is collected orgenerated, often for the purpose of statistical analysis (for further attributes see,for example, Japec et al. (2015)). When a large number of observed character-istics are available for only a limited number of observations, however, the high-dimensionality of the data sets poses challenges. Moreover, big data comes in avariety of forms, including many sorts of paradata (Kreuter, 2013b) such as callrecords, time stamps or device-type and questionnaire-navigation data from onlinesurveys (Callegaro, 2013), as well as sensor data from mobile surveys (Strumin-skaya et al., 2020) and data from outside sources that can augment survey dataand be linked to persons or population groups by unique personal or group identi-ﬁers. These outside data contain, for example, administrative records (cf. Durrantand Steele (2009) for nonresponse analysis), data from social media (an extensivediscussion on the role of social media in public opinion research can be found inMurphy et al. (2014)) or regional information (e.g., Feddersen et al. (2016) studythe impact of weather and climate on self-reported life satisfaction). Increasingly,the ﬁeld of survey analysis is facing the challenges posed by high-dimensional data2ets. Long-lasting panel surveys produce big data, for example, by collecting largenumbers of variables over many panel waves. Some frequently used methods can-not be employed with big data sets that have comparatively few observations andnumerous variables. To deal with problems of high dimensionality, machine learn-ing methods have found their way in survey research modeling (see, for example,Buskirk et al. (2018), Buskirk (2018), Kirchner and Signorino (2018), Eck (2018)and Kern et al. (2019) for introductions of the use of machine learning techniqueswith survey methodological questions).Generally speaking, there are two main kinds of statistical modeling: causalinference (also known as explanatory analysis) and predictive modeling. Bothhave their own model-building logic and evaluation tools (Breiman, 2001). AsShmueli (2010) states, high predictive power does not necessarily imply high ex-planatory power, so diﬀerent tools should be used to explain and to predict. Theaim of prediction models is to predict the dependent variable y for individualswho were not among those used to build the model. The best model is found,for example, by minimizing the out-of-sample mean squared error (MSE). Mod-ern machine learning methods have been highly successful at building predictivemodels. In contrast to predictive modeling, causal inference entails learning theeﬀect of a particular variable on the dependent variable y while holding all othervariables constant. Being able to draw ceteris paribus conclusions in this manner,researchers can think about interventions (i.e., changing x will aﬀect y in a knownway) and use this to design future studies. Applying modern machine learningmethods to gain explanatory insights, however, is more challenging than buildingpredictive models because machine learning methods inevitably introduce some3ias in the estimation (Belloni et al., 2014a). In the recent years, progress hasbeen made in applying machine learning to causal inference, and tools for doingso, such as the double machine learning framework, have been developed. In thispaper, we demonstrate how survey statistics can beneﬁt from these methods, ob-taining insights into dealing with high-dimensional survey data sets by applyingthe double machine learning method to learn about nonresponse in the recruitmentof the GESIS panel.Survey nonresponse is arguably one of the chief problems in survey research(Kreuter, 2013a) and many decades of study have been invested in developingmethods to explain and thereby prevent or adjust for it (for recent examples seeDurrant and Steele (2009); Roßmann and Gummer (2016)). With the rise of bigdata and the increasing number of variables being considered, one of the morerecent methods is machine learning. Multiple studies have demonstrated its use-fulness in this context: For example, Kern et al. (2019) show that regression treescan eﬀectively be used to predict nonresponse in the German Socio-EconomicPanel; Phipps et al. (2012) use trees to analyze nonresponse in an establishmentpanel and Buskirk and Kolenikov (2015) use random forest classiﬁcation modelsand random forest relative class frequency models to predict response propensitiesin a simulation study. Other examples are Signorino and Kirchner (2018), whoemploy adaptive lasso to predict nonresponse in the National Health InterviewSurvey; Earp et al. (2014), who use an ensemble of classiﬁcation trees to predictnonresponse in an establishment survey’s subsequent wave; Kern et al. (2019), whoapply diﬀerent machine learning methods to predict nonresponse using informationfrom multiple waves of the GESIS panel; and Zinn and Gnambs (2020), who use4ayesian additive regression trees to predict temporary and permanent dropoutin an event history analysis in the German National Educational Panel Study.Finally, Liu (2020) compare the use of random forests, support vector machinesand lasso regression to predict response in the second interview of the Surveys ofConsumers national telephone survey.As mentioned above, one must be careful when the results produced by ma-chine learning algorithms are interpreted beyond predictions. While nonresponseprediction can be seen as a goal in its own right, one must be clear about its limi-tations: the eﬀects of the control variables cannot be interpreted because machinelearning algorithms – when applied directly – inevitably introduce bias, and thusno understanding of any causal eﬀects of explanatory variables on the dependentvariable of interest can be gained. Nonresponse prediction models help to identifyindividuals who are most likely to drop out but do not allow us to understand thedriving factors, which are, however, key to identifying and developing preventionstrategies (Lynn, 2017).In this paper, we use machine learning methods not only to predict nonre-sponse, but to analyze explanatory factors in a high-dimensional setting for surveystatistics. Recently, double machine learning techniques to deal with high dimen-sions and to deliver unbiased estimates have been developed (cf. Chernozhukovet al. (2015); Belloni et al. (2017); Chernozhukov et al. (2018)). We give an intro-duction to the double machine learning approach and show how double lasso canbe applied to explain nonresponse in the welcome survey of the GESIS panel. Us-ing our causal machine learning approach, we ﬁnd that nonresponse is aﬀected byrespondents’ socio-demographic characteristics, and interviewers’ ratings of both5he respondents’ cooperativeness during the interview and the respondents’ like-lihood to participate in the welcome survey. Socio-demographics are additionallyfound to interact with the chosen mode of participation. Our ﬁndings can helpsurvey researchers who design and implement panel surveys to develop targetedstrategies to prevent nonresponse.The rest of the paper is structured as follows: In Section 2 we introduce thebasic principles of double machine learning, focusing on double selection for lo-gistic regression models. In Section 3, we describe an application for nonresponsemodeling in the GESIS panel. We conclude with a discussion in Section 4. Machine learning methods have been developed mostly for prediction problems,which are based on ﬁnding correlations among variables. Often the machine learn-ing algorithm is considered to be a black box that delivers acceptable forecastaccuracy but in which the interaction of the variables is not understood. In manysituations, however, scientists and practitioners are interested in learning the eﬀectof certain variables, often called treatment variables, on one or more dependentvariables, holding all other factors constant. This is more challenging than build-ing a predictive model because here the black box must be opened and the innermechanism learned.Almost all machine learning methods, like lasso, lead to biased estimates ofcausal relationships and hence invalid inference results, despite their predictive6ower (Belloni et al., 2014a). In recent years, frameworks for valid post-selectioninference have been developed. The double machine learning framework we presentin the following section allows for such valid inference and hence learning aboutparameters and explanatory variables in a high-dimensional setting.

In this section, we would like to introduce the basic ideas behind double machinelearning. The goal is to estimate the treatment eﬀect α of a treatment variable D on the dependent variable Y in a high-dimensional setting, namely Y = γ + α D + g ( X ) + ε, E ( ε | D, X ) = 0 , where γ is the intercept and g ( · ) a function of the control variables. The set ofcontrol variables X = ( X , . . . , X p ) might be high-dimensional. The most commoncase, which we will focus on here, is a linear approximation g ( X ) = β X + . . . β p X p ,with β = ( β , . . . , β p ) as nuisance parameters.Our goal is to perform valid inference on the treatment parameter α in a high-dimensional setting, i.e. the number of variables p might be larger than the numberof observations n . The function g , or in the linear case the parameter vector β ,are considered nuisance parameters and are not part of the model interpretation.For ease of exposition, we consider the case of one treatment variable here, butseveral treatment variables can just as easily be considered and the eﬀects esti-7ated at the same time. If the number of variables or hypotheses to test becomeslarge, methods from simultaneous inference may be applied (for a survey on recentdevelopments, we refer to Bach et al. (2018)).In a high-dimensional setting, standard ordinary least squares (OLS) estima-tion is not appropriate because of overﬁtting, which leads to poor estimates andforecasts. A naive approach often employed by empirical researchers is to use lassoto select the relevant regressors ﬁrst and then to conduct an OLS regression of thedependent variable Y on the treatment variable D and the selected regressors fromthe lasso regression. This procedure, however, leads to biased results because lassocan fail to select variables that are strongly correlated with the treatment variablebut only weakly correlated with the dependent variable. While this does not harmthe predictive performance of the lasso, it leads to omitted variable bias (Belloniet al., 2014b), which biases the inference results. To correct for this problem, a de-biased lasso/double machine learning approach was introduced by Chernozhukovet al. (2018). To understand this approach, we introduce an auxiliary equation forthe treatment variable, as follows: D = γ X + . . . + γ p X p + ν. The idea of double machine learning is to run a lasso regression of the auxiliarymodel to identify which variables create the omitted variable bias in the ﬁrst stepand subsequently include them in the ﬁnal regression step. It can be shown thatthis approach leads to estimates of the target parameter that are asymptoticallynormally distributed (allowing valid post-selection inference). Introducing this8uxiliary regression step and including omitted variables in the ﬁnal regressionimplicitly creates a moment condition for the target parameter that fulﬁlls theso-called Neyman orthogonality property. This means that the derivative withregard to the nuisance parameter of the corresponding score function is equal tozero at the true parameter values. Intuitively, we can see that small errors inthe estimation of the nuisance parameter, as they occur under lasso, do not havea ﬁrst-order eﬀect on the treatment parameter. Despite selection errors in theconfounders, valid results are achieved.

In many survey applications, the dependent outcome variable is binary, and forbinary outcome variables, logistic regression is often the approach of choice. Forlogistic regression, the same arguments as outlined above apply when modernmachine learning methods such as lasso are used to select variables and estimatethe coeﬃcients. To enable valid post-selection inference for the logistic regression,the double machine learning approach has to be modiﬁed appropriately (cf. Belloniet al. (2013)).

In the logistic regression, a binary dependent variable Y relates to a scalar treat-ment D of interest and a p -dimensional control X through a link function G [ Y | X, D ] = G ( Dα + X (cid:48) β ) . For logistic regression, the link function is given by G ( t ) = exp( t ) / { t ) } .We aim to perform statistical inference on the coeﬃcient α , which represents theimpact of the treatment on the dependent variable through the link function.Estimation is usually based on the (negative) log-likelihood function associatedwith the logistic link function, as follows:Λ i ( α, β ) = log { D i α + X (cid:48) i β ) } Y i ( D i α + X (cid:48) i β ) . For estimation in a high-dimensional setting, an (cid:96) -penalty term, || ( α, β ) || = | α | + (cid:80) pj =1 | β j | , is added to the minimization problem. The lasso logistic regressionestimator is given by: (cid:16) ˆ α, ˆ β (cid:17) ∈ arg min α,β E n [Λ i ( α, β )] + λ/n || ( α, β ) || , where λ is the penalty level and E n denotes the empirical mean. As discussedin the section above, inference on the treatment parameter α is challenging andrequires a modiﬁed estimation method, e.g., the de-biasing lasso estimator, basedon a modiﬁed moment condition. The algorithm for the de-biased estimation ofthe treatment parameter α is presented in Algorithm 1 in Appendix B.10 Application: Nonresponse modeling for theGESIS panel

To illustrate the double machine learning lasso, we apply the technique to modelnonresponse in the 2013 recruitment to the GESIS panel.

Recruitment to a probability-based panel is arguably the most important and mostexpensive part of the panel life-cycle. The recruited sample needs to represent thetarget population in order for valid inferences to be drawn for that population,and the sample size needs to be large enough to obtain precise estimates. Therecruitment process usually includes several steps: contacting sampled cases andinviting them to a ﬁrst recruitment survey, conducting this recruitment interviewand, often during it, obtaining consent to proceed in the panel. Consenting re-spondents are then invited to a welcome survey (or proﬁle survey), and those whocomplete it are considered to be panel members. The panel members are thensurveyed on a regular basis.Even if the regular panel waves are conducted in a self-administered mode (e.g.,by mail questionnaire and/or online), it is common to approach sampled personsand conduct the recruitment interview in an interviewer-administered (face-to-face or telephone) mode (Blom et al., 2016). Respondents to the recruitmentsurvey are then asked to proceed with the subsequent survey using cost-savingself-administered modes. This, however, includes a switch in response mode that11ay be subject to systematic nonresponse.For our application, we choose nonresponse in the ﬁrst interview after thisswitch of modes. We consider this stage to be very important for several reasons:First, this is when a large number of respondents to the recruitment survey areusually lost (for nonresponse rates in four large-scale scientiﬁc (mixed-mode) on-line surveys, see Blom et al. (2016)), and there is need to understand nonresponsein order to prevent it, i.e., by tackling likely nonresponse through targeted invi-tations (Lynn, 2020). Second, nonresponse among respondents to the face-to-faceinterview is costly if we consider that they have completed the cost- and labour-intensive personal interview and are no longer available to take part in the lessexpensive self-administered part of the panel. In addition, refreshment samplesare usually planned for panels once the number of respondents has fallen below acertain minimum number. Starting with a smaller sample means that costly newrecruitment is needed sooner. Third, nonresponse can introduce bias to the panel.If the respondents are not lost at random, analyses of panel data can be severelybiased.While a number of studies have been published on panel attrition, e.g., nonre-sponse to individual panel waves or dropout from the panel, the literature aboutnonresponse at the recruitment stages is surprisingly scarce. Sakshaug and Huber(2015) analyze total recruitment error, which they deﬁne as error from initial non-response plus error from non-consent to be contacted again. In their comparison ofa self-administered (mail/web) and CAPI recruitment, they ﬁnd, for both modes,nonresponse bias to be larger than non-consent bias and total recruitment bias tobe similar in both groups: both recruited samples overrepresent older and more ed-12cated population groups, currently employed persons and higher-wage groups andunderrepresent foreign-born persons. For the GEISIS recruitment panel, Bosnjaket al. (2018) ﬁnd age, citizenship, marital status, household size, place of birth,education and household income to be distributed diﬀerently among the sampleof respondents compared to the general population, with the diﬀerences tendingto be larger for the welcome survey.In contrast to the initial recruitment survey, in which usually only a few vari-ables from the sample frame are available, the recruitment interview usually gen-erates a lot of information on the respondent, facilitating the study of nonresponsein the welcome survey. In addition to basic-sociodemographic information, therecruitment survey often includes information on attitudes and survey experience.In interviewer-administered surveys, the interviewers often provide informationabout the interview situation and their expectations of the respondents’ futureparticipation in the panel. In particular, interviewers’ ratings of a respondent’spropensity to participate in a future survey, as well as ratings of cooperativenessand enjoyment, have been found to improve nonresponse models (see for exampleSinibaldi and Eckman (2015); Plewis et al. (2017)). Understanding the nonre-sponse process better can help to identify measures to address the problem, forexample through targeted invitations (Lynn, 2020).While having a rich set of factors that potentially inﬂuence nonresponse isvery helpful to understanding the nonresponse decision, it poses a challenge tononresponse modeling. Indeed, including a large number of variables, possiblysplit into multiple dummy variables, and interactions requires big data solutions.13 .2 The GESIS panel data

The GESIS panel (Bosnjak et al., 2018) is a probability-based, mixed-mode on-line and postal mail panel conducted bimonthly by GESIS – Leibniz Institutefor the Social Sciences in Mannheim, Germany. The ﬁrst cohort of the GESISpanel was recruited in 2013 and refreshment samples were recruited in 2016 and2018. Recruitment to the GESIS panel in 2013 was based on a random sampleof 21 ,

870 German-speaking residents of Germany aged 18 to 70 during the yearof recruitment. In the ﬁrst step, all sampled cases were invited to participate ina face-to-face recruitment survey. During this survey, respondents were asked fortheir consent to be invited to the GESIS panel by means of the self-administeredonline mode or the paper and pencil mode. Consenting respondents were theninvited to participate in the welcome survey in the mode of their choice. Onlyafter completing the welcome survey were respondents considered to be GESISpanel members.In our study, we analyze nonresponse in the 2013 welcome survey among con-senting respondents. We use data from the GESIS panel registration survey in2013 (GESIS, 2020) to model nonresponse (or drop-out) (yes/no) in the subse-quent welcome survey. In total, 7 ,

599 persons participated in the face-to-faceregistration survey, of whom 6 ,

210 agreed to being invited to the welcome surveyand participating in the GESIS panel. Of these individuals, 4 ,

938 responded to thewelcome survey and thus became regular panel members (dropout rate: 20 . ,

908 respondents from the recruitment survey, of whom 4 ,

720 com-14leted the welcome survey (dropout rate: 20 . E [ Y | X, D ] = exp( Dα + X (cid:48) β )1 + exp( Dα + X (cid:48) β ) . The binary dependent variable Y indicates nonresponse to the welcome survey.The regressors split up into 303 control variables X and 26 treatment variables D . For the treatment variables, we choose key socio-demographics, the mode therespondents chose for the welcome interview (paper-and-pencil or online question-naire) and interviewer ratings collected in the recruitment survey. The interviewerratings include three cooperativeness ratings and one rating of individuals’ will-ingness to participate in the welcome interview. The questions are: • How would you rate the respondent’s willingness to answer the questions?(answer categories: good, moderate, low, good in the beginning but gotworse, low in the beginning but got better) • How diﬃcult or easy was it to persuade the respondent to take part in theinterview? (answer categories: very diﬃcult, rather diﬃcult, rather easy,15ery easy) • How diﬃcult or easy was it to persuade the respondent to take part in thefollow-up interview? (answer categories: very diﬃcult, rather diﬃcult, rathereasy, very easy) • How likely is it that the respondent will take part in the ﬁrst online- or paperquestionnaire? (answer categories: very likely, rather likely, rather unlikely,very unlikely)We combine sparse categories with other categories for our analysis. We re-code the answer categories into good vs. bad/all other categories for “willingnessto answer the questions” and combine very diﬃcult and diﬃcult for the two ques-tions on the diﬃculty of persuading respondents to take part in the interview andfollow-up interview. For the rating of the likelihood of response to the ﬁrst onlineor paper questionnaire, we combine rather unlikely and very unlikely. With regardto sociodemographics, we include age, gender, highest educational degree, countryof birth and living situation. We generate the living situation variable from infor-mation on marital status, partnership and living in a shared household leading tothe ﬁve categories: no partner ; partner, not in household ; partner, in household ; married, living together ; married, living apart . An overview of the coding for alltreatment variables can be found in Table 3 in the Appendix. We include interac-tions of the choice of mode for the welcome survey with age, education and livingsituation to account for diﬀerential eﬀects of the choice of mode on nonresponse.16reatment variablesgender, age, nationality, education, living situation, invitation mode,willingness to answer the questions, willingness to participate in the interview,willingness to participate in the panel, probability of participating in the surveyControl variablesmigration, employment status, occupational group, life satisfaction, leisure time,country of birth, internet use, technical aﬃnity, survey experience,household size, number of children, income, incentive point,invitation hesitance, interview interventionTable 1: Extract of Regressors In this section, we present the results of our double machine learning approachto the inferential analysis of nonresponse in the GESIS panel. The results of thedouble lasso for logistic regression are visualized in Figures 1 to 3, and a regressiontable can be found in Table 2 in the Appendix. We start with the interpretationof the interviewer ratings. The estimated coeﬃcients of the interviewer ratingsfrom the logistic regression together with the corresponding conﬁdence intervalsare displayed in Figure 1.

Cooperativeness

We ﬁnd that the interviewer observation of respondents’ will-ingness to answer the survey questions in the recruitment survey had a highlysigniﬁcant negative eﬀect on survey nonresponse. Respondents who were rated ashaving good willingness to respond to the recruitment survey dropped out of thesurvey after the recruitment stage to a lesser extent than respondents who wererated as having low willingness. We do not ﬁnd signiﬁcant eﬀects for the ease ofpersuading respondents to participate in the interview nor for the ease of persuad-17 ery likely to participaterather likely to participatevery easy (follow−up interview)rather easy (follow−up interview)very easy (interview)rather easy (interview)good willingness to answer −0.6 −0.3 0.0 0.3

Figure 1: Regression coeﬃcients of the interviewer ratings in the logistic regressionmodel.ing respondents to consent to be contacted again for the follow-up interview. Theeﬀects however tend in the same direction as the observed willingness to answerthe questions: respondents who were rated as being rather easy or very easy topersuade were less likely to drop-out.

Rated likelihood of participation

Respondents who were rated as beingrather or very likely to participate in the welcome survey dropped out after therecruitment survey to a lesser extent than did those who were rated as being ratherunlikely or very unlikely to participate. We, however, ﬁnd that the only signiﬁcanteﬀect in this regard is for “very likely” category.18 ocio-demographics and chosen survey mode

Next, we discuss the eﬀectsof socio-demographics and chosen survey mode for the welcome survey. The resultsare found in Figures 2 and 3 . married, living apartmarried, living togetherpartner, in householdpartner, not in householdother educationhigh educationmedium educationGermanyfemaleage −3 −2 −1 0 1

Figure 2: Regression coeﬃcients of the socio-demographic characteristics in thelogistic regression model.We do not ﬁnd a signiﬁcant eﬀect for respondents’ gender but do ﬁnd a pos-itive eﬀect for having German citizenship: respondents with German citizenshipdropped out after the recruitment survey at a higher rate than respondents withoutGerman citizenship.We ﬁnd that survey mode interacts with age, education (though only signiﬁ-cantly with high education) and living situation (only being signiﬁcant at the 10%level for “married, living together”). We interpret the eﬀects of all variables that19 arried, living apart*onlinemarried, living together*onlinepartner, in household*onlinepartner, not in household*onlineother education*onlinehigh education*onlinemedium education*onlineage*onlineonline −2 −1 0 1

Figure 3: Regression coeﬃcients of the chosen survey mode in the logistic regres-sion model.show a signiﬁcant interaction with chosen survey mode. The online mode has apositive eﬀect and is positively interacted with age, which itself has a negativeeﬀect: the older the respondents, the lower their likelihood to drop out after therecruitment survey. The eﬀect is much stronger for respondents who chose thepaper-and-pencil mode ( − . − . . − . − . − . We ﬁnd that socio-demographic characteristics, survey mode, one measure of co-operativeness, and a rating of willingness to participate in the welcome surveyexplain survey drop-out after the recruitment survey. Losing respondents at thisstage is not only very costly but can, through selective nonresponse, put the va-lidity of panel inference at risk. Thus, the goal of panel recruitment should be toprevent panel drop-oﬀ among population groups that are found to be least likelyto become panel members. Knowing which population groups are likely to drop-21ut can help in the identiﬁcation and development of targeted strategies for thesegroups (Lynn, 2017). Especially interesting in this context is the moderating eﬀectof mode choice. Knowing this, it might be worthwhile to develop targeted inter-ventions that increase response depending on the mode the respondents choose.Further research is needed to determine which interventions are the most successfulfor diﬀerent population groups.

In this paper, we introduce double machine learning methods to survey statistics,enabling researchers to study causal relationships between treatment variables anddependent variables in high-dimensional data sets while controlling for large num-bers of variables. In an application, we analyze drop-outs in the recruitment tothe GESIS panel using double machine learning for logit regression. Classicalmachine learning is well suited to predict who will not respond to a survey butleads to biased estimates of causal relationships and invalid inference results andshould therefore not be applied in studies aiming to explain the eﬀects of treatmentvariables in a high-dimensional setting. Performing valid post-selection inference,de-biased/double machine learning allows the signiﬁcant variables inﬂuencing thedependent variable to be identiﬁed. Unbiased estimation is crucial for learning (a)which treatments aﬀect which dependent variables in which ways and (b) whichfactors to manipulate to achieve better outcomes. Given that survey scientists areconfronted with many types of big data, such as paradata from the web, or datafrom sensor tracking or mobile apps, the applications for which survey scientists22ight beneﬁt from the double machine learning technique are numerous. For fu-ture research, we intend to analyze how double machine learning might be usedto select and include high numbers of control variables in imputation or weightingmodels. 23

Data and Empirical Results

Coeﬃcient p-value 2.5% 97.5%Age -2.126 0.000 -3.120 -1.133Female -0.069 0.440 -0.246 0.107Germany 0.489 0.003 0.171 0.807Good willingness to answer questions -0.354 0.013 -0.632 -0.076Rather easy to persuade respondent (interview) -0.147 0.178 -0.360 0.067Very easy to persuade respondent (interview) -0.209 0.111 -0.465 0.048Rather easy to persuade respondent (follow-up interview) -0.114 0.353 -0.354 0.127Very easy to persuade respondent (follow-up interview) -0.158 0.295 -0.454 0.138Rather likely to participate -0.039 0.821 -0.373 0.296Very likely to participate -0.436 0.012 -0.775 -0.098Medium education -0.146 0.289 -0.417 0.124High education 0.081 0.616 -0.235 0.397Other education -0.370 0.514 -1.481 0.741Not married with partner, separate households 0.014 0.951 -0.430 0.458Not married with partner, joint household 0.023 0.910 -0.371 0.416Married living together -0.634 0.003 -1.047 -0.221Married living apart 0.233 0.562 -0.554 1.019Online 0.537 0.044 0.015 1.058Age*online 0.615 0.098 -0.114 1.343Medium education*online -0.282 0.123 -0.640 0.077High education*online -0.711 0.002 -1.153 -0.269Other education*online -0.938 0.294 -2.691 0.814Not married with partner, separate households*online -0.126 0.637 -0.652 0.399Not married with partner, joint household*online -0.218 0.449 -0.783 0.347Married living together*online 0.334 0.092 -0.054 0.722Married living apart*online -0.605 0.309 -1.770 0.560

Table 2: Estimated treatment eﬀects of the double lasso for logistic regressionincluding p-values and conﬁdence intervals.24 ariable Answer Categories Code (0 is baseline)Participation mode Oﬄine 0: OﬄineOnline 1: OnlineWillingness of the respondent Good 1: Goodto answer the question Medium 0: BadBad 0: BadGood in the beginning but got worse 0: BadBad in the beginning but got better 0: BadDiﬃculty to persuade respondent Very diﬃcult 0: Diﬃcultto take part in interview Rather diﬃcult 0: DiﬃcultRather easy 3: Rather easyVery easy 4: Very easyDiﬃculty to persuade respondent Very diﬃcult 0: Diﬃcultto take part in follow-up interview Rather diﬃcult 0: DiﬃcultRather easy 3: Rather easyVery easy 4: Very easyLikelihood of participation in ﬁrst Very likely 5: Very likelyonline- or paper questionnaire Rather likely 2: Rather likelyRather unlikely 0: UnlikelyVery unlikely 0: UnlikelyHighest education Still in school 8: HighLeft school without degree 0: LowLower secondary degree 0: LowSecondary degree 4: MediumPolytechnical secondary degree (GDR) 8th or 9th grade 0: LowPolytechnical secondary degree (GDR) 10th grade 4: MediumAdvanced technical college certiﬁcate 8: HighGeneral qualiﬁcation for university entrance 8: HighOther degree 9: OtherGender Male 0: MaleFemale 2: FemaleCitizenship Germany 0: GermanyEU28 4: OtherRest of Europe 4: OtherOther 4: OtherAge Year of birth 2013 - year of birthLiving situation Not married, no partner 0: No partnerNot married with partner, separate households 1: Partner not in householdNot married with partner, joint household 2: Partner in householdMarried living together 3: Married living togetherMarried living apart 4: Married living apart

Table 3: Answer categories and coding of the treatment variables

B Double Machine Learning for Logistic Regres-sion

The double machine learning approach for logistic regression includes three mainsteps:(1) initial estimation of the regression function via post-lasso logistic regression,(2) estimation of instruments that are orthogonal to the weighted controls viaweighted post-selection least squares, and253) estimation of α based on the nuisance estimates obtained in (1) and (2).The estimation procedure for α is summarized in more detail in the followingalgorithm: 26 lgorithm 1 DML for Logistic Regression Run a post-lasso-logistic regression of Y i on D i and X i :( ˆ α, ˆ β ) ∈ arg min α,β E n [Λ i ( α, β )] + λ /n || ( α, β ) || , ( ˜ α, ˜ β ) ∈ arg min α,β E n [Λ i ( α, β )] : support ( β ) ⊂ support ( ˜ β ) . For i = 1 , . . . , n , keep the value X (cid:48) i ˜ β and weight ˆ f i := ˆ w i / ˆ σ i , whereˆ w i = G (cid:48) ( D i ˜ α ) + X (cid:48) i ˜ β and ˆ σ i = V ar ( Y i | D i , X i ) = G ( D i ˜ α + X (cid:48) i ˜ β ) { − G ( D i ˜ α + X (cid:48) i ˜ β ) } . Run a post-lasso OLS regression of ˆ f i D i on ˆ f i X i :ˆ θ ∈ arg min θ E n [ ˆ f i ( D i − X (cid:48) i θ ) ] + λ /n || Γ θ || , ˜ θ ∈ arg min θ E n [ ˆ f i ( D i − X (cid:48) i θ ) ] : support ( θ ) ⊂ support (ˆ θ ) . Keep the residual ˆ v i := ˆ f i ( D i − X (cid:48) i ˜ θ and instrument ˆ z i := ˆ v i / √ ˆ σ i , i = 1 , ..., n . Run an instrumental Logistic regression of Y i − X (cid:48) i ˜ β on D i using ˆ z i as theinstrument for D i ˇ α ∈ arg inf α ∈A L n ( α ) , where L n ( α ) = | E n [ { Y i − G ( D i α + X (cid:48) i ˜ β ) } z i ] | E n [ { Y i − G ( D i α + X (cid:48) i ˜ β ) } ˆ z i ]and A = { α ∈ R : | α − ˜ α | ≤ C/ log n } . Compute the conﬁdence region withasymptotic coverage 1 − ξ : { α ∈ R : | α − ˜ α | ≤ ˆΣ n Φ − (1 − ξ/ / √ n } . eferences Bach, P., V. Chernozhukov, and M. Spindler (2018, September). Valid Simulta-neous Inference in High-Dimensional Settings (with the hdm package for R).Papers 1809.04951, arXiv.org.Belloni, A., V. Chernozhukov, I. Fern´andez-Val, and C. Hansen (2017). Pro-gram evaluation and causal inference with high-dimensional data.

Economet-rica 85 (1), 233–298.Belloni, A., V. Chernozhukov, and C. Hansen (2014a). High-dimensional meth-ods and inference on structural and treatment eﬀects.

Journal of EconomicPerspectives 28 (2), 29–50.Belloni, A., V. Chernozhukov, and C. Hansen (2014b). Inference on treatmenteﬀects after selection amongst high-dimensional controls.

Review of EconomicStudies 81 , 608–650.Belloni, A., V. Chernozhukov, and Y. Wei (2013). Honest conﬁdence regions fora regression parameter in logistic regression with a large number of controls.cemmap working paper CWP67/13, London.Belloni, A., V. Chernozhukov, and Y. Wei (2016). Post-selection inference forgeneralized linear models with many controls.

Journal of Business & EconomicStatistics 34 (4), 606–619.Blom, A. G., M. Bosnjak, A. Cornilleau, A.-S. Cousteaux, M. Das, S. Douhou, andU. Krieger (2016). A comparison of four probability-based online and mixed-mode panels in europe.

Social Science Computer Review 34 (1), 8–25.29osnjak, M., T. Dannwolf, T. Enderle, I. Schaurer, B. Struminskaya, A. Tanner,and K. W. Weyandt (2018). Establishing an open probability-based mixed-modepanel of the general population in germany: The gesis panel.

Social ScienceComputer Review 36 (1), 103–115.Breiman, L. (2001). Statistical modeling: The two cultures (with comments anda rejoinder by the author).

Statistical science 16 (3), 199–231.Buskirk, T. D. (2018). Surveying the forests and sampling the trees: An overviewof classiﬁcation and regression trees and random forests with applications insurvey research.

Survey Practice 11(1) , 1–13.Buskirk, T. D., A. Kirchner, A. Eck, and C. S. Signorino (2018). An introductionto machine learning methods for survey researchers.

Survey Practice 11(1) ,1–10.Buskirk, T. D. and S. Kolenikov (2015). Finding respondents in the forest: A com-parison of logistic regression and random forest models for response propensityweighting and stratiﬁcation.

Survey Methods: Insights from the Field , 1–17.Callegaro, M. (2013). Paradata in web surveys. In F. Kreuter (Ed.),

Improvingsurveys with paradata: Analytic uses of process information , Chapter 11, pp.261–279. Wiley Online Library.Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duﬂo, C. Hansen, W. Newey,and J. Robins (2018, 01). Double/debiased machine learning for treatment andstructural parameters.

The Econometrics Journal 21 (1), C1–C68.Chernozhukov, V., C. Hansen, and M. Spindler (2015). Valid post-selection and30ost-regularization inference: An elementary, general approach.

Annual Reviewof Economics 7 (1), 649–688.Durrant, G. B. and F. Steele (2009). Multilevel modelling of refusal and non-contact in household surveys: evidence from six uk government surveys.

Journalof the Royal Statistical Society: Series A (Statistics in Society) 172 (2), 361–381.Earp, M., M. Mitchell, J. McCarthy, and F. Kreuter (2014). Modeling nonresponsein establishment surveys: Using an ensemble tree model to create nonresponsepropensity scores and detect potential bias in an agricultural survey.

Journal ofOﬃcial Statistics 30 (4), 701–719.Eck, A. (2018). Neural networks for survey researchers.

Survey Practice 11(1) ,1–10.Feddersen, J., R. Metcalfe, and M. Wooden (2016). Subjective wellbeing: whyweather matters.

Journal of the Royal Statistical Society: Series A (Statisticsin Society) 179 (1), 203–228.GESIS (2020). Gesis panel - standard edition. GESIS Datenarchiv, K¨oln. ZA5665Datenﬁle Version 37.0.0, https://doi.org/10.4232/1.13573.Japec, L., F. Kreuter, M. Berg, P. Biemer, P. Decker, C. Lampe, J. Lane, C. O’Neil,and A. Usher (2015, 11). Big Data in Survey Research: AAPOR Task ForceReport.

Public Opinion Quarterly 79 (4), 839–880.Kern, C., T. Klausch, and F. Kreuter (2019). Tree-based machine learning methodsfor survey research.

Survey Research Methods 13 (1), 73–93.31ern, C., B. Weiss, and J.-P. Kolb (2019). A longitudinal framework for predictingnonresponse in panel surveys. Papers 1909.13361, arXiv.org.Kirchner, A. and C. S. Signorino (2018). Using support vector machines for surveyresearch.

Survey Practice 11(1) , 1–11.Kreuter, F. (2013a). Facing the nonresponse challenge.

The ANNALS of theAmerican Academy of Political and Social Science 645 (1), 23–35.Kreuter, F. (2013b).

Improving surveys with paradata: Analytic uses of processinformation , Volume 581. John Wiley & Sons.Liu, M. (2020). Using machine learning models to predict attrition in a surveypanel.

Big Data Meets Survey Science: A Collection of Innovative Methods ,415–433.Lynn, P. (2017). From standardised to targeted survey procedures for tacklingnon-response and attrition.

Survey Research Methods 11 (1), 93–103.Lynn, P. (2020). Methods for recruitment and retention.

Understanding SocietyWorking Paper 2020-07 .Murphy, J., M. W. Link, J. H. Childs, C. L. Tesfaye, E. Dean, M. Stern, J. Pasek,J. Cohen, M. Callegaro, and P. Harwood (2014, 11). Social Media in PublicOpinion Research: Executive Summary of the Aapor Task Force on EmergingTechnologies in Public Opinion Research.

Public Opinion Quarterly 78 (4), 788–794.Phipps, P., D. Toth, et al. (2012). Analyzing establishment nonresponse using an32nterpretable regression tree model with linked administrative data.

The Annalsof Applied Statistics 6 (2), 772–794.Plewis, I., L. Calderwood, and T. Mostafa (2017). Can interviewer observationsof the interview predict future response?

Methods, data, analyses 11 (1).Roßmann, J. and T. Gummer (2016). Using paradata to predict and correct forpanel attrition.

Social Science Computer Review 34 (3), 312–332.Sakshaug, J. W. and M. Huber (2015, 10). An evaluation of panel nonresponseand linkage consent bias in a survey of employees in germany.

Journal of SurveyStatistics and Methodology 4 (1), 71–93.Shmueli, G. (2010). To explain or to predict?

Statistical science 25 (3), 289–310.Signorino, C. S. and A. Kirchner (2018). Using lasso to model interactions andnonlinearities in survey data.

Survey Practice 11(1) , 1–10.Sinibaldi, J. and S. Eckman (2015). Using call-level interviewer observations toimprove response propensity models.

Public Opinion Quarterly 79 (4), 976–993.Struminskaya, B., P. Lugtig, F. Keusch, and J. K. H¨ohne (2020). Augmentingsurveys with data from sensors and apps: Opportunities and challenges.

SocialScience Computer Review , 1 – 13.Zinn, S. and T. Gnambs (2020). Analyzing nonresponse in longitudinal surveysusing bayesian additive regression trees: A nonparametric event history analysis.