[PDF] A Pipeline for Variable Selection and False Discovery Rate Control With an Application in Labor Economics

Abstract

We introduce tools for controlled variable selection to economists. In particular, we apply a recently introduced aggregation scheme for false discovery rate (FDR) control to German administrative data to determine the parts of the individual employment histories that are relevant for the career outcomes of women. Our results suggest that career outcomes can be predicted based on a small set of variables, such as daily earnings, wage increases in combination with a high level of education, employment status, and working experience.

Full PDF

aa r X i v : . [ ec on . E M ] J un A Pipeline for Variable Selection andFalse Discovery Rate Control With anApplication in Labor Economics

Sophie-Charlotte Klose ∗ Johannes Lederer † June 24, 2020

Abstract

We introduce tools for controlled variable selection to economists. In particular, we apply a recentlyintroduced aggregation scheme for false discovery rate (FDR) control to German administrativedata to determine the parts of the individual employment histories that are relevant for the careeroutcomes of women. Our results suggest that career outcomes can be predicted based on a small setof variables, such as daily earnings, wage increases in combination with a high level of education,employment status, and working experience.

Keywords:

Variable selection, aggregated FDR control, knockoﬀ-ﬁlter, inference, penalized logisticregression, female employment careers ∗ Department of Management, University of Duisburg-Essen, Germany, e-mail: [email protected]. † Department of Mathematics, Ruhr-University Bochum, Germany, e-mail: [email protected]. The authorssincerely thank Fang Xie for generously sharing the coding material used for implementing the aggregated FDR controlinto the knockoﬀ-ﬁlter. Introduction

Titles like “Economics in the age of big data” [EL14], “Big data: new tricks for econometrics” [Var14]and “Beyond prediction: using big data for policy problems” [Ath17] highlight a shift in the scale ofeconomic data towards

Big Data . The term Big Data refers to settings, where we observe informationon a large number of units, or many pieces of information on each unit, or both, and often in a complexsetting with several crosssections per unit. Nowadays, empirical research in economics increasinglyrelies on newly available large-scale administrative data sets, which oﬀer new possibilities for ﬁttingmodels.In particular, Big Data applications bring the opportunity to ﬁt high-resolution models. Complexmodels arise naturally in Big Data, where the total number of measurements is large. In the extremecase, ﬁtting sophisticated models can mean to deal with parameter spaces that are comparable ( p ≈ n )to the number of samples or even larger ( p ≫ n ). Parameter spaces that are too high-dimensional( p ≈ n or p ≫ n ) can not be examined with standard estimation methods and therefore rely ontools from high-dimensional statistics. High-dimensional statistics comes into play whenever one ﬁtscomplex models ( p is large). In high-dimensional settings the number of variables is substantial, butthere is often a sense that many of the variables are of minor importance or completely irrelevant. Thebasic concept is to focus on the most relevant parts of the parameter space, which are identiﬁed byleveraging prior information. High-dimensional statistics complement classical estimators with so-calledpenalty or prior functions that formulate mathematically how “likely” or “promising” certain modelsare. The data-driven calibrated tuning parameters weight the prior function and thus determine thedegree of regularization, i.e. for example, the degree of feeding the bare measurements with additionalinformation.The notion most commonly imposed on the prior functions in high-dimensional statistics is sparsity .Sparsity means that the data generating process can be modelled accurately by using only a smallnumber of variables even though the actual number of variables at hand is large. The probably mostdiscussed and prominent sparsity-inducing penalty function, such as in the Lasso literature [Tib96], isthe ℓ -prior, which we also focus on. If the data generating process can be approximated by a sparsemodel, it makes sense to speak of variable selection and false discovery rate (FDR) control [BH95].2ariable selection means that we are interested in teasing apart the “relevant variables” (variables withnon-zero-valued entries in the parameter vector) from the “irrelevant variables” (variables with zero-valued entries in the parameter vector). FDR control means that we want to ensure that the expectedfraction of false discoveries among all discoveries is not to high in order to guarantee that the selectedvariables are indeed the true ones.Roughly speaking, sparsity is the main ingredient of variable selection, since we are often interestedin ﬁnding a few of the important covariates, that is to extract the relevant variables from those whichare scientiﬁcally not extremely useful for understanding the dependence between the response variableand the few truely relevant covariates.The ﬁnal goal is to draw inferences about the estimates ˆ β of the selected variables. In most cases, itis impossible to recover the true subset S of variables with no error. Hence, we are naturally interestedin procedures that keep the resulting error small. This is typically measured in terms of false positivesand false negatives. It is also known as the type I and type II errors, respectively. False positivesare falsely selected variables, and false negatives are falsely omitted variables. A suitable measure toevaluate the estimator’s variable selection accuracy is, for example, the hamming distance, the sum offalse positives and false negatives. The smaller this number, the better the estimator’s performance.How does variable selection and FDR control work? In classic hypothesis testing we are commonlyinterested in controlling the Type I error at a certain signiﬁcance level α by individually testing p null-hypotheses of the form H ,j : β ∗ j = 0, for j ∈ { , . . . , p } . Extending the Type I error control to multipletesting, we are able to control the FDR. The FDR is the expected proportion of false discoveries andtotal discoveries. However, accurate hypothesis testing also requires that the (statistical) power, theproportion of correctly selected hypotheses and total number of true hypotheses, is large. The power istypically measured in terms of true positives. True positives are truely selected variables. Accordingly,the number of true positives is tp:= |{ j ∈ { , . . . , p } : ˆ β j = 0 and β ∗ j = 0 }| .Therefore, in addition to the FDR, we also control the aggregated false discovery rate (AFDR)recently proposed by [XL19]. It maximizes the power while simultaneously guaranteeing FDR control.This aggregation scheme is an improvement on the original FDR and has the advantage of having the The hamming distance is deﬁned as hd := fp + fn, where fp := |{ j ∈ { , . . . , p } : ˆ β j = 0 and β ∗ j = 0 }| andfn := |{ j ∈ { , . . . , p } : ˆ β j = 0 and β ∗ j = 0 }| . X , in a way that allows for accurate (A)FDR control. The knockoﬀ ﬁlter can be adapted for abroad class of models and for a variety of test statistics [CFJL16, RSC19]. Moreover, the knockoﬀprocedure works under any ﬁxed design matrix X and does not require the strong assumptions for thecovariates, usually required for most variable selection techniques in high-dimensional statistics [ZY06].In fact, high-dimensional tools often have to limit the correlations between the covariates. Typicallythe irrepresentability condition is also required which limits the correlations between the “relevant”and “irrelevant” variables.Why is variable selection and (A)FDR control for economists of interest? First of all, variable selec-tion and (A)FDR control answer a simple, but very fruitful, question: which variables does an outcomeof interest depend upon? Economists often want to know which demographic or socioeconomic vari-ables aﬀect future economic outcomes such as income. Answering this question is not easy, as economicdata sources become more and more detailed, providing a ﬂood of potential explanatory variables, oftenknowing full well that the outcome of interest only depends on a small fraction of it. Especially inadministrative data, the main workhorse in labor economics, there is due to the longitudinal naturea deluge of explanatory variables possibly interacting in many diﬀerent ways. So far, many economicpapers have shied away from including many explanatory variables. They subconsciously assumed thatthe data generating process is characterized by only a few variables that matter for the outcome ofinterest. Important covariates are included based on economic reasoning. As consequence, relevantexplanatory variables which enter the model with complex and very ﬂexible functional form are usuallynot accounted for. Even though economists may believe that an economic outcome depends on a smallset of variables, they have a priori little or no clue about which ones are relevant. Therefore, modernhigh-dimensional tools aimed to controlled variable selection can help economists to tease apart the4elevant from irrelevant variables and to achieve valid inference.So far, in the empirical research, economists limit the number of explanatory variables by hand,rather then choosing them in a data-driven manner. Only for prediction tasks we already have manyinsights from the economic literature on data-driven techniques. In recent years, scholars adoptedmachine learning (ML) tools in economics. Indeed, most recent inﬂuential reviews of ML methods aimedto economists (for example [MS17, AI19]) introduce ML as a powerful tool to solve problems aroundprediction. There have been successful applications of the prediction methodology to policy problems,where ML tools have been embeeded in the context of economic decision-making [KLMO15, KLL + ℓ penalized logistic regression. To achieve valid inference in our setting, we apply theMX knockoﬀs to a data set from the Sample of Integrated Employment Biographies (SIAB). Correct5nference in our controlled variable selection problem means that we eﬀectively control the (A)FDReven in logistic regression with a large set of covariates. Thus, we can be pretty conﬁdent that the MXknockoﬀ procedure correctly teases apart important from irrelevant variables while maximizing powerand guaranteeing FDR control.It is well known from the economic literature that a higher educational level is associated with bettercareer outcomes in terms of higher earnings. There is a huge economic literature showing the positivelabor market returns of a high educational attainment (for example [BHJ19, Car99, HHV18]). Thereis also some indication in the literature that the occupational choice is related to the career outcomes[AA70], but there seems to be no extensive analysis of which of the several hundred explanatoryvariables from individual employment and wage histories are truely associated with future professionalcareers of women.Our results suggest, ﬁrst, that individual employment histories have predictive power with regardto professional careers and second, that only a relatively small subset of information gained fromemployment records is crucial.The remainder of this paper is organized as follows. Section 2 introduces the model set-up and themethodology. Section 3 describes the data, after which Section 4 presents the main empirical results.The ﬁnal section concludes the paper. 6 Methodology

In our empirical application, we use a large-scale model that links a large set of potential explanatoryvariables to the response variable in a nonlinear fashion. We want to discover which variables fromindividual employment histories are truely associated with future professional careers of women. Thedata consist of vector-valued samples, each of them describing the employment and wage situation ofwomen over the past ﬁve years. The design matrix X ∈ R

79 782 × is prepocessed and scaled such thatit follows a multivariate normal distribution. The outcome variable y ∈ R

79 782 is an indicator thattakes the value 1 if the woman makes a professional career in 2010, otherwise 0.

We consider data in form of a real-valued deterministic design matrix X ∈ R n × p and a binary response y ∈ R n . We denote the rows of X by x , . . . , x n ∈ R p and the columns of X by x , . . . , x p ∈ R n . Thedesign matrix X is assumed to be scaled such that ( X ⊤ X ) jj = n for j ∈ { , . . . , p } .We deﬁne the vector of residuals as u = ( u , . . . , u n ) ⊤ with entries u i = y i − P ( y i = 1 | x i ) for i ∈ [ n ].The vector u is random noise with mean zero, i.e. E ( y i − P ( y i = 1 | x i )) = 0 . Further, we assume E ( u i | x i ) = 0 and E ( u i u j ) = 0 ∀ i = j , i.e. our models contain only exogenous variables and the errorterms are assumed to be independent and identically distributed (i.i.d.).The design matrix X and the response vector y are linked to standard logistic regression model P ( y i = 1 | x i ) = exp( x ⊤ i β ∗ )1 + exp( x ⊤ i β ∗ ) ( i = 1 , . . . , n ) , (1)where β ∗ ∈ R p is the unknown regression vector.We assume known that our data generating process is sparse , that is, only a small and a prioriunknown number of predictors is relevant for predicting the careers of women ( |{ j : β j = 0 }| ≪ min { n, p } ). Sparsity can be motivated on economic grounds in situations where a researcher believesthat the economic outcome can be modeled accurately by using only a small number of variables(relative to the sample size) but is unsure about the identity of the relevant variables. Traditionallyin the empirical literature economists assumed that the model of interest is characterized by a small7umber of variables, and they limited the number of explanatory variables by hand, rather than choosingthem in a data-driven manner. However, recent advances in high-dimensional modeling have promptedeconomists to take advantage of the recent big data revolution to deal with large dimensional data sets,in a way that maintains the interpretability of economic models by searching for the truely importantvariables (for example [BCH11, FLQ11]). The sparsity assumption imposed there allows the eﬀectiveuse of a large set of covariates while at the same time maintaining the spirit of parsimonious modelsin the economic discipline. For this reason, we follow this strand of literature on the sparse modelingof economic processes. We use penalized techniques to solve the logistic regression and choose the ℓ penalty throughout ourestimations. The ℓ penalty is suitable for sparsity because it forms a square constraint region forthe parameter vector, and the least-squares contours are likely to intercept the constraint region atthe extremes, such that certain coordinates of the parameter vector are set to zero. Moreover the ℓ penalty is a convex function, which makes it convenient to optimize over.The goal is to estimate the support set S = supp( β ∗ ) for the model in eq. 1 with the family ofregularized estimatorsˆ β [ r, X, y ] ∈ arg min β ∗ ∈ R p {L [ β ∗ , X, y ] + rh [ β ∗ ] } , (2)where L [ β ∗ , X, y ] = P ni =1 (log(1 + exp( x ⊤ i β ∗ )) − y i x ⊤ i β ∗ ) /n is the negative log-likelihood function, r ∈ [0 , ∞ ) is a tuning parameter and h : R p → [0 , ∞ ] is a prior function which we specify as sparsityinducing prior function h [ β ∗ ] := || β ∗ || , where | · | denotes the absolute value and || β ∗ || := P pj =1 | β ∗ j | is the ℓ -prior.Next, to avoid an unwanted overall shrinkage of the estimates imposed by the l penalty, we reﬁt the8egularized estimators ˆ β with subsequent logistic regression estimation on the support ˆ S := supp( ˆ β ):ˆ β reﬁt [ X ˆ S , y ] ∈ arg min β ∗ ∈ R p supp[ β ∗ ] ⊂ ˆ S L [ β ∗ , X ˆ S , y ] . (3) In this paper, we focus on controlling the aggregated false discovery rate (AFDR) [XL19], which wecan deﬁne as follows: letting S ∗ := { j ∈ { , . . . , p } : β ∗ j = 0 } be the true support set and ˆ S q [[ X, y ] , k ]an estimate of S ∗ operating on data [ X, y ]. Then, the AFDR at target level q = P ki =1 q i ∈ [0 ,

1] isAFDR := E " | ˆ S q [[ X, y ] , k ] \S ∗ || ˆ S q [[ X, y ] , k ] | ∨ ≤ q with S q [[ X, y ] , k ] := ∪ ki =1 S q [ X, y ] , (4)where | · | denotes the cardinality of a set. The AFDR scheme is nothing else as applying the FDRcontrol method [BH95] k times with speciﬁc target levels q , . . . , q k and combining the results by takingthe union. Consequently, for k = 1, we consider the original FDR control scheme. Because the AFDRscheme is equipped with the same guarantees as the original FDR control method, the same proceduresthat are used to control the FDR can here be applied. Next, we give a quick introduction to the model-X knockoﬀ ﬁlter which we use in our empiricalapplication to control the AFDR. The model-X (MX) knockoﬀ ﬁlter is a method for high-dimensionalcontrolled variable selection in any class of generalized linear models (GLM) [CFJL16]. It extendsthe knockoﬀ procedure that was originally designed for controlling the FDR in low-dimensional linearmodels [BC15]. The key ingredient of the knockoﬀ ﬁlter are the generated knockoﬀ copies ˜ X ∈ R n × p for the design matrix X , which mimic the correlation structures between the variables and thereforeserve as a control group for them to ensure that not too many irrelevant variables are selected. Theﬁnal goal is to perform FDR control on the speciﬁc statistics based on both X and ˜ X .Denote X = ( x , . . . , x n ) ⊤ and ˜ X = (˜ x , . . . , ˜ x n ) ⊤ . In this paper, we generate MX knockoﬀs ˜ X from a multivariate normal distribution obeying˜ X | X d ∼ N p ( µ , V ) , (5) For the proof see [XL19]. X d ∼ N p ( , Σ) with Σ ∈ R p × p being positive deﬁnite and µ and V satisfy µ := X − X Σ − diag { s } , (6) V := 2diag { s } − diag { s } Σ − diag { s } , (7)with s ∈ R p making V positive deﬁnite. This way to construct knockoﬀs is most common in theliterature (for example [BCS19, CFJL16, BC15, XL19]).We consider the following penalized estimator for logistic regression which is augmented by theknockoﬀsˆ β [ r, X, ˜ X, y ] ∈ arg min β ∗ ∈ R p {L [ β ∗ , [ X ˜ X ] , y ] + rh [ β ∗ ] } , (8)where L [ β ∗ , [ X ˜ X ] , y ] = P ni =1 (log(1 + exp([ x ˜ x ] ⊤ i β ∗ )) − y i [ x ˜ x ] ⊤ i β ∗ ) /n is the negative log-likelihoodfunction, [ X ˜ X ] ∈ R n × p is an augmented matrix, r ∈ [0 , ∞ ) is a tuning parameter, and h : R p → [0 , ∞ ]is a prior function which we specify as sparsity inducing prior function h [ β ∗ ] := p X j =1 | β ∗ j | . We use Cross-Validation (CV) and a recently introduced novel calibration scheme [LL19] for ℓ -penalized logistic regression for calibrating the tuning parameter in equation 8. The latter calibrationscheme is based on simple tests along the tuning parameter path and is equipped with ﬁnite sampleguarantees for feature selection.We consider here the Lasso Signed Max (LSM) [BC15] and the Lasso coeﬃcient-diﬀerence (LCD)[CFJL16] as knockoﬀ statistics. The LSM statistic denotes the maximum penalty coeﬃcients of eachvariable entering in the model in 8. The LCD statistic denotes the diﬀerence between the absolutevalues of the Lasso coeﬃcients of the original variables and the knockoﬀ copies. Denote the LSM andLCD statistics by ( Z , . . . , Z p , ˜ Z , . . . , ˜ Z p ), that is, Z j [ X, y ] := sup { r : ˆ β j [ r, X, ˜ X, y ] = 0 } , and Z j [ X, y ] := | ˆ β j [ r, X, ˜ X, y ] | , ˜ Z j [ X, y ] := sup { r : ˆ β p + j [ r, X, ˜ X, y ] = 0 } , and ˜ Z j [ X, y ] := | ˆ β p + j [ r, X, ˜ X, y ] | , j ∈ { , . . . , p } . Then the LSM and LCD statistic vectors W := ( W , . . . , W p ) ⊤ can be deﬁned by W j [ X, y ] := max { Z j , ˜ Z j } · sign( Z j − ˜ Z j ) , and W j [ X, y ] := Z j [ X, y ] − ˜ Z j [ X, y ] , for j ∈ { , . . . , p } .Then, the data-dependent thresholds of the knockoﬀ and knockoﬀ + ﬁlter methods (two types ofknockoﬀ procedures) for a given FDR target q ∈ [0 ,

1] are deﬁned by T q [ X, y ] := min n t ∈ W : { j : W j [ X, y ] ≤ − t } { j : W j [ X, y ] ≥ t } ≤ q o , (knockoﬀ) T + q [ X, y ] := min n t ∈ W : 1 + { j : W j [ X, y ] ≤ − t } { j : W j [ X, y ] ≥ t } ∨ ≤ q o , (knockoﬀ + ) . Thus, the corresponding estimated support recovery sets are deﬁned asˆ S q [ X, y ] := { j : W j [ X, y ] ≥ T q [ X, y ] } , ˆ S + q [ X, y ] := { j : W j [ X, y ] ≥ T + q [ X, y ] } . The estimated support ˆ S + q [ X, y ] obtained by the knockoﬀ + procedure satisﬁes the inquality in 4,whereas ˆ S q [ X, y ] satisﬁes this theoretical bound for an approximated FDR which is less or equal to theFDR.We use the (A)FDR control methods for variable selection in our highdimensional empirical appli-cation. To apply (A)FDR control to our data, we use the knockoﬀ ﬁlter for logistic regression. Inline with the recommendations in [XL19] we simulate k = 3 independent knockoﬀs ˜ X , ˜ X , ˜ X from aGaussian distribution for the AFDR control method. As target FDR, we choose q = 0 . Data

Our analysis draws on employment records from German administrative data. In particular, we use theSample of Integrated Employment Biographies (SIAB). It is provided by the Institut f¨ur Arbeitsmarkt-und Berufsforschung (IAB) in Nuremberg, and draws a 2 % random sample from the population ofIntegrated Employment Biographies (IEB). The source of the data set comes from the registrationprocedure for social insurance. The SIAB has been used in a number of studies on career and wageoutcomes of women (for example [ADS17, EK13]).The sample of the IEB considered here covers all employees in Germany that contributed to thesocial security system between 1975 and 2014 or receiving transfer payments from the labor agency,or being registered job seekers. Civil servants, including teachers and self-employed, are not includedin the sample. The sample provides detailed daily information on, for example, earnings, occupations,employment, tansitions in and out of work, periods of unemployment, unemployment and welfarebeneﬁts as well as basic demographics. An important advantage of the data set is that, because ofthe large sample size and the longitudinal nature of the administrative data, employment and wagehistories are measured precisely.

Empirical data preparation

Our goal is to discover which variables from employment records and basic demographics are truelyassociated with professional careers of women. Due to methodological reasons, we concentrate on oneparticular year: 2010. As robustness check, we do the same analysis for two further years: 2009 and2011. We focus on West-Germany and restrict our sample to birth cohorts of women between 1965 and1975. The birth cohort restriction corresponds to the age groups between 35 and 45 years. Before age35 many women still climb the career ladder, and have not yet reached the ﬁnal top salary, while fewwomen reach their career peak at an age older than 45. Because we do not predict when women willreach their professional career for the ﬁrst time, we also leave women who already had a professionalcareer before 2010 in the sample. Finally, we drop those women who do not have any employment, The data basis of this project is the Scientiﬁc Use File (SUF) of the SIAB (version SIAB-Regionalﬁle 1975 - 2014[GSBW]). The data was assessed via a guest stay at the Research Data Centre (FDZ) of the German Federal EmploymentAgency at the Institute for Employment Research (IAB) and then via controlled data processing at the FDZ. The proposed FDR pipeline requires the design matrix X to have independent and identically distributed rows. N = 1 567 women with a careerin 2010 and N = 78 214 women with no career in 2010.To discover which variables from employment records and basic demographics aﬀect future profes-sional careers in 2010, we generate a plethora of separate variables from our SIAB data to capturethe available information on the wage and (un-)employment histories in a very detailed and ﬂexibleway. Especially the longitudinal nature of the administrative employment data allows to capture com-plex information on individual employment and earning histories (for example types of employment,wages, skill level) for the entire elapsed working live possibly interacting in many diﬀerent ways. Toaccurately record time-varying information of the elapsed (un-)employment history, we consider ﬁvelags for each of the baseline variables. For time-constant information (for example education) we onlyconsider one lag. Consequently, we use information from data spells until 2009. Employment recordsfrom 2010 cannot be used for our analysis because of endogeneity issues. Consider a woman makinga professional career in November 2010. She has most likely ﬁnd out about her upcoming promotionin contract negotiations with her employer a few months earlier, and her knowledge of the career peakmay induce her to work less.We consider 5 sociodemographic and 107 (un-)employment and wage history baseline predictorvariables. The reason for including also sociodemographic variables that do not explicitly accountfor the employment and wage histories is that they are strongly associated with professional careeroutcomes. For instance, highly educated women are more likely to achieve a professional career thanlow educated women. This means, the skill level might be a powerful predictor for a professional career.Finally, to select the main eﬀects that are truely associated with professional careers, it is importantto allow for a wide range of plausible interactions between the baseline variables. Thus, we interactthe lagged terms of the associated baseline predictors with each other in many diﬀerent ways, so thatwe end up with 327 sociodemographic, and employment and wage history variables. A full list of theincluded predictor variables can be found in the appendix. We only consider those interactions that13ake sense from an economic point of view, and that do not lead to perfect multicollinearity, that is X ⊤ X is invertible. Moreover, our interacted time-varying variables come from the same time periodbecause we want to avoid modeling complex dynamics between diﬀerent predictors for the ease of theinterpretation later on. We also interact time-constant variables with time-varying variables. Further,we assume that some of the predictor variables can be expressed as a linear combination of a number ofbasis functions of the original measured variables. Indeed, our two wage proﬁle predictor variables canbe expressed through the group of the wage variables. We also include non-linear eﬀects for women’sage and working experience by using second-order polynomials.In our empirical application, we consider a large-scale design matrix X ∈ R

79 782 × . Thus, weconsider a setting in which both the sample size and the feature size are large, but the feature size ismuch smaller than the sample size ( n, p ≫ , p ≪ n ). Although we do not consider the case where n, p ≫ , p ≈ n or p ≫ n , and high-dimensional statistics become indispensable, our case ( p is large) isthe typical scope of high-dimensional statistics.The outcome variable y i is an indicator that takes the value 1 if the woman makes a professionalcareer in 2010, otherwise 0. Making a professional career means, according to our deﬁnition, to achievea high wage level. A diﬃculty of the SIAB data is that the wage variable is censored from above toensure that high earners cannot be identiﬁed. Therefore, we deﬁne the binary career outcome variablebased on the social security contribution ceiling imposed on the SIAB data : whenever the wage variablecoincides with the the social security contribution ceiling, the career dummy takes the value 1, and 0otherwise. Descriptives

Table 4 in the appendix provides detailed descriptive statistics of all baseline variables included inthe analysis. Among the 112 baseline predictors, 32 are continuous (like for example age in years)and 80 are binary (like for example part-time employed). The predictors collect information on age,education, non-German citizenship, employment, part-time employment, marginal employment, dailyearnings, diﬀerent wage proﬁles, change of establishment, wage growth, experience, unemploymentbeneﬁts, welfare beneﬁts, registered job search and occupations. Note that the social security contribution ceiling varies for each year and between West- and East-Germnay. able 1: Descriptives

Women with career in 2010 Women with no career in 2010

Mean Std.Dev. Min Max Mean Std.Dev. Min Maxage 1 39.766 3.054 34 44 39.579 3.098 34 44high educated 1 0.607 0.489 0 1 0.097 0.296 0 1middle educated 1 0.386 0.487 0 1 0.792 0.405 0 1low educated 1 0.005 0.071 0 1 0.097 0.296 0 1non-German citizenship 1 0.035 0.185 0 1 0.046 0.210 0 1career 1 0.743 0.437 0 1 0.004 0.062 0 1employed 1 0.982 0.133 0 1 0.868 0.339 0 1full-time employed 1 0.919 0.272 0 1 0.390 0.485 0 1part-time employed 1 0.063 0.242 0 1 0.297 0.457 0 1marginally employed 1 0.000 0.000 0 0 0.181 0.385 0 1daily wage 1(missing if non-employed) 163.283 27.507 13.041 177 50.080 39.624 0.029 177daily wage 1(zero if non-employed) 160.262 35.032 0 177 43.453 40.624 0 177experience 1 15.619 5.040 0 25 16.387 5.857 0 25

Source : Own calculations based on data of the SIAB.

Notes : N = 1 567 women with career in 2010 and N = 78 215 women with no career in 2010. N = 79 782 Table 1 provides the descriptive statistics of some key baseline variables. The table shows thedescriptives separately for the women with and without a professional career in 2010. The summarystatistics show that the majority of women with no career in 2010 has vocational training (79%),around 10% are tertiary educated, and 10% have no postsecondary degree. Unlike these, most of thewomen with a career in 2010 are highly educated (61%), followed by women with middle education(37%) and low education (0.5%). Our sample of women has an average age of 40 years in 2009. Theaverage employment rate is 87% for women with no career in 2010 and 98% for women with a careerin 2010. Almost all women with a professional career in 2010 are full-time employed (92%), followedby part-time working women (6%). None of these women are marginally employed. On the otherhand, only 39% of women with no career in 2010 are full-time employed, followed by part-time (30%)and marginally (18%) employment. In our sample, around 4% (5%) of the women have non-Germancitizenship. The average daily wage among working women with a professional career in 2010 is around163 EUR. However, for women without a professional career in 2010 it is only 50 EUR. Making a careeris highly persistent, since 74% of women with a professional career in 2010 also had a career in 2009.In the case of women without a career in 2010, however, it is only 0.4%.15

Results

How important is a woman’s employment history for her professional career? To approach this question,we establish an FDR pipeline based on [BC15, CFJL16, XL19] to identify among a plethora of predictorsderived from employment records those ones that are most predictive for a successful career.In particular, we want to increase the estimation accuracy by eﬀectively identifying the importantpredictor variables and improve the model interpretability. For this purpose, we take into account alarge set of variables derived from employment records, and apply a new aggregation scheme [XL19]for controlling the FDR to the data of the SIAB. We use this FDR approach beyond multiple testingfor controlled variable selection as introduced in [BC15] and [CFJL16]. To achieve correct inference,we apply the knockoﬀ-ﬁlter to the aggregation scheme which solves the controlled variable selectionproblem in such a generality and for a wide variety of test-statistics that it is attractive to introduce itin the ﬁeld of economics.We assume that our data generating process can be modeled precisely by using only a small, buta priori unknown number of predictor variables. Therefore we include all predictors to search for thetruely important ones with the FDR pipeline.

Our main ﬁndings suggest that our data generating process can indeed be described by a sparserepresentation, since it improves the prediction performance and model interpretability. In particular,the (A)FDR methods show an improvement in terms of model interpretability and estimation accuracycompared to conventional variable selection methods such as the LASSO. In sum, we ﬁnd that theemployment status, working experience, daily wages, and wage increases in combination with a highlevel of education are truely associated with professional careers.

Our objective is to establish the FDR pipeline for variable selection in modern big data applicationsin economics. In doing so, we apply the proposed pipeline to labor market data and identify those16ariables from individual employment histories that are truely important for professional careers. Sincethere are no ground truths available for our application, that is, the true support S := supp[ β ] isunveriﬁable, we cannot measure variable selection accuracy directly. However, the ground truth isalmost never available in empirical research. Therefore, we infer the methods’ performance from thenumber of selected variables and the prediction accuracy. We apply the diﬀerent methods as describedin the methodology section. In addition, we estimate conventional logistic regression with the full setof variables and without any variable (only intercept). These two latter estimations mark the twoextreme situations in which all variables or no variable is relevant for predicting professional careers.For the ease of interpretation, we typically seek methods that yield a sparse model with a smallnumber of variables. Moreover, we search for models with small prediction errors (good ﬁt to thedata). The model size is here crucial since our primary goal is variable selection. In particular,we are interested in the variable selection and prediction performances of the l -regularized logisticregression with 10-fold Cross-Validation reﬁt. The prediction error of the 10-fold-CV with reﬁt isdeﬁned as pred. error := n v P j =1 L [ y V j , X ˆ S V j ˆ β reﬁt [ y T j , X ˆ S T j ]], where L [ · ] is the negative log-likelihoodfunction, X ˆ S is the restricted design matrix after screening the non-zero coordinates with the proposedvariable selection methods, ˆ β reﬁt are the reﬁtted estimators, and T j := { , . . . , n }\A j and V j := A j , j ∈ { , . . . , } are our 10 training and validation sets, with A , . . . , A having equal cardinality of n/ Table 2:

Prediction performance. Our proposed method provides accurate prediction based on only 10 (19)predictors. Method Model size Pred. error (10-fold-CV–reﬁt)LASSO 42 1.47

FDR LSM 25 1.46AFDR LSM 19 1.46FDR LCD CV CV

10 1.43FDR LCD

Testing – –AFDR LCD

Testing – –

Full 327 1.47Empty 0 7.71 S should be a good approximation of the true support S . Below, we focus on the AFDR control methodsfor two main reasons. First, [XL19] have shown in simulations that the aggregated knockoﬀ ﬁltercan simultanously decrease the FDR and increase the power, while maintaining the original method’stheoretical FDR guarantees. Second, we are interested in identifying a medium-sized model, becausea model with only a few predictor variables leaves too little room for interpretation on the one hand,and on the other hand a model that is too large is diﬃcult to interpret and harbors the risk of noisevariables. Our results in Table 2 show that the AFDR method selects a medium-sized model regardlessof the knockoﬀ statistics chosen.Next, we provide the variables selected by the diﬀerent applied methods. Table 5 in the appendixcontains the results for the six (A)FDR control estimations and Table 8 in the appendix displays the18esults for the LASSO estimation. The upper panel of Table 5 displays the results for the (A)FDRcontrol method based on the LSM statistic, whereas the two lower panels display the results based onthe LCD statistic.We observe that variables that appear multiple times across methods are wage profile 1 , employed i , marginal employed , daily wage i , strong positive wage growth , change of establishment , high educated × change of establishment , high educated × part-time employed j , high educated × marginal employed l , high educated × daily wage i , high educated × strong positive wagegrowth i , with i ∈ { , , , , } , j ∈ { , } , and l ∈ { , } . We can observe that the selected variablesform two main groups: either predictors related to wages or predictors related to the employment statusare selected. On the other hand, predictors related to unemployment beneﬁts and welfare beneﬁts arenot selected by the methods at all and therefore are not associated with professional careers.This ﬁnding is in line with our intuition: Daily wages and daily wages (or positive wage growths)combined with a high level of education are generally considered as important determinants for aprofessional career. Our proposed FDR pipeline conﬁrms this assumption, since all delays of the abovementioned variables are selected. The fact that several delays of the employment variable are selectedhas to do with the career prediction signal inherent in this variable. A continuous employment is almostindispensable for making a successful career. Finally, we observe that in the majority of cases wheretwo baseline variables interact with each other, the high level of education interacts with the variablesfrom the employment and wage history. This is also not suprisingly, given that a high educational levelis often associated with a successful career. A closer look at these interactions reveals that the highlevel of education interacts with the variables that have already been selected as meaningful baselinepredictors.Tables 6 and 7 in the appendix contain the selected variables of the robustness checks. To checkthe robustness, we carried out the same analysis for the LSM knockoﬀ statistics for the years 2009 and2011. The results show that, although fewer variables are selected overall, the same variables as in the2010 sample are selected.Overall, our results suggest that the proposed FDR pipeline is very useful in practice, especially if theresearcher is faced with huge amounts of data. A distinctive feature of the FDR pipeline from conven-tional econometric tools is that it teases apart important from irrelevant variables while guaranteeing19ype I error control. Thus, the FDR approach enable economists to limit the number of covariates ina data-driven manner rather than by hand, in a way that the interpretability of economic models ispreserved and correct inference is achieved. Therefore, we recommend economists who handle largeamounts of data to use modern high-dimensional tools for variable selection. Due to the shift in thescale of economic data towards Big Data these high-dimensional models will become a new importanttoolbox in empirical economic reasearch alongside the conventional econometric toolbox. Having saidthis, a reasonable question is why should economists include so many covariates in the analysis? Theanswer is twofold. First, because we can due to the newly available large-scale administrative data sets.Second, and more importantly, even though economists may believe that a certain economic outcomedepends on a small fraction of all available variables, we have a priori no idea which ones are the truelyimportant ones. In the previous paragraph, we showed that the (A)FDR control, which was originally designed formultiple testing, is extremely useful for variable selection in empirical applications where a plethoraof predictor variables is available. The ﬁnal goal of variable selection with (A)FDR control is to drawcorrect inferences about the estimates of the selected variables. Therefore, in this paragraph we focuson the inference and interpretation of the estimates. Since the l penalty sets a certain fraction ofparameters exactly to zero and favors small estimates, this can lead to an overall unwanted shrinkageof the estimates. To remove such biases, we reﬁt the penalized estimators with subsequent logisticregression and least-squares estimation on the support. The advantage of this two-step procedureis that least-squares provides an accurate and unbiased estimator in low-dimensional and correctlyspeciﬁed models which can be interpreted straightforward. In addition, we calculate marginal eﬀectsof the logistic regression estimates to compare them with the respective least-squares estimates.In Table 3 we provide the estimation results for reﬁtting AFDR LCD CV . Tables 9, 10 and 11 in theappendix report the corresponding estimation results for reﬁtting AFDR LSM, FDR LCD CV and FDRLSM. The respective tables provide the OLS estimates (with reﬁt), the logistic regression estimates(with reﬁt) and the corresponding marginal eﬀects.20 ﬁrst look at the results shows that the estimates are robust against the respective speciﬁcation,and that most estimates are statistically signiﬁcant. The OLS and logistic regression results in columns3 and 5 are very similar. For the sake of simplicity, we only interpret the statistically signiﬁcant OLSestimates in the second column of Table 3, since only the continuous variables are standardized in thiscolumn, so that a reasonable interpretation of the binary variables is possible. In addition, we onlygive an exemplary interpretation of one lagged term from the respective variable group. Table 3:

InferenceLCD CV ˆ S AFDR

OLS Estimate ++ OLS Estimate + Estimate + Marginal eﬀect + (with reﬁt) (with reﬁt) (with reﬁt) (with reﬁt)daily wage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . − . − . − . − . . . . . − . − . − . − . . . . . − . − . − . − . . . . . . . . . . . . . . . . . . . . . × employed − . − . − . − . . . . . + Variables are standardized. ++ Only continuous baseline variables are standardized. ∗ , ∗∗ , ∗∗∗ statistically signiﬁcant at the 10%, 5% and 1% levels, respectively. Starting to interpret the OLS estimate on daily wage , we ﬁnd that, by holding all other factorsconstant (h.a.o.f.c.), increasing the daily wage in 2009 by one standard deviation leads on average to a4.0 standard deviation higher probability of making a professional career in 2010. It is striking that alllagged terms of the daily wage have a signiﬁcant positive impact on the professional career. Continuingwith the next group of selected variables that has statistically signiﬁcant estimates, we ﬁnd that being21mployed in 2008 decreases the probability of making a professional career in 2010 on average by 1.6percentage points compared of being non-employed in 2008 (h.a.o.f.c.). Moreover, we ﬁnd that havinga strong negative wage growth in 2009 increases the probability of making a successful career in 2010on average by 1.4 percentage points compared of having a normal or a positive wage growth in 2009.Finally, we give the interpretation on the interaction term experience × employed : h.a.o.f.c., womenwho are employed in 2009 have on average a 0.1 standard deviation lower probability of making aprofessional career in 2010 than women who are unemployed in 2009 with every additional standarddeviation experience.Table 9 in the appendix gives further important insights regarding the interpretation of the OLSestimates of ˆ S AFDR with the LSM statistic. We ﬁnd that being employed in 2009 increases the proba-bility of making a professional career in 2010 on average by 1.2 percentage points compared of beingnon-employed in 2009 (h.a.o.f.c). We further ﬁnd that being marginal employed in 2007 decreases theprobability of making a professional career in 2010 on average by 0.86 percentage point compared ofbeing full-time, part-time or unemployed in 2007. On the other hand, a steady wage growth over thepast ﬁve years increases the probability of making a professional career on average by 0.55 percentagepoint (h.a.o.f.c.). 22

Conclusion

Based on an empirical application towards the labor market, this paper introduced the potentialsof high-dimensional tools aimed to controlled variable selection to economists. More speciﬁcally, weapplied a new aggregation scheme for FDR control [XL19] to a high-dimensional logistic regressionmodel, which teases apart important from irrelevant variables while maximizing the (statistical) powerand guaranteeing Type I error control. So far, the aggreagated FDR control method has only beenused in the context of high-dimensional linear regression models [XL19]. However, we were easily ableto extent the aggregation scheme for the non-linear case by exploiting the framework of MX knockoﬀs[CFJL16]. The MX knockoﬀ-ﬁlter eﬀectively controls the aggregated false discovery rate and performsvalid inference in our high-dimensional logistic regression model by mimicking the correlation structurefound within the potential covariates.In particular, we applied the MX knockoﬀ-ﬁlter to the data from the Sample of Integrated Employ-ment Biographies to discover which variables from individual employment histories are truely associatedwith female professional careers. We assumed that the binary career outcome variable can be modeledaccurately by using a sparse representation of the covariates, but unlike the conventional economicliterature, we were unsure about the identity of the relevant covariates and selected them in a data-driven manner. To reach a parsimonious model, we used the l penalized technique to solve the logisticregression throughout our estimations.Our main results suggest that our high-resolution logistic regression model can indeed be describedby a sparse representation, since it improves the prediction performance and model interpretability.Indeed, the (A)FDR methods, which are designed for variable selection, show an improvement in termsof estimation accuracy and model interpretability compared to conventional variable selection methodssuch as the LASSO, which are designed for prediction. Overall, the relatively small subset of theemployment history variables that are genuinely related to female professional careers includes theworking experience, employment status, daily wage and wage increases in combination with a highlevel of education.Our results provide new insights for the empirical work in the economic discipline. First, the high-dimensional tools presented here enable economists to ﬁt high-dimensional models. Due to the newly23vailable large-scale data-sets these high-dimensional models will become a new important workhorsein empirical economic research along the conventional econometrics toolbox. Second, based on theempirical application towards the labor market, we have shown that data-driven (A)FDR control, whichwas originally designed for multiple testing, is also useful for variable selection and correct inferencein high-dimensional settings. This is particularly an important insight given the fact that there is lessknowledge about variable selection and certainly not about (A)FDR control in the economic literature.Third, the tools for controlled variable selection introduced here enable economists to limit the numberof explanatory variables in a data-driven fashion, in a way that the interpretability of economic modelsis preserved by searching for a sparse representation of the data generating process.24 eferences [AA70] Almquist , E. M. ;

Angrist , S. S.: Career salience and atypicality of occupational choiceamong college women. In:

Journal of Marriage and Family

32 (1970), Nr. 2, S. 242–249[ADS17]

Adda , J. ;

Dustmann , C. ;

Stevens , K.: The career costs of children. In:

Journal ofPolitical Economy

125 (2017), Nr. 2, S. 293–337[AI19]

Athey , S. ;

Imbens , G.: Machine learning methods economists should know about. In:

Annual Review of Economics

11 (2019), S. 685–725[Ath17]

Athey , S.: Beyond prediction: Using big data for policy problems. In:

Science

355 (2017),Nr. 6324, S. 483–485[BC15]

Barber , R. F. ;

Cand`es , J.: Controlling the false discovery rate via knockoﬀs. In:

TheAnnals of Statistics

43 (2015), Nr. 5, S. 2055–2085[BCH11]

Belloni , A. ;

Chernozhukov , V. ;

Hansen , C.: Inference for high-dimensional sparseeconometric models. In:

Advances in Economics and Econometrics. 10th World Congressof Econometric Society (2011)[BCS19]

Barber , R. F. ;

Cand`es , E. J. ;

Samworth , R. J.: Robust inference with knockoﬀs. In: arXiv:1801.03896 (2019)[BH95]

Benjamini , Y. ;

Hochberg , Y.: Controlling the false discovery rate: A practical andpowerful approach to multiple testing. In:

Journal of the Royal Statistical Society: SeriesB (Methodological)

57 (1995), Nr. 1, S. 289–300[BHJ19]

B¨ockerman , P. ;

Haapanen , M. ;

Jepsen , C.: Back to school: Labor-market returns tohigher vocational schooling. In:

Labour Economics

61 (2019), S. 101758[Car99]

Kapitel

30. In:

Card , D.:

The causal eﬀect of education on earnings . Bd. 3A. O. C.Ashenfelter and D. Card, 1999[CFJL16]

Cand`es , E. ;

Fan , Y. ;

Janson , L. ; Lv , J.: Panning for gold: model-X knockoﬀs forhigh-dimensional controlled variable selection. In: Journal of the Royal Statistical Society:Series B

80 (2016), Nr. 3 25EK13]

Ejrnæs , M. ;

Kunze , A.: Work and wage dynamics around childbirth. In:

The Scandi-navian Journal of Economics

115 (2013), Nr. 3, S. 856–877[EL14]

Einav , L. ;

Levin , J.: Economics in the age of big data. In:

Science

346 (2014), Nr. 6210,S. 1243089[FLQ11]

Fan , J. ; Lv , J. ; Qi , L.: Sparse high dimensional models in Economics. In: Annual Reviewof Economics

Ganzer , A. ;

Schmucker , A. ;

Berge , P. vom ;

Wurdack , A.: Sample of IntegratedLabour Market Biographies – Regional File 1975 - 2014 (SIAB-R 7514). FDZ data report(en), 01/2017.[HHV18]

Heckman , J. J. ;

Humphries , J. E. ;

Veramendi , G.: Returns to education: the causaleﬀects of education on earnings, health and smoking. In:

Journal of Political Economy + Kleinberg , J. ;

Lakkaraju , H. ;

Leskovec , J. ;

Ludwig , J. ;

Mullainathan , S.:Human decisions and machine predictions. In:

The Quarterly Journal of Economics

Kleinberg , J. ;

Ludwig , J. ;

Mullainathan , S. ;

Obermeyer , Z.: Prediction policyproblems. In:

American Economic Review: Papers & Proceedings

105 (2015), Nr. 5, S.491–495[LL19] Li , W. ; Lederer , J.: Tuning parameter calibration for ℓ regularized logistic regression.In: Journal of Statistical Planning and Inference

202 (2019), S. 80–98[MS17]

Mullainathan , S. ;

Spiess , J.: Machine learning: an applied econometric approach. In:

Journal of Economic Perspectives

31 (2017), Nr. 2, S. 87–106[RSC19]

Romano , Y. ;

Sesia , M. ;

Cand`es , E.: Deep knockoﬀs. In:

Journal of the AmericanStatistical Association

00 (2019), Nr. 0, S. 1–12[Tib96]

Tibshirani , R.: Regression shrinkage and selection via the lasso. In:

Journal of RoyalStatistical Society: Series B (Methodological)

58 (1996), Nr. 1, S. 267–28826Var14]

Varian , H. R.: Big data: new tricks for econometrics. In:

Journal of Economic Perspec-tives

28 (2014), Nr. 2, S. 3–28[XL19]

Xie , F. ;

Lederer , J.: Aggregated False Discovery Rate Control. In: arXiv:1907.03807 (2019)[ZY06]

Zhao , P. ; Yu , B.: On model selection consistency of Lasso. In: Journal of MachineLearning Research

Appendix

Table 4:

Descriptives of Baseline Predictors

Women with career in 2010 Women with no career in 2010

Mean Std.Dev. Min Max Mean Std.Dev. Min Maxcareer 1.000 0.000 1 1 0.000 0.000 0 0age able 4: Continued change of establishment able 4: Continued days of registered job searchwhile not employed able 4: Continued occupation25 1 0.477 0.500 0 1 0.269 0.444 0 1occupation26 1 0.020 0.141 0 1 0.010 0.099 0 1occupation27 1 0.033 0.178 0 1 0.010 0.101 0 1occupation28 1 0.108 0.311 0 1 0.113 0.317 0 1occupation29 1 0.031 0.172 0 1 0.083 0.275 0 1occupation30 1 0.003 0.056 0 1 0.139 0.346 0 1

Notes : N = 1 567 career women and N = 78 215 non-career women in 2010. able 5: Selected variables with (A)FDR controlLSM statistic ˆ S FDR ˆ S AFDR employed employed marginal employed marginal employed change of establishment change of establishment change of establishment –change of establishment –strong negative wage growth –strong positive wage growth strong positive wage growth wage proﬁle 1 wage proﬁle 1high educated × part-time employed high educated × part-time employed high educated × part-time employed high educated × part-time employed high educated × marginal employed –high educated × marginal employed high educated × marginal employed high educated × marginal employed –high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth – high educated × employed high educated × change of establishment high educated × change of establishment high educated × daily wage high educated × daily wage high educated × daily wage high educated × daily wage high educated × daily wage high educated × daily wage high educated × daily wage high educated × daily wage experience × strong positive wage growth –experience × daily wage –LCD CV statistic ˆ S FDR ˆ S AFDR daily wage daily wage daily wage daily wage daily wage – daily wage – employed – employed – employed – strong negative wage growth – strong negative wage growth experience × employed LCD AV statistic ˆ S FDR ˆ S AFDR – –

Notes:

Target level q = 0 . able 6: Robustness check 1: Selected variables with (A)FDR controlLSM statistic ˆ S FDR ˆ S AFDR employed –change of establishment –strong positive wage growth –high educated × part-time employed high educated × part-time employed high educated × marginal employed –high educated × marginal employed –high educated × strong negative wage growth –high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth –high educated × strong positive wage growth –high educated × strong positive wage growth –high educated × wage proﬁle 1 high educated × wage proﬁle 1experience × daily wage – Notes:

Target level q = 0 .

10. 2009 Sample

Table 7:

Robustness check 2: Selected variables with (A)FDR controlLSM statistic ˆ S FDR ˆ S AFDR high educated × part-time employed high educated × part-time employed high educated × part-time employed high educated × part-time employed high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth high educated × strong positive wage growth – high educated × change of establishment high educated × daily wage high educated × daily wage high educated × daily wage high educated × daily wage high educated × daily wage high educated × daily wage Notes:

Target level q = 0 .

10. 2011 Sample able 8: Selected variables with LASSONo statistic ˆ S LASSO daily wage 1 high educated × daily wage 1daily wage 2 high educated × daily wage 2daily wage 3 high educated × daily wage 3daily wage 4 high educated × daily wage 4daily wage 5 high educated × daily wage 5employed 1 experience × daily wage 3employed 3 experience × daily wage 5employed 4employed 5marginally employed 1marginally employed 2marginally employed 3marginally employed 4marginally employed 5years since last change of establishment 1change of establishment 1change of establishment 2strong positive wage growth 1strong positive wage growth 2occupation29 1wage proﬁle 1high educated × part-time employed 1high educated × part-time employed 5high educated × marginally employed 2high educated × marginally employed 3high educated × strong negative wage growth 1high educated × strong positive wage growth 1high educated × strong positive wage growth 2high educated × strong positive wage growth 3high educated × strong positive wage growth 4high educated × strong positive wage growth 5high educated × employed 1high educated × employed 4high educated × employed 5high educated × change of establishment 1 able 9: InferenceLSM ˆ S AFDR

OLS Estimate ++ OLS Estimate + Estimate + Marginal eﬀect + (with reﬁt) (with reﬁt) (with reﬁt) (with reﬁt)employed . . . . . . . . − . − . − . − . . . . . − . . . . . . . . − . − . − . − . . . . . . . . . . . . . × . . . . . . . . × − . − . − . − . . . . . × . . . . . . . . × − . − . − . − . . . . . × − . − . − . − . . . . . × − . − . − . − . . . . . × − . − . − . − . . . . . × − . − . − . − . . . . . × − . − . − . − . . . . . × . − . − . − . . . . . × . . . . .

000 04 0 . . . × . . . . .

000 05 0 . . . × . . . . .

000 04 0 . . . + Variables are standardized. ++ Only continuous baseline variables are standardized. ∗ , ∗∗ , ∗∗∗ statistically signiﬁcant at the 10%, 5% and 1% levels, respectively. able 10: InferenceLCD CV ˆ S FDR

OLS Estimate + Estimate + Marginal eﬀect + (with reﬁt) (with reﬁt) (with reﬁt)daily wage . . . . .

014 0 . . . . . .

014 0 . + Variables are standardized. ∗ , ∗∗ , ∗∗∗ statistically signiﬁcant at the 10%, 5% and 1% levels, respectively. Table 11:

InferenceLSM ˆ S FDR

OLS Estimate + Estimate + Marginal eﬀect + (with reﬁt) (with reﬁt) (with reﬁt)employed − . − . − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . × − . − . − . . . . × − . − . − . . . . × . − . − . . . . × . . . . . . × . . . . . . × − . − . − . . . . able 11: Continued high educated × − . − . − . . . . × − . − . − . . . . × − . − . − . . . . × − . − . − . . . . × − . − . − . . . . × . . . . . . × . . . . . . × . . . . . . × . . . . . . × − . − . − . . . . × . . . . . . + Variables are standardized. ∗ , ∗∗ , ∗∗∗ statistically signiﬁcant at the 10%, 5% and 1% levels.37 able 12: Legend of Baseline PredictorsVariable

Descriptioncareer 1 if woman makes a professional career, 0 otherwiseage Women’s age in yearslow educated 1 if no postsecondary degree, 0 otherwisehigh educated 1 if tertiary educated, 0 otherwisenon-German citizenship 1 if with migration background, 0 otherwiseemployed 1 if woman is employed at least 50% of the year, 0 otherwisepart-time employed 1 if working part-time, 0 otherwisemarginally employed 1 if woman works in a minijob, 0 otherwisedaily wage Daily wage in Euroswage proﬁle 1 1 if the wage level anually increases over the past ﬁve years, 0otherwisewage proﬁle 2 1 if the wage increases by 10% in the last 5 years, 0 otherwiseyears since last change ofestablishment How many years have passed since the last change of businesschange of establishment 1 if change of business, 0 otherwisestrong negative wagegrowth 1 if wage growth < = − .

05, 0 otherwisestrong positive wagegrowth 1 if wage growth > = 0 . able 12: Continued occupation15 1 if carpenter, modeler, 0 otherwiseoccupation16 1 if painter, 0 otherwiseoccupation17 1 if goods inspector, 0 otherwiseoccupation18 1 if auxiliary worker, 0 otherwiseoccupation19 1 if machinists, 0 otherwiseoccupation20 1 if engineers, chemists, physicists, mathematicians, 0 otherwiseoccupation21 1 if technician, 0 otherwiseoccupation22 1 if were merchants, 0 otherwiseoccupation23 1 if services consultants, 0 otherwiseoccupation24 1 if traﬃc professions, 0 otherwiseoccupation25 1 if organizational or administrative oﬃce occupations, 0 otherwiseoccupation26 1 if regulatory or security occupations, 0 otherwiseoccupation27 1 if writers and artistic professions, 0 otherwiseoccupation28 1 if health care professionals, 0 otherwiseoccupation29 1 if social and educational occupations, 0 otherwiseoccupation30 1 if general service occupations, 0 otherwise39 ull list of predictor variables: A full = { career , career , career , career , career , wage , wage , wage , wage , wage , full-time employed , full-time employed , full-time employed , full-time employed , full-time employed , part-time employed , part-time employed , part-time employed , part-time employed , part-time employed , marginally employed , marginally employed , marginally employed , marginally employed , marginally employed , change of establishment , change of establishment , change of establishment , change of establishment , change of establishment , years since last change of establishment , years since last change of establishment , years since last change of establishment , years since last change of establishment , years since last change of establishment , strong negative wage growth , strong negative wage growth , strong negative wage growth , strong negative wage growth , strong negative wage growth , strong positive wage growth , strong positive wage growth , strong positive wage growth , strong positive wage growth , strong positive wage growth , experience , experience , experience , experience , experience , experience , unemployment beneﬁt days , unemployment beneﬁt days , unemployment beneﬁt days , unemployment beneﬁt days , unemployment beneﬁt days , registered job search days while not employed , registered job search days while not employed , registered job search days while not employed , unemployment beneﬁt , unemployment beneﬁt , unemployment beneﬁt , unemployment beneﬁt , unemployment beneﬁt , registered job search while not employed , registered job search while not employed , registered job search while not employed , low educated , high educated , non-German citizenship , occupation1 occupation2 , occupation3 , occupation4 , occupation5 , occupation6 , occupation7 occupation8 , occupation9 , occupation10 , occupation11 , occupation12 , occupation13 occupation14 , occupation15 , occupation16 , occupation17 , occupation18 , occupation19 occupation20 , occupation21 , occupation22 , occupation23 , occupation24 , occupation25 occupation26 , occupation27 , occupation28 , occupation29 , occupation30 age , age , wage proﬁle 1 , wage proﬁle 2high educated × experience , high educated × experience , high educated × experience high educated × part-time employed , high educated × part-time employed , high educated × part-time employed , high educated × part-time employed , high educated × part-time employed , high educated × marginally employed , high educated × marginally employed , high educated × marginally employed , high educated × marginally employed , high educated × marginally employed , high educated × strong negative wage growth , high educated × strong negative wage growth , high educated × strong negative wage growth , × strong negative wage growth , high educated × strong negative wage growth , high educated × strong positive wage growth , high educated × strong positive wage growth , high educated × strong positive wage growth , high educated × strong positive wage growth , high educated × strong positive wage growth , high educated × full-time employed , high educated × full-time employed , high educated × full-time employed , high educated × full-time employed , high educated × full-time employed , high educated × years since last change of establishment , high educated × years since last change of establishment , high educated × years since last change of establishment , high educated × years since last change of establishment , high educated × years since last change of establishment , high educated × change of establishment , high educated × change of establishment , high educated × change of establishment , high educated × change of establishment , high educated × change of establishment , high educated × daily wage , high educated × daily wage , high educated × daily wage , high educated × daily wage , high educated × daily wage , high educated × unemployment beneﬁt days , high educated × unemployment beneﬁt days , high educated × unemployment beneﬁt days , high educated × unemployment beneﬁt days , high educated × unemployment beneﬁt days , high educated × registered job search days while employed , high educated × registered job search days while employed , high educated × registered job search days while employed , high educated × unemployment beneﬁt , high educated × unemployment beneﬁt , high educated × unemployment beneﬁt , high educated × unemployment beneﬁt , high educated × unemployment beneﬁt , high educated × registered job search days while employed , high educated × registered job search days while employed , high educated × registered job search days while employed , low educated × experience , low educated × experience , low educated × experience low educated × part-time employed , low educated × part-time employed , low educated × part-time employed , low educated × part-time employed , low educated × part-time employed , low educated × marginally employed , low educated × marginally employed , low educated × marginally employed , low educated × marginally employed , low educated × marginally employed , low educated × strong negative wage growth , low educated × strong negative wage growth , low educated × strong negative wage growth , low educated × strong negative wage growth , low educated × strong negative wage growth , low educated × strong positive wage growth , low educated × strong positive wage growth , low educated × strong positive wage growth , low educated × strong positive wage growth , low educated × strong positive wage growth , × employed , low educated × employed , low educated × employed low educated × employed , low educated × employed , low educated × years since last change of establishment , low educated × years since last change of establishment , low educated × years since last change of establishment , low educated × years since last change of establishment , low educated × years since last change of establishment , low educated × change of establishment , low educated × change of establishment , low educated × change of establishment , low educated × change of establishment , low educated × change of establishment , low educated × daily wage , low educated × daily wage , low educated × daily wage , low educated × daily wage , low educated × daily wage , low educated × unemployment beneﬁt days , low educated × unemployment beneﬁt days , low educated × unemployment beneﬁt days , low educated × unemployment beneﬁt days , low educated × unemployment beneﬁt days , low educated × registered job search days while employed , low educated × registered job search days while employed , low educated × registered job search days while employed , low educated × unemployment beneﬁt , low educated × unemployment beneﬁt , low educated × unemployment beneﬁt , low educated × unemployment beneﬁt , low educated × unemployment beneﬁt , low educated × registered job search days while employed , low educated × registered job search days while employed , low educated × registered job search days while employed , high educated × wage proﬁle1 , high educated × wage proﬁle2 , high educated × non-German citizenship , low educated × wage proﬁle1 , low educated × wage proﬁle2 , low educated × non-German citizenship experience × part-time employed , experience × part-time employed , experience × part-time employed , experience × part-time employed , experience × part-time employed , experience × marginally employed , experience × marginally employed , experience × marginally employed , experience × marginally employed , experience × marginally employed , experience × strong positive wage growth , experience × strong positive wage growth , experience × strong positive wage growth , experience × strong positive wage growth , experience × strong positive wage growth , experience × strong negative wage growth experience × strong negative wage growth , experience × strong negative wage growth experience × strong negative wage growth , experience × strong negative wage growth , experience × employed , experience × employed , experience × employed , experience × employed , experience × employed , experience × years since last change of establishment , × years since last change of establishment , experience × years since last change of establishment , experience × years since last change of establishment , experience × years since last change of establishment , experience × change of establishment , experience × change of establishment , experience × change of establishment , experience × change of establishment , experience × change of establishment , experience × daily wage , experience × daily wage experience × daily wage , experience × daily wage , experience × daily wage experience × unemployment beneﬁt days , experience × unemployment beneﬁt days , experience × unemployment beneﬁt days , experience × unemployment beneﬁt days , experience × unemployment beneﬁt days , experience × registered job search days while employed , experience × registered job search days while employed , experience × registered job search days while employed , experience × unemployment beneﬁt , experience × unemployment beneﬁt , experience × unemployment beneﬁt , experience × unemployment beneﬁt , experience × unemployment beneﬁt , experience × registered job search while unemployed , experience × registered job search while unemployed , experience × registered job search while unemployed , experience × part-time employed , experience × part-time employed , experience × part-time employed , experience × part-time employed , experience × part-time employed , experience × marginally employed , experience × marginally employed , experience × marginally employed , experience × marginally employed , experience × marginally employed , experience × strong positive wage growth , experience × strong positive wage growth , experience × strong positive wage growth , experience × strong positive wage growth , experience × strong positive wage growth , experience × strong negative wage growth experience × strong negative wage growth , experience × strong negative wage growth experience × strong negative wage growth , experience × strong negative wage growth , experience × employed , experience × employed , experience × employed , experience × employed , experience × employed , experience × years since last change of establishment , × years since last change of establishment , experience × years since last change of establishment , experience × years since last change of establishment , experience × years since last change of establishment , experience × change of establishment , experience × change of establishment , experience × change of establishment , experience × change of establishment , experience × change of establishment , experience × daily wage , experience × daily wage experience × daily wage , experience × daily wage , experience × daily wage experience × unemployment beneﬁt days , experience × unemployment beneﬁt days , experience × unemployment beneﬁt days , experience × unemployment beneﬁt days , experience × unemployment beneﬁt days , experience × registered job search days while employed , experience × registered job search days while employed , experience × registered job search days while employed , experience × unemployment beneﬁt , experience × unemployment beneﬁt , experience × unemployment beneﬁt , experience × unemployment beneﬁt , experience × unemployment beneﬁt , experience × registered job search while unemployed , experience × registered job search while unemployed , experience × registered job search while unemployed , experience × wage proﬁle 1 , experience × wage proﬁle 2 , experience × wage proﬁle 1 , age × high educated , age × high educated , age × low educated , age × low educated . }}