Variational Bayes survival analysis for unemployment modelling
Pavle Boškoski, Matija Perne, Martina Rameša, Biljana Mileva Boshkoska
VVariational Bayes survival analysis forunemployment modelling
Pavle Boškoski ∗1 , Matija Perne , Martina Rameša , and Biljana MilevaBoshkoska Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia Faculty of information sciences in Novo mesto, Ljubljanska cesta 31a, 8000 Novomesto, Slovenia Employment service of Slovenia, Rožna dolina, Cesta IX/6, 1000 Ljubljana, Slovenia
Abstract
Mathematical modelling of unemployment dynamics attempts to pre-dict the probability of a job seeker finding a job as a function of time.This is typically achieved by using information in unemployment records.These records are right censored, making survival analysis a suitable ap-proach for parameter estimation. The proposed model uses a deep artifi-cial neural network (ANN) as a non-linear hazard function. Through em-bedding, high-cardinality categorical features are analysed efficiently. Theposterior distribution of the ANN parameters are estimated using a vari-ational Bayes method. The model is evaluated on a time-to-employmentdata set spanning from 2011 to 2020 provided by the Slovenian public em-ployment service. It is used to determine the employment probability overtime for each individual on the record. Similar models could be applied toother questions with multi-dimensional, high-cardinality categorical dataincluding censored records. Such data is often encountered in personalrecords, for example in medical records.
A reliable time-to-employment estimate is a valuable piece of information bothfor the job seekers and for the public employment service (PES) employmentcounsellors. Perceived employability has important effects on job seekers [1] soimproving its accuracy may be beneficial. Long-term unemployment has beenidentified as having a significant scarring effect on society and the economy and,more importantly, to the health of the unemployed persons [2, 3, 4]. As a result,the PES counsellors want to focus their attention on those job seekers that needtheir help. A time-to-employment prediction can help them identify the ones ∗ [email protected] (corresponding author) a r X i v : . [ s t a t . A P ] F e b hat do not need PES resources as they will get employed soon regardless of theinterventions. Algorithmic tools predicting the future of particular job seekersare therefore needed to support the decision making process.The field of creating such tools is very active. One can track such efforts forover 20 years [5]. The methods used can be roughly separated into four groups.The first and most numerous group consists of approaches based on logit/probitmodels [6, 7, 8, 9]. The second group consists of so-called in/out models, whosegoal is to estimate the probabilities of entering and exiting the labour market [10,11]. The biggest limitation with these two groups is their inability to incorporatea large number of high-cardinality discrete data. The data used are typicallyyes/no questionaries, hence the limited efficiency. The third group includesmachine learning approaches [12]. These approaches, on the other hand, arecapable of handling vast amounts of heterogeneous data. However, the resultsare almost always point estimates, and the lack of uncertainty assessment canhave significant negative practical consequences. Finally, there are the emergingapproaches based on labour flow networks with the goal of modelling the labourmarket as a dynamic system [13, 14]. These models require high quality micro-data that usually have restricted access.The biggest issue when modelling the labour market, or in other terms,modelling the dynamics of the unemployed part of the population, is discoveringthe influencing forces. It is well known that it is impossible to infer the overallsystem dynamics through modelling the behaviour of each individual “actor”in a system [15]. The same is shown to be valid for sociological systems [16].Therefore, it is important to take various interdependencies between differentsocietal levels into account [17].As the modern society evolves quickly, old information describing socio-economical dynamics bears little importance for current or future behaviour.The amount of historical data that can be effectually used for deriving the cur-rent and future dynamics is therefore limited. This inevitably leads to censoreddata points [18, 19]. The concept of censoring is important because ignoring orotherwise mistreating the censored cases might lead to false conclusions [20].Survival analysis is capable of properly handling cases of censored time-to-event data [21]. The most prominent example is the Cox proportional hazardsmodel [22]. With a linear risk function in the Cox model, the complete likelihoodcan be derived in a closed form. However, this limits the applicability of themodel for systems with more complex and non-linear dynamics.The issue with more complex functions is that the solution of the Cox modelbecomes intractable. An alternative is to employ Markov chain Monte Carloapproaches [21]. However, there are two main limitations of such approaches:the immense computational load for multidimensional data and the presence ofdiscrete/categorical data. The latter issue is usually resolved through so-calledone-hot encoding, which for large cardinality covariates causes an explosion ofmodel parameters.A logical step forward in survival analysis is the use of machine learn-ing [23, 24]. These techniques are capable of handling mixtures of continuousand categorical data through the concept of embedding [25]. Kvamme et al. [26]2ave derived the loss function for a survival analysis model using a deep neuralnetwork as its core. There have been several recent results that build upon Coxproportional hazards models [27, 28, 29]. Additionally, there are proposals inwhich parametric survival models employ recurrent neural networks (RNN) inorder to predict the empirical probability distribution of future events [30, 31].These models have undisputed applicability with one drawback. They provideonly point estimates of the posterior distributions of the survival/hazard func-tions.A method harnessing the power of obtaining complete posterior distribu-tions, like with Markov chain Monte Carlo approaches, while preserving theprocessing power of the machine learning methods, is the variational Bayes(VB) method [32]. It has been successfully applied in various fields, such asGaussian processes modelling [33, 34], deep generative models [35], compressedsensing [36, 37], hidden Markov models [38, 39], reinforcement learning andcontrol [40], etc. It is possible to define a survival model with an arbitrary riskfunction that is implemented as an artificial neural network (ANN) and esti-mate the posterior distribution of the ANN model parameters despite the largenumber of parameters.All these properties come to use when building survival models using per-sonal records, for instance medical, PES data, or similar. The records tend to bemulti-dimensional and usually have high-cardinality categorical data [41, 42, 43].Such data are an ideal candidate for harnessing the capability of the VB-basedsurvival method.The proposed model estimating time-to-employment is a survival model withan ANN used as a non-linear hazard function. Embedding is used for large car-dinality covariates and the VB method is used to estimate the probability dis-tribution of the parameters. In this way, it is not limited to linear hazard rates,it provides an estimate of the parameter distribution, and it treats censoringcorrectly to avoid unnecessary restrictions on the training data.A basic overview of survival analysis is given in Section 2. Section 3 presentsthe variational Bayes approach. The proposed VB-based survival analysis so-lution is presented in Section 4. The application on unemployment records isdescribed in Section 5. The results are presented in Section 6 and discussed inSection 7. Survival analysis is the study of time-to-event data that initially focused onestimation of lifespans. The majority of the developed methods investigatecontinuous-time models [44], but there are also developments on discrete-timeapproaches [45].Survival analysis explores the probability distribution of an event over time.The probability that an event occurs at time T before a certain time t can be3ritten as Pr( T ≤ t ) = (cid:90) t f ( s ) ds = F ( t ) , (1)where the functions f ( t ) and F ( t ) are the probability density and the cumulativeprobability function, respectively. The opposite case, i.e. the probability thatthe time of occurrence T will be after a certain time t , is Pr(
T > t ) = S ( t ) = 1 − F ( t ) . (2)The function S ( t ) is called the survival function. The hazard rate, or hazardfunction, λ ( t ) is defined as the event rate at time t if the event has not occurredup until t and is expressed as λ ( t ) = lim ∆ t → t Pr( t ≤ T < t + ∆ T | T ≥ t ) = f ( t ) S ( t ) . (3)The survival function can be obtained from the hazard function as S ( t ) = exp [ − Λ( t )] , Λ( t ) = (cid:90) t λ ( s ) ds. (4)Survival models are categorised based on the type of the hazard rate. Often,the multiplicative hazard function is used [44], λ ( t | x ) = λ ( t ) e h ( x ) . (5)Its two components are the baseline hazard λ ( t ) and a risk function h ( x ) , whichdepends on directly measurable covariates x that are also known as explanatoryvariables or features. In the simplest form, i.e. Cox proportional hazards models,the risk function is linear, h β ( x ) = β T x , and needs no constant term as it isincluded in the baseline term λ ( t ) . The parameter set β is identified from thedata. This is achieved using various estimation approaches, the most popularbeing the Kaplan-Meier estimator [46]. For the case of non-linear models, therisk function h ( x ) can be an arbitrary real function. In such cases, various formsof ANNs have been applied [26, 27, 28].Another possibility is the use of accelerated failure-time models [44], wherethe logarithm of the survival time y is approximated with linear regression, log T = y = β T x + σW. (6)The probability distribution of the error is altered using regression coefficients β , covariates x , and scale factor σ to obtain the probability distribution of theevent time T . The choice of the probability distribution of W directly definesthe probability distribution of the survival time T . Typical pairs are listed inTable 1.The model can be extended by replacing the linear function β T x with anarbitrary real function of the covariates h z ( x ) , where z is the set of the param-eters of the non-linear function. When the model (6) is extended this way, theequation reads y = h z ( x ) + σW. (7)It has been done with the output of an ANN used as h z ( x ) [47, 23, 26, 48].4able 1: Error distribution and corresponding survival time distributions.Log-Error distribution Survival time distributionNormal distribution Log-normalGumbel (Extreme value) WeibullLogistic Log-logistic Two particularities of time-to-event data sets that complicate their processingare censoring and truncation [44]. Censoring happens when the time of eventmay only be known to be within a particular time interval. Truncation occurswhen the cases with event times outside of the observation period are not ob-served. In the studied example, there is no truncation and only right censoring,also known as Type I censoring. That is, all the subjects are observed, the tim-ing of the event is known if it occurs prior to the end of the observation time,and the exact timing is unknown for the events that occur later.Censoring might lead to false conclusions if it is not accounted for prop-erly [20]. The distinction between survival analysis and simple regression is inthe handling of censored data.
The most common survival analysis models are of the types (5) and (7) andlimited to such functions h ( x ) or h z ( x ) that the likelihood function exists inclosed form [44]. However, the modelling benefits from the use of more gen-eral functions that are capable of handling non-linear relationships between thecovariates x . For those, the evidence p ( x ) in equation (8) is intractable. Solv-ing the equation (8) in order to estimate the posterior probability distribution p ( θ | x ) of the model parameters θ thus requires the use of an approximation.We choose to use an ANN in place of h z ( x ) in equation (7) and to estimatethe posterior probability distribution p ( θ | x ) using the VB method [32]. An-other issue is the use of high-cardinality discrete covariates, necessitating someform of dimension embedding. Such properties are becoming typical in medicalrecords [41, 42, 43] and are also present in this study. When performing a stochastic analysis, such as survival analysis, the inferenceprocess of estimating the model parameters relies on the Bayes’ rule. For a setof observations x , generated by a system with parameters θ , the Bayes’ rule5eads p ( θ | x ) (cid:124) (cid:123)(cid:122) (cid:125) Posterior = Likelihood (cid:122) (cid:125)(cid:124) (cid:123) p ( x | θ ) Prior (cid:122)(cid:125)(cid:124)(cid:123) p ( θ ) p ( x ) (cid:124)(cid:123)(cid:122)(cid:125) Evidence . (8)The likelihood p ( x | θ ) is typically known because it is prescribed by the modelstructure. The prior p ( θ ) is typically chosen. The biggest obstacle in comput-ing the posterior probability distribution p ( θ | x ) is the evidence p ( x ) . It is thesolution of the equation p ( x ) = (cid:90) θ p ( x | θ ) p ( θ ) dθ, (9)which in most cases cannot be obtained in a closed form. For multi-dimensionalcases, even Monte Carlo integration becomes impractical due to the immensecomputational load. The VB method provides an approximative solution to thisproblem.The idea of the VB method is to sufficiently closely approximate the trueposterior p ( θ | x ) with an approximative probability distribution q ω ∗ ( θ ) , referredto as variational distribution . It belongs to a variational family of the functions q ω ( θ ) ∈ Q , ω ∈ Ω , where Ω is the set of all possible values of the latent param-eters ω . A typical variational family is the mean-field variational family [49] inwhich the parameters θ are mutually independent.The optimal values of the latent parameters ω ∗ are obtained by minimis-ing the Kullback-Leibler (KL) divergence KL ( q ω ( θ ) || p ( θ | x )) between the trueposterior and the variational distribution. The optimisation problem ω ∗ = arg min ω ∈ Ω KL ( q ω ( θ ) || p ( θ | x )) (10)is solved.Since the true posterior is unknown, calculating the KL divergence requiresa minor rearrangement. As shown by Šmídl and Quinn [32], KL ( q ω ( θ ) || p ( θ | x )) can be written as KL ( q ω ( θ ) || p ( θ | x )) = E q (cid:20) log q ω ( θ ) p ( θ | x ) (cid:21) = E q [log q ω ( θ )] − E q [log p ( θ | x )]= E q [log q ω ( θ )] − E q [log p ( x, θ ) − log p ( x )]= E q [log q ω ( θ ) − log p ( x, θ )] + log p ( x )= − E q [log p ( x, θ ) − log q ω ( θ )] (cid:124) (cid:123)(cid:122) (cid:125) ELBO + log p ( x ) . (11)The first term of the final expression is known as evidence lower bound (ELBO)and maximising it results in minimising the KL divergence between the varia-tional distribution and the true posterior. The second term, log p ( x ) , is constantand is therefore not a part of the optimisation process.6aving an approximative variational distribution instead of the true poste-rior distribution introduces an inherent bias which depends on the variationalfamily used. The selection of the variational family Q is thus not an ad-hocdecision but is based on prior knowledge such as support, empirical observa-tions or experts’ knowledge. In spite of the inherent bias, the VB method isjustified by the substantial increase in the computational efficiency compared tothe alternatives such as Monte Carlo integration. In our analysis, ADAM opti-miser [50, 51], implemented as a part of the PyTorch package [52], is used foroptimisation (10) through Pyro [53]. The overall idea is schematically presentedin Figure 1.
Unknown true posteriordistribution p ( θ | x )Selected variationaldistribution q ω ( θ ) Final q ω ( θ ) with ω optimised through KLBias due to the mismatch between thevariational distribution q ω ( θ ) and thetrue posterior p ( θ | x ) θ θ Figure 1: Optimisation process of finding the closest variational distribution q ω ( θ ) over the set of latent variables ω . The use of the model of the form (7) with an ANN as h z ( x ) results in the integral(9) that cannot be calculated analytically. Obtaining the posterior probabilitydensity function p ( θ | x ) thus requires an approximation. ANNs typically havehundreds of parameters, precluding even the use of Monte Carlo methods. An-other option are Gaussian processes that are shown to be equivalent to a fullyconnected ANN with an infinite number of hidden units in each layer [54]. Avariational inference approach provides a computationally efficient solution us-ing ELBO (11) as the loss function.With the VB method, one has to select the prior distribution p ( θ ) and thevariational family of the posterior distribution Q . This is followed by opti-misation (10) using the ELBO loss function (11), resulting in the variationaldistribution latent parameters ω ∗ . In the most general form of the VB method,every model parameter may be treated as a stochastic one.We implement the function h z ( x ) as an ANN. The model parameters areassembled as θ = [ σ, z ] , z ] = [ µ , . . . , µ K . We use a mean-field variationalfamily, so the probability distribution q ω ( θ ) can be expressed as q ω ( θ ) = q ( σ ) · K (cid:89) k =1 q k ( z k ) , (12)7here K is the number of the ANN parameters. For the variational distribution,we use a half-normal distribution for the factor q ( σ ) and normal distribution forevery q k ( z k ) . The vector of latent parameters ω is structured as ω = [ σ σ , ω z ] , ω z = [ µ , . . . , µ K , σ , . . . , σ K ] . The latent parameter σ σ is used in q ( σ ) = σ σ φ (cid:16) σσ σ (cid:17) for σ ≥ and the latent parameters ω z define q k ( z k ) = σ k φ (cid:16) z k − µ k σ k (cid:17) ,where φ is the probability density function of the standard normal probabilitydistribution.Models relying on the VB method are described using the concepts of prob-abilistic graphical models [55, 56] and directed factor graphs [57]. The model(7) is transformed into a VB form that takes censoring of the observations intoaccount. The directed factor graph is shown in Figure 2. Whq k ( z k ) ω k σ σ q ( σ ) x ( i ) c ( i ) W Latent variableObserved variable
Bernoulli y ( i ) c = 0 c = 1 K N
Figure 2: Factor graph describing the devised survival model. The pseudocodeof parameter estimation is shown in Algorithm 1. The expression 1-CDF W isthe predicted survival function and Bernoulli denotes the Bernoulli distribution.The latent variable ω k stands for ω k = [ µ k , σ k ] . The model predicts the time-to-event from the covariates and the probability of censoring for a given maximumtime-to-event.The implementation of the VB method used in this study is based on theso-called black box variational inference [58, 59]. It relies purely on numericalevaluation and requires only proper specification of the model. That is, only themodel likelihood p ( x | θ ) and the observations for parameter estimation x have tobe specified. The prior probability distribution p ( θ ) is assumed to be constantand the variational family Q is chosen. In our case, the likelihood is given byequation (7) and the variational family Q is as described in equation (12). As derived by Wingate and Weber [58], the gradient of the criterion function isexpressed as ∇ ω L ( ω ) = E q ω ( θ ) [ ∇ ω log q ω ( θ ) (log p ( x, θ ) − log q ω ( θ ))] , (13)where L ( ω ) is ELBO. This expression is used in stochastic gradient optimisationto find an estimate of ω ∗ . 8he likelihood of the i th training data point for a sampled value of θ iscalculated based on equation (7). As illustrated in Figure 2, the evaluationprocess depends on whether the observation is censored or not. The vectorof covariates is labelled with x ( i ) , y ( i ) is the logarithm of the time-to-event ortime-to-censoring, and c ( i ) is the censoring label. For a complete observation,i.e. c ( i ) = 0 , the gradient (13) is calculated at the observed y ( i ) . For censoredobservations, i.e. c ( i ) = 1 , the loss is calculated from the predicted probability ofcensoring at y ( i ) , which equals the survival function 1-CDF W at y ( i ) . That is, thepredicted probability distribution of c ( i ) is Bernoulli with the expected value of1-CDF W . The complete procedure is also shown as pseudo code in Algorithm 1.For the initial guess ω (0) , we use σ σ = 5 and µ k = 0 , σ k = 1 ∀ k ∈ { , . . . , K } . Algorithm 1
VB-based model parameter estimation assuming normal distri-bution as the variational family. Input N observations, x ( i ) are covariates, y ( i ) is logarithm of time-to-eventor time-to-censoring, and c ( i ) is censoring label, i ∈ { , . . . , N } . Variationalfamily Q , prior p ( θ ) , model structure p ( y | x , θ ) . Output
Vector of latent parameters ω , approximating ω ∗ , specifying thevariational density q ω ( θ ) Initialize latent parameters ω (0) ∈ Ω Initialize rate parameter α (cid:46)
Using ADAM optimiser, this is adaptable while not ELBO convergence do draw σ ∼ q ( σ ) = σ σ φ (cid:16) σσ σ (cid:17) , σ ≥ (cid:46) half-normal distribution for all k ∈ { , . . . , K } do (cid:46) loop through parameters of the ANN draw z k ∼ q k ( z k ) = N ( µ k , σ k ) (cid:46) θ = [ σ, z ] end for for all i ∈ { , . . . , N } do (cid:46) loop through observations if c ( i ) = 0 then (cid:46) non-censored observations l ( i ) = log p ( y ( i ) | x ( i ) , θ ) (cid:46) term in stochastic gradient else (cid:46) censored observations Pr[ c ( i ) | y ( i ) , x ( i ) , θ ] = 1 − (cid:82) y ( i ) −∞ p ( y | x ( i ) , θ ) dy (cid:46) probability ofsurviving beyond y ( i ) l ( i ) = log Pr[ c ( i ) | y ( i ) , x ( i ) , θ ] (cid:46) term in stochastic gradient end if end for Compute the gradient ∇ ω L ( ω ) of equation (13) (cid:46) log p ( x, θ ) = (cid:80) Ni =1 l ( i ) , and q ω ( θ ) is known Update ω ( j +1) = ω ( j ) + α ∇ ω L ( ω ) end while S ( t ) Solving the survival model in the Bayesian framework, the survival function S ( t | x ) for a given value of the covariates x and of the model latent parameters9 can be obtained as the posterior prediction distribution using the identifiedvariational distribution q ω ( θ ) instead of the true unknown posterior. The prob-ability density function is p ω ( t | x ) = (cid:90) p ( t | x , θ ) q ω ( θ ) dθ. (14)The survival function S ( t | x ) = Pr[ T ≥ t ] is by definition expressed as S ( t | x ) = 1 − (cid:90) t p ω ( s | x ) ds = 1 − (cid:90) t (cid:90) p ( s | x , θ ) q ω ( θ ) dθ ds (15)and can be evaluated from (15) using Monte Carlo integration. It is a fairlysimple and computationally efficient process. The pseudo code is presented inAlgorithm 2. Algorithm 2
Prediction of survival function S ( t | x ) . Input covariates x , model structure p ( y | x , θ ) , variational distribution Q , la-tent parameters ω , variational model p ω ( y | x ) = (cid:82) p ( y | x , θ ) q ω ( θ ) dθ , numberof samples N MCMC Output S ( t | x ) = Pr( y > log t | x ) = n ( t ) /N MCMC for all k ∈ { , . . . , N MCMC } do (cid:46) Monte Carlo integration draw θ k ∼ q ω ( θ ) draw y k ∼ p ( y | x , θ k ) end for b k ( t ) = (cid:40) , if y k > log t , otherwise n ( t ) = (cid:80) N MCMC k =1 b k ( t ) We have mentioned that Monte Carlo integration of the integral (9) wouldbe too computationally intensive, but the integral (15) is easy to calculate usingthe Monte Carlo method even though the integration variable θ has the samedimensionality in both cases. The reason is that the probability density function q ω ( θ ) is quite different from p ( θ ) . While p ( θ ) is very wide – we take it to beconstant over (cid:60) + × (cid:60) K – and the integration would require a lot of samples, thefunction q ω ( θ ) is much more localized, so every sample contributes a lot moreto the accuracy of the result and a much lower number is sufficient. PES records provide a rich description of job seekers in the form of their profiles.One of the most common metrics that PES associates with every job seeker isthe so-called probability of exit, referring to “exiting” from their records. Itshould be noted that not every “exit” is due to employment but also events such10s retirement, PES determining that someone is not genuinely seeking a job,and many others.Some of the job seekers entering the records at any point in time have exitedat a later known date while the others are still unemployed. From the dataperspective, this is a clear example of a right censored data. Survival analysisis therefore a viable tool for analysing PES data where the probability of exitfrom PES records serves the function of the hazard rate.
Every job seeker is described using 19 covariates. Some describe personal char-acteristics of each job seeker such as age, gender, education based on the nationalclassification, last work position based on ESCO classification, municipality ofthe permanent residence, country of origin, duration of work experience, date ofentering the unemployment records, date of employment, and limitations such asdisability. The others describe the interventions of the public employment ser-vice, for instance courses attended, employment plan, unemployment benefits,social security benefits and active job seeking grade. The covariates representa mix of continuous and discrete variables with very different cardinality. Thecomplete list of the discrete covariates and their cardinalities is shown in Ta-ble 2. It should be noted that the covariates with cardinality 2 are treated ascontinuous and do not go through the process of embedding.Table 2: Error distribution and corresponding survival time distributions.Continuous covariates Discrete covariates (cardinality)Day of PES entry Specific profession category (109)Month of PES entry Profession program (2336)Age Municipality (215)Months of work experience Employment plan status (5)eApplication PES Office (61)Gender Reason for PES Entry (10)Employment plan ready Employability assessment (6)Social benefits Education category (22)Unemployment benefits Profession (ESCO) (3772)Disabilities (17)
When defining the survival model, one of the key decisions is the selection ofthe probability density function of the error W in (7). Since the probability offinding a job decreases over time [11], the suitable choices of W are the onesthat result in decreasing hazard rate λ ( t ) . The typical choices such as Gumbelor other extreme value distributions behave in the opposite way. For Gumbel11robability density function of W , the resulting baseline survival time wouldfollow the Weibull distribution and the hazard rate would rise over time. Suchbehaviour is reasonable in many uses of survival analysis but not in modellingof unemployment.There are several possible choices of the probability density function of W that result in a time-decreasing hazard rate. The simplest one is the normaldistribution. For given values of z and x , the survival time T in (7) then followsthe log-normal distribution and the hazard rate is monotonically decreasing afterits maximum [44]. The probability density function of the event f ( t ) in (1) andthe corresponding survival function S ( t ) become f ( t ) = φ (cid:16) log t − µσ (cid:17) t and S ( t ) = 1 − Φ (cid:18) log t − µσ (cid:19) , (16)where φ and Φ are the probability density and the cumulative density functionsof the standard normal distribution, respectively. Some typical shapes of thehazard functions for various values of µ and σ are shown in Figure 3. t . . . . . . . λ ( t ) µ = 0 , σ = 1 µ = 0 , σ = 2 µ = 1 , σ = 1 Figure 3: Various shapes of the hazard function λ ( t ) for various parameters ofthe log-normal distribution. h z ( x ) The underlying risk function h z ( x ) is modelled using a deep ANN of the archi-tecture shown in Figure 4. At the entry level of the network, the dimensions ofcategorical (discrete) covariates are reduced and the continuous covariates arenormalised. The network then follows three repeating groups of linear layers,normalisation, and dropout [60]. The cardinality of the categorical covariates12anges from simple boolean (yes/no) up to 3700 categories for a covariate de-scribing occupation. The dimensions are reduced using embedding [25] with thereduced number of dimensions equal to n = min (cid:18) , (cid:22) d (cid:23) + 1 (cid:19) , (17)where d is the original number of categories. Emb Emb BNorm BNorm C a t e go r i c a l C o n t i nu o u s DropoutLinear Dropout Batchnormalisation Linear Dropout Batchnormalisation Linear
Figure 4: ANN architecture for describing the risk function h ( x ) .All of the ANN parameters z are treated as independent with a mean-fieldVB approach and the variational family Q consists of normal distributions q ω z ( z ) for them. The latent parameters ω z thus represent the mean values and thevariances of the parameters z . The standard normal distribution is used for W ,and the parameter σ is sampled from a half-normal distribution. The data set contains daily updates on every job seeker in Slovenia from 2011 upto 2020. The network is trained on the data covering 12 months and evaluatedon the records from the following 6 months. The presented results show theevaluation on the first 6 months of 2012 using a network trained on the setspanning the whole year of 2011. This is the most dynamic period in the labourmarket, a consequence of the global financial crisis. The training set, spanningfrom January 2011 until December 2011, contains 99,139 records. Of those, 55%have censored events, i.e. a potential exit from PES records occurred after theobservation window. The evaluation set, spanning from January 2012 until July2012, contains 43,641 records. It should be noted that exit from PES recordscan be due to various reasons, for instance employment, retirement, additionalschooling, maternity leave, etc. 13igure 5 shows the identified survival curves of two typical job seekers plottedover a period of 1 year starting from the time of entry on the PES records. Asexpected, the survival probability S ( t ) for a job seeker unemployed over thenext 12 months remains high. Conversely, for the case of a person that wasemployed, the survival function drops significantly. The results on S ( t ) at agiven value of t for the population can be analysed from two viewpoints: pureclassification and assessment of the probability of exit. . . . . . S ( t ) EmployedUnemployed
Figure 5: Typical shape of the modelled survival functions of the trained model.The pale lines are realisations of the S ( t ) obtained from (14) by sampling theposterior distribution of the model parameters from (15) using Monte Carlointegration with 200 samples and each bold line is the average of 80 such reali-sations. Although the overall goal is not classification of the unemployed persons, viewingthe results from a classification standpoint offers an easy way for quantifyingthe performance of the approach. A simple assessment of the model accuracycan be achieved by analysing the distribution of the survival probabilities of thewhole test population at a certain time. In Figure 6, the values of the survivalfunction S ( t ) at t = 180 days from entry are shown separately for the job seekersthat exit the records in under 180 days and for the ones that stay unemployedlonger. This value of t = 180 days is chosen because more than 180 days ofunemployment have a significant negative influence on a job seeker’s capabilityof finding a job [11]. Therefore, this result can be treated as an indicator of ajob seeker’s capability of finding a job.The classification accuracy depends on the selected threshold value. Forthreshold S ( t = 180 days ) = 0 . , the area under the ROC curve reaches themaximum. At this value, the classification accuracy is 75.6%, whereas the trivialmodel has 50.5% accuracy. The trivial model is the case when the whole testset is labelled as either employed or unemployed. The two groups are shown14n Figure 6. The blue histogram contains persons that exited PES records priorto 180 days, and the orange histogram are persons that are either still on therecords or exited after 180 days. . . . . . S ( t )0 . . . . . . . t ≤
180 daysExit t >
180 days
Figure 6: Distribution of the survival probability (remaining on the PES records)after 180 days for the test data for the period from January to June 2012.Training was on data from year 2011.
Unlike classification, the estimated survival probability over time t offers a betterinsight. The set of job seekers remaining in the PES records for over 180 days isinvestigated into more detail. Table 3 lists the most common outcomes for thesejob seekers divided into three groups based by the values of S ( t = 180 d ) . Weare particularly interested in the ones for whom the model predicted survivalprobability below the threshold S ( t = 180 d ) < . .The false positives – the job seekers for whom the model assigns low survivalprobability S ( t = 180 days ) but remained for more than 180 days – deservefurther attention as they may not get the necessary PES support if the modelis relied upon. In Table 3, we split them into two subgroups based on S ( t =180 d ) and compare them to the true positives for whom the model correctlypredicts over 180 days to exit. Let us term employment within a further 60 days,maternity leave, and having a less common outcome lumped as “other” a “good”outcome. An underestimate of the date of exit for under 60 days and the failureto take maternity leave into account are inconsequential for PES and hopefullyfor the job seeker. The “other” outcomes are of different desirabilities for the jobseeker – they range from enrolling into further education up to imprisonment –but they tend to be independent of PES actions. Just like the people employedin 60 extra days and the ones on maternity leave, the “other” people are neitherfailed by PES nor a burden of PES. Therefore, assigning low survival probabilityto such a job seeker will not be likely to cause much harm.15able 3: Detailed overview of job seekers with exit time T > days basedon the value of the survival function S ( t = 180 d ) and their outcome. The jobseekers that exit the records through employment are divided further based onthe time of event. The rest tend to end up being not active job seekers, meaningthey quit fulfilling their obligations toward PES. Most of the remainder areemployed by PES in public service, and a significant fraction of the rest go onmaternity leave. S ( t = 180 d ) [%]<0.4 0.4–0.61 >0.61 Sum [%]Employed, T ≤ days 1.2 6.0 2.5 9.8Employed, T > days 1.6 13.7 20.0 35.3Not active job seekers 1.5 10.1 25.0 36.6Public service 0.1 1.6 6.2 8.0Maternity leave 0.2 1.2 1.3 2.6Other 1.5 2.3 4.0 7.7Total 6.1 34.9 59.0 100.0The outcome is “good” for 30% of the false positives, rising to 47% withthe model-predicted S ( t = 180 d ) < . . In contrast, only 13.2% of the truepositives achieve a “good” outcome. It seems the type II error is more likely forthe job seekers for whom it is less damaging, which is fortunate.The fraction of the long-term job seekers of over 180 days that lose their sta-tus and become “not active” is the highest in the true positive S ( t = 180 d ) ≥ . group. For this group, losing the status is the most likely outcome. Theylose the status by not fulfilling their obligations toward PES. It can be inferredthat part of the group are the job seekers that PES actions did not help. Inex-plicably from the data, the outcome is very common for job seekers around 60years of age, indicating that there may be regulatory actions making the statusless appealing to them. Detailed age profile results are shown in A.The second most frequent outcome for the true positives is getting employedafter 240 days, thus using PES resources and suffering the consequences ofunemployment for a longer time. A lot of the remainder get employed in publicservice, meaning that they stay in a contractual relationship with PES. Themodel-predicted S ( t = 180 d ) is strongly correlated with the probability of thejob seeker serving in public service. This is not surprising as public service isonly available to the long-term unemployed and there may be causal connectionsbetween the model inputs and the availability of public service to the job seeker. Two of the goals of PES are to get as many of the job seekers from their recordsemployed as quickly as possible and to lower the number of long-term (over 116ear) unemployed [61]. The significance of long-term unemployment is that ithas various negative consequences, some of which decrease the chance of gettingemployed [62, 11]. The presented model could help PES better estimate whichjob seekers need more assistance, potentially improving their service.However, all of the reported accuracy results on implemented systems ad-dressing unemployment records focus solely on classification. As a result, notmany such systems became operational and others were disbanded due to fearsof discrimination of the classification process. Most recently this happened withthe Austrian AMAS system [63].Due to the lack of proper profiling model results, the only comparable sys-tems are those that perform classification. Even so, comparing the obtained76% classification accuracy with the other published results has proven itselfchallenging. Due to data privacy regulations, it is not feasible to get access todata sets from various PES registries used in the literature. The algorithmsare not published either, so it is not possible to apply them to the data thatis available to us. Model comparison through applying different models to thesame data is therefore not possible. We thus compare the reported results withthe results of our model. One has to assume that the data sets of various PESorganisations are of comparable quality and information content, and neglectthe differences between labour markets in order to compare various modellingapproaches this way.Results of several systems implemented in PES offices are reported by Scop-petta and Buckenleib [64]. Table 4 shows a list of reported models comparableto the presented one, their underlying modelling methods and the resulting ac-curacies. In most published cases, the accuracy of the model in predicting thecorrect probability of exit over time is close to 70%.Table 4: Summary of reported modelling results addressing the problems ofunemployed persons profiling.Reference Method AccuracyAustralia [65] Logistic regression Not reportedAustria [63] Logistic regression 80-85%Belgium [63] Random forest 67%Croatia [66] Logistic regression 69%Denmark [67, 68,69] Logistic regression 66%Finland [6] Statistical model 89%France [12] Logistic/Random forest/Neural networks 70%Ireland [8] Probit regression 69%Netherlands [7] Logistic regression 70%New Zealand [70] Random forest and Gradient boosting 63-83%It should be noted that in many cases the reported accuracy listed in Table 4refers to a particular subset of data, limited by gender, location, education, etc.17urthermore, there is no information on the balance of the test data, and it iseasier to achieve a certain accuracy with a less balanced data set. The proposedVB model is tested on the whole population of job seekers, the data set isbalanced (50.5% positive, 49.5% negative outcome), and the achieved accuracyis above the average reported in the literature. It can therefore be concludedthat it works well. The source code is published and the curves of survivalprobability for each classification outcome are reported in Figure 6. It will thusbe possible to compare the future models with the presented VB algorithm.
The study successfully introduces advanced survival analysis methods to thequestion of unemployment. It demonstrates the use of variational Bayesianmethods for performing survival analysis with an arbitrary risk function imple-mented as an artificial neural network (ANN). The method enables a compu-tationally efficient approximation of the posterior probability distributions. Itthus becomes possible to exploit the true statistical nature of the survival anal-ysis despite the need of using deep ANNs with a high number of parameters.The proposed survival model predicts the time until exit from the job seekerrecords. As the most common reason of exiting the records is employment,the model provides time information regarding the probability of employment.This information is useful both for the job seekers and for the operation of thepublic employment service (PES). The model can assist the PES employmentcounsellors in recognising the job seekers that do not need PES resources asthey will get employed soon regardless of the interventions. It thus enables thecounsellors to focus their attention on the job seekers that need help.Data analysis has been used to make predictions of similar kinds before.However, the performance of the proposed model cannot be accurately comparedwith the historical ones because the published results are incomplete. Based onthe available data and the theoretical soundness, we believe that the proposedmodel performs relatively well.The resulting model can be used for performing gamification scenarios al-lowing both the counsellors and the job seekers to explore various strategies forimproving the job prospects, i.e. for reducing the survival probability. A limita-tion of the study is that it only explores the survival function and not the typeof event that results in exiting the unemployment records. Most exits are of adesirable kind but some are not. Future research that includes the differencesbetween types of exit and focuses on increasing the likelihood of desirable oneswould be highly beneficial.
Ethics statement
The data for this analysis was provided by the public employment service inSlovenia in the framework of the EU H2020 HECAT project. All records were18nonymised according to the General Data Protection Regulation of the Euro-pean Union [71] and respecting all national privacy laws.
A Age distribution for inactive persons
The age distribution of unemployed persons that were identified as inactive bythe PES office is shown in Figure 7. It is clear that the majority of the casesidentified as not active job seekers are persons older than 50 years. There is nosuch significant skewness in the age distribution for persons with lower estimatedsurvival probability.
10 20 30 40 50 60 70Age [years]0 . . . . . S ( t ) > .
61 0 . < S ( t ) ≤ . S ( t ) ≤ . Figure 7: Age distribution of persons that were identified by PES counsellorsas not active job seekers and remained on the PES records longer than 180days. The age is taken at the moment of entry on the PES records. The plotsrepresent kernel-density estimation of the observed number of records.
Acknowledgements
The authors acknowledge the research core funding No. P2-0001 were financiallysupported by the Slovenian Research Agency. The authors also acknowledgethe funding received from the European Union’s Horizon 2020 research andinnovation programme project HECAT under grant agreement No. 870702 .
References [1] X. Yizhong, Z. Lin, Y. Baranchenko, C. K. Lau, A. Yukhanaev, H. Lu,Employability and job search behavior: A six-wave longitudinal studyof Chinese university graduates, Employee Relations 39 (2017) 223–239.doi: . 192] W. Arulampalam, Is unemployment really scarring? Effects of unemploy-ment experiences on wages, The Economic Journal 111 (2001) F585–F606.URL: .[3] P. Virtanen, U. Janlert, A. Hammarström, Health status andhealth behaviour as predictors of the occurrence of unemploy-ment and prolonged unemployment, Public Health 127 (2013)46–52. URL: . doi: .[4] M. Strandh, A. Winefield, K. Nilsson, A. Hammarström, Unemploy-ment and mental health scarring during the life course, EuropeanJournal of Public Health 24 (2014) 440–445. URL: https://academic.oup.com/eurpub/article-pdf/24/3/440/1290126/cku005.pdf . doi: .[5] J. Grundy, Statistical profiling of the unemployed, Studies in PoliticalEconomy 96 (2015) 47–68. doi: .[6] T. Riipinen, Risk profiling of long-term unemployment in Finland, Dia-logue Conference – Brussels, 2011. URL: http://ec.europa.eu/social/BlobServlet?docId=7583&langId=en .[7] M. A. Wijnhoven, H. Havinga, The work profiler: A digital instrumentfor selection and diagnosis of the unemployed, Local Economy: The Jour-nal of the Local Economy Policy Unit 29 (2014) 740–749. doi: .[8] P. J. O’Connell, S. McGuinness, E. Kelly, J. Walsh, National Profil-ing of the Unemployed in Ireland, number 10 in Resarch Series, Eco-nomic and Social Research Institute, 2009. URL: .[9] A. Loxha, M. Morgandi, Profiling the unemployed : A review of OECDexperiences and implications for emerging economics, Technical Report,Social Protection and Labour No 1424, 2014. URL: http://hdl.handle.net/10986/20382 .[10] G. Sengul, Learning about match quality: Information flows and labormarket outcomes, Labour Economics 46 (2017) 118–130. doi: .[11] R. Shimer, Reassessing the ins and outs of unemployment, Review ofEconomic Dynamics 15 (2012) 127–148. doi: .[12] T. Berthet, C. Bourgeois, Towards ‘activation-friendly’ integration? As-sessing the progress of activation policies in six European countries, Inter-national Journal of Social Welfare 23 (2014) S23–S39. doi: . 2013] J. Park, I. B. Wood, E. Jing, A. Nematzadeh, S. Ghosh, M. D. Conover,Y.-Y. Ahn, Global labor flow network reveals the hierarchical organizationand dynamics of geo-industrial clusters, Nature Communications 10 (2019)3449. doi: .[14] E. López, O. Guerrero, R. L. Axtell, The network picture of labor flow(2015). arXiv:1507.00248 .[15] P. W. Anderson, More is different, Science 177 (1972) 393–396. doi: .[16] B. Kittel, A crazy methodology?: On the limits of macro-quantitativesocial science research, International Sociology 21 (2006) 647–677. doi: .[17] M. Erlinghagen, Employment and its institutional contexts, KZfSS KölnerZeitschrift für Soziologie und Sozialpsychologie 71 (2019) 221–246. doi: .[18] C.-Y. Huang, J. Ning, J. Qin, Semiparametric likelihood inference forleft-truncated and right-censored data., Biostatistics 16 (2015) 785–798.doi: .[19] J. M. Robins, An analytic method for randomized trials with informativecensoring: Part 1., Lifetime data analysis 1 (1995) 241–254. doi: .[20] O. A. Kittaneh, M. A. El-Beltagy, Efficiency estimation of type-I censoredsample from the Weibull distribution based on sup-entropy, Communi-cations in Statistics - Simulation and Computation 46 (2017) 2678–2688.doi: .[21] J. G. Ibrahim, M.-H. Chen, D. Sinha, Bayesian Survival Analysis,Springer Series in Statistics, Springer New York, 2001. doi: .[22] D. R. Cox, Regression models and life-tables, Journal of theRoyal Statistical Society: Series B (Methodological) 34 (1972) 187–202. URL: . doi: .[23] D. Faraggi, R. Simon, A neural network model for survival data, Statisticsin Medicine 14 (1995) 73–82. doi: .[24] M. S. Kovalev, L. V. Utkin, E. M. Kasimov, SurvLIME: A method forexplaining machine learning survival models, Knowledge-Based Systems203 (2020) 106164. URL: . doi: .2125] C. Guo, F. Berkhahn, Entity embeddings of categorical variables (2016). arXiv:1604.06737v1 .[26] H. Kvamme, O. Borgan, I. Scheel, Time-to-event prediction with neu-ral networks and Cox regression, Journal of Machine Learning Re-search 20 (2019) 1–30. URL: http://jmlr.org/papers/v20/18-424.html . arXiv:1907.00825 .[27] J. L. Katzman, U. Shaham, A. Cloninger, J. Bates, T. Jiang, Y. Kluger,DeepSurv: personalized treatment recommender system using a Cox pro-portional hazards deep neural network, BMC Medical Research Methodol-ogy 18 (2018) 24. doi: .[28] C. Lee, J. Yoon, M. v. d. Schaar, Dynamic-DeepHit: A deep learningapproach for dynamic survival analysis with competing risks based on lon-gitudinal data, IEEE Transactions on Biomedical Engineering 67 (2020)122–133. doi: .[29] T. Ching, X. Zhu, L. X. Garmire, Cox-nnet: An artificial neural networkmethod for prognosis prediction of high-throughput omics data, PLOSComputational Biology 14 (2018) 1–18. URL: https://doi.org/10.1371/journal.pcbi.1006076 . doi: .[30] E. Giunchiglia, A. Nemchenko, M. van der Schaar, RNN-SURV: A deeprecurrent model for survival analysis, in: Artificial Neural Networks andMachine Learning – ICANN 2018, Springer International Publishing, 2018,pp. 23–32. doi: .[31] E. Martinsson, WTTE-RNN : Weibull Time To Event RecurrentNeural Network, Master’s thesis, Chalmers University Of Technol-ogy, 2016. URL: http://publications.lib.chalmers.se/records/fulltext/253611/253611.pdf .[32] V. Šmídl, A. Quinn, The Variational Bayes Method in Signal Processing,Springer-Verlag, Berlin, Heidelberg, 2006. doi: .[33] T. D. Bui, J. Yan, R. E. Turner, A unifying framework for Gaussianprocess pseudo-point approximations using power expectation propagation,Journal of Machine Learning Research 18 (2017) 1–72. URL: http://jmlr.org/papers/v18/16-603.html . arXiv:1605.07066v3 .[34] J. Hensman, A. Matthews, Z. Ghahramani, Scalable variational Gaussianprocess classification (2014). arXiv:1411.2005v1 .[35] D. J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation andapproximate inference in deep generative models, in: E. P. Xing, T. Jebara(Eds.), Proceedings of the 31st International Conference on Machine Learn-ing, volume 32 of Proceedings of Machine Learning Research , PMLR, Be-jing, China, 2014, pp. 1278–1286. URL: http://proceedings.mlr.press/v32/rezende14.html . arXiv:1401.4082v3 .2236] Z. Yang, L. Xie, C. Zhang, Variational bayesian algorithm for quantizedcompressed sensing, IEEE Transactions on Signal Processing 61 (2013)2815–2824. doi: .[37] V. P. Oikonomou, S. Nikolopoulos, I. Kompatsiaris, A novel compressivesensing scheme under the variational bayesian framework, in: 2019 27thEuropean Signal Processing Conference (EUSIPCO), IEEE, 2019. doi: .[38] C. Gruhl, B. Sick, Variational bayesian inference for hiddenmarkov models with multivariate Gaussian output distributions (2016). arXiv:1605.08618v1 .[39] K. P. Panousis, S. Chatzis, S. Theodoridis, Variational conditional-dependence hidden markov models for human action recognition (2020). arXiv:2002.05809v1 .[40] S. Levine, Reinforcement learning and control as probabilistic inference:Tutorial and review (2018). arXiv:1805.00909v3 .[41] R. Pivovarov, D. J. Albers, J. L. Sepulveda, N. Elhadad, Identifying andmitigating biases in EHR laboratory tests, Journal of Biomedical Infor-matics 51 (2014) 24–34. doi: .[42] S. Pölsterl, S. Conjeti, N. Navab, A. Katouzian, Survival analysis for high-dimensional, heterogeneous medical data: Exploring feature extraction asan alternative to feature selection, Artificial Intelligence in Medicine 72(2016) 1–11. doi: .[43] M. Shafiq, M. Atif, R. Viertl, Generalized likelihood ratio test and Cox’s F -test based on fuzzy lifetime data, International Journal of IntelligentSystems 32 (2017) 3–16. doi: .[44] J. P. Klein, M. L. Moeschberger, Survival Analysis: Techniques for Cen-sored and Truncated Data, Springer, New York, 2003. doi: .[45] G. Tutz, M. Schmid, Modeling Discrete Time-to-Event Data, Springer In-ternational Publishing, 2016. doi: .[46] E. L. Kaplan, P. Meier, Nonparametric estimation from incomplete obser-vations, Journal of the American Statistical Association 53 (1958) 457–481.doi: .[47] A. Xiang, P. Lapuerta, A. Ryutov, J. Buckley, S. Azen, Comparison of theperformance of neural network methods and Cox regression for censoredsurvival data, Computational Statistics & Data Analysis 34 (2000) 243–257. doi: .2348] R. Ranganath, A. Perotte, N. Elhadad, D. Blei, Deep survival analy-sis, volume 56 of Proceedings of Machine Learning Research , PMLR,Northeastern University, Boston, MA, USA, 2016, pp. 101–114. URL: http://proceedings.mlr.press/v56/Ranganath16.html .[49] D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational inference: A reviewfor statisticians, Journal of the American Statistical Association 112 (2017)859–877. doi: .[50] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in:3rd International Conference for Learning Representations, May 7-9, 2015,San Diego, 2014. arXiv:1412.6980v9 .[51] S. J. Reddi, S. Kale, S. Kumar, On the convergence of Adam and beyond,in: International Conference on Learning Representations, 2018. URL: https://openreview.net/forum?id=ryQu7f-RZ . arXiv:1904.09237v1 .[52] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,L. Fang, J. Bai, S. Chintala, PyTorch: An imperative style, high-performance deep learning library, in: H. Wallach, H. Larochelle,A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advancesin Neural Information Processing Systems 32, Curran Associates,Inc., 2019, pp. 8024–8035. URL: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf .[53] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Kar-aletsos, R. Singh, P. Szerlip, P. Horsfall, N. D. Goodman, Pyro: Deepuniversal probabilistic programming, Journal of Machine Learning Re-search 20 (2019) 1–6. URL: https://jmlr.csail.mit.edu/papers/v20/18-403.html . arXiv:1810.09538 .[54] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington,J. Sohl-Dickstein, Deep neural networks as Gaussian processes,in: Sixth International Conference on Learning Representations ICLR,2018. URL: https://iclr.cc/Conferences/2018/Schedule?showEvent=161 . arXiv:1711.00165 .[55] C. M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag,New York, 2006.[56] D. Koller, N. Friedman, Probabilistic graphical models: principles and tech-niques, MIT Press, 2009.[57] T. Minka, J. Winn, Gates: A Graphical Notation for Mix-ture Models, Technical Report MSR-TR-2008-185, 2008. URL: .2458] D. Wingate, T. Weber, Automated variational inference in probabilisticprogramming, 2013. arXiv:1301.1299 .[59] R. Ranganath, S. Gerrish, D. Blei, Black Box Variational Inference, vol-ume 33 of Proceedings of Machine Learning Research , PMLR, Reykjavik,Iceland, 2014, pp. 814–822. URL: http://proceedings.mlr.press/v33/ranganath14.html .[60] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdi-nov, Dropout: A simple way to prevent neural networks from overfit-ting, Journal of Machine Learning Research 15 (2014) 1929–1958. URL: http://jmlr.org/papers/v15/srivastava14a.html .[61] Poslovni načrt za leto 2020 Zavoda Republike Slovenije za zaposlovanje,Zavod Republike Slovenije za zaposlovanje, Ljubljana, 2020. URL: .[62] Dolgotrajno brezposelne osebe, Zavod Republike Slovenije za zaposlovanje,Ljubljana, 2015. URL: .[63] S. Desiere, K. Langenbucher, L. Struyven, Statistical profiling in publicemployment services: an international comparison, in: OECD Social, Em-ployment and Migration Working Papers, OECD Publishing, Paris, 2019.doi: .[64] A. Scoppetta, A. Buckenleib, Tackling long-term unemployment throughrisk profiling and outreach. A discussion paper from the employment the-matic network, in: European Commission – ESF Transnational Cooper-ation. Technical Dossier no. 6, May 2018, Publications Office of the Eu-ropean Union, Luxembourg, 2018. URL: .[65] N. Ponomareva, J. Sheen, Australian labor market dynamics across theages, Economic Modelling 35 (2013) 453–463. doi: .[66] P. I. Pojarski, Implementation Completion and Results Report 8426-HR,Technical Report ICR00004446, The World Bank, 2018.[67] M. Rosholm, M. Svarer, B. Hammer, A Danish profiling system (Novem-ber 25, 2004). Univ. of Aarhus economics working paper No. 2004-13, in:Discussion paper series, Aarhus University Economics Department, 2004.doi: .[68] P. K. Madsen, Youth Unemployment and the Skills Mismatch in Denmark,Technical Report, European Parliament, Directorate General for InternalPolicies, 2014. URL: https://core.ac.uk/download/pdf/60647777.pdf .2569] A. Larsen, A. B. Jonsson, Employability profiling system – the Danish expe-rience, Presentation at PES, 2011. URL: http://ec.europa.eu/social/BlobServlet?docId=7584&langId=en .[70] J. Obben, Towards a formal profiling model to foster active labour mar-ket policies in New Zealand, Discussion paper (Massey University. Depart-ment of Applied and International Economics) no. 02.07, Dept. of Ap-plied and International Economics, Massey University, Palmerston North,N.Z., 2002. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.195.2652&rep=rep1&type=pdf .[71] Council of European Union, Regulation (EU) 2016/679 of the EuropeanParliament and of the Council of 27 April 2016 on the protection of naturalpersons with regard to the processing of personal data and on the freemovement of such data, and repealing Directive 95/46/EC (General DataProtection Regulation), 2016. URL: https://eur-lex.europa.eu/eli/reg/2016/679/ojhttps://eur-lex.europa.eu/eli/reg/2016/679/oj