[PDF] Interpretable Neural Networks for Panel Data Analysis in Economics

Abstract

The lack of interpretability and transparency are preventing economists from using advanced tools like neural networks in their empirical research. In this paper, we propose a class of interpretable neural network models that can achieve both high prediction accuracy and interpretability. The model can be written as a simple function of a regularized number of interpretable features, which are outcomes of interpretable functions encoded in the neural network. Researchers can design different forms of interpretable functions based on the nature of their tasks. In particular, we encode a class of interpretable functions named persistent change filters in the neural network to study time series cross-sectional data. We apply the model to predicting individual's monthly employment status using high-dimensional administrative data. We achieve an accuracy of 94.5% in the test set, which is comparable to the best performed conventional machine learning methods. Furthermore, the interpretability of the model allows us to understand the mechanism that underlies the prediction: an individual's employment status is closely related to whether she pays different types of insurances. Our work is a useful step towards overcoming the black-box problem of neural networks, and provide a new tool for economists to study administrative and proprietary big data.

Full PDF

IInterpretable Neural Networks for Panel DataAnalysis in Economics ∗ Yucheng Yang

Princeton University [email protected]

Zhong Zheng

Penn State University [email protected]

Weinan E

Princeton University [email protected]

November 2020

Abstract

The lack of interpretability and transparency are preventing economists fromusing advanced tools like neural networks in their empirical research. In thispaper, we propose a class of interpretable neural network models that can achieveboth high prediction accuracy and interpretability. The model can be written asa simple function of a regularized number of interpretable features, which areoutcomes of interpretable functions encoded in the neural network. Researcherscan design different forms of interpretable functions based on the nature of theirtasks. In particular, we encode a class of interpretable functions named persistentchange ﬁlters in the neural network to study time series cross-sectional data. Weapply the model to predicting individual’s monthly employment status using high-dimensional administrative data. We achieve an accuracy of 94.5% in the testset, which is comparable to the best performed conventional machine learningmethods. Furthermore, the interpretability of the model allows us to understandthe mechanism that underlies the prediction: an individual’s employment statusis closely related to whether she pays different types of insurances. Our work is auseful step towards overcoming the “black box” problem of neural networks, andprovide a new tool for economists to study administrative and proprietary big data.

Traditionally, economists have relied on interpretable models like linear or logistic regressions, whichprovide clear insights on the causal or statistical relationships in small datasets [2, 22]. Recently,the use of administrative and proprietary big data has led to lots of exciting work in empiricaleconomics [9, 7, 15, 19]. Though such datasets enjoy the richness of variables and large samplesize, advanced tools like neural networks have not been adopted widely in their analysis as expected.Methodologically, there have been some attempts to bring machine learning tools to economicanalysis. However, most of those successful applications are limited to relatively simple models likeLasso, ridge regressions or decision trees [16, 6, 13]. Economists are still nervous about using moreadvanced tools like neural networks, since they mostly deliver outcomes from a complicated blackbox without transparency or interpretability [3], even though they enjoy better capability to ﬁt datawell than traditional econometric models.To address this dilemma, we propose a new class of interpretable neural network models. Our modelallows us to take advantage of the high accuracy of neural networks, while remaining interpretable ∗ Yang and Zheng contributed equally to this paper. We thank Bin Dong, Guanhua Huang, Dake Li, ChrisSims, Ranran Wang, Lingzhou Xue, Linfeng Zhang, Guang Zeng and audience in the BIBDR Economics andBig Data Workshop for helpful comments.Preprint. Under review. a r X i v : . [ ec on . E M ] N ov ike linear or logistic regressions. To this end, we design a modiﬁed version of neural networks: alllayers except the last one encode various interpretable functional forms with unknown parameters,while the ﬁnal layer is a logistic function that only takes a regularized number of neurons as inputs.As is shown in Figure 1, our model can essentially be written as a simple logistic function of a limitednumber of interpretable features (red circles), which makes it more interpretable than conventionalneural networks. Under this general formulation, researchers can design different interpretablefunctional forms based on their own tasks. In this paper, we design a class of persistent changeﬁlters as part of the network, which turns out particularly helpful in time series cross-sectional dataanalysis . inputs output (a) Conventional Neural Networks g g g M-1 effectivevariableselection g M inputs outputlogisticinterpretablefeaturesg M-1 … g g (b) Interpretable Neural Networks Figure 1:

Comparison of Conventional Neural Networks (left) and Interpretable Neural Networks (right).

The interpretable neural network model we propose satisﬁes all the three desiderata for interpretablemachine learning models [17]. First, it has high “predictive accuracy” due to the capability of itsmulti-layer architecture to ﬁt data well. Second, it has high “descriptive accuracy” since the modelstructure is designed to be interpretable. Third, it meets the “relevancy” requirement. Reseacherscan design different interpretable functional forms in different tasks, so that the model estimates candeliver relevant knowledge. Beyond these three desiderata, our model estimates are robust to datamissing problems, which is crucial if we expect to get reliable interpretations of the model.As an application, we use the interpretable neural network models with persistent change ﬁlters to predict individual’s monthly employment status using high-dimensional administrative data inChina. We achieve an accuracy of 94.5% on the test set, which is comparable to the best-performingconventional machine learning methods. In addition, we clearly understand how the model predictsan individual’s employment status with her payment records of different types of insurances. Boththe accuracy and the interpretability are robust subject to data missing problems.This paper contributes to the literature with the following distinct features. First, we expand themachine learning toolbox for economists [16, 6, 13, 21] by introducing a modiﬁed neural networkmodel that achieves high accuracy without sacriﬁcing interpretability and transparency. Second, wecontribute to the large literature on interpretable machine learning. Previous work in computer visionand natural language processing [8, 25, 14] mostly focus on interpreting how hidden units of neuralnetworks represent local features of images and texts. In this paper, we propose an interpretable modelto study panel data , to help people understand the mechanisms behind the collective human behavior.In the terminology of the overview paper on interpretable machine learning [17], our model delivers“model-based interpretability”, which is different from work that delivers “post hoc interpretability”by approximating outcomes from complicated models with simple functions. Last but not least, theapplication of our model to predicting individual employment status with China’s administrativedata also contributes to the large literature on China’s unemployment study in the absence of reliable Time series cross-sectional data are “data with a cross section of units with repeated observations on themover time”. It is also called panel data. In this paper, we will use these two names interchangeably. Our paper is also distinct from recent literature that study causal inference in panel data with machinelearning tools [4, 1]. Our focus is to develop an interpretable model that achieve high predictive accuracy on thedata. persistent change ﬁlters in the model to study panel data. In Section 3 we exhibit the virtuesof the model by applying it to predict individual’s employment status in a large administrative dataset.Finally we conclude with discussions on the future work.

We ﬁrst lay down the general formulation for the interpretable neural network model. For observation i , we write the model as y i = g ( x i , θ ) , where g shares the multi-layer structure as conventionalneural networks g = g ( M ) ◦ g ( M − ◦ · · · ◦ g (1) .For the ﬁrst M − layers, g ( M − ◦ · · · ◦ g (1) are multi-dimensional differentiable interpretablefunctions with unknown parameters. Here the functional forms should be designed based on thenature of tasks, since interpretable functions in computer vision may look very different from those ineconomics. These functions map the original variables into high-dimensional interpretable features.The last layer g ( M ) is a logistic function with these interpretable features as inputs. To make theﬁnal model more interpretable, we require the ﬁnal layer to only take a small number of interpretablefeatures as inputs. This is achieved by adding a Lasso [20] or group Lasso [24] type of penalty on theparameters of the ﬁnal layer to the loss function: L ( θ, φ ) = E X,Y

Loss ( g ( x, θ ) , y ) + λ penalty ( g ( M ) ) (1)Here λ is a hyperparameter. The full architecture of interpretable neural network g is in the rightpanel of Figure 1. Since all components are differentiable in loss function (1), the model can betrained like a conventional neural network with gradient descent algorithms like Adam [12]. Now we design an interpretable functional form for panel data analysis. Consider a model y it = g ( x it , θ ) = g ( M ) ◦ g ( M − · · · ◦ g (1) , where y it is the outcome for unit i at time t and x it are highdimensional inputs. We choose M = 4 and design functional forms for { g ( m ) } , m = 1 , , as below.1. g (1) : Decision-tree-like splitting Sp ( x ) = Sigmoid ( cx + d ) . For x ∈ R , a sigmoid function canfunction as a decision-tree-like splitting operator like a differentiable indicator function [23]. Forexample, if c is close to 1 and d is a large negative number, the output can be close to 0 and 1 todiscriminate whether x is smaller or larger than a threshold. Here c and d are unknown parameters.2. g (2) : Dimension reduction Re ( x ) . For x ∈ R m , deﬁne Re ( x ) : R m → R n ( n (cid:28) m ) as a certaindimension reduction function with unknown parameters. For example, Re ( x ) could be linearcombination with sparse inputs, and the linear coefﬁcients are unknown parameters.3. g (3) : persistent change ﬁlter D ( x ) . Now we introduce a class of interpretable functions named persistent change ﬁlter , which turns out helpful in panel data analysis. D ( x ) also include unknownparameters. Persistent Change Filter

For a particular time series for some individual unit i (the i subscript is omitted here for simplicity), x τ ∈ [0 , , τ = 1 , ...t , we deﬁne the persistent change ﬁlter z t = D ( { x τ } tτ =1 ) = p t − q t , where: p = x , p τ +1 = [1 + k ( x τ +1 − p τ + x τ +1 = 1 − x , q τ +1 = [1 + k ( − x τ +1 )] q τ + (1 − x τ +1 ) Here k ∈ [0 , is an unknown parameter. Then we can map the original time series { x t } , t =1 , , ..., T to a persistent change ﬁlter time series D ( { x τ } tτ =1 ) , t = 1 , , ..., T . The deﬁnition of the persistent change ﬁlter D is motivated by identifying a persistent change in time series data. Weillustrate this idea ﬁrst with two examples and formalize with a proposition. (a) Persistent Change Filter on Data withoutMissing Problem (b)

Persistent Change Filter on Data with Miss-ing Problems

Figure 2:

Persistent Change Filter

Captures the Duration of a Persistent Change

Example I

For a binary time series x t ∈ { , } , t = 1 , ...T = 1000 . x t begins with 0 for someperiods, and persistently switch to 1 for the last t periods: x t = [ t > T − t ] . We plot the persistentchange ﬁlter on the entire time series D ( { x τ } Tτ =1 ) against the number of periods t between thetransition and the terminal period in the left panel of Figure 2. We ﬁnd that no matter what smoothingvariable k we choose, the persistent change ﬁlter ﬁts perfectly with t on the 45 ◦ line. This impliesthat the persistent change ﬁlter captures the duration of a persistent change in such a binary timeseries , ..., , , , ..., . Example II

In real world, such perfect binary data may not exist. Consider time series x t , whereeverything is the same as Example I, except that a random 5% of x t are replaced with x t = 0 . when t > T − t to mimic an abnormal data or data missing problem. Then we plot the persistentchange ﬁlter D ( { x τ } Tτ =1 ) against t for different choices of k in the right panel of Figure 2. For thethe persistent change ﬁlter without any smoothing (i.e. k = 1 ), the plot deviates from the 45 ◦ linesigniﬁcantly. When k decreases, the plot gets closer to the 45 ◦ line. So an optimal choice of k wouldhelp capture the duration of a persistent change in such a time series with potential data issues. Tonote, k would be left as an unknown parameter that our model would learn from the data.We formalize the examples above with the following proposition:P ROPOSITION

When k → , p T uniformly converges to D (0) deﬁned as below: D (0) ( x T , x T − ...x ) = T (cid:88) i =1 i (cid:89) j =1 x T − j +1 . What does D (0) means for the time series? Consider the binary time series x t ∈ { , } , t = 1 , ...T , D (0) is the lasting time periods for recent x i ’s to be 1. In other words, D (0) ( x ) = m ⇐⇒ x T = x T − ... = x T − m +1 = 1 , x T − m = 0 Thus D (0) gives information on the lasting time since the most recent period when the binary timeseries persistently switch to 1. For continuous x t in [0 , , D (0) is differentiable at any point, whichmeans it could be part of a trainable model. However, D (0) only captures the jumps from low values4o high values, so we extend D (0) to D so that it can capture both persistent jumps and drops in thetime series . This is the q t term in the formulation. As is discussed above, we also add the trainablesmoothing parameter k to adjust for data missing or abnormal data problems. For a more elaboratediscussion and the origins of the persistent change ﬁlter , please refer to Supplement D. In this section, we apply the interpretable neural network model for panel data to predict individual’smonthly employment status with administrative data. The data come from a four-million populationcity in China , and includes basic demographic information (age, family relations, gender, education,etc.), as well as individual level monthly payments to six different kinds of social insurances and theHousing Provident Fund (HPF) . With all these features, together with employment/unemploymentlabels on part of the sample (about 400,000 individuals are labelled every month), we constructan interpretable neural network model to predict employment status of all the individuals in thepopulation each month. With the prediction results, we can calculate unemployment rate on thewhole population as an important economic indicator for policy makers. This application is noveland important given the unreliability of China’s ofﬁcial unemployment statistics [10].For individual i in calendar month t , her employment status is denoted as y it ∈ { , } , where 1 and0 correspond to being employed and unemployed respectively. Her payment amounts for differentinsurances, among other individual features are denoted as x ijt where j = 1 , ...m correspond todifferent features. To predict y it , we stack features of the individual in the past s period as modelinputs, denote as ˜ x it = ( x ijτ ) where j = 1 , ...m, τ = t, t − , ..., t − s + 1 .Following Section 2.2, we write the model as P ( y it = 1 | ˜ x it ) = g (˜ x it , θ ) where g = g (4) ◦ g (3) ◦ g (2) ◦ g (1) . The original input variables of g is a ms × vector: ˜ x it = ( x ijτ ) where j = 1 , . . . m, τ = t, t − , . . . , t − s + 1 . Here we use monthly payment data of various social insurances and the HPF from July 2016 toDecember 2017 to construct ˜ x it , so m = 7 is the number of insurances to pay and s = 6 denotesthe lagged periods we consider in model inputs. With the administrative data in Section 3.1, . Toconstruct a balanced data sample, we randomly select 20,000 employed observations and 20,000unemployed observations. Both positive and negative samples are evenly divided into two parts toget the training set and the test set. To evaluate our model, we will compare it with other interpretable models (logistic regressions)as well as more complicated models (random forests or neural networks). These models requirehandcrafted features as model inputs. Based on the descriptive statistics and domain knowledge (seethe details in Supplement A) on this problem, we construct the following features:1. “Insurance count” (IC). The number of types of insurances the individual pays in period t . Some might ﬁnd the idea of the persistent change ﬁlter related to the structural break literature in time seriesanalysis [5]. However, the structural break detection methods are not applicable for feature constructions, sincethey mostly focus on detecting break points with formal statistical tests, while we hope to get an explicit functiontransform as part of the interpretable model. For conﬁdentiality, we would not release any city-speciﬁc information including the city name in this paper. In China, there are ﬁve major types of social insurances: endowment insurance, basic medical insurance,unemployment insurance, employment injury insurance, and maternity insurance. For urban residents, basicmedical insurance consists of “basic medical insurance for working urban residents” and “basic medical insurancefor non-working urban residents”. Together with the Housing Provident Fund (HPF), which we take as one typeof insurance from now on, our administrative data include detailed individual level payments to seven types ofinsurances in total. To note, those who pay for employment injury insurance or unemployment insurance are notnecessarily employed or unemployed, and vice versa.

5. “Naive persistent change” (NPC). For each kind of insurance, we construct the “naive persistentchange” D (0) ( x t , x t − , ..., x t − s +1 ) following the deﬁnition in Proposition 2.1, where x τ is abinary variable on whether the individual does not pay for a speciﬁc insurance in period τ .Different from the persistent change ﬁlter , these “naive” measures do not incorporate any trainableparameters, and come directly from feature engineering.3. “Payment change” (PC). We construct another “naive persistent change” variable taking thesame expression based on the following binary time series x τ : whether the number of types ofinsurances the individual has paid is larger than 2. The feature with threshold 2 is constructedwith insights from the descriptive statistics (see Supplement A).With these features in hand, we compare the performance of the following 11 models:1. The interpretable neural network (IntNN) model (1) we propose in this paper. No handcraftedfeature is needed in this model.2. Conventional interpretable model: logistic regression (Logistic) with (2) 7 “naive persistentchanges”, “payment change” and insurance count; (3) 7 “naive persistent changes”; (4) “paymentchange” and insurance count; (5) “payment change”; (6) insurance count.3. Conventional uninterpretable model: random forests (RF) and neural networks (NN) with (7)7 “naive persistent changes”, “payment change” and insurance count; (8) 7 “naive persistentchanges”; (9) “payment change” and insurance count; (10) “payment change”; (11) insurancecount.The predictive accuracy of the best-performing models under each category is reported in Table 1.Table 1: Model Comparisons

Index Model Inputs Test Accuracy1 IntNN -

The interpretable neural network modelsdeliver a clear mechanism behind the model outcome. Now we look into the model estimation resultsfor each component of g = g (4) ◦ g (3) ◦ g (2) ◦ g (1) , and interpret the model we obtain.1. Decision-tree-like splitting g (1) : For ∀ j, τ , x (1) ijτ = Sigmoid ( cx ijτ + d ) .The estimates are c = 2 . , d = − . . As is discussed in Section 2.2, with such c and d , g (1) would transform small payment values to , while keep large payment values close to 1. So, itapproximates an indicator function of whether the individual pays each type of insurances.2. Dimension reduction g (2) : For ∀ τ , x (2) iτ = Sigmoid ( w T ( x (1) ijτ ) mj =1 + b ) .The estimates of the m-dimensional vectors w = ( − . , . , − , . , . , − . , . T ,and each element corresponds to endowment insurance, urban working medical insurance, un-employment insurance, employment injury insurance, maternity insurance, urban non-workingmedical insurance, and the Housing Provident Fund (HPF) respectively. The intercept b = 3 . . As is discussed in Proposition 2.1, D (0) only works for increasing time series ..., 0, 0,..., 0, 1,..., 1. So herewe deﬁne x τ as whether the individual does NOT pay, rather than pay a speciﬁc insurance in period τ .

6n this layer, we obtain a 1-dimension variable x (2) iτ from the linear combination of the 7-dimensional payment records after the decision-tree-like splitting. The output is positivelycorrelated with payments of urban working medical insurance, employment injury insurance,maternity insurance, and the Housing Provident Fund (HPF), while negatively correlated with pay-ments of endowment insurance, unemployment insurance, urban non-working medical insurance. Persistent Change Filter g (3) : ˜ x (3) it = D (( x (2) iτ ) t − s +1 τ = t , k ) .The optimal smoothing parameter learnt from the original data is k = 0 . . As a persistentchange ﬁlter on x (2) iτ , a large positive ˜ x (3) it corresponds to a persistent jump of x (2) iτ , while a a largenegative value corresponds to a persistent drop of x (2) iτ .4. Logistic regression g (4) : P ( y it = 1) = Sigmoid ( u ˜ x (3) it + v ) The estimates are u = 1 . , v = − . . This is a differentiable version of linear transform withcoefﬁcient close to 1, which is similar to identity.To summarize, our interpretable neural network model predicts individual’s employment status simplywith the persistent change ﬁlter of a composite variable from a linear combination of whether anindividual pay each type of insurance. The weights of the linear combination imply the relativeimportance of each insurance when predicting the employment status. From the model, we learnthat when an individual that used to consistently pay urban working medical insurance, employmentinjury insurance, maternity insurance, or the Housing Provident Fund (HPF), but suddenly drop outfrom those insurance programs, there is a higher probability for her to become unemployed. Similarly,for individuals that are beginning to get enrolled in endowment insurance, unemployment insurance,urban non-working medical insurance, there is a higher probability for her to get unemployed. Thesemodel interpretations would be helpful for policymakers who hope to understand the accurate modelpredictions and design policies based on machine learning models. As is discussed in [17], an interpretable model should be robust to data missing or abnormal dataproblem. To mimic such scenarios, we randomly set 10% of all the payment records to 0, and trainthe same 11 models as in Section 3.2. The accuracy comparison results still hold. Also, the parameterestimates deliver qualitatively equivalent interpretations. The value of parameter k in the persistentchange ﬁlter becomes smaller to offset data missing. Please refer to Supplement C for more details. In this paper, we propose a class of interpretable neural network models that can achieve both highprediction accuracy and interpretability. We ﬁrst propose a general formulation of such neuralnetworks, which begins with interpretable functional forms encoded in the ﬁrst several layers,followed by the ﬁnal layer with a regularized number of interpretable features as input. The modelsatisﬁes all the three desiderata for interpretable machine learning models [17]. Researchers candesign different forms of interpretable functions based on the nature of their tasks. In this paper, weincorporate a class of interpretable functions named persistent change ﬁlters as part of the neuralnetwork to study panel data in economics.As an application, we use the model to predict individual’s monthly employment status using high-dimensional administrative data in China. We achieve a high accuracy as 94.5% on the test set, whichis comparable to the best-performing conventional machine learning methods. Furthermore, weinterpret the model to understand the mechanisms behind. We ﬁnd it accurately predicts individual’semployment status simply with the persistent change ﬁlter of a composite variable from a linearcombination of whether an individual pay each type of insurance. The model deliver robust inter-pretations subject to potential data issues, and would be helpful for researchers and policymakers tounderstand useful information contained in the big administrative data. We also compare w with parameters from logistic regression models, and ﬁnd they share qualitatively thesame interpretation while our model obtains higher accuracy. See Supplement B for more details. References [1] Alberto Abadie. Using synthetic controls: Feasibility, data requirements, and methodologicalaspects.

Journal of Economic Literature , 2019.[2] Joshua D Angrist and Jörn-Steffen Pischke.

Mostly harmless econometrics: An empiricist’scompanion . Princeton university press, 2008.[3] Susan Athey. The impact of machine learning on economics. In

The economics of artiﬁcialintelligence: An agenda , pages 507–547. University of Chicago Press, 2018.[4] Susan Athey, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi.Matrix completion methods for causal panel data models. Technical report, National Bureau ofEconomic Research, 2018.[5] Alessandro Casini and Pierre Perron. Structural breaks in time series. arXiv preprintarXiv:1805.03807 , 2018.[6] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo, Christian Hansen,Whitney Newey, and James Robins. Double/debiased machine learning for treatment andstructural parameters, 2018.[7] Raj Chetty, Nathaniel Hendren, Patrick Kline, and Emmanuel Saez. Where is the land ofopportunity? the geography of intergenerational mobility in the united states.

The QuarterlyJournal of Economics , 129(4):1553–1623, 2014.[8] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 , 2017.[9] Liran Einav and Jonathan Levin. Economics in the age of big data.

Science , 346(6210), 2014.[10] Shuaizhang Feng, Yingyao Hu, and Robert Mofﬁtt. Long run trends in unemployment and laborforce participation in urban China.

Journal of Comparative Economics , 45(2):304–324, 2017.[11] John Giles, Park Albert, and Juwei Zhang. What is China’s true unemployment rate?

ChinaEconomic Review , 16(2):149–170, 2005.[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[13] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan.Human decisions and machine predictions.

The Quarterly Journal of Economics , 133(1):237–293, 2018.[14] Tao Lei.

Interpretable neural models for natural language processing . PhD thesis, Mas-sachusetts Institute of Technology, 2017.[15] Atif Mian and Amir Suﬁ.

House of debt: How they (and you) caused the Great Recession, andhow we can prevent it from happening again . University of Chicago Press, 2015.[16] Sendhil Mullainathan and Jann Spiess. Machine learning: an applied econometric approach.

Journal of Economic Perspectives , 31(2):87–106, 2017.[17] W James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. Deﬁnitions,methods, and applications in interpretable machine learning.

Proceedings of the NationalAcademy of Sciences , 116(44):22071–22080, 2019.[18] Laura Rieger, Chandan Singh, W James Murdoch, and Bin Yu. Interpretations are use-ful: penalizing explanations to align neural networks with prior knowledge. arXiv preprintarXiv:1909.13584 , 2019.[19] Emmanuel Saez and Gabriel Zucman. Wealth inequality in the united states since 1913:Evidence from capitalized income tax data.

The Quarterly Journal of Economics , 131(2):519–578, 2016. 820] Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the RoyalStatistical Society: Series B (Methodological) , 58(1):267–288, 1996.[21] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effectsusing random forests.

Journal of the American Statistical Association , 113(523):1228–1242,2018.[22] Jeffrey M Wooldridge.

Introductory econometrics: A modern approach . Nelson Education,2016.[23] Yongxin Yang, Irene Garcia Morillo, and Timothy M Hospedales. Deep neural decision trees. arXiv preprint arXiv:1806.06988 , 2018.[24] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 68(1):49–67, 2006.[25] Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. Interpretable convolutional neuralnetworks. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 8827–8836, 2018. 9 upplementary MaterialsA Descriptive Statistics

Table 2 presents some descriptive statistics on insurance payment behaviors of both the employed andthe unemployed sample. The upper panel is the number of employed and unemployed individualsTable 2:

Descriptive Statistics

B Interpretation Comparison: Our Model vs Logistic Regression

We compare our model with the logistic regression model with the “naive persistent changes” ofseven kinds of insurances as model inputs. The deﬁnition of “naive persistent change” is in Section3.2. The linear combination weights in g (2) of our interpretable neural network model and the logisticcoefﬁcients of each variable are in Table 3.Table 3: Comparison: Interpretable Neural Networks vs. Conventional Interpretable ModelsEndowment Working medical Unemploy Injury Maternity Non-work medical HPFIntNN -1.73 1.84 -4 0.64 0.62 -0.66 4.19Logistic 2.02 -0.61 0.32 -0.52 -1.16 1.53 -0.12From Table 3, we ﬁnd the interpretations of our model and the logistic regression model are qualita-tively the same. As is discussed in the deﬁnition of “naive persistent change” in Section 3.2, largernaive persistent change means the individual switch from paying to not paying a speciﬁc type ofinsurance. Thus negative coefﬁcients in the logistic regression imply a positive relationship betweenjumps in payments and the probability getting employed, which share the same interpretation as posi-tive coefﬁcients in our model. Here we ﬁnd the signs of all the coefﬁcients in the interpretable neural10etworks are exactly opposite to each other, which means they share exactly the same interpretationqualitatively. C Robustness of the Model

C.1 Model Accuracy with Data Missing

The model we propose is robust to data missing or abnormal data problems. To check the robustness,we randomly set 10% of all the payment records to 0 as data missing. Then we rerun the same 11models as in Section 3.2. The model accuracy results are in Table 4.Table 4: Robust Model Comparisons: Our Model vs. Logistic Regression vs. Random ForestIndex Model Inputs Test Accuracy1 IntNN -

C.2 Model Interpretation with Data Missing

In Section 3.3, we interpret our model clearly through the model structure. Besides transparency ofthe model structure, robustness is another important feature of model interpretability. In this section,we illustrate the robustness of our model with model parameters we estimate with 10% missing datain Section C.1. Decision-tree-type splitting g (1) : For ∀ j, τ , x (1) ijτ = Sigmoid ( cx ijτ + d ) . The estimates are c = 2 . , d = − . , which are both close to those in the baseline model.2. Dimension reduction g (2) : For ∀ τ , x (2) iτ = Sigmoid ( w T ( x (1) ijτ ) mj =1 + b ) . The estimates of the m-dimensional vectors w = ( − . , . , − . , . , . , − . , . T . The intercept b = 2 . .All the estimates are close to those in the baseline model.3. Persistent Change Filter g (3) : ˜ x (3) it = D (( x (2) iτ ) t − s +1 τ = t , k ) . The estimated smoothing parameter is k = 0 . , different from the baseline estimates 0.999999. This is totally sensible: due to datamissing, we need to impose some smoothing to get a reasonable persistent change ﬁlter timeseries that predict the outcomes well. This also conﬁrms the rationale of the persistent changeﬁlter deﬁnition, as is detailed discussed in Supplement D.4. Logistic regression g (4) : P ( y it ) = Sigmoid ( u ˜ x (3) it + v ) . The estimates are u = 1 . , v = − . ,which are both close to those in the baseline model.To summarize, our model is quite robust subject to data missing problems, and deliver essentially thesame interpretations of the model outcomes. 11 Motivation and Variants of

Persistent Change Filter

The persistent change ﬁlter is a monotone reduction designed for univariate time-series input x =( x , x ...x T ) , x t ∈ [0 , and mainly captures the persistent jumps or drops in it. D.1 Persistent change D (0) for Binary Inputs Consider the binary time series x t ∈ { , } , t = 1 , ...T , persistent change is deﬁned as the lastingtime periods for recent x i ’s to be 1. In other words, D (0) ( x ) = m ⇐⇒ x T = x T − ... = x T − m +1 =1 , x T − m = 0 . This variable gives information on the moment the input turns into 1 and the lastingtime. D.2 Continuous Persistent change D (1) for Continuous Inputs When we have a continuous input x t ∈ [0 , , t = 1 , ..., T , we ﬁnd a compatible expression forpersistent change which reduces to the original version if we restrict the input to be binary. D (1) = T (cid:88) t =1 t (cid:89) j =1 x T − j +1 In an iterative way, we can deﬁne a series p t , t = 1 , ..., T and p = x , p t +1 = x t +1 + x t +1 p t ,then we have the continuous persistent change measure D (1) = p T . To understand the iterativedeﬁnition, we see when x t +1 gets to 0, it will almost clean up the historical accumulation p t . When x t +1 remains to be 1, it will keep the historical accumulation p t and p t +1 accumulates by adding x t +1 . D.3 Persistent change D (2) : Symmetrical to Jumps and Drops The measures D (0) and D (1) can only capture time series change from small values (like 0) to bigvalues (like 1). To treat small and large values equally, we deﬁne a new measure D (2) on continuousvalue time series x t ∈ [0 , , t = 1 , ..., T : p = x , p t +1 = x t +1 + x t +1 p t ; q = 1 − x , q t +1 = (1 − x t +1 ) + (1 − x t +1 ) q t D (2) = p T − q T In this deﬁnition, both jumps and drops in x t are addressed. D.4 Smooth Persistent change D (3) : Adding a trainable smooth variable k To weaken the cleaning-up effect against abnormal time series change due to data missing or data errorproblems, we deﬁne a smoothing variable k ∈ [0 , , and deﬁne a new measure D (3) on continuousvalue time series x t ∈ [0 , , t = 1 , ..., T : p s = x , p st +1 = x t +1 + kx t +1 p t + (1 − k ) p t q s = 1 − x , q st +1 = (1 − x t +1 ) + k (1 − x t +1 ) q st + (1 − k ) q st D (3) = p sT − q sT Compared to D (2) , the multiplication term x t +1 p t is replaced by kx t +1 p t + (1 − k ) p t . In otherwords, larger k means larger cleaning-up effect, while the accumulation effect is unaffected. When k = 1 we have D (3) = D (2) . We leave k trainable to let the data ﬁnds the best value. Finally, wehave the notion of persistent change ﬁlter D = D (3) .12 .5 Why Do We Call It Persistent Change Filter ? Consider four pairs of inputs: x (1)1 = (0 , , , , , , , , , , , , , , , x (2)1 = (1 , , , , , , , , , , , , , , x (1)2 = (0 , , , , , , , , , , , , , , , x (2)2 = (1 , , , , , , , , , , , , , , x (1)3 = (0 . , . , . , . , . , . , ..., . , . , x (2)3 = (0 . , . , . , . , . , . , ..., . , . x (1)4 = (0 . , . , . , . , . , . , . , . , . , . , . , . , . , . , . ,x (2)4 = (0 . , . , . , . , . , . , . , . , . , . , . , . , . , . , . and the persistent change measures we deﬁned above are in Table 5.Table 5: Persistent Change Measures of Example Time Series D (0) D (1) D (2) D (3) , k = 0 . D (0) D (1) D (2) D (3) , k = 0 . x (1)1

12 12 12 11.90 x (2)1 x (1)2 x (2)2 x (1)3 – 6.49 6.37 9.06 x (2)3 – 0.11 -6.37 -9.06 x (1)4 – 3.47 3.36 6.90 x (2)4 – 0.111 -3.36 -6.90From Table 5, we have the following intuitive ﬁndings as below:1. As is shown in the ﬁrst two rows, D (1) is equivalent to D (0) for binary inputs, while it can alsohandle continuous inputs as the last two rows.2. D (2) captures the persistent change idea, but is vulnerable to abnormal data points, as is in thesecond and forth rows.3. D (3) can still capture the persistent change idea even upon abnormal data points. Here thesmoothing parameter k is selected arbitrarily, while it could be trained to an optimal value whenwe hope to use D (3)(3)