[PDF] Assessing Sensitivity of Machine Learning Predictions.A Novel Toolbox with an Application to Financial Literacy

Abstract

Despite their popularity, machine learning predictions are sensitive to potential unobserved predictors. This paper proposes a general algorithm that assesses how the omission of an unobserved variable with high explanatory power could affect the predictions of the model. Moreover, the algorithm extends the usage of machine learning from pointwise predictions to inference and sensitivity analysis. In the application, we show how the framework can be applied to data with inherent uncertainty, such as students' scores in a standardized assessment on financial literacy. First, using Bayesian Additive Regression Trees (BART), we predict students' financial literacy scores (FLS) for a subgroup of students with missing FLS. Then, we assess the sensitivity of predictions by comparing the predictions and performance of models with and without a highly explanatory synthetic predictor. We find no significant difference in the predictions and performances of the augmented (i.e., the model with the synthetic predictor) and original model. This evidence sheds a light on the stability of the predictive model used in the application. The proposed methodology can be used, above and beyond our motivating empirical example, in a wide range of machine learning applications in social and health sciences.

Full PDF

AAssessing Sensitivity of Machine Learning Predictions.A Novel Toolbox with an Application to Financial Literacy.

Falco J. Bargagli-Stoﬃ ∗ Kenneth De Beckker † Joana Elisa Maldonado ‡ Kristof De Witte § Abstract

Despite their popularity, machine learning predictions are sensitive to potential unobservedpredictors. This paper proposes a general algorithm that assesses how the omission of anunobserved variable with high explanatory power could aﬀect the predictions of the model.Moreover, the algorithm extends the usage of machine learning from pointwise predictionsto inference and sensitivity analysis. In the application, we show how the framework canbe applied to data with inherent uncertainty, such as students’ scores in a standardizedassessment on ﬁnancial literacy. First, using Bayesian Additive Regression Trees (BART),we predict students’ ﬁnancial literacy scores (FLS) for a subgroup of students with missingFLS. Then, we assess the sensitivity of predictions by comparing the predictions and per-formance of models with and without a highly explanatory synthetic predictor. We ﬁndno signiﬁcant diﬀerence in the predictions and performances of the augmented (i.e., themodel with the synthetic predictor) and original model. This evidence sheds a light onthe stability of the predictive model used in the application. The proposed methodologycan be used – above and beyond our motivating empirical example – in a wide range ofmachine learning applications in social and health sciences.

Keywords:

Machine Learning; Sensitivity; Bayesian Statistical Learning; Financial Lit-eracy; Predictions; Targeted Policies

JEL Codes: G53; C38; I21; I28

We are grateful to participants at KU Leuven and IMT School for Advanced Studies seminars and at the2020 Data Science Webinar. We want to thank Tommaso Agasisti, Ilja Corneliz, Chiara Masci, Giorgio Gnecco,Chris Van Klaveren, Kwonsang Lee, Fabrizia Mealli, Rachel Nethery, Massimo Riccaboni, Armando Rungi, andMike Smet. Falco J. Bargagli-Stoﬃ acknowledges funding from the Alfred P. Sloan Foundation. Kenneth DeBeckker acknowledges the ﬁnancial support from Wikiﬁn.be. Joana Elisa Maldonado acknowledges support fromthe Flemish Science Organization through the grant S000617N. Kristof De Witte acknowledges the ﬁnancialsupport from Wikiﬁn.be and from the Flemish Science Organization through the grant S000617N. ∗ Corresponding author. Mail to: fbargaglistoﬃ@hsph.harvard.edu. Harvard University, 655 Huntington Ave, Boston, MA 02115,United States. † Hasselt University, Agoralaan Building D, 3590 Diepenbeek, Belgium. KU Leuven, Warmoesberg 26 - 1000 Brussels, Belgium. ‡ KU Leuven, Naamsestraat 69 - 3000 Leuven, Belgium. § KU Leuven, Naamsestraat 69 - 3000 Leuven, Belgium. UNU-Merit, Maastricht University, Minderbroedersberg 4 - 6211 LKMaastricht, The Netherlands. a r X i v : . [ ec on . E M ] F e b Introduction

Machine learning is rapidly growing in popularity in social sciences as it helps scholars to addressa large set of issues regarding prediction, causal inference, theory testing and development ofnew data sources (Mullainathan and Spiess; 2017; Athey and Imbens; 2019). This rise has beenmainly driven by the staggering performance of machine learning in predictive tasks. However,the usage of oﬀ-the-shelf machine learning methodologies to perform predictions should be donein a careful and thoughtful way. To this day, little has been done to explore how sensitivemachine learning predictions are to potentially unobserved predictors.From a methodological point of view, this paper accounts and accommodates for this short-coming by proposing a general algorithm that assesses how the omission of an unobserved predic-tor could aﬀect the predictions and the performance of the model. As such, we obtain estimatesfor, what we call, sensitivity of prediction analysis . The algorithm that we propose is generalenough to be applied to both Bayesian methodologies (i.e., the Bayesian Additive RegressionTree method introduced by Chipman et al.; 2010) and frequentist techniques (i.e., the randomforest methodology by Breiman; 2001). In particular, our method aims at assessing the extentto which an unobserved predictor would impact the model’s predictions and its performance.To perform this analysis, we generate a synthetic predictor with a high explanatory power(i.e., high correlation with the outcome) but that is uncorrelated with all the predictors in themodel. Then, we check how the inclusion of this additional variable in the machine learningmodel modiﬁes its predictions and accuracy. We assume that the model at hand is able to catchmost of the signal from the data if its forecasts and its predictive ability are not extensivelymodiﬁed by the inclusion of the synthetic predictor. In order to assess this, we check if: (i) theunit level predictions of the model with the synthetic predictor – what we call the augmentedmodel – fall within the conﬁdence interval of the unit level prediction of the model that weare comparing; and (ii) the predictive performance of the augmented model is not statisticallydiﬀerent from the performance of the original model. The approach that we develop here isapplied in the case of a continuous outcome variable, but it can be easily extended to a binaryoutcome variable. Moreover, we make this methodology robust by introducing, in a second step,1 set of correlations between the synthetic predictor and the most important predictors in theoriginal model. This second case more closely mimics real-world applications where predictorsare correlated.On the applied side, we innovate the literature by introducing uncertainty of predictions,which has various applications as machine learning is increasingly used to perform predictionson sensitive manners: e.g., patients’ life expectancy (Kleinberg et al.; 2015), human capitalselection (Chalﬁn et al.; 2016), unemployment insurance eﬀectiveness (Chernozhukov et al.;2017), students’ learning abilities (Kotsiantis et al.; 2004; Delen; 2010; Kotsiantis; 2012), and soon. In this spirit, the proposed methodology is a ﬁrst step towards a more transparent way toaccount for predictions’ stability and uncertainty in machine learning applications in social andhealth sciences.As a motivating application, we are the ﬁrst to introduce supervised machine learning in theﬁnancial literacy literature. First, we use machine learning to predict ﬁnancial literacy scores(FLS) for students in a region of Belgium where these scores are not observed. We train ourmodel using the PISA data (OECD; 2017a) for the Flemish region of Belgium were the FLSare observed. Then, we perform the prediction for the missing FLS of students in the Walloonregion of Belgium. We use an extensive set of student and school-level variables as predictorsin our machine learning model. Second, we use machine learning to identify the characteristicsof students with outlying predicted test scores. In our case, the main focus is on studentswith lower predicted FLS. Indeed, machine learning can help researchers and policy-makers tounderstand the characteristics of students who have low predictions for ﬁnancial literacy. Theseinsights could enable policy-makers to target interventions to lower performing students in orderto improve their performances. Third, we employ the novel sensitivity analysis to assess therobustness of our analyses. Sensitivity is central in the case of our application, as we deal withsensitive data such as students’ scores in a standardized assessment on ﬁnancial literacy. In thissetting, it is deemed important to assess how sensitive the predictions of students’ FLS are to apotentially unobserved predictor with high explanatory power.Our main results can be summarized as follows. The performance of the proposed model isgood as we reach an estimated adjusted R of roughly 73% and both the root mean squared2rror (RMSE) and mean absolute error (MAE) of prediction are comparatively small. Moreover,the best performing machine learning technique, Bayesian Additive Regression Trees (Chipmanet al.; 2010), allows us to get draws from the posterior distribution of the predicted observationsand construct credible intervals for the unit level predictions. Hence, we are able to detectthe predictions that fall outside the 95% credible intervals for the mean posterior predictedvalues. We use this information to identify a set of outlier predictions. On one side, this analysisshows that the predictive probability of having a low ﬁnancial literacy score is the largest forstudents with lower scores for reading and math in the PISA test. On the other side, thestudents’ background plays a critical role: students with the largest predicted low FLS are oftenindividuals from families where the school language is not spoken at home and the parents havea poor educational background. Moreover, results from the sensitivity of predictions analysisshow that the inclusion of an unobserved predictor would not aﬀect the predictions and thepredictive ability of the model in a sizable manner. This hints at the fact that our model isalready getting enough signal to perform its predictions.The remainder of the paper is organized as follows. In Section 2 we introduce the machinelearning methodologies used and developed in this paper. Section 3 describes the PISA dataused to illustrate the proposed methodology. In Section 4 we discuss the results obtained fromthe application on the PISA data. Section 5 concludes with a discussion of the results. This Section discusses in detail the machine learning technique used for prediction (Subsection2.1) and its extension to the detection of observations with outlying predicted values whichpaves the way to targeted policies. In Subsection 2.2, we introduce the novel methodologiesfor sensitivity analysis that extend the usage of machine learning algorithms from pointwisepredictions to robustness analysis regarding these predictions. This performance measures result from 10-folds cross-validation. K-folds cross-validation is a well knownway to assess the performance of a predictive model in machine learning (James et al.; 2013). The R and Stata codes used for the analysis, together with the data and the functions forthe machine learning analysis are publicly available on the GitHub page of the corresponding author( https://github.com/fbargaglistoffi/sensitivity-analysis-machine-learning ). .1 Machine learning for predictions and targeted policies We rely on Bayesian Additive Regression Tree (BART) algorithm (Chipman et al.; 2010) –which is a Bayesian ensemble of trees methodology – as a predictive model. BART was shown tohave an excellent performance both in prediction tasks (Murray; 2017; Linero and Yang; 2018;Linero; 2018; Hern´andez et al.; 2018; Bargagli-Stoﬃ, Riccaboni and Rungi; 2020; Bargagli-Stoﬃ,Niederreiter and Riccaboni; 2020) and in causal inference tasks (Hill; 2011; Logan et al.; 2019;Nethery et al.; 2019; Bargagli-Stoﬃ et al.; 2019; Bargagli-Stoﬃ; 2020; Lee et al.; 2020). This isdue to the particular ﬂexibility of this method and the fact that it overcomes potential issuesconnected with other machine learning methodologies such as the Classiﬁcation And RegressionTree (CART) algorithm (Breiman et al.; 1984) and the Random Forest (RF) algorithm (Breiman;2001). Moreover, BART allows researchers to get draws from the posterior distribution ofthe predicted values. In Bayesian statistics, the posterior predictive distribution indicates thedistribution of predicted data based on the data one has already seen (Gelman et al.; 2014).Hence, the posterior predictive distribution can be used to predict new data values and to drawinference around the distribution of these predicted values.BART ﬁnds its foundation in the CART algorithm. CART is a widely used algorithm forthe construction of trees where each node is split into only two branches (i.e., binary trees).Figure 1 illustrates how binary partitioning works in practice based on a simple case with justtwo predictors x ∈ [0 ,

1] and x ∈ [0 , Y the outcome vector, with y i the outcome for a generic unit i , and with X the N × P matrixof covariates or predictors (where P is the number of predictors), with X i the P -dimensionalvector of predictors for i (with i = 1 , ..., N ). Then, the BART model can be expressed as: Y = f ( X ) + (cid:15) ≈ T ( X ; D , M ) + ... + T J ( X ; D j , M j ) + (cid:15), (cid:15) i ∼ N (0 , σ ) , (1)where the J distinct binary trees are denoted by T ( X ; D j , M j ). T is a function that sorts4 < . x > . l l No Yes l No YesFigure 1: (Left) An example of a binary tree. The internal nodes are labelled by their splitting rules and theterminal nodes are labelled with the corresponding parameters l i .(Right) The corresponding partition of the sample space. each unit into one of the sets of m j terminal nodes, associated with mean parameters M j = { µ , ..., µ m j } , based on a set of decision rules, D j . (cid:15) is an error term and is typically assumed tobe independent, and identically normally distributed when the outcome is continuous (Chipmanet al.; 2010). Using more compact notation we can re-write the BART model as: Y = J (cid:88) j =1 T j ( X ; D j , M j ) + (cid:15). (2)The Bayesian component of the algorithm is incorporated in a set of three diﬀerent priors on:(i) the structure of the trees, D j (this prior is aimed at limiting the complexity of any singletree T and works as a regularization device); (ii) the distribution of the outcome in the nodes, M j (this prior is aimed at shrinking the node predictions towards the center of the distributionof the response variable Y ); (iii) the error variance σ (which bounds away σ from very smallvalues that would lead the algorithm to overﬁt the training data) . The aim of these priors is to“regularize” the algorithm, preventing single trees from dominating the overall ﬁt of the model(Kapelner and Bleich; 2016). The choice of the priors and the derivation of the posterior distributions is discussed in depth by Chipmanet al. (2010) and Kapelner and Bleich (2016). Namely, (i) the prior on the probability that a node will split atdepth k is β (1 + k ) − η where β ∈ (0 , , η ∈ [0 , ∞ ) (these hyper-parameters are generally chosen to be η = 2 and β = 0 . N (0 , σ q ) where σ q = σ / √ q and σ can be used to calibrate the plausible range of the regression function; (iii)the prior on the error variance is σ ∼ InvGamma ( v/ , vλ/

2) where λ is determined from the data in a way thatthe BART will improve 90% of the times the RMSE of an OLS model. { X , Y } ) for which weobserve the set of predictors X , but the vector of outcomes is observed just for a subsample ofthe study population, Y obs . Hence, we can indicate with Y mis the vector of missing outcomesin the study sample, where Y = Y mis ∪ Y obs . To obtain predictions for these values, the ﬁrststep is to build a BART model as in (2) regressing Y obs on the observed matrix of predictors.To perform the imputation for Y mis , BART collects a sample from the posterior distribution of θ = { σ , D j , M j } , p ( θ | Y obs ) using the Bayesian backﬁtting algorithm proposed by Chipman et al.(2010). An imputed value for ˆ Y mis can be obtained by sampling from the posterior predictivedistribution: p ( Y mis | Y obs ) = (cid:90) p ( Y mis | Y obs , θ ) · p ( θ | Y obs ) dθ. (3)From (3) we obtain draws from the posterior predictive distribution of Y mis . Then, we can usestandard techniques employed for outlier detection, such as the ones proposed by Miller (1991)(i.e., K standard deviation from the mean of the distribution) and Leys et al. (2013) (i.e., K absolute deviations from the median), to detect the predictions with values that are consistentlyfurther away from the mean or the median of the predicted distribution, respectively. The nicefeature of such techniques, is that they can be manually tuned in order to include units withmore or less extreme predicted values. These analyses can be highly relevant for policy-makers to spot observations with more or lessextreme predicted outcome values. Indeed, it could be the case that policy-makers are interestedin targeting their intervention to a speciﬁc subpopulation, based on the values of an unobserved,yet predictable, outcome (e.g., provide additional learning material to more vulnerable studentsin a region where ﬁnancial literacy is not observed). Moreover, information on the factors related In case of multivariate predictions, one could implement data driven methodologies for outliers detectionsuch as the Isolation Forest (Liu et al.; 2012).

6o these “high or low levels” of the predicted outcome are not only useful to identify potentialsubgroups but also to reveal which factors are associated with speciﬁc levels or values of theoutcome.

The ability of machine learning techniques to provide accurate and robust predictions is key. Aspredictions always hold a degree of uncertainty, we argue that there is a need for speciﬁc tech-niques that enable researchers and practitioners to assess their sensitivity . This is particularlytrue as the usage of machine learning for prediction policy problems has been increasing overthe last few years (Kleinberg et al.; 2015; Mullainathan and Spiess; 2017).The main objective of the sensitivity of predictions analysis introduced here is to assesshow the omission of an unobserved predictor could aﬀect both the forecasts of the model andits performance. Indeed, we argue that a desirable property of any predictive model is to berobust to potentially unobserved predictors. In a predictive scenario, we can deﬁne robustness as the model’s ability to provide stable predictions and a stable performance over the inclusionof additional variables in the model. Robustness of predictions is a desirable property as volatilepredictions may be completely overturned by the inclusion of an additional predictor in themodel. Stability of predictions per-se is not necessarily a desideratum: a model that is completelyunable to properly predict a certain outcome may be stable but highly inaccurate. We argue thatthe stability of the model is strictly connected with its performance and depends on the qualityof the additional predictor included. As one includes a new predictor with a high explanatorypower in a poorly performing model (i.e., a model with highly inaccurate predictions), the modelwill depict a boost in its predictive power driven by an increase in the precision of its predictions.Hence, it is intuitive to see how such a sensitivity of prediction analysis is important for at leasttwo reasons. First, it can show how much an additional predictor could boost the performanceof the model, indicating how much the model can be improved by taking into account additional,potentially unobserved predictors. Second, it reveals the variability in the predictions inducedby novel predictors: the more stable they are, given a high predictive performance, the more the7odel can be trusted for predictive policy tasks.Such a sensitivity of prediction analysis is directly inspired by its causal counterpart (Rosen-baum and Rubin; 1983a; Ichino et al.; 2008). In particular, Ichino et al. (2008) propose asimulation-based sensitivity analysis to assess the robustness of causal estimands to failures inthe unconfoundedness assumption (i.e., independence between the treatment and the potentialoutcomes conditional on the observed covariates). The authors build on the works by Rosenbaumand Rubin (1983b) and Rosenbaum and Rubin (1983a) and simulate an unobserved confounderthat is associated with both the potential outcome and the treatment. Then, they introduce thisadditional confounder in the model for the estimation of the Average Treatment eﬀect on theTreated (ATT). They repeat the estimation procedure many times using a diﬀerent set of valuesfor the simulated confounder and compare these estimates with the original estimate of the ATTobtained without the inclusion of the simulated confounder. Such comparison informs scholarson how robust the original causal estimands are to diﬀerent conﬁgurations used for the con-struction of the confounder (and, in turn, diﬀerent levels of violations of the unconfoundednessassumption).While departing from a similar intuition, the sensitivity analysis that we implement here isfundamentally diﬀerent because we deal with a prediction setting rather than a causal inferenceone. In fact, we depart from the conceptual meaning of causal sensitivity analyses as tools toassess the robustness of an estimators to an underlying and untestable assumption. In this sense,the sensitivity analysis that we introduce in this paper is not aimed at testing any identiﬁcationassumption. Conversely, our aim is to provide a direct assessment of the quality of the modeland its predictions. The added value of the proposed approach is in those situations where onehas a well-performing predictive model and wants to assess whether or not it’s worthwhile tocontinue the search for new predictors.The sensitivity of prediction analysis is performed by generating a synthetic predictor andchecking if (and how) its inclusion in the set of observed predictors changes the model’s pre-dictions and, in turn, its performance. As we want this predictor to have a good explanatorypower with respect to the outcome vector Y , we will design it to have a high correlation withthe outcome and to be uncorrelated with all the other predictors in the model.8efore going into the nuts and bolts on how to construct the synthetic variable, let usdiscuss the implications of constructing this synthetic variable in such a way. While a highcorrelation with the outcome and a zero correlation with the observed predictors guarantees ahigh explanatory power – i.e., a large share of the variation in the outcome being explainedby the synthetic variable, controlling for the other predictors – it does not guarantee that thesynthetic variable is also the one with the highest predictive power. Indeed, it may be the casethat in a non-linear setting, the interaction between two variables with lower explanatory powerand the outcome could carry more information than a predictor with a higher explanatory powerthan these two individual variables. However, such interaction patterns between the predictorsand the outcome are hidden in the data (in some way, the ultimate goal of supervised learningmethodologies is to discover such underlying patterns), and is not possible to reproduce them inany meaningful way. Hence, constructing a variable with a high predictive power, while it maynot lead to the best absolute model performance, is (i) feasible (while it may not be possibleto construct a synthetic variable with the highest predictive power due to the complex, non-linear patterns of interactions between the predictors and the outcome) and; (ii) meaningful:the explanatory power of a variable is relevant also in a prediction setting (especially when theultimate users are the policy-makers). Let us deﬁne with R a N -dimensional vector for the synthetic predictor. We start from sampling N independent realizations from a normally distributed continuous variable: ˜ R ∼ N (0 , R be the N × ( P + 2) matrix that stacks the output vector Y , ˜ R and the matrix of observed variables X such that:˜ R ( N × ( P +2)) = (cid:2) Y ( N × ˜ R ( N × X ( N × P ) (cid:3) . (4)9e deﬁne with ˆ K the estimated correlation matrix of ˜ R ,ˆ K (( P +2) × ( P +2)) =  ρ ( Y, ˜ R ) . . . ˆ ρ ( Y, X p )ˆ ρ ( ˜ R, Y ) 1 . . . ˆ ρ ( ˜ R, X p )... ... . . . ...ˆ ρ ( X p , Y ) ˆ ρ ( X p , ˜ R ) . . .  (5)and with ˜ Σ = diag (ˆ σ Y , ˆ σ R , ˆ σ X ..., ˆ σ X p ) the ( P + 2) × ( P + 2) diagonal matrix of estimatedvariances, where ˆ σ Y is the estimated variance of the output, ˆ σ R is the estimated variance of ˜ R and ˆ σ X j is the estimated variance of the j -th variable in X . Moreover, we let C be the symmetric,idempotent centering N × N matrix: C = I − N (6)where I is the identity matrix and is a N × N matrix of 1s. We center and scale the ˜R matrixas follows: ˙ R = C ˜ R ˜ Σ − (7)where ˙ R is a multivariate normal matrix. This is similar to the type of scaling used, for exam-ple, in standard normal variate correction, and corresponds to a variation of the whitening (orsphering) transformation. Then, we introduce no-correlation within the columns in ˜ R as followsmultiplying L − ˙ R (8)where L is a lower triangular matrix obtained through the Cholesky decomposition of ˜ K : ˜ K = LL T . The matrix that we obtain in (8) is a set of realizations of uncorrelated multivariate normalrandom variables. At this point, we can reintroduce the correlation between each column asfollows: UL − ˙ R (9)10here L is obtained thought as the Cholesky decomposition of K , K = UU T , and K is a newlydeﬁned correlation matrix: K (( P +2) × ( P +2)) =  (cid:101) ρ ( Y, R ) . . . ˆ ρ ( Y, X p ) (cid:101) ρ ( R, Y ) 1 . . . (cid:101) ρ ( R, X p )... ... . . . ...ˆ ρ ( X p , Y ) (cid:101) ρ ( X p , R ) . . .  (10)where (cid:101) ρ ( Y, R ) is the correlation between the synthetic predictor R and the outcome, that is setdirectly by the researcher, and (cid:101) ρ ( X j , R ) is the correlation between the synthetic predictor the j − th regressor. These correlations are set to be equal to zero in our ﬁrst setting: K (( P +2) × ( P +2)) =  (cid:101) ρ ( Y, R ) . . . ˆ ρ ( Y, X p ) (cid:101) ρ ( R, Y ) 1 . . . ρ ( X p , Y ) 0 . . .  (11)In a further step, we make this algorithm robust by introducing correlations between the syn-thetic variable and the best predictor, to closely mimic the case of real-world applications wherepredictors are correlated. The choice of the correlations pattern is left to the researcher, and canbe informed by previous analyses on the relative importance and correlation of the predictors.We can now rework the model in (2) to include the set of stack predictors G : L (cid:88) l =1 T l ( G ; D l , M l ) + ψ (12)where G is: G ( N × ( P +1)) = (cid:2) R ( N × X ( N × P ) (cid:3) . (13) This is possible as far as K is a symmetric, positive-deﬁnite matrix. In the case of a semi-positive deﬁnitematrix, one can employ a global shift α > K = SUU T S + α I where S is diagonal scaling matrix. This shift is known as Tikhonov regularization. R = C − UL − ˙ R ˜ Σ (14)and “extract” the newly generated synthetic predictor R from the R matrix. In this speciﬁccase, the synthetic predictor corresponds to the second column of the matrix R . However, itsrelative position depends on how the matrix ˜ R is generated.Finally, we compare the predictions of the new augmented model in (12) with the ones ofthe original BART model in (2). To do so, we run the two models on b bootstrapped samples b = 1 , ..., B to get a set of predicted values at each iteration: J (cid:88) j =1 T j,b ( X ; D j , M j ) + (cid:15) = ˆ Y b ( X ); (15) L (cid:88) l =1 T l,b ( G ; D l , M l ) + ψ = ˆ Y b ( G ) . (16)Once we get the unit level predictions for these models at each iteration step b (ˆ y b ( x i ) andˆ y b ( g i ), respectively), we can run a series of comparisons between both the distribution of thepredictions and the performance of augmented model as compared to the original model. Toassess the statistical diﬀerence between the unit level predictions’ distributions we assess thediﬀerence between the mean of the unit level prediction of the augmented model, ¯ˆ y ( x i ), and themean of the unit level predictions of the original model, ¯ˆ y ( r i ):¯ˆ y ( x i ) = 1 B B (cid:88) b =1 ˆ y b ( X = x i ) and ¯ˆ y ( r i ) = 1 B B (cid:88) b =1 ˆ y b ( G = g i ) . To assess whether or not there is an improvement in the performance of the augmented model,we compare the RMSE of prediction and the adjusted R of the original and augmented models.As the correlation coeﬃcients between the synthetic predictor and the outcome are subjectivelydeﬁned, these comparisons can be drawns for diﬀerent correlation settings. The more similarthe unit level predictions and performance of the original and the augmented model, the moreone can argue that the available signal is enough for stable and precise predictions.12he table Algorithm 1 summarizes the algorithm introduced in this Subsection and its mainsteps. Algorithm 1

Overview of the sensitivity analysisThe steps of the algorithm:1. Create a new matrix of predictors G stacking the observed predictors X and a syntheticpredictor R correlated with the outcome Y and uncorrelated to all the variables in X ;2. Generate two models, one including only the observed predictors X ( original model ) andone including the observed predictors and the synthetic predictor G ( augmented model );3. Run the two models on b bootstrapped samples ( b = 1 , ..., B ) to get a set of predictedvalues at each iteration for ˆ Y b ( X ) and ˆ Y b ( G );4. Compare the original and the augmented model with respect to:(a) Their unit level predictions;(b) Their performance. To illustrate the proposed methodologies for prediction and sensitivity, we apply them to the 2015data of the OECD’s Program for International Student Assessment (PISA) for Belgium. PISAis a worldwide study aimed at evaluating educational systems by measuring students scholasticperformance on reading, mathematics, and science. These assessments are intended to providecomparable data that can, in turn, enable countries to improve their education policies andoutcomes.The structure of the Belgian PISA 2015 data oﬀers an interesting and relevant example offorecasting missing scores on a standardised student assessment. All regions of the countryparticipated in the general assessment of PISA, while only selected regions participated in the Here, we introduce our algorithm in its most general version. However, running B times the machine learningmodel at hand can be computationally intensive, especially for less scalable machine learning methodologies. Inthe case of BART, to reduce such computational burden, one can directly sample the unit level predictions fromthe posterior predictive distribution. For the RF algorithm, we argue that researchers can invoke the statisticalproperties introduced by Wager and Athey (2018) to construct reliable conﬁdence intervals and statistical testsfor each unit level prediction. This structure of the dataallows us to use a common set of predictors for all regions from the general assessment in orderto predict ﬁnancial literacy outcomes for students in the region that did not participate in thispart of PISA. We choose the Belgian case to illustrate the forecasting methodology, since the regions ofthe country have many similarities and common policies that justify the assumption of similarunderlying relationships between the predictors and the outcome variable. At the same time,the Belgian regions diﬀer in population characteristics which suggests that the missing scores inWallonia are unlikely to be identical to those in Flanders. In the next two Subsections, we willdescribe the set of predictors (Subsection 3.1) and the outcome (Subsection 3.2) in more detail.

In contrast to regression analyses, in which multicollinearity of regressors needs to be avoided,a large number of (potentially correlated) predictors can be used for forecasting in machinelearning (Vaughan and Berry; 2005; Makridakis et al.; 2008; Shmueli et al.; 2010). We thereforeselect a broad set of predictors from the general assessment of PISA 2015. Table A1 providesan overview of the variables used as predictors. The set of predictors includes: (i) students’background characteristics; (ii) proxies of the socioeconomic status of students; (iii) indicatorsof student achievement and attitudes; (iv) school characteristics.To account for the students’ background, we include gender, grade, age, language and studytrack. Existing studies commonly ﬁnd FLS to be highly correlated in particular with mathand reading performance (e.g., Manceb´on et al.; 2019; Riitsalu and P˜oder; 2016). In the caseof Flanders, the correlation of FLS with math and reading scores amounts to 0.8, which isslightly higher than the OECD average of 0.75 (Universiteit Gent; 2017). As such, we include To simplify, we use Flanders to denote both the Flemish and Bruxelles region. In the eight OECD countries participating in the PISA ﬁnancial literacy assessment 2015, Belgium andCanada are the ones with the most remarkable diﬀerences in regional assessments. , ki; 2017).To approximate the socioeconomic status of a student, we use the indicators of educational andeconomic possessions of the family, the number of books in the home, the immigration back-ground of the student, mother’s and father’s education and job, as well as a variable capturingperceived parental emotional support.Finally, studies of PISA ﬁnancial literacy outcomes commonly include variables for schoollevel variables (e.g., Pesando; 2018; Cordero and Pedraja; 2018). We include a number of schoolcharacteristics from the questionnaire for school principals, such as the school’s location, sizeand autonomy. As indicators of teaching quality, we use the student-teacher ratio, class size andteacher professional development. We also use school level indicators of socioeconomic status,such as the share of students with a diﬀerent home language, special needs or a socioeconomicallydisadvantaged home, the number of available computers and shortage of educational material.Figure A1 in Appendix A shows the missing observations in the variables used in the anal-ysis. Apart from the FLS, which are, as described above, only available for Flemish students,no clear patterns of missingness appear across the predictors. We can therefore assume the ob-servations to be Missing-Completely-at-Random (Little and Rubin; 2019) and we proceed withmultiple imputations using the Fully Conditional Speciﬁcation (FCS) developed by Buuren andGroothuis-Oudshoorn (2010) and implemented in the R package MICE .As shown in Table A2 in Appendix A, a series of T-tests reveal signiﬁcant diﬀerences inmeans on background characteristics of students in Flanders and Wallonia. The Belgian sample isbalanced across regions in terms of gender, age and parents’ characteristics. However, potentiallyimportant variables with respect to ﬁnancial literacy, such as math and reading scores, as well as15he socioeconomic status, are signiﬁcantly diﬀerent in the two regional samples. Figure 2 showsthat the distribution of math scores in Flanders and Wallonia overlaps; however the distributionis shifted to the right for Flanders compared to Wallonia. A similar pattern is observed regardingthe socioeconomic status of students in the two samples, as approximated by the PISA wealthindicator which summarises economic possessions of the family. Figure 2 shows that, similarly,the distribution of wealth overlaps, but is slightly shifted to the right in Flanders compared toWallonia.Figure 2: (Left) Distribution of PISA math Scores in Flanders (blue) and Wallonia (yellow).(Right) Distribution of PISA Wealth Index in Flanders (blue) and Wallonia (yellow).

Even though there are obvious diﬀerences in the univariate distributions of the predictors, weﬁnd that their multivariate distributions largely overlap. In particular, we develop an overlapscore aimed at assessing if there are regions of the features’ space where there is no overlapbetween the multivariate distribution of the predictors in Flanders and Wallonia. Additionaldetails on the proposed methodology can be found in Appendix B. Since we ﬁnd that themultivariate distributions of the predictors largely overlap for the two regions, this hints at thefact that our predictive model will not extensively rely on extrapolation to extend its predictionsfrom Flanders to Wallonia. 16 .2 Outcome Variable

The FLS in PISA 2015 are reported as a set of ten plausible values. The ﬁnancial literacy scoreis based on a test of ﬁnancial knowledge with 43 items in four content categories: (i) money andtransactions; (ii) risk and reward; (iii) planning and managing ﬁnances; and (iv) the ﬁnanciallandscape (OECD; 2017a).In the case of Belgium, there are 9,651 observations on the general assessment from allBelgian regions, but the FLS are only available for the 5,675 students from Flanders, while theFLS of the 3,976 students from Wallonia are missing. Flanders thus represents our “study”population in which we observe the FLS Y obs . Wallonia represents the “target” population forwhich we predict the missing FLS, Y mis , based on the common set of predictors from the generalassessment X . Table A3 in Appendix A provides summary statistics of the outcome variable.On average, Flemish students score 541 points, which corresponds to the PISA proﬁciency level3 out of 5 levels. Figure A2 in Appendix A shows the distribution of FLS in the Flemish data.12% of Flemish students fail to reach the baseline level 2 of 400 points or more which the OECDdeﬁnes as the level necessary to participate in society (OECD; 2017a). Figure A3 in AppendixA provides an overview of the required knowledge corresponding to the ﬁve PISA proﬁciencylevels. In this Section, we discuss the performance of BART in the prediction task at hand. Then,we provide the results and some descriptive evidence regarding the observation with posteriorpredicted probability lower than the average (vulnerable students). Finally, we depict and discussthe results from the sensitivity analysis. For each participating country or economy, 33% of the students who completed the general PISA assessmentwere tested on ﬁnancial literacy (OECD; 2017b). The plausible values were then constructed for all studentsparticipating in the general PISA based on item response theory and latent regression. In the case of Belgium,this was only done for Flanders. As a results, the plausible values are available for all participating students in theFlemish region, but missing for students from Wallonia. In the following, we use the plausible value

P V F LIT for the analysis, since any bias caused by using a single plausible value instead of all ten simultaneously is arguablynegligible (Gramat , ki; 2017). .1 Results for BART predictions Many recent studies have shown that BART performs in an excellent way in various predictivetasks across diﬀerent scenarios (Murray; 2017; Linero and Yang; 2018; Linero; 2018; Hern´andezet al.; 2018; Bargagli-Stoﬃ, Riccaboni and Rungi; 2020; Bargagli-Stoﬃ, Niederreiter and Ric-caboni; 2020). In our particular scenario, we assess the performance of BART through a 10-folds cross-validation, and compare it with the performance of the random forest (RF) algorithmBreiman (2001). A generic k -folds cross-validation technique consists of dividing the overall sam-ple (in our speciﬁc case we draw this sample from the set of observations for which the outcomevariable is present) into k diﬀerent subsamples. k − R . BART demonstrates a very good performance(as well as a series of other qualities that were described brieﬂy in Section 2), as it outperformsRF with respect to all the selected performance measures (i.e., smaller RMSE and MAE, higher R ). Hence, we use this technique to predict the missing FLS for students in Wallonia.Table 1: Summary Statistics of the Performance of the ML techniques RMSE MAE R BART 58.1799 45.7505 0.7306Random Forest 58.8402 46.1451 0.7251

Figure 3 depicts the posterior predicted values for FLS for students in Flanders (light blue)and Wallonia (orange), while Table 2 reports summary statistics for the same predictions. From18 .0000.0010.0020.0030.004 200 400 600

Predicted Financial Literacy Scores D en s i t y Figure 3: Predicted FLS for Flanders (light blue) and Wallonia (orange). The red line indicatesthe threshold of the baseline level of proﬁciency in ﬁnancial literacy. OECD suggests thatstudents above this threshold of 400 points have ﬁnancial literacy levels that are suﬃcient toparticipate in society (OECD; 2017a).Table 2 we observe that the mean posterior predictive FLS for students in Flanders is higher thanfor those in Wallonia, 541.4 compared to 516.6. Considering the minimum and maximum valuesfor Flanders and Wallonia, we note that the scores of the students in Flanders are more centeredaround the mean in Wallonia compared to Flanders. The same conclusion can be drawn fromFigure 3 where the peak of the light blue area (Flanders) is to the right of the peak of the orangearea (Wallonia). As OECD provide interpretable diﬀerence between the FLS, we can interpretthe diﬀerent results for Flanders and Wallonia. In particular, based on OECD computations,a diﬀerence of 40 points equals an entire school year of learning. Hence, based on this, we canstate that, on average, students in Flanders are roughly half a school-year ahead of students inWallonia with respect to their standardized knowledge in ﬁnancial literacy.19able 2: Summary Statistics of the Predicted FLS

Mean SD Minimum Median Maximum NPredicted FLS Flanders 541.4 96.8 157.3 559.9 750.0 5675Predicted FLS Wallonia 516.6 95.7 188.3 530.3 731.7 3976

The quality of the predictions of FLS in Wallonia is not directly testable as the outcomeis unobserved in this region. However, the quality of these predictions can be potentially verysimilar to the one obtained through cross-validation if the underlying assumption that the ‘tech-nology’ and the eﬃciency to transform teaching and environmental inputs into learning outputis comparable between Flanders and Wallonia. As we argued in Section 3,this assumption islikely to hold.To investigate these assumption thoroughly, we run a series of robustness checks using diﬀer-ent outcome variables. Given the high correlation between the outcome and math and readingscores, and given the standardized way all these outcome are recorded, the most straightforwardway to implement this analysis is to use these two variable as outcomes and assess whether theBART model build using Flemish data performs well when tested on Walloon data. The resultsfor these additional analyses are depicted in Appendix C. We ﬁnd no relevant diﬀerences betweenthe performance of the model built using the Flemish data both for training and testing and theone built using Flemish data for training and Walloon data for testing. This robustness checkseems to conﬁrm the fact that the ‘technology’ and the eﬃciency in Flanders and Wallonia arecomparable.

Policy-makers often want to target only those who are really in need for an intervention as theyare often facing budget constraints that prevent them from enacting policies for all constituents.These groups can be identiﬁed, in cases where the values of a certain outcome of interest are20bserved, by simply looking at the outcome’s distribution. However, it is often the case that suchoutcomes may be hidden or not observed for a part of the study population. In such scenarios,machine learning predictions may furnish a valuable tool to policy makers. In the case of ourapplication, we face a prediction problem: imagine that policy makers want to provide additionallearning material to more vulnerable students in a region where FLS are not observed. Machinelearning can provide a useful tool to draw such predictions.In order to detect the most vulnerable students (i.e., those with low predicted FLS), we use theprocedure introduced in Section 2.1. Uncovering subpopulations of students that diﬀer in termsof the distribution of a certain outcome is central to boost schools’ and teachers’ eﬀectivenessand various strategies have been applied to this task (Masci et al.; 2019). Interestingly, Bayesianmachine learning has been applied to discover the relationship between ﬁnancial literacy and self-reported economic outcomes (Puelz and Puelz; 2020). However, to the extent of our knowledge,this is the ﬁrst time that a supervised machine learning technique is used to predict students’FLS.In a previous study on OECD data, De Beckker et al. (2019) identify four groups of adultswith diﬀerent ﬁnancial literacy levels through an unsupervised machine learning algorithm (i.e.,k-means). The advantage of the use of machine learning is clear as it allows to partition a largeheterogeneous group of people into subgroups according to their FLS. However, while simpleunsupervised machine learning algorithms such as k-means are useful to get a ﬁrst impression ofthe diﬀerent subgroups and their socioeconomic characteristics, they do not provide immediateinformation on the robustness of the outcomes. Moreover, these simple algorithms are not capa-ble of making predictions for out-of-sample regions. This is particularly relevant when dealingwith standardized tests, such as PISA, which often only cover subpopulations of a country, suchas speciﬁc regions.First, we train BART on the sample of Flemish students for which we have data on FLS. Next,we predict the FLS for both the sample of students in Flanders and Wallonia. After that, wecompare the posterior predicted values for each student with the mean of the predicted posteriorvalues. Finally, we detect the students for which these values are lying outside the credibilityintervals for the mean values. Here, we deﬁne outliers as those observations with predicted FLS21alues smaller than two standard deviations from the mean following Miller (1991). In AppendixD, we check the results we would obtain when deﬁning the outliers as observations with predictedFLS two absolute deviations smaller than the median as suggested by Leys et al. (2013). To get a better understanding of which variables best explain low predicted FLS, we generatea dummy variable that assumes value 1 if the student has a low predicted FLS and 0 otherwise.Finally, we built a series of conditional inference trees (Hothorn et al.; 2006) to have a descriptivesense of which are the drivers of low ﬁnancial literacy. This ﬁt-the-ﬁt procedure has been widelyapplied in recent years to obtain interpretable results for the main drivers in the heterogeneityof an outcome (Lee et al.; 2018; Bargagli-Stoﬃ et al.; 2019; Bargagli-Stoﬃ and Gnecco; 2020;Lee et al.; 2020).Figure 4 depicts the conditional tree for the overall sample, Figure 5 shows the conditionaltrees for Flanders and Wallonia respectively, Figure 6 depicts the trees for students in grades7 to 9 and 10 to 12, and Figure 7 shows the results for the trees for students in general andvocational education.As shown in Figure 4, reading (

PV1READ ) and math (

PV1MATH ) scores, the language spokenat home (

BELANGN ), study track (

ISCEDD ) and grade (

ST001D01T ) are important variables dis-tinguishing groups of students in our entire sample. For students from the lower grades (grade7 to 9) with lower scores on reading ( ≤ . ≤ .

49) the predicted probabilityof low FLS is 95 percent. A higher PISA math score ( > . > . ≤ . ≤ . ≤ . > . Such results are not signiﬁcantly diﬀerent from the ones that are depicted in this Section. Thus, we arguethat our results are robust to diﬀerent deﬁnitions of outlying predictions.

22o matter (

IMMIG ). Among those who are ﬁrst generation-migrants the predictive probability ofhaving low FLS is 82 percent. In Wallonia, the proportion of students with low FLS is again thelargest (94 percent) among those with low reading ( ≤ . ≤ . MISCED ) seems to play a distinctive role.Students with better reading and math scores and with a highly educated mother only have apredictive probability of 1 percent of low FLS.In Figure 6, we split by grade. We group grades 7-9 (left panel), and 10-12 (right panel).Indeed in Flanders, after grade 9 certain conditions change (teacher qualiﬁcation requirements,etc.). Hence, it is interesting to investigated the drivers of heterogeneous eﬀects by grade. Ingrades 7-9, the largest predictive proportion (95 percent) of students with low FLS are againamong those with a lower reading ( ≤ .

29) and math score ( ≤ . .

49) but with a low reading score ( ≤ . FISCED ) seems to have a signiﬁcant inﬂuence on FLS. For students with fathers with a highereducation background, the predictive probability of low FLS lies between 1 and 20 percent,depending on whether the student has a math score above or below 399.714.In Figure 7, we split by track. In the case of general education, students (left panel) fromgrade 7, 8 and 12 with a low reading ( ≤ . ≤ . V1READ (n = 9651) £ > £ > £ > Figure 4: Conditional tree for the entire sample. Within each leaf is depicted in red the histogramof the percentage of units that have a low ﬁnancial literacy score, and next to it the percentageof units with not-low ﬁnancial literacy score within the same leaf.

PV1READ (n = 5675) £ > £ > PV1READ (n = 3976) £ > £ > £ > £ > £ > Figure 5: (Left) conditional tree for Flanders. (Right) The corresponding tree for Wallonia. V1READ (n = 3394) £ > £ > £ > £ > £ > PV1READ (n = 6257) £ > £ > £ > £ > Figure 6: (Left) conditional tree for students in grades 7 to 9. (Right) The corresponding tree for students ingrades 10 to 12.

ST001D01T (n = 5446){Grade 12, Grade 7, Grade 8} {Grade 10, Grade 11, Grade 9}PV1READ (n = 521) £ > £ > £ > £ > £ > £ > PV1READ (n = 4171) £ > £ > £ > Figure 7: (Left) conditional tree for general education. (Right) The corresponding tree for vocational education.

In this Section, we report the results for the sensitivity of predictions analysis. In Section 2.2we introduced the methodology which we employ here to assess how an unobserved predictor,uncorrelated to the observed ones, could impact the predicted FLS and the performance of themodel. In particular, the sensitivity function that we developed in the statistical software R identiﬁes how many unit level predictions are statistically diﬀerent between the original modeland the model with the synthetic predictor and how much the synthetic predictor would improvethe overall performance of the model.Starting from the unit level predictions that we introduced in Section 2.2, we can get a senseof how much the synthetic predictor would aﬀect the model prediction in a varying range ofcorrelations between the synthetic predictor and the outcome (in the case of our application25his range is [0 . , . Here, we test how many of the predicted values of the augmentedmodel would be statistically diﬀerent from the predicted values of the original model. We ﬁndno evidence of statistically signiﬁcantly diﬀerent predictions up to a correlation of 0.5, wherejust 0.01% of the unit level predictions of the augmented model are signiﬁcantly diﬀerent.Moreover, ﬁgure 8 depicts the results for both the RMSE and R of the original and aug-mented model with their conﬁdence intervals as the correlation between the synthetic predictor R and the outcome Y increases. In both cases, even when the synthetic predictor has a correla-tion of 0.5 with the outcome, there is no signiﬁcant diﬀerence between the RMSE of the originaland the augmented model. The R is signiﬁcantly improved just when the correlation reaches0.5. This evidence could be interpreted as the fact that, in order to get a signiﬁcant increasein the R , we would need an explanatory variable (completely unrelated to the observed ones)with at least a 0.5 correlation to the outcome. To put things into perspective, a predictor withsuch a correlation would rank between the ﬁrst three best predictors in the model. We arguethat, in the context of our application, such an unobserved variable would be very hard, if notimpossible, to ﬁnd. Sensitivity of Predictions (RMSE)

Corr(y,r) E s t i m a t ed R M SE f o r S y n t he t i c m ode l . . . . . Sensitivity of Predictions (R Squared)

Corr(y,r) E s t i m a t ed R S qua r ed Figure 8: (Left) Estimated RMSE for the augmented model with conﬁdence intervals (blue) and for the originalmodel (red). (Right) Estimated R for the augmented model with conﬁdence intervals (blue) and for the originalmodel (red). The 99% conﬁdence intervals were constructed following Cohen et al. (2014). As argued before, the interpretation of the sensitivity of prediction analysis is that, even if wecould introduce a new, unobserved predictor, with a high explanatory power in the model, theunit level predictions as well as the overall performance of the model would remain fairly stable. This range depends on the data at hand as we need to guarantee that R is a symmetric, positive deﬁnitematrix. In this paper, we introduce a novel sensitivity analysis to assess the robustness of our preferredpredictive model in terms of the stability of its predictions and performance. The model thatwe propose is general enough to be applied to both Bayesian techniques (such as the BayesianAdditive Regression Trees algorithm) or frequentist techniques (such as the Random Forest al-gorithm). In particular, we propose a novel way to partially answer policy-relevant questionssuch as: “ how much would an unobserved predictor impact the model’s prediction and its per-formance? ”. To do so, we develop a novel methodology to construct a synthetic predictor thatis correlated with the outcome but is uncorrelated with the observed predictors. By generatingthe synthetic predictor in this a way, we are able to interpret the results from the sensitivityanalysis in a way that hints at how much an exogenous predictor with a high explanatory valuewould add to the model. If the addition of the synthetic predictor is not resulting in signiﬁcantdiﬀerences, then we can safely assume that our predictive model is capturing enough signal fromthe training data. Moreover, through our novel methodology, we are able to use a wide rangeof correlations between the outcome and the synthetic predictor. This enables the researchersto check whether or not there is a break-point (given by the explanatory power of the synthetic The best predictors are chosen according to the built-in variables’ importance measure of BART. For areview of various tree-based variables’ importance measures and their pitfalls we refer the reader to James et al.(2013) and the recent contribution by Gottard et al. (2020). R ) with respect to FLS predictions.We ﬁnd that the predicted values for FLS for students in Flanders are, on average, somewhatlarger than those for students in Wallonia. Nevertheless, the distribution of FLS in Flanders ismore extreme (i.e. lower minimum, larger maximum) compared to Wallonia.Next, we estimate the posterior predicted observations available for each student which alsoallows to identify outliers (i.e. students for which the the FLS are situated outside the 95%credibility intervals for the mean values). Our results show that the predictive probability ofhaving low FLS is the largest for students with lower scores on reading and math in the PISAtest. This corroborates with ﬁndings of Manceb´on et al. (2019) who argue that the developmentof ﬁnancial abilities of students is mediated by their mathematical skills. Another interestingobservation is that the family background also has a key inﬂuence. Students with the largestpredicted low FLS are often students from families where the school language is not spoken athome. Particularly in Flanders, being a ﬁrst-generation immigrant or not is key in having lowFLS. The educational background of parents is another important predictor. These observationsare in line with existing literature suggesting that more vulnerable groups in terms of low FLSare often situated among those with a lower socioeconomic background (Gramat , ki; 2017; Riitsaluand P˜oder; 2016; De Beckker et al.; 2019).Finally, we conﬁrm that our novel approach is robust in terms of potential unobserved vari-ables. The inclusion of a synthetic predictor highly correlated with the outcome does result ina signiﬁcant diﬀerence in the performance measures of the original model and the augmentedmodel. From a policy perspective, our novel approach is highly interesting as it not only allowsto predict outcomes for certain countries (or regions within countries, as in the case of our ap-plication) with missing data, but also introduces the possibility to detect observations that fall28utside the mean outcomes. In this respect, policy-makers can detect which factors drive lowresults. We applied our model to PISA ﬁnancial literacy data, however, the same approach canalso be applied to other large administrative datasets. References

Athey, S. (2019).

The Impact of Machine Learning on Economics , University of Chicago Press,pp. 507–547.Athey, S. and Imbens, G. W. (2019). Machine learning methods that economists should knowabout,

Annual Review of Economics .Bargagli-Stoﬃ, F. J. (2020). Essays on applied machine learning.Bargagli-Stoﬃ, F. J., De Witte, K. and Gnecco, G. (2019). Heterogeneous causal eﬀectswith imperfect compliance: a novel bayesian machine learning approach, arXiv preprintarXiv:1905.12707 .Bargagli-Stoﬃ, F. J. and Gnecco, G. (2020). Causal tree with instrumental variable: An exten-sion of the causal tree framework to irregular assignment mechanisms, International Journalof Data Science and Analytics (3): 315–337.Bargagli-Stoﬃ, F. J., Niederreiter, J. and Riccaboni, M. (2020). Supervised learning for theprediction of ﬁrm dynamics, arXiv preprint arXiv:2009.06413 .Bargagli-Stoﬃ, F. J., Riccaboni, M. and Rungi, A. (2020). Machine learning for zombie hunting.ﬁrms’ failures, ﬁnancial constraints, and misallocation, FEB Research Report Department ofEconomics DPS20. 06 .Breiman, L. (2001). Random forests,

Machine Learning (1): 5–32.Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classiﬁcation and regressiontrees, Belmont, CA: Wadsworth & Brooks .Buuren, S. v. and Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chainedequations in r,

Journal of Statistical Software pp. 1–68.Chalﬁn, A., Danieli, O., Hillis, A., Jelveh, Z., Luca, M., Ludwig, J. and Mullainathan, S. (2016).Productivity and selection of human capital with machine learning,

American Economic Re-view (5): 124–27.Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C. and Newey, W. (2017).Double/debiased/neyman machine learning of treatment eﬀects,

American Economic Review (5): 261–65.Chipman, H. A., George, E. I., McCulloch, R. E. et al. (2010). Bart: Bayesian additive regressiontrees,

The Annals of Applied Statistics (1): 266–298.Cohen, P., West, S. G. and Aiken, L. S. (2014). Applied multiple regression/correlation analysisfor the behavioral sciences , Psychology Press.29ordero, J. M. and Pedraja, F. (2018). The eﬀect of ﬁnancial education training on the ﬁnancialliteracy of Spanish students in PISA,

Applied Economics (16): 1679–1693.De Beckker, K., De Witte, K. and Van Campenhout, G. (2019). Identifying ﬁnancially illiterategroups: an international comparison, International Journal of Consumer Studies (5): 490–501.Delen, D. (2010). A comparative analysis of machine learning techniques for student retentionmanagement, Decision Support Systems (4): 498–506.Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2014). Bayesian data analysis (vol.2), Boca Raton, FL: Chapman .Gottard, A., Vannucci, G. and Marchetti, G. M. (2020). A note on the interpretation of tree-based regression models,

Biometrical Journal .Gramat , ki, I. (2017). A comparison of ﬁnancial literacy between native and immigrant schoolstudents, Education Economics (3): 304–322.Hern´andez, B., Raftery, A. E., Pennington, S. R. and Parnell, A. C. (2018). Bayesian additiveregression trees using bayesian model averaging, Statistics and Computing (4): 869–890.Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference, Journal of Computa-tional and Graphical Statistics (1): 217–240.Hooker, G. (2004). Diagnostics and extrapolation in machine learning , Stanford University Press.Hothorn, T., Hornik, K. and Zeileis, A. (2006). Unbiased recursive partitioning: A conditionalinference framework,

Journal of Computational and Graphical statistics (3): 651–674.Ichino, A., Mealli, F. and Nannicini, T. (2008). From temporary help jobs to permanent employ-ment: what can we learn from matching estimators and their sensitivity?, Journal of AppliedEconometrics (3): 305–327.James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). Tree-based methods, An introduc-tion to statistical learning , Springer, pp. 303–335.Kapelner, A. and Bleich, J. (2013). Bartmachine: Machine learning with bayesian additiveregression trees, arXiv preprint, arXiv:1312.2171 .Kapelner, A. and Bleich, J. (2016). bartmachine: Machine learning with bayesian additiveregression trees,

Journal of Statistical Software, Articles (4): 1–40. URL:

Kleinberg, J., Ludwig, J., Mullainathan, S. and Obermeyer, Z. (2015). Prediction policy prob-lems,

American Economic Review (5): 491–95.Kotsiantis, S. B. (2012). Use of machine learning techniques for educational proposes: a decisionsupport system for forecasting students’ grades,

Artiﬁcial Intelligence Review (4): 331–344.Kotsiantis, S., Pierrakeas, C. and Pintelas, P. (2004). Predicting students’performance in dis-tance learning using machine learning techniques, Applied Artiﬁcial Intelligence (5): 411–426. 30ee, K., Bargagli-Stoﬃ, F. J. and Dominici, F. (2020). Causal rule ensemble: Interpretableinference of heterogeneous treatment eﬀects, arXiv preprint arXiv:2009.09036 .Lee, K., Small, D. S. and Dominici, F. (2018). Discovering eﬀect modiﬁcation and randomizationinference in air pollution studies, arXiv preprint, arXiv:1802.06710 .Leys, C., Ley, C., Klein, O., Bernard, P. and Licata, L. (2013). Detecting outliers: Do not usestandard deviation around the mean, use absolute deviation around the median, Journal ofExperimental Social Psychology (4): 764–766.Linero, A. R. (2018). Bayesian regression trees for high-dimensional prediction and variableselection, Journal of the American Statistical Association (522): 626–636.Linero, A. R. and Yang, Y. (2018). Bayesian regression tree ensembles that adapt to smooth-ness and sparsity,

Journal of the Royal Statistical Society: Series B (Statistical Methodology) (5): 1087–1110.Little, R. J. and Rubin, D. B. (2019). Statistical analysis with missing data , Vol. 793, JohnWiley & Sons.Liu, F. T., Ting, K. M. and Zhou, Z.-H. (2012). Isolation-based anomaly detection,

ACMTransactions on Knowledge Discovery from Data (TKDD) (1): 1–39.Logan, B. R., Sparapani, R., McCulloch, R. E. and Laud, P. W. (2019). Decision making anduncertainty quantiﬁcation for individualized treatments using bayesian additive regressiontrees, Statistical Methods in Medical Research (4): 1079–1093.Longobardi, S., Pagliuca, M. M. and Regoli, A. (2018). Can problem-solving attitudes explain thegender gap in ﬁnancial literacy? Evidence from Italian students’ data, Qual Quant : 1677–1705.Makridakis, S., Wheelwright, S. C. and Hyndman, R. J. (2008). Forecasting methods and appli-cations , John Wiley & Sons.Manceb´on, M.-J., Xim´enez-de Emb´un, D. P., Mediavilla, M. and G´omez-Sancho, J.-M. (2019).Factors that inﬂuence the ﬁnancial literacy of young Spanish consumers,

International Journalof Consumer Studies : 227–235.Masci, C., Paganoni, A. M. and Ieva, F. (2019). Semiparametric mixed eﬀects models forunsupervised classiﬁcation of italian schools, Journal of the Royal Statistical Society: SeriesA (Statistics in Society) (4): 1313–1342.Miller, J. (1991). Reaction time analysis with outlier exclusion: Bias varies with sample size,

The quarterly journal of Experimental Psychology (4): 907–912.Mullainathan, S. and Spiess, J. (2017). Machine learning: an applied econometric approach, Journal of Economic Perspectives (2): 87–106.Murray, J. S. (2017). Log-linear bayesian additive regression trees for categorical and countresponses, arXiv preprint, arXiv:1701.01503 .Nethery, R. C., Mealli, F., Dominici, F. et al. (2019). Estimating population average causaleﬀects in the presence of non-overlap: The eﬀect of natural gas compressor station exposureon cancer mortality, The Annals of Applied Statistics (2): 1242–1267.31ECD (2017a). PISA 2015 Results (Volume IV): Students’ Financial Literacy, Technical report .OECD (2017b). PISA 2015 Technical Report,

Technical report .Pesando, L. M. (2018). Education Economics Does ﬁnancial literacy increase students’ perceivedvalue of schooling? Does ﬁnancial literacy increase students’ perceived value of schooling?,

Education Economics (5): 488–515.Puelz, D. and Puelz, R. (2020). Financial literacy and perceived economic outcomes.Riitsalu, L. and P˜oder, K. (2016). A glimpse of the complexity of factors that inﬂuence ﬁnancialliteracy, International Journal of Consumer Studies (6): 722–731.Rosenbaum, P. R. and Rubin, D. B. (1983a). Assessing sensitivity to an unobserved binarycovariate in an observational study with binary outcome, Journal of the Royal StatisticalSociety: Series B (Methodological) (2): 212–218.Rosenbaum, P. R. and Rubin, D. B. (1983b). The central role of the propensity score in obser-vational studies for causal eﬀects, Biometrika (1): 41–55.Shmueli, G. et al. (2010). To explain or to predict?, Statistical Science (3): 289–310.Universiteit Gent (2017). Financial literacy in 15 years old. Vlaams Rapport PISA 2015., Tech-nical report .Vaughan, T. S. and Berry, K. E. (2005). Using monte carlo techniques to demonstrate themeaning and implications of multicollinearity,

Journal of Statistics Education (1).Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment eﬀectsusing random forests, Journal of the American Statistical Association (523): 1228–1242.32 nline appendix

A Data

Table A1: Variables used from the PISA dataPISA Code Variable

Student Characteristics

ST001D01T

International Grade

ST004D01T

Gender

AGE

Age

ISCEDD

Study Track: ISCED Designation

ISCEDO

Study Track: ISCED Orientation

BELANGN

Speaks Belgian Language at Home

Socioeconomic Status

HEDRES

Educational Resources at Home

WEALTH

Family Wealth Index (Economic Possessions)

ST013Q01TA

Number of Books at Home

IMMIG

Immigration Status

MISCED

Mother’s Education (ISCED)

FISCED

Father’s Education (ISCED)

BMMJ1

Mother’s Job (ISEI)

BFMJ2

Father’s Job (ISEI)

EMOSUPS

Parents Emotional Support

Achievement and Attitude

PV1MATH

Plausible Value 1 in Mathematics

PV1READ

Plausible Value 1 in Reading

REPEAT

Grade Repetition

OUTHOURS

Out-of-School Study Time per Week

MMINS

Mathematics Learning Time at School

LMINS

Language Learning Time at School

ANXTEST

Personality: Test Anxiety

MOTIVAT

Achievement Motivation

School Characteristics

SC001Q01TA

School Community (Location)

SC048Q01NA

Share of Students With a Diﬀerent Heritage Language

SC048Q02NA

Share of Students With Special Needs

SC048Q03NA

Share of Socioeconomically Disadvantaged Students

SCHSIZE

School Size

CLSIZE

Class Size

RATCMP1

Number of Available Computers per Student

LEADPD

Teacher Professional Development

SCHAUT

School Autonomy

EDUSHORT

Shortage of Educational Material

STRATIO

Student-Teacher Ratio1able A2: Summary Statistics of the Predictors

Flanders Wallonia Diﬀerence in MeansMean SD N Mean SD N p-value

Student Characteristics

International Grade 9.72 0.52 5675 9.44 0.75 3857 0.000Gender 1.51 0.50 5675 1.51 0.50 3976 0.976Age 15.85 0.29 5675 15.84 0.29 3976 0.725Study Track: ISCED Designation 1.43 0.81 5675 1.24 0.64 3976 0.000Study Track: ISCED Orientation 2.09 1.00 5675 1.55 0.89 3976 0.000Speaks Belgian Language at Home 0.91 0.29 5595 0.87 0.34 3929 0.000

Socioeconomic Status

Educational Resources at Home 0.27 0.91 5585 -0.15 0.90 3930 0.000Family Wealth Index (Economic Possessions) 0.27 0.74 5606 -0.04 0.85 3936 0.000Number of Books at Home 3.02 1.51 5557 3.32 1.53 3905 0.000Immigration Status 1.20 0.54 5512 1.32 0.66 3851 0.000Mother’s Education (ISCED) 4.62 1.39 5376 4.63 1.53 3774 0.882Father’s Education (ISCED) 4.53 1.44 5221 4.52 1.58 3654 0.901Mother’s Job (ISEI) 47.05 22.75 4763 47.30 22.10 3159 0.635Father’s Job (ISEI) 46.21 21.38 4713 46.21 22.16 3318 0.996Parents Emotional Support -0.01 0.96 5458 0.01 0.96 3801 0.302

Achievement and Attitude

Plausible Value 1 in Mathematics 521.61 97.89 5675 494.78 90.71 3976 0.000Plausible Value 1 in Reading 510.95 100.67 5675 490.46 95.81 3976 0.000Grade Repetition 0.24 0.43 5457 0.42 0.49 3804 0.000Out-of-School Study Time per Week 14.24 10.21 4245 16.27 11.61 3216 0.000Mathematics Learning Time at School 191.28 86.43 5273 219.74 81.86 3668 0.000Language Learning Time at School 187.24 84.63 5275 227.10 83.49 3665 0.000Personality: Test Anxiety -0.30 0.97 5416 -0.01 1.01 3775 0.000Achievement Motivation -0.64 0.83 5417 -0.28 0.87 3767 0.000

School Characteristics

School Community (Location) 2.87 0.81 5556 3.24 1.11 3681 0.000Share of Students With a Diﬀerent Heritage Language 16.41 23.39 5406 26.19 30.29 2812 0.000Share of Students With Special Needs 19.58 20.14 5154 18.97 21.94 3005 0.208Share of Socioeconomically Disadvantaged Students 20.02 22.11 5248 33.06 30.38 3120 0.000School Size 695.86 331.42 5405 771.27 327.30 3413 0.000Class Size 18.89 5.14 5592 21.01 3.77 3578 0.000Number of Available Computers per Student 1.25 0.89 5270 0.47 0.33 3251 0.000Teacher Professional Development 0.10 0.90 5150 0.11 1.07 3433 0.685School Autonomy 0.77 0.18 5675 0.59 0.21 3626 0.000Shortage of Educational Material 0.02 0.87 5488 0.21 0.87 3429 0.000Student-Teacher Ratio 9.09 3.21 5325 9.14 2.66 2847 0.454

Note:

Summary statistics of all predictors used in the analysis from the PISA 2015 data for Flanders andWallonia. SD stands for the standard deviation of the variable. The last column shows the p-value of atwo-sample t-test for the equality of means across the regions of Flanders and Wallonia. Language hasbeen grouped for Belgium (1=Dutch, 2=French, 3=German, 4=Other).

Mean SD Minimum Median Maximum NFinancial Literacy 541.43 112.16 51.81 555.14 901.64 5675

Note:

Summary statistics of Plausible Value 1 in Financial Literacy from PISA 2015 forthe Flemish region in Belgium. The OECD constructs ten plausible values for ﬁnancialliteracy using item response theory and latent regression (OECD; 2017b). The analysisin this paper is based on Plausible Value 1. SD stands for the standard deviation of thevariable.

Figure A1: Missingness Map3igure A2: Histogram of the Outcome Variable (PISA Financial Literacy Score). The red lineindicates the threshold of the baseline level of proﬁciency in ﬁnancial literacy. The OECDsuggests that students above this threshold of 400 points have ﬁnancial literacy levels that aresuﬃcient to participate in society (OECD; 2017a).Figure A3: PISA Proﬁciency Levels for Financial Literacy (OECD; 2017a)4

Overlap

Researchers often face potential pitfalls of machine learning predictions when the data on whichthey train their algorithm are substantially diﬀerent from the data on which they perform thepredictions. We provide a novel methodology to investigate if there are regions of the predictors’space in the out-of-sample set where the algorithm is relying on extrapolation.In real world scenarios, one may be facing the problem of generalizing the results from onesample – for which one observes the distribution of a certain outcome – to a novel samplefor which one has no information on the outcome. Generalizing the prediction from the ﬁrstsample (in-sample predictions) to the new sample for which the outcome is not observed (out-of-sample predictions) can potentially lead to less reliable predictions if there is scarce overlapin the observed predictors in the two samples. Lack of overlap in the support of in-sample andout-of-sample predictors leads to the predictor model heavily relying on extrapolation in thenon-overlap areas.As highlighted by Hooker (2004), extrapolation can create unrealistic predictions, even inreasonable points of the predictors space. Hence, we argue that it is critical to get a clear senseof how much the predictive model at hand is relying on extrapolation. As a rule of thumb, weargue that one should rely less on predictions obtained in subsamples for which the overlap issmall.To evaluate the performance of an algorithm on a new sample diﬀerent from the sampleon which the algorithm is trained, commonly, a simple random sampling is performed whichdivides the study dataset Ω s into two samples. One sample, the training sample Ω s,tr , is usedfor the construction of the algorithm. The other sample, the test sample Ω s,te , is used to assessthe performance of the algorithm Performing random sampling guarantees that the observablepredictors in Ω s,tr and Ω s,te are fairly balanced, i.e., the distributions for the predictor in X Ω tr and X Ω te are similar in both samples.In real world scenarios we may be facing the problem of generalizing the results from onesample for which we observe the distribution of a certain outcome (Ω s ), to a target sample forwhich we have no information on the outcome (Ω t ). However, the predictive performance of the5lgorithm may not be mimicked when the characteristics of the new sample Ω t are fairly diﬀerentfrom the characteristics of Ω s . For example, this can be the case if Ω t and Ω s are not randomsubsamples from the same super-population. In order to tackle this problem we propose to relyon what we call overlap score. B.1 Overlap Score

We deﬁne the overlap score as the probability of being in the study dataset conditional on acertain set of predictors. In mathematical terms this probability is the following: o ( X ) = P r ( ( S ) = 1 | X = x ) , where: ( S ) =  i ∈ Ω s ;0 otherwise . ˆ o ( X ) can be estimated using generalized linear models for binary outcome such as logistic or pro-bit regressions or machine learning methods for classiﬁcation such as classiﬁcation tree (Breimanet al.; 1984) and random forest (Breiman; 2001).We argue that the overlap score provides a useful insight to understand how much the machinelearning technique is relying on extrapolation to extend the predictions from the study sampleto the target sample. In principle, the idea would be to check if, for any observation in the studypopulation, there is a similar counterpart in the target population. This may be done in variousways by deﬁning diﬀerent distance measures (e.g., Mahalanobis distance). The nice feature ofthe overlap score is that it furnishes an easy and computationally scalable way of performingsuch an analysis. Indeed, the overlap score can be thought of as a measure of the distancebetween two units in the study and in the target population. In this, the overlap score is similarto its causal inference counterpart, the propensity score (Rosenbaum and Rubin; 1983b).6 .2 Results for multivariate overlap Here, we propose to use the overlap score to assess if there are regions of the features spaceswhere there is no overlap between students in Flanders and in Wallonia. This can be easily doneby checking whether there are units in the study population with an estimated overlap scorelower (greater) than the lowest (greatest) overlap score in the target population. The algorithmthat we propose, is implemented in such a way that it gives the researcher the possibility to trimaway these non-overlapping units.Figure A4 depicts the results for the overlap score obtained ﬁtting a logistic regression. Inlight blue the predicted overlap scores for students in Flanders, while in orange the complementof predicted overlap scores for students in Wallonia. As one can see, there is wide overlap betweenthe two distributions hinting at the fact that there are no regions of the features’ space wherethe predictive algorithm will rely on extrapolation to perform out-of-sample predictions.

Overlap scores den s i t y Figure A4: Predicted overlap scores for Flanders (light blue) and Wallonia (orange).7

Robustness check

To perform this robustness check, we use a reworked version of the 10-folds cross-validation usedin the previous Section. This procedure assigns nine random folds from the Flemish data to thetraining sample and one random fold from the Walloon data to the testing sample at each split.The performance of the method is evaluated at each split and then the results are aggregatedas a usual 10-fold cross-validation.The results are depicted in Table A4. The adjusted R for BARTs built on the diﬀerentoutcomes are comparable to the ones in Table 1 and hint at the fact that there are no relevantdiﬀerences between the performance of the model built using the Flemish data both for trainingand testing and the one built using Flemish data for training and Walloon data for testing. Inturn, this robustness check seems to conﬁrm the fact that the ‘technology’ and the eﬃciency inFlanders and Wallonia are comparable.Table A4: Summary statistics of the performance of the BART on diﬀerent outcomes RMSE MAE R Math Score Outcome

Reading Score Outcome

D Changing outliers deﬁnition

Below we depict the results for the conditional tree analysis, when we deﬁne outliers as obser-vation with predicted values below two absolute deviation from the median. This is a morerestrictive deﬁnition of outliers. 8

SCEDD (n = 9651)B {A, C}PV1READ (n = 172) £ > £ > £ > Figure A5: Conditional tree for the entire sample. Within each leaf are depicted in red thehistogram of the percentage of units that have a low ﬁnancial literacy score, and next to it thepercentage of units with not-low ﬁnancial literacy score within the same leaf.

ISCEDD (n = 5675)B {A, C}PV1READ (n = 112) £ > £ > £ > £ > ISCEDD (n = 3976)B {A, C}PV1READ (n = 60) £ > £ > £ > Figure A6: (Left) conditional tree for Flanders. (Right) The corresponding tree for Wallonia. SCEDD (n = 3394)B {A, C}PV1READ (n = 168) £ > £ > £ > REPEAT (n = 6257)Repeated a Did not repeat a PV1READ (n = 145) £ > £ > £ > Figure A7: (Left) conditional tree for students in grades 7 to 9. (Right) The corresponding tree for studentsin grades 10 to 12.

ST001D01T (n = 5446){Grade 7, Grade 8} {Grade 10, Grade 11, Grade 12, Grade 9}PV1READ (n = 520) £ > £ > £ > £ > £ > ISCEDD (n = 4171)B {A, C}PV1READ (n = 172) £ > £ > £ > Figure A8: (Left) conditional tree for general education. (Right) The corresponding tree for vocational educa-tion.(Left) conditional tree for general education. (Right) The corresponding tree for vocational educa-tion.