Quantifying Model Complexity via Functional Decomposition for Better Post-Hoc Interpretability
QQuantifying Model Complexity via FunctionalDecomposition for Better Post-HocInterpretability
Christoph Molnar, Giuseppe Casalicchio, and Bernd Bischl
Department of Statistics, LMU Munich,Ludwigstr. 33, 80539 Munich, Germany [email protected]
Abstract.
Post-hoc model-agnostic interpretation methods such as par-tial dependence plots can be employed to interpret complex machinelearning models. While these interpretation methods can be applied re-gardless of model complexity, they can produce misleading and verboseresults if the model is too complex, especially w.r.t. feature interactions.To quantify the complexity of arbitrary machine learning models, wepropose model-agnostic complexity measures based on functional decom-position: number of features used, interaction strength and main effectcomplexity. We show that post-hoc interpretation of models that mini-mize the three measures is more reliable and compact. Furthermore, wedemonstrate the application of these measures in a multi-objective opti-mization approach which simultaneously minimizes loss and complexity.
Keywords:
Model Complexity · Interpretable Machine Learning · Ex-plainable AI · Accumulated Local Effects · Multi-Objective Optimization
Machine learning models are optimized for predictive performance, but it is of-ten required to understand models, e.g., to debug them, gain trust in the predic-tions, or satisfy regulatory requirements. Many post-hoc interpretation methodseither quantify effects of features on predictions, compute feature importances,or explain individual predictions, see [17, 24] for more comprehensive overviews.While model-agnostic post-hoc interpretation methods can be applied regard-less of model complexity [30], their reliability and compactness deteriorates whenmodels use a high number of features, have strong feature interactions and com-plex feature main effects. Therefore, model complexity and interpretability aredeeply intertwined and reducing complexity can help to make model interpreta-tion more reliable and compact. Model-agnostic complexity measures are neededto strike a balance between interpretability and predictive performance [4, 31].
Contributions.
We propose and implement three model-agnostic measuresof machine learning model complexity which are related to post-hoc interpretabil-ity. To our best knowledge, these are the first model-agnostic measures that de-scribe the global interaction strength, complexity of main effects and number a r X i v : . [ s t a t . M L ] S e p C. Molnar et al. of features. We apply the measures to different datasets and machine learningmodels. We argue that minimizing these three measures improves the reliabilityand compactness of post-hoc interpretation methods. Finally, we illustrate theuse of our proposed measures in multi-objective optimization.
In this section, we introduce the notation, review related work, and describe thefunctional decomposition on which we base the proposed complexity measures.
Notation:
We consider machine learning prediction functions f : R p (cid:55)→ R ,where f ( x ) is a prediction (e.g., regression output or a classification score). Forthe decomposition of f , we write f S : R | S | (cid:55)→ R , S ⊆ { , . . . , p } , to denote afunction that maps a vector x S ∈ R | S | with a subset of features to a marginalprediction. If subset S contains a single feature j , we write f j . We refer to thetraining data of the machine learning model with the tuples D = { ( x ( i ) , y ( i ) ) } ni =1 and refer to the value of the j -th feature from the i -th instance as x ( i ) j . We write X j to refer to the j -th feature as a random variable. Complexity and Interpretability Measures:
In the literature, modelcomplexity and (lack of) model interpretability are often equated. Many com-plexity measures are model-specific, i.e., only models of the same class can becompared (e.g., decision trees). Model size is often used as a measure for in-terpretability (e.g., number of decision rules, tree depth, number of non-zerocoefficients) [3, 16, 20, 22, 31–34]. Akaikes Information Criterion (AIC) and theBayesian Information Criterion (BIC) are more widely applicable measures forthe trade-off between goodness of fit and degrees of freedom. In [26], the au-thors propose model-agnostic measures of model stability. In [27], the authorspropose explanation fidelity and stability of local explanation models. Furtherapproaches measure interpretability based on experimental studies with humans,e.g., whether humans can predict the outcome of the model [8, 13, 20, 28, 35].
Functional Decomposition:
Any high-dimensional prediction function canbe decomposed into a sum of components with increasing dimensionality: f ( x ) = Intercept (cid:122)(cid:125)(cid:124)(cid:123) f + (cid:122) (cid:125)(cid:124) (cid:123) p (cid:88) j =1 f j ( x j ) + (cid:122) (cid:125)(cid:124) (cid:123) p (cid:88) j Algorithm 1: Number of Features Used (NF) Input: Number of samples M , data D NF = 0 for j ∈ , . . . , p do Draw M instances { x ( m ) } Mm =1 from dataset D Create { x ( m ) ∗ } Mm =1 as a copy of { x ( m ) } Mm =1 for m ∈ , . . . , M do Sample x ( new ) j from { x ( i ) j } ni =1 with the constraint that x ( new ) j (cid:54) = x ( m ) j Set x ( m ) ∗ j = x ( new ) j if f ( x ( m ) ∗ ) (cid:54) = f ( x ( m ) ) for any m ∈ { , . . . , M } then NF = NF + 1. return NF sampling x j from other instances from D ), and observe whether the predictionschange. If the prediction of any sample changes, the feature was used.We tested the NF heuristic with the Boston Housing data. We trained de-cision trees (CART) with maximum depths ∈ { , , } leading to 1, 2 and 4 fea-tures used and an L1-regularized linear model with penalty λ ∈ { , , , , . , . } leading to 0, 2, 3, 4, 11 and 13 features used. For each model, we estimated NFwith sample sizes M ∈ { , , } and repeated each estimation 100 times. Forthe elastic net models, NF was always equal to the number of non-zero weights.For CART, the mean absolute differences between NF and number of featuresused in the trees were 0.280 ( M = 10), 0.020 ( M = 50) and 0.000 ( M = 500). Interactions between features mean that the prediction cannot be expressed as asum of independent feature effects, but the effect of a feature depends on valuesof other features [24]. We propose to measure interaction strength as the scaledapproximation error between the ALE main effect model and the predictionfunction f . Based on the ALE decomposition, the ALE main effect model isdefined as the sum of first order ALE effects: f ALE st ( x ) = f + f ,ALE ( x ) + . . . + f p,ALE ( x p )We define interaction strength as the approximation error measured with loss L : IAS = E ( L ( f, f ALE st )) E ( L ( f, f )) ≥ f is the mean of the predictions and can be interpreted as the functionaldecomposition where all feature effects are set to zero. IAS with the L y i are replacedwith f ( x ( i ) ). IAS = (cid:80) ni =1 ( f ( x ( i ) ) − f ALE st ( x ( i ) )) (cid:80) ni =1 ( f ( x ( i ) ) − f ) = 1 − R uantifying Model Complexity 5 If IAS = 0, then L ( f, f ALE st ) = 0, which means that the first order ALE modelperfectly approximates f and the model has no interactions. To determine the average shape complexity of ALE main effects f j,ALE , wepropose the main effect complexity (MEC) measure. For a single ALE maineffect, we define MEC j as the number of parameters needed to approximate thecurve with piece-wise linear models. For the entire model, MEC is the averageMEC j over all main effects, weighted with their variance. Figure 1 shows anALE plot (= main effect) and its approximation with two linear segments. −3−2−101 0.00 0.25 0.50 0.75 1.00 x A L E Fig. 1. ALE curve (solid line) approximated by two linear segments (dotted line). We use piece-wise linear regression to approximate the ALE curve. Within thesegments, linear models are estimated with ordinary least squares. The break-points that define the segments are found by greedy and exhaustive search alongthe interval boundaries of the ALE curve. Greedy here means that we first opti-mize the first breakpoint, then the second breakpoint with the first breakpointfixed and so on. We measure the degrees of freedom as the number of non-zerocoefficients for intercepts and slopes of the linear models. The approximationallows some error, e.g., an almost linear main effect may have MEC j = 1, evenif dozens of parameters would be needed to describe it perfectly. The approxi-mation quality is measured with R-squared ( R ), i.e., the proportion of varianceof f j,ALE that is explained by the approximation with linear segments. An ap-proximation has to reach an R ≥ − (cid:15) , where (cid:15) is the user defined maximumapproximation error. We also introduced parameter max seg , the maximum num-ber of segments. In the case that an approximation cannot reach an R ≥ − (cid:15) with a given max seg , MEC j is computed with the maximum number of seg-ments. The selected maximum approximation error (cid:15) should be small, but nottoo small. We found (cid:15) between 0 . 01 and 0 . (cid:15) = 0 . 05 throughout the paper. We applya post-processing step that greedily sets slopes of the linear segments to zero,as long as R ∈ { − (cid:15), } . The post-processing potentially decreases the MEC j , C. Molnar et al. especially for models with constant segments like decision trees. MEC j is aver-aged over all features to obtain the global main effect complexity. Each MEC j isweighted with the variance of the corresponding ALE main effect to give moreweight to features that contribute more to the prediction. Algorithm 2 describesthe MEC computation in detail. Algorithm 2: Main Effect Complexity (MEC). Input: Model f , approximation error (cid:15) , max. segments max seg , data D Define R ( g j , f j,ALE ) := (cid:80) ni =1 ( g j ( x ( i ) j ) − f j,ALE ( x ( i ) j )) / (cid:80) ni =1 ( f j,ALE ( x ( i ) j )) for j ∈ { , . . . , p } do Estimate f j,ALE // Approximate ALE with linear model Fit g j ( x j ) = β + β x j predicting f j,ALE ( x ( i ) j ) from x ( i ) j , i ∈ , . . . , n Set K = 1 // Increase nr. of segments until approximation is good enough while K < max seg AND R ( g j , f j,ALE ) < (1 − (cid:15) ) do // Find intervals Z k through exhaustive search along ALEcurve breakpoints// For categorical feature, set slopes β ,k to zero g j ( x j ) = (cid:80) K +1 k =1 I x j ∈ Z k · ( β ,k + β ,k x j ) Set K = K + 1 Greedily set slopes to zero while R > − (cid:15) // Sum of non-zero coefficients minus first intercept MEC j = K + (cid:80) Kk =1 I β ,k > − V j = n (cid:80) ni =1 ( f j,ALE ( x ( i ) )) return MEC = (cid:80) pj =1 V j (cid:80) pj =1 V j · MEC j In the following experiment, we train various machine learning models on dif-ferent prediction tasks and compute the model complexities. The goal is to an-alyze how the complexity measures behave across different datasets and mod-els. The dataset are: Bike Rentals [10] (n=731; 3 numerical, 6 categorical fea-tures), Boston Housing (n=506; 12 numerical, 1 categorical features), (down-sampled) Superconductivity [18] (n=2000; 81 numerical, 0 categorical features)and Abalone [9] (n=4177; 7 numerical, 1 categorical features).Table 1 shows performance and complexity of the models. As desired, themain effect complexity for linear models is 1 (except when categorical featureswith 2+ categories are present as in the bike data), and higher for more flexiblemethods like random forests. The interaction strength (IAS) is zero for additivemodels (boosted GAM, (regularized) linear models). Across datasets we observe uantifying Model Complexity 7 bike Boston Housing superconductivity abalonelearner MSE MEC IAS NF MSE MEC IAS NF MSE MEC IAS NF MSE MEC IAS NFcart 923035 1.1 0.07 6 23.7 1.9 0.12 4 325.0 1.0 0.23 8 6.0 2.8 0.09 3cart2 1245105 1.0 0.01 2 29.8 1.7 0.02 2 417.6 1.0 0.22 3 6.7 3.0 0.02 1cvglmnet 667291 1.1 0.00 9 27.4 1.0 0.00 8 351.1 1.0 0.00 50 5.1 1.0 0.00 6gamboost 539538 1.6 0.00 8 17.7 2.5 0.00 10 360.3 1.7 0.00 14 5.3 1.1 0.00 4ksvm 424184 1.6 0.04 8 13.7 1.7 0.09 13 256.0 2.2 0.25 81 4.6 1.0 0.12 8lm 629144 1.5 0.00 9 23.4 1.0 0.00 13 337.4 1.0 0.00 81 4.9 1.0 0.00 8rf 478115 1.8 0.06 9 13.2 2.5 0.10 13 167.4 3.0 0.25 81 4.6 1.7 0.30 8 Table 1. Model performance and complexity on 4 regression tasks for various learners:linear models (lm), cross-validated regularized linear models (cvglmnet), kernel supportvector machine (ksvm), random forest (rf), gradient boosted generalized additive model(gamboost), decision tree (cart) and decision tree with depth 2 (cart2). that the underlying complexity measured as the range of MEC and IAS acrossthe models varies. The bike dataset seems to be adequately described by onlyadditive effects, since even random forests, which often model strong interactionsshow low interaction strength here. In contrast, the superconductivity datasetis better explained by models with more interactions. For the abalone datasetthere are two models with low MSE: the support vector machine and the randomforest. We might prefer the SVM, since main effects can be described with singlenumbers ( M EC = 1) and interaction strength is low. Minimizing the number of features (NF), the interaction strength (IAS), andthe main effect complexity (MEC) improves reliability and compactness of post-hoc interpretation methods such as partial dependence plots, ALE plots, featureimportance, interaction effects and local surrogate models. Fewer features, more compact interpretations. Minimizing the numberof features improves the readability of post-hoc analysis results. The computa-tional complexity and output size of most interpretation methods scales with O (NF), like feature effect plots [1, 14] or feature importance [6, 11]. As demon-strated in Table 2, a model with fewer features has a more compact representa-tion. If additionally IAS = 0, the ALE main effects fully characterize the pre-diction function. Interpretation methods that analyze 2-way feature interactionsscale with O (NF ). A complete functional decomposition requires to estimate (cid:80) NFk =1 (cid:0) NFk (cid:1) components which has a computational complexity of O (2 NF ). Less interaction, more reliable feature effects. Feature effect plots suchas partial dependence plots and ALE plots visualize the marginal relationshipbetween a feature and the prediction. The estimated effects are averages acrossinstances. The effects can vary greatly for individual instances and even haveopposite directions when the model includes feature interactions.In the following simulation, we trained three models with different capabilitiesof modeling interactions between features: a linear regression model, a supportvector machine (radial basis kernel, C=0.05), and gradient boosted trees. We C. Molnar et al. simulated 500 data points with 4 features and a continuous target based on[15]. Figure 2 shows an increasing interaction strength depending on the modelused. More interaction means that the feature effect curves become a less reliablesummary of the model behavior. Fig. 2. The higher the interaction strength in a model (IAS increases from left toright), the less representative the Partial Dependence Plot (light thick line) becomesfor individual instances represented by their Individual Conditional Expectation curves(dark thin lines). The less complex the main effects, the better summarizable. Inlinear models, a feature effect can be expressed by a single number, the regressioncoefficient. If effects are non-linear the method of choice is visualization [1, 14].Summarizing the effects with a single number (e.g., using average marginal effects[23]) can be misleading, e.g., the average effect might be zero for U-shapedfeature effects. As a by-product of MEC, there is a third option: Instead ofreporting a single number, the coefficients of the segmented linear model can bereported. Minimizing MEC means preferring models with main effects that canbe described with fewer coefficients, offering a more compact model description. We demonstrate model selection for performance and complexity in a multi-objective optimization approach. For this example, we predict wine quality (scalefrom 0 to 10) [7] from the wines physical-chemical properties such as alcohol andresidual sugar of 4870 white wines. It is difficult to know the desired compromisebetween model complexity and performance before modeling the data. A solutionis multi-objective optimization [12]. We suggest searching over a wide spectrumof model classes and hyperparameter settings, which allows to select a suitablecompromise between model complexity and performance.We used the mlrMBO model-based optimization framework [19] with ParEGO[21] (500 iterations) to find the best models based on four objectives: number of uantifying Model Complexity 9 features used (NF), main effect complexity (MEC), interaction strength (IAS)and cross-validated mean absolute error (MAE) (5-fold cross-validated). We op-timized over the space of following model classes (and hyperparameters): CART (maximum tree-depth and complexity parameter cp), s upport v ector m achine(cost C and inverse kernel width sigma), elastic net regression (regulariza-tion alpha and penalization lambda), g radient b oosted t rees (maximum depth,number of iterations), gradient boost ed g eneralized a dditive m odel (number ofiterations nrounds) and r andom f orest (number of split features mtry). Results . The multi-objective optimization resulted in 27 models. The mea-sures had the following ranges: MAE 0.41 – 0.63, number of features 1 – 11,mean effect complexity 1 – 9 and interaction strength 0 – 0.71. For a more in-formative visualization, we propose to visualize the main effects together withthe measures in Table 2. The selected models show different trade-offs betweenthe measures. Table 2. A selection of four models from the Pareto optimal set, along with their ALEmain effect curves. From left to right, the columns show models with 1) lowest MAE,2) lowest MAE when MEC = 1, 3) lowest MAE when IAS = ≤ . 2, and 4) lowestMAE with NF ≤ 7. gbt(maxdepth:8,nrounds:269) svm(C:23.6979,sigma:0.0003) gbt(maxdepth:3,nrounds:98) CART(maxdepth:14,cp:0.0074)MAE 0.41 0.58 0.52 0.59MEC 4.2 1 4.5 2IAS 0.64 0 0.2 0.2NF 11 11 11 4fixed.acidityvolatile.aciditycitric.acidresidual.sugarchloridesfree.sulfur.dioxidetotal.sulfur.dioxidedensitypHsulphatesalcohol0 C. Molnar et al. We proposed three measures for machine learning model complexity based onfunctional decomposition: number of features used, interaction strength andmain effect complexity. Due to their model-agnostic nature, the measures al-low model selection and comparison across different types of models and theycan be used as objectives in automated machine learning frameworks. This alsoincludes ”white-box” models: For example, the interaction strength of interac-tion terms in a linear model or the complexity of smooth effects in generalizedadditive models can be quantified and compared across models. We argued thatminimizing these measures for a machine learning model improves its post-hocinterpretation. We demonstrated that the measures can be optimized directlywith multi-objective optimization to make the trade-off between performanceand post-hoc interpretability explicit. Limitations. The proposed decomposition of the prediction function anddefinition of the complexity measures will not be appropriate in every situa-tion. For example, all higher order effects are combined into a single interactionstrength measure that does not distinguish between two-way interactions andhigher order interactions. However, the framework of accumulated local effectdecomposition allows to estimate higher order effects and to construct differentinteraction measures. The main effect complexity measure only considers linearsegments but not, e.g., seasonal components or other structures. Furthermore,the complexity measures quantify machine learning models from a functionalpoint of view and ignore the structure of the model (e.g., whether it can be rep-resented by a tree). For example, main effect complexity and interaction strengthmeasures can be large for short decision trees (e.g. in Table 1). Implementation. The code for this paper is available at https://github.com/compstat-lmu/paper 2019 iml measures. For the examples and experimentswe relied on the mlr package [5] in R [29]. Acknowledgements. This work is funded by the Bavarian State Ministryof Science and the Arts in the framework of the Centre Digitisation.Bavaria(ZD.B) and supported by the German Federal Ministry of Education and Re-search (BMBF) under Grant No. 01IS18036A. The authors of this work take fullresponsibilities for its content. References [1] Apley, D.: Visualizing the effects of predictor variables in black box super-vised learning models. arXiv preprint arXiv:1612.08468 (2016)[2] Apley, D.: Aleplot: Accumulated local effects (ale) plots and partial depen-dence(pd) plots. CRAN (2017)[3] Askira-Gelman, I.: Knowledge discovery: comprehensibility of the results.In: Proceedings of the thirty-first Hawaii international conference on systemsciences. vol. 5, pp. 247–255. IEEE (1998)[4] Bibal, A., Fr´enay, B.: Interpretability of machine learning models and rep-resentations: an introduction. In: Proceedings on ESANN. pp. 77–82 (2016) uantifying Model Complexity 11 [5] Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E.,Casalicchio, G., Jones, Z.M.: mlr: Machine learning in R. Journal of MachineLearning Research (170), 1–5 (2016)[6] Casalicchio, G., Molnar, C., Bischl, B.: Visualizing the feature importancefor black box models. In: Joint European Conference on Machine Learningand Knowledge Discovery in Databases. pp. 655–670. Springer (2018)[7] Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling winepreferences by data mining from physicochemical properties. Decision Sup-port Systems (4), 547–553 (2009)[8] Dhurandhar, A., Iyengar, V., Luss, R., Shanmugam, K.: TIP: typifying theinterpretability of procedures. arXiv preprint arXiv:1706.02952 (2017)[9] Dua, D., Graff, C.: UCI machine learning repository (2017), http://archive.ics.uci.edu/ml[10] Fanaee-T, H., Gama, J.: Event labeling combining ensemble detectors andbackground knowledge. Progress in Artificial Intelligence pp. 1–15 (2013)[11] Fisher, A., Rudin, C., Dominici, F.: All Models are Wrong but many are Use-ful: Variable Importance for Black-Box, Proprietary, or Misspecified Predic-tion Models, using Model Class Reliance. arXiv preprint arXiv:1801.01489(2018)[12] Freitas, A.A.: Comprehensible classification models: a position paper. ACMSIGKDD explorations newsletter (1), 1–10 (2014)[13] Friedler, S.A., Roy, C.D., Scheidegger, C., Slack, D.: Assessing the local in-terpretability of machine learning models. arXiv preprint arXiv:1902.03501(2019)[14] Friedman, J.H.: Greedy function approximation: a gradient boosting ma-chine. Annals of statistics pp. 1189–1232 (2001)[15] Friedman, J.H., et al.: Multivariate adaptive regression splines. The annalsof statistics (1), 1–67 (1991)[16] F¨urnkranz, J., Gamberger, D., Lavraˇc, N.: Foundations of rule learning.Springer Science & Business Media (2012)[17] Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi,D.: A survey of methods for explaining black box models. ACM computingsurveys (CSUR) (5), 93 (2018)[18] Hamidieh, K.: A data-driven statistical model for predicting the criticaltemperature of a superconductor. Computational Materials Science ,346–354 (2018)[19] Horn, D., Bischl, B.: Multi-objective parameter configuration of machinelearning algorithms using model-based optimization. In: 2016 IEEE Sym-posium Series on Computational Intelligence (SSCI). pp. 1–8. Ieee (2016)[20] Huysmans, J., Dejaeger, K., Mues, C., Vanthienen, J., Baesens, B.: Anempirical evaluation of the comprehensibility of decision table, tree and rulebased predictive models. Decision Support Systems (1), 141–154 (2011)[21] Knowles, J.: Parego: a hybrid algorithm with on-line landscape approxima-tion for expensive multiobjective optimization problems. IEEE Transactionson Evolutionary Computation (1), 50–66 (2006) [22] Lakkaraju, H., Kamar, E., Caruana, R., Leskovec, J.: Interpretable& explorable approximations of black box models. arXiv preprintarXiv:1707.01154 (2017)[23] Leeper, T.J.: Interpreting regression results using average marginal effectswith R’s margins. CRAN (2017)[24] Molnar, C.: Interpretable Machine Learning (2019), https://christophm.github.io/interpretable-ml-book/[25] Molnar, C., Bischl, B., Casalicchio, G.: iml: An R package for interpretablemachine learning. JOSS (26), 786 (2018)[26] Philipp, M., Rusch, T., Hornik, K., Strobl, C.: Measuring the stability ofresults from supervised statistical learning. Journal of Computational andGraphical Statistics (4), 685–700 (2018)[27] Plumb, G., Al-Shedivat, M., Xing, E., Talwalkar, A.: Regularizing black-box models for improved interpretability. arXiv preprint arXiv:1902.06787(2019)[28] Poursabzi-Sangdeh, F., Goldstein, D.G., Hofman, J.M., Vaughan, J.W.,Wallach, H.: Manipulating and measuring model interpretability. arXivpreprint arXiv:1802.07810 (2018)[29] R Core Team: R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria (2018)[30] Ribeiro, M.T., Singh, S., Guestrin, C.: Model-agnostic interpretability ofmachine learning. arXiv preprint arXiv:1606.05386 (2016)[31] R¨uping, S., et al.: Learning interpretable models. Univ. Dortmund (2006),http://d-nb.info/997491736[32] Schielzeth, H.: Simple means to improve the interpretability of regressioncoefficients. Methods in Ecology and Evolution (2), 103–113 (2010)[33] Ustun, B., Rudin, C.: Supersparse linear integer models for optimized med-ical scoring systems. Machine Learning102