DDistributional Anchor Regression
Lucas Kook , ∗ , Beate Sick , , Peter B¨uhlmann University of Zurich, Switzerland Zurich University of Applied Sciences, Switzerland ETH Zurich, Switzerland
Abstract
Prediction models often fail if train and test data do not stem from the same distribu-tion. Out-of-distribution (OOD) generalization to unseen, perturbed test data is a desirable butdifficult-to-achieve property for prediction models and in general requires strong assumptions onthe data generating process (DGP). In a causally inspired perspective on OOD generalization, thetest data arise from a specific class of interventions on exogenous random variables of the DGP,called anchors. Anchor regression models, introduced by Rothenh¨ausler et al. (2018), protectagainst distributional shifts in the test data by employing causal regularization. However, so faranchor regression has only been used with a squared-error loss which is inapplicable to commonresponses such as censored continuous or ordinal data. Here, we propose a distributional versionof anchor regression which generalizes the method to potentially censored responses with at leastan ordered sample space. To this end, we combine a flexible class of parametric transformationmodels for distributional regression with an appropriate causal regularizer under a more generalnotion of residuals. In an exemplary application and several simulation scenarios we demonstratethe extent to which OOD generalization is possible.
Keywords anchor regression, covariate shift, diluted causality, distributional regression, trans-formation models, out-of-distribution generalization
Common methods in supervised statistical learning assume the test data to follow the same dis-tribution as the training data. This is implicitly exploited in e.g., cross-validation or by randomlysplitting a dataset into a training and a test set, which has been demonstrated to be potentiallyflawed (Efron, 2020) due to concept drift or domain shift where new (test) data do not follow thesame distribution as the training data. More recently, the problem has been referred to as out-of-distribution (OOD) generalization (Sun et al., 2019). The desire to achieve reliable test predictionsunder distributional shifts is ubiquitous in many fields of machine learning and statistics, such astransfer learning (Pan and Yang, 2009; Rojas-Carulla et al., 2018), domain adaptation (Maglia-cane et al., 2018; Redko et al., 2020), multi-task learning (Caruana, 1997), representation learning(Mitrovic et al., 2020) or prediction models in medical statistics (Subbaswamy and Saria, 2019).Accordingly, many different formulations of the problem of OOD generalization exist in the lit-erature (a detailed overview can be found in Chen and B¨uhlmann, 2020). We will frame OODgeneralization as the problem of robustly predicting an outcome in novel, unseen environments,based on data from a few observed environments and extend on the idea of anchor regression and ∗ Corresponding author, email: [email protected] a r X i v : . [ s t a t . M E ] J a n ausal regularization (Rothenh¨ausler et al., 2018; B¨uhlmann, 2020; B¨uhlmann and ´Cevid, 2020) todevelop distributional anchor regression. In such a framework, training a model on heterogeneousdata is not a disadvantage but rather a necessity. It has been known for decades that a causal model is robust towards arbitrarily strong perturbationson components other than the response (Haavelmo, 1943). However, identifying causal structuresis not only difficult but often leads to sub-par prediction performance when the test data containperturbations of bounded strength (Rothenh¨ausler et al., 2018). Rothenh¨ausler et al. introducelinear anchor regression, which allows a trade-off between prediction performance and robustnessagainst shift perturbations of a certain size. The framework of linear anchor regression was ex-tended to deal with nonlinear regression between the response and covariates (B¨uhlmann, 2020).Furthermore, Christiansen et al. (2020) provide a causal framework to decide which assumptionsare needed for and to what extent OOD generalization is possible.Anchor regression is related to Instrumental Variables (IV) regression. However, the main IVassumption that the instrument A does not directly affect some hidden confounding variables H is dropped, at the price of non-identifiability of the causal parameter (Angrist et al., 1996). Agraphical description of the issue is given in Figure 1. XA Y H XA Y H Figure 1: Graphical models for the response variable Y , covariates X and hidden confounders H :IV regression with instruments A (left) and anchor regression with anchor A (right). In anchorregression, A is only required to be a source node but is allowed to directly influence response,covariates and hidden confounders. In this work we develop a framework for distributional anchor regression in the broad class of trans-formation models (TMs, Hothorn et al., 2014). The resulting class of anchor TMs generalizes (non-)linear anchor regression to potentially censored responses and characterizes the full conditional dis-tribution of Y | X = x instead of estimating solely the conditional mean function. While the L anchor loss can be decomposed into a squared error and causal regularization term penalizing corre-lation between anchors and residuals, we propose a distributional anchor loss based on the negativelog-likelihood and replacing the least-squares residuals by the more general score residuals. Theproposed causal regularizer induces uncorrelatedness between the anchors and these score residuals.The resulting procedure is tailored towards protecting against distributional shifts induced by theanchors and naturally interpolates between the unpenalized maximum-likelihood estimate and asolution for which anchors and residuals are strictly uncorrelated. The latter may be thought of asa distributional IV-like objective but it generally does not estimate the causal model due to the factthat the anchor A can also directly influence H and Y (see Figure 1). It leads to some invarianceof the score residuals across the values of the anchors A , and such an invariance property has beenreferred to as “diluted causality” (B¨uhlmann, 2020).2e implement all methods and algorithms in the R language for statistical computing (R CoreTeam, 2020) and the code is available on GitHub. In the appendix we present further details onnotation, computation, and score residuals. First, we introduce structural equation models (SEMs) before recapping linear anchor regression.In Section 2.3, we switch perspectives from modelling the conditional expectation to transformationmodels which enable to capture entire conditional distributions. The notation used in this work isdescribed in Appendix A.
Let Y be a response which takes values in R , X be a random vector of covariates taking values in R p , H denotes hidden confounders with sample space R d , and A are exogenous variables (calledanchors, due to exogeneity; source node in the graph in Figure 1) taking values in R q . The SEMgoverning linear anchor regression is given by Y XH ← B Y XH + M A + ε , (1)with (1 + p + d ) × (1 + p + d )-matrix B which corresponds to the structure of the SEM in terms of adirected acyclic graph (DAG), the effect of A enters linearly via the (1 + p + d ) × q -matrix M , and ε denotes the error term with mutually independent components. The “ ← ” symbol is algebraically adistributional equality sign. It emphasizes the structural character of the SEM, saying that, e.g., Y is only a function of the parents of the node Y in the structural DAG and the additive component( M A + ε ) .The anchors A may be continuous or discrete. In the special case of discrete anchors each levelcan be viewed as an “environment”.We define perturbations as intervening on A , e.g., by do( A = a ), which replaces A by a in theSEM while leaving the underlying mechanism, i.e., the coefficients in the SEM, unchanged. In thiswork we restrict ourselves to do- (Pearl, 2009) and push-interventions (Markowetz et al., 2005) on A , which in turn lead to shifts in the distribution of X . Since A is exogenous and a source node inthe graph, the specific type of intervention does not play a major role. Christiansen et al. (2020)show that under the above conditions OOD generalization is possible in linear models. Linear anchor regression with its corresponding causal regularization estimates the linear regressionparameter β asˆ β = arg min β (cid:26) (cid:107) (Id − Π A )( y − X β ) (cid:107) /n + γ (cid:107) Π A ( y − X β ) (cid:107) /n (cid:27) , where 0 ≤ γ ≤ ∞ is a regularization parameter and Π A = A ( A (cid:62) A ) − A (cid:62) denotes the orthogonalprojection onto the column space of the anchors (Rothenh¨ausler et al., 2018). For γ = 1 one obtainsordinary least squares, γ → ∞ corresponds to two-stage least squares as in Instrumental Variablesregression and γ = 0 is adjusting for the anchor variables A (being equivalent to ordinary least3quares when regressing Y on X and A ). Causal regularization encourages, for large values of γ ,uncorrelatedness of the anchors A and the residuals. As a procedure, causal regularization doesnot depend at all on the SEM in eq. (1). However, as described below, the method inherits adistributional robustness property, whose formulation depends on the SEM in eq. (1).Rothenh¨ausler et al. (2018) establish the duality between the L loss in linear anchor regres-sion and optimizing a worst case risk over specific shift perturbations. The authors consider shiftperturbations ν , which are confined to be in the set C γ := (cid:8) ν : ν = M δ , δ independent of ε , E [ δδ (cid:62) ] (cid:22) γ E [ AA (cid:62) ] (cid:9) , and which generate the perturbed response Y ν , and covariates X ν via Y ν X ν H ν ← B Y ν X ν H ν + ν + ε . The set C γ contains all vectors which lie in the span of the columns of M and thus in the samedirection as the exogenous contribution M A of the anchor variables. The average squared size of δ is limited to γ times the smallest eigenvalue of the centered anchor’s variance-covariance matrix.Now, the explicit duality between the worst case risk over all shift perturbations of limited size andthe L anchor loss is given bysup ν ∈ C γ E (cid:104) ( Y ν − ( X ν ) (cid:62) β ) (cid:105) = E (cid:104) ((Id − P A )( Y − X (cid:62) β )) (cid:105) + γ E (cid:104) ( P A ( Y − X (cid:62) β )) (cid:105) , (2)where P A = E [ ·| A ] is the population analogue of Π A . We note that the right-hand side is the popu-lation analogue of the objective function in anchor regression. Hence, causal regularization in anchorregression provides guarantees for optimizing worst-case risk across a class of shift perturbations.The details are provided in Rothenh¨ausler et al. (2018). We now switch perspective from models for the conditional mean to the conditional distributions.Specifically, we consider transformation models (Hothorn et al., 2014). TMs decompose the condi-tional distribution of Y | x into a pre-defined simple distribution function F Z , with log-concave den-sity f Z , and a (semi-) parametric transformation function h ( y | x ), which is monotone non-decreasingin y F Y | x ( y | x ) = F Z ( h ( y | x )) . This way, the problem of estimating a conditional distribution simplifies to estimating the parame-ters of the transformation function h = F − Z ◦ F Y | x (since F Z is pre-specified and parameter-free).Depending on the complexity of h , very flexible conditional distributions can be modelled. Hothornet al. (2018) give theoretical guarantees for the existence and uniqueness of the transformationfunction h for absolute continuous, countably infinite and ordered discrete random variables. Forthe sake of generality, h is parametrized in terms of a basis expansion in the argument y whichcan be as simple as a linear function in y or as complex as a basis of splines to approximate asmooth function in y . In this work, we assume the transformation function for a continuous re-sponses can be additively decomposed into a linear predictor in x and a smooth function in y which is modelled as a Bernstein polynomial of order P with parameters θ ∈ R P +1 (Hothorn et al.,2018), such that h ( y | x ) = b Bs ,P ( y ) (cid:62) θ + x (cid:62) β . Monotonicity of b Bs ,P ( y ) (cid:62) θ and thereby of h ( y | x )4an then be enforced via the P linear constraints θ ≤ θ ≤ . . . θ P +1 . In case of an ordinal re-sponse taking values in { y , y , . . . , y K } , the transformation function is a monotone increasing stepfunction, h ( y k | x ) = θ k + x (cid:62) β , for k = 1 , . . . , K − θ K = + ∞ . Wesummarize a transformation model based on its simple distribution function F Z , basis b , whichmay include covariates, and parameters ϑ , such that F Y | x ( y | x ) = F Z ( b ( y, x ) (cid:62) ϑ ). For instance,for a transformation model with continuous response and explanatory variables x we thus use b ( y, x ) = ( b Bs ,P ( y ) (cid:62) , x (cid:62) ) (cid:62) and ϑ = ( θ (cid:62) , β (cid:62) ) (cid:62) , yielding h ( y | x ) = b Bs ,P ( y ) (cid:62) θ + x (cid:62) β . For a TMwith ordinal response we substitute the Bernstein basis with a dummy encoding of the response,which we denote by ˜ y ( e.g., Kook et al., 2020). Also note that the unconditional case is coveredby the above formulation as well, by omitting all explanatory variables from the TM’s basis. f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) f Z ( z ) −4−202 h ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y ) f Y ( y )
50 60 70 80 90y
Figure 2: Illustration of an unconditional transformation model (1 − exp( − exp( · )) , b Bs , , ϑ ) forthe Old Faithful Geyser data (Azzalini and Bowman, 1990) using a Bernstein polynomial basisexpansion of order 6 for the transformation function, h ( y ) = b Bs , ( y ). The colored regions indicatethe transport of probability mass from P Y (lower right) to P Z (upper left) via the transformationfunction h ( y ) = b ( y ) (cid:62) ϑ (upper right). If h is continuously differentiable, the density of Y is givenby f Y ( y ) = f Z ( h ( y )) h (cid:48) ( y ).Figure 2 illustrates the intuition behind transformation models. The transformation function(upper right panel) transforms the complex, bimodal distribution of Y (lower panel) to F Z = F MEV ,the standard minimum extreme value distribution (upper left panel).
Definition 1 (Transformation model, Definition 4 in Hothorn et al. (2018))
The triple( F Z , b , ϑ ) is called transformation model. Example 1 (Linear regression)
The normal linear regression model (Lm) is commonly formu-lated as Y = β + x (cid:62) ˜ β + ε , ε ∼ N (0 , σ ) , or Y | x ∼ N ( β + x (cid:62) ˜ β , σ ) . For a distributional treatmentwe write the above expression as F Y | x ( y | x ) = Φ (cid:32) y − β − x (cid:62) ˜ β σ (cid:33) = Φ( ϑ + ϑ y − x (cid:62) β ) , (3)5 hich can be understood as a transformation model by letting ϑ = − β /σ , ϑ = 1 /σ and β = ˜ β /σ .Formally, it corresponds to the model ( F Z , b , ϑ ) = (Φ , (1 , y, x (cid:62) ) (cid:62) , ( ϑ , ϑ , − β (cid:62) ) (cid:62) ) . Note that the baseline transformation, h ( y | X = 0) , is constrained to be linear with constant slope ϑ . Due to the linearity of h and the choice F Z = Φ , the modeled distribution of Y | x will alwaysbe normal with constant variance. By parametrizing h in a smooth way, we arrive at much moreflexible conditional distributions for Y | x . The parameters of a TM can be jointly estimated using maximum-likelihood. The likelihood can bewritten in terms of the simple distribution function F Z , which makes its evaluation computationallymore convenient. For a single datum { (¯ y, ¯ y ] , x } with potentially censored response the log-likelihoodcontribution is given by (Lindsey et al., 1996) (cid:96) ( ϑ ; y, x ) = log f Z ( b ( y, x ) (cid:62) ϑ ) + log (cid:0) b (cid:48) ( y, x ) (cid:62) ϑ (cid:1) , y = (¯ y + ¯ y ) / , exact,log F Z ( b (¯ y, x ) (cid:62) ϑ ) , y ∈ ( −∞ , ¯ y ] , left,log (cid:0) − F Z ( b (¯ y, x ) (cid:62) ϑ ) (cid:1) , y ∈ (¯ y, + ∞ ) , right,log (cid:0) F Z ( b (¯ y, x ) (cid:62) ϑ ) − F Z ( b (¯ y, x ) (cid:62) ϑ ) (cid:1) , y ∈ (¯ y, ¯ y ] , interval.The likelihood is always understood as conditional on X when viewing the covariables as random.Allowing for censored observations is of practical importance, because in many applications theresponse of interest is not continuous or suffers from inaccuracies, which can be taken into accountvia uninformative censoring. Example 2 (Lm, cont’d)
For an exact datum { y, x } the log-likelihood in the normal linear re-gression model is given by (cid:96) ( ϑ , ϑ , β ; y, x ) = log φ (cid:0) ϑ + ϑ y − x (cid:62) β (cid:1) + log( ϑ ) , using the density approximation to the likelihood (Lindsey et al., 1996). Here, φ denotes the standardnormal density, and b (cid:48) ( y, x ) (cid:62) ϑ = ∂ b ( y, x ) (cid:62) ϑ ∂y = ϑ . Now that we have established TMs and the log-likelihood function to estimate their parameters, wealso need a more general notion of the residuals to formulate a causal regularizer for a distributionalanchor loss. Most importantly, these residuals have to fulfill the same requirements as least squaresresiduals in the L anchor loss. That is, they have to have zero expectation and a positive definitecovariance matrix ( e.g., Theorem 3 in Rothenh¨ausler et al., 2018). In the survival analysis literature,score residuals have received considerable attention, and fulfill the above requirements at leastasymptotically (Lagakos, 1981; Barlow and Prentice, 1988; Therneau et al., 1990; Farrington, 2000).We now define score residuals for the general class of transformation models.
Definition 2 (Score residuals)
Let ( F Z , b , ϑ ) be a fully specified TM. On the scale of the trans-formation function, add an additional parameter, α , to arrive at the TM ( F Z , ( b (cid:62) , (cid:62) , ( ϑ (cid:62) , − α ) (cid:62) ) with distribution function F Y | x ( y | x ) = F Z ( b ( y, x ) (cid:62) ϑ − α ) . Because the model is fully specified, α is constrained to . The score residual for a single datum y ∈ (¯ y, ¯ y ] is now defined as r := ∂∂α (cid:96) ( ϑ , α ; y, x ) (cid:12)(cid:12)(cid:12)(cid:12) ˆ ϑ ,α ≡ , (4)6 hich can be understood as the score contribution of a single observation to test α = 0 for acovariate which is not included in the model. When viewed as a random variable, the vector ofscore residuals has mean zero asymptotically and its components are asymptotically uncorrelated(Farrington, 2000). The score residuals can be derived in closed form for a transformation model and observations underany form of uninformative censoring r = − f (cid:48) Z ( b ( y, x ) (cid:62) ˆ ϑ ) /f Z ( b ( y, x ) (cid:62) ˆ ϑ ) , y = (¯ y + ¯ y ) / , exact, − f Z ( b (¯ y, x ) (cid:62) ˆ ϑ ) /F Z ( b (¯ y, x ) (cid:62) ˆ ϑ ) , y = ( −∞ , ¯ y ] , left, f Z ( b (¯ y, x ) (cid:62) ˆ ϑ ) / (1 − F Z ( b (¯ y, x ) (cid:62) ˆ ϑ )) , y = (¯ y, + ∞ ) , right,( f Z ( b (¯ y, x ) (cid:62) ˆ ϑ ) − f Z ( b (¯ y, x ) (cid:62) ˆ ϑ ) (cid:14) y = (¯ y, ¯ y ] , interval.( F Z ( b (¯ y, x ) (cid:62) ˆ ϑ ) − F Z ( b (¯ y, x ) (cid:62) ˆ ϑ )) Example 3 (Lm, cont’d)
By including the addtitional intercept parameter in the normal linearmodel in eq. (3) , the score residuals are given by ∂∂α (cid:96) ( ϑ , ϑ , β , α ; y, x ) (cid:12)(cid:12)(cid:12)(cid:12) ˆ ϑ , ˆ ϑ , ˆ β ,α ≡ = ∂∂α (cid:8) log φ (cid:0) ϑ + ϑ y − x (cid:62) β − α ) + log( ϑ ) (cid:9)(cid:12)(cid:12)(cid:12)(cid:12) ˆ ϑ , ˆ ϑ , ˆ β ,α ≡ = ˆ ϑ + ˆ ϑ y − x (cid:62) ˆ β = y − ˆ β − x (cid:62) ˆ β ˆ σ . In this simple case the score residuals are equivalent to scaled least-square residuals, which underlinesthe more general nature of score residuals.
We are now ready to cast transformation models into the framework of SEMs. In the definitionbelow, we define linear SEMs for TMs in terms of the transformed random variable Z := h ( Y | x , h , a )and transform it back to the scale of Y . This strategy is motivated by SEMs for generalized linearmodels (GLMs), which parametrize the natural parameter, instead of modelling the response directly( e.g., Skrondal and Rabe-Hesketh, 2004).
Definition 3 (Linear structural equation transformation model)
Let the conditional distri-bution of Y | x , h , a be given by the transformation model cdf F Y | x , h , a = F Z ◦ h . We set up a SEMfor the transformed response h ( Y | x , h , a ) = F − Z ( F Y | x , h , a ( Y | x , h , a )) , which is distributed according to F Z . By Corollary 1 in Hothorn et al. (2018), the transformationfunction exists, is unique and monotone non-decreasing in its argument. The following distributionalequalities and structural equations summarize the data generating process F Y | x , h , a ( y | x , h , a ) = F Z ( h ( y | x , h , a )) h ( Y ) ← b ( Y ) (cid:62) θ + B Y X X + B Y H H + M Y AX ← B XX X + B XH H + M X A + ε X H ← B HH H + M H A + ε H A ← ε A . In contrast to the linear SEM in eq. (1) , the SEM is set up for the transformed response. b ( y ) (cid:62) θ can be viewed as an intercept function, which fixes the overallshape of the transformation function. The remaining additive components of the transformationfunction, in turn, solely shift the transformation up- or downwards with the covariates. This mayseem restrictive at first, however, all covariates influence not only the conditional mean, but allhigher conditional moments of F Y | x , h , a . A graphical representation of the SEM from Definition 3 isgiven in Figure 3. However, we do not display the possibility that some components of X directlyinfluence each other, and likewise for H . In fact, in the simulations in Section 4, the coefficients B XX = B HH = 0. XA h ( Y | X , H , A ) H M X B Y X B XH B Y H M H M Y Figure 3: Linear structural equation model for a transformation model. Instead of setting up theSEM on the scale of Y , it is set up on the scale of the transformation function h . The conditionaldistribution of Y is still fully determined by h and F Z .Next, we will present our main proposal on distributional anchor regression to achieve robustTMs with respect to perturbations on the anchor variables A . We now formulate distributional anchor regression, for which we consider a distributional lossfunction, i.e., the negative log-likelihood, which can take into account censored observations andcaptures the conditional distribution of Y | X , and a causal regularizer involving score residuals. Wefirst give some intuition how to arrive at the distributional anchor loss, starting from the L anchorloss. One can decompose the L anchor loss L ( β ; y , X , A ) = 12 (cid:18) (cid:107) (Id − Π A )( y − X β ) (cid:107) /n + γ (cid:107) Π A ( y − X β ) (cid:107) /n (cid:19) into a squared error and a pure penalty term. We rewrite L ( β ; y , X , A ) ∝ (cid:107) (Id − Π A )( y − X β ) (cid:107) + γ (cid:107) Π A ( y − X β ) (cid:107) = (cid:107) y − X β (cid:107) + ( γ − (cid:107) Π A ( y − X β ) (cid:107) , which is a sum of the squared-error loss and a causal regularizer involving the L norm of theresiduals ( y − X β ) projected linearly onto the space spanned by the columns of A to encourageuncorrelatedness between the residuals and the anchor variables. The cross-terms when expandingthe L norm vanish because Π A is an orthogonal projection. Now an intuitive choice for thedistributional anchor regression loss would be L ( β ; y , X , A ) ∝ − (cid:96) ( β ; y , X ) + ( γ − (cid:107) Π A r (cid:107) , where the negative log-likelihood induced by a transformation model, (cid:96) ( · ), replaces the squarederror loss and, most importantly, r denotes the vector of score residuals as defined in Section 2.3.8hus, the loss encourages uncorrelatedness between the anchor variables and the score residuals,particularly for large values of γ . The origin and importance of score residuals is outlined inAppendix B. We now give a definition for the distributional anchor loss. Definition 4 (Distributional anchor loss)
Consider a linear TM and its SEM, as in Defini-tion 3. Then, the distributional anchor loss is defined as L ( ϑ ; y , X , A , ξ ) = − (cid:96) ( ϑ ; y , X ) /n + ξ (cid:107) Π A r (cid:107) /n, where (cid:96) ( · ) denotes the log-likelihood induced by a TM, r denotes the vector of score residuals and ξ ∈ [0 , + ∞ ) controls the extent of causal regularization. As mentioned earlier, the log-likelihood isconditional on X . Example 4 (Lm, cont’d)
For normal linear regression with constant variance, the L anchor lossand the distributional anchor loss are equivalent. This is because L ( ϑ , ϑ , β ; y , X , ξ ) = − log φ ( ϑ + ϑ y − X β ) /n − log( ϑ ) + ξ (cid:107) Π A ( ϑ + ϑ y − X β ) (cid:107) /n = (cid:107) y − β − X ˜ β (cid:107) / (2 σ n ) + ξ (cid:107) Π A ( y − β − X ˜ β ) (cid:107) / ( σ n ) + C. Absorbing all additive constants into C and factoring out the variance renders the above objectiveequivalent to the L anchor loss for ξ = ( γ − / . In the following Section, we will empirically evaluate the prediction performance of transformationmodels estimated under the distributional anchor loss. Computational details for fitting TMs usingthe distributional anchor loss, are given in Appendix C.
We begin the section by describing the experimental setup in the application and simulation studiesand then present the results in Sections 4.1 and 4.2. We consider median housing prices in theBostonHousing2 dataset (Harrison and Rubinfeld, 1978) to illustrate an application of anchor TMsin normal linear regression (Lm), which assumes normality and equal variances. To lift theseassumptions, a continuous probit (c-probit) regression is used to model more complex, skeweddistributions, which are typical for housing prices. Lastly we use a continuous logit (c-logit) model,which now takes into account the censored observations in the BostonHousing2 dataset and enablesmore easily interpretable shift effects on the log-odds scale. Furthermore, the proposed distributionalframework for anchor regression is evaluated in simulation studies for Lm, c-probit and orderedlogistic regression (o-logit). A summary of the models used to empirically evaluate anchor TMs isgiven in Table 1.
For the BostonHousing2 data we wish to predict corrected median housing prices ( cmedv ) fromseveral socio-economical and environmental factors. These include per capita crime rates ( crim ),average number of rooms ( rm ), and nitric oxide concentration ( nox ) among others. Each observationcorresponds to a single census tract in Boston. Individual cities will serve as anchors in this examplebecause they are plausibly exogenous factors that induce heterogeneity in the observed covariatesand housing prices. “Leave-one-environment-out” (LOEO) cross validation is used to demonstratethe change in estimated regression coefficients and NLL comparing a plain model without causal9able 1: Transformation models used to illustrate the distributional anchor loss. F SL = expitdenotes the standard logistic distribution. By ˜ y we denote the dummy encoded response, ˜ y = e ( k ),for Y taking class y k , k = 1 , . . . , K . Here, e ( k ) denotes the k th unit vector. In the experiments, thebasis functions for y are Bernstein polynomials with maximum order P , b Bs ,P ( y ) ∈ R P +1 . Becausethe transformation function h ( y ) = b ( y ) (cid:62) ϑ must be monotone non-decreasing, we require someconstraints on the parameters of the transformation function. Name Transformation Model Constraints Type Response Lm (cid:0) Φ , (1 , y, x (cid:62) ) (cid:62) , ( θ , θ , − β (cid:62) ) (cid:62) (cid:1) θ > (cid:0) Φ , ( b (cid:62) Bs ,P , x (cid:62) ) (cid:62) , ( θ (cid:62) , − β (cid:62) ) (cid:62) (cid:1) θ ≤ · · · ≤ θ P +1 Continuousc-logit (cid:0) F SL , ( b (cid:62) Bs ,P , x (cid:62) ) (cid:62) , ( θ (cid:62) , − β (cid:62) ) (cid:62) (cid:1) θ ≤ · · · ≤ θ P +1 Continuouso-logit (cid:0) F SL , (˜ y (cid:62) , x (cid:62) ) (cid:62) , ( θ (cid:62) , − β (cid:62) ) (cid:62) (cid:1) θ < · · · < θ K − , Ordinal θ K = + ∞ regularization ( ξ = 0) to three different anchor TMs over a large range of causal regularization(Figure 4). For some of the left-out towns the conditional distribution of cmedv will differ fromthe training distribution and contain unseen perturbations. In this case, the town will be harderto predict, leading to a worse cross-validated NLL compared to the environments which are notperturbed. We hypothesize, an anchor TM should improve prediction performance for the censustracts in these hard-to-predict towns, in analogy to the distributional robustness results describedin eq. (2), whereas it should perform worse than a plain TM for environments with only mildperturbations.First, a linear model assuming homoscedasticity and conditional normality is used to estimatethe conditional distribution of cmdev depending on the socio-economic factors described above. Anotable reduction in the observed worst-case loss is already observed for mild causal regulariza-tion ( ξ ∈ { , } ) without loosing too much predictive performance for the other environments(Figure 4 A). For stronger causal regularization, the cross-validated NLL becomes gradually worse.However, assuming a symmetric distribution for prices ignores the typical skewness of these out-comes. The c-probit model allows a non-linear basis expansion in the response and thus relaxesthe homoscedasticity and conditional normality assumption. Essentially the same gain in termsof worst-case CV NLL is observed for the c-probit model compared to Lm. Figure 5 shows thepredicted conditional densities for the three observations in Boston Beacon Hill and emphasizes theimportance of modelling cmedv using a right-skewed distribution. A disadvantage of switching froma linear probit to a non-linear probit model is the loss of interpretability of the individual regressioncoefficients ( e.g., Fahrmeir et al., 2007). However, also this disadvantage can be overcome in theframework of anchor TMs by specifying a different distribution function, F Z , while keeping the basisexpansion in the outcome equally flexible. In addition, the housing prices above $50 (cid:48)
000 ( cmedv = 50) are right-censored in the BostonHousing2 data, which is commonly ignored in analyses, butcrucial to capture the uncertainty in predicting the skewed outcome (Gilley and Kelley Pace, 1996).The c-logit model now takes into account right censoring of the observations and yields regressioncoefficients that are interpretable as log odds-ratios (Lohse et al., 2017). Indeed, for census tract
Loc 1 the c-logit model puts more probability mass on prices of cmedv beyond $50 (cid:48)
000 relative tothe density at its mode compared to the c-probit model, which treats the censored observations asexact (Figure 5). Taking into right censoring apparently made out-of-sample prediction for Boston10 og x - ¥ lllllllllll llllllllll lllllllll llllllll llllllllllll llllllllllll Boston Back BayBoston Beacon HillBoston North End lllll ll
Boston Back BayBoston Beacon HillBoston North End lllll lllll lll lll llllllll lllllll
Boston Back BayBoston Beacon HillBoston North End
Lm c−probit c−logit - ¥ - ¥ - ¥ x N LL C V A lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Lm c−probit c−logit indus lstat nox rm indus lstat nox rm indus lstat nox rm−3−2−1012 b ^ C V B Figure 4: Leave-one-environment-out cross validation under increasing causal regularization forthe BostonHousing2 data, with town as anchors. A linear (Lm), continuous probit (c-probit) andcontinuous logit (c-logit) model is fitted on 91 towns and used to predict the left out town. A : Out-of-sample NLL for the left-out census tracts. Beacon Hill, Back Bay and North End are consistentlyhardest to predict. Consequently, for these towns the cross-validated NLL improves with increasingcausal regularization up to a certain point. For the majority of the remaining towns predictionperformance decreases. We thus indeed improve worst-case prediction, in analogy to eq. (2). Notethat log ξ = −∞ corresponds to the unpenalized model. B : Estimated regression coefficients,which are interpretable as difference in means (Lm), difference in transformed means (c-probit) andlog odds-ratios (c-logit). Solely the c-logit model accounts for right-censored observations. Withincreasing causal regularization the estimates shrink towards zero, indicating that town may be aweak instrument (see Appendix D).Beacon Hill, Back Bay and North End easier, but the improvement through causal regularizationdiminished slightly (Figure 4 A).Coefficient estimates for all three models are shown in Figure 4 B. With increasing amount ofcausal regularization, all estimates shrink towards 0, which indicates town may be a weak instrument(Imbens and Rosenbaum, 2005), for more details see Appendix D. However, intermediate amountsof causal regularization yield estimates for which anchors and score residuals are somewhat de-correlated and still lead to the desired robustness against perturbations in unseen environments.11 .000.020.04 0 10 20 30 40 50cmedv f ^ Y | x ( y | x ) Loc 1 Loc 2 Loc 3 Lm c−probit c−logit
Figure 5: Density estimates for the three census tracts (
Loc 1 , Loc 2 , Loc 3 ) in Boston BeaconHill, the hardest to predict town in terms of LOEO cross-validated NLL for ξ = 10 ( cf. Figure 4).Lm assumes equal variances and conditional normality, whereas c-probit loosens this assumptionleading to more accurate, skewed distributions. Only c-logit takes into account right censoring inthe data.
In this the section, the results of the simulation scenarios are presented. The parameters for theSEMs used to simulate from the models in Table 1 are summarized in Table 2.
We begin with a comparison of linear anchor regression and the distributional version of linear an-chor regression in scenario la , which was first used to study anchor regression in B¨uhlmann (2020).The non-linear scenario nla also stems from B¨uhlmann (2020), which we use to show how shifttransformation models can estimate non-linear conditional expectation function, albeit for their lin-ear model formulation in the covariates. For the last two scenarios iv1 and iv2 , the IV assumptionshold, i.e., the anchors influence only X . Scenario iv1 showcases discrete anchors and a continuousresponse and a non-linear transformation function, which we model by a continuous probit regres-sion. Scenario iv2 features an ordinal response and a more detailed simulation, including variousshift strengths. In scenarios la , iv1 and iv2 , the effect from X to Y is denoted by β , whereas thenon-linear f is used in scenario nla . For the data generating processes that involve transformationmodels, the transformation function h is specified. For ordinal responses the number of classes, K , and for continuous outcomes, the maximum order of the Bernstein basis, P , determines thenumber of parameters for the baseline transformation. The parameters of the Bernstein basis arefixed by applying the transformation function h to a ( P + 1)-vector of evenly spaced values in thedesired support of Y . In turn, such a basis approximation leads to a distribution approximation forthe true distribution of Y which improves as P increases. However, the transformation function isconstrained to be monotone non-decreasing, which makes a parsimonious parametrization sufficient.12able 2: Simulation scenarios used to empirically evaluate distributional anchor regression. Scenar-ios la and nla are adapted from B¨uhlmann (2020) and will be used to evaluate linear and continuousprobit anchor TMs. The seven covariates omitted in the table in both scenarios are noise covari-ates, i.e., β j = 0 , j (cid:54) = 2 ,
3. In nla , f ( X ) = X + X + ( X ≤
0) + ( X ≤ . ( X ≤ n train = 300 and n test = 2000 observations. In the iv scenarios the instrumental variablesassumptions hold, because the anchor neither influences the hidden confounders nor the response.The scenarios generalize Example 1 in Rothenh¨ausler et al. (2018) to anchor TMs with a continuousoutcome ( iv1 ) and an ordinal outcome ( iv2 ). Both use n train = 1000 and n test = 2000 observations. Scenario Model Distributions Parameters Perturbation la Lm ε Y ∼ N (0 , . ) ε X ∼ N (0 , Id) ε H ∼ N (0 , ε A ∼ N (0 , Id) β j = 3 , j = 2 , M Y = ( − , M X j ∼ N (0 , Id) , j = 1 , . . . , M H = 0 B Y H = 1 B XH = (1 , . . . , A ∼ N (0 ,
10 Id)) nla c-probit ε Y ∼ N (0 , . ) ε X ∼ N (0 , . Id) ε H ∼ N (0 , ε A ∼ N (0 , Id) f = f ( X , X ) M Y = M H = 0 M X j = (1 , , j = 1 , . . . , B Y H = 3 B XH = (2 , . . . , A ∼ N ( µ , Id)) µ ∼ N (1 , Id) iv1 c-probit ε X ∼ N (0 , . ) ε H ∼ N (0 , . ) ε A ∼ Rademacher β = 0 . M Y = M H = 0 M X = 0 . B Y H = B XH = 0 . h = Φ − ◦ F χ P = 6 do( A = 3 . iv2 o-logit ε X ∼ N (0 , . ) ε H ∼ N (0 , . ) ε A ∼ Rademacher β = 0 . M Y = M H = 0 M X ∈ {− , . , } B Y H = B XH = 1 . h = F − ◦ Id K ∈ { , , } do( A = a ) a ∈ { , , . , } la The linear anchor scenario la was first presented in B¨uhlmann (2020) for the L anchor loss. Theperformance gain of using anchor regression compared to a plain linear model is shown in Figure 6for the L anchor loss (left) and the distributional anchor loss (right). A performance gain acrossall quantiles of the log-likelihood contributions can be observed. However, the larger the quantile,the higher the performance gain. The extent of causal regularization was chosen based on thetheoretical insight that, in a multivariate normal model, γ can be interpreted as the quantile of a χ distribution, which relates the expected size of the unobserved perturbations to the conditionalmean squared error given the anchors (Lemma 1 in Rothenh¨ausler et al., 2018).13 l l l l l l l l l ll l l l l l l l l l l a a − quan t il e o f APE ll anchorplain A a a − quan t il e o f - l ogL i k i B Figure 6: Test performance averaged over 100 simulations for scenario la for n train = 300 and n test = 2000. The α -quantiles of test absolute prediction error APE := | y − ˆ y | , where ˆ y denotesthe conditional median, is shown for linear anchor regression ( A ) and the negative log-likelihoodcontributions for distributional (conditionally Gaussian) linear L anchor regression ( B ). The twomodels are equivalent up to estimating the residual variance via maximum likelihood in the distri-butional anchor TM. The change in perspective from an L to a distributional loss requires differentevaluation metrics, of which the log-likelihood, being a proper scoring rule (Good, 1952), is the mostnatural choice. nla In scenario nla , which features non-linear anchor regression, a continuous probit model is fitted.Figure 7 shows a vast gain in performance over all quantiles, comparable to what was observed inB¨uhlmann (2020) with a different nonlinear anchor boosting method. This gain in performance llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll N LL ll anchorplain A a a − quan t il e o f - l ogL i k i B Figure 7: Test performance over 100 simulations for scenario nla with n train = 300 and n test =2000. NLL ( A ) and α -quantiles of the negative log-likelihood contributions ( B ) for the c-probitanchor TM. The test data are generated under strong push-interventions on the distribution of theanchors ( cf. Table 2).can be explained in the causal generalization framework of Christiansen et al., because the causalfunction linearly extends outside the support of X train .14 .2.4 Scenario iv1 In scenario la the anchors influence the response and hidden confounders, violating the instrumentalvariable assumptions. Scenario iv1 explores binary anchors as valid instruments, while the baselinetransformation b Bs , ( y ) (cid:62) θ is non-linear. Note that although the model formulation is linear in β ,the conditional expectation function may be non-linear, because of the non-linear transformationfunction. This is inspired by Example 1 in Rothenh¨ausler et al. (2018) and translates it intoa transformation model SEM from Definition 3 for continuous but non-normal outcomes. The a a − quan t il e o f - l ogL i k i log x - ¥ A ll l - ¥ x b ^ B Figure 8: Test performance over 100 simulations for scenario iv1 with n train = 1000 and n test = 2000. Quantiles of the individual log-likelihood contributions ( A ) and estimates of β ( B ) forincreasingly strong causal regularization. The ground truth is indicated by a dashed line. The testdata are generated under the intervention do( A = 3 . A = 3 . A is a valid instrument, thecausal parameter seems to be slightly biased in the finite sample case for larger ξ (Figure 8 B). iv2 The instrumental variable assumptions hold also in the last scenario iv2 . However, the response’sdistribution is now ordered categorical and more varying parameters are considered, including thenumber of classes of the response, the strength of the instruments and the perturbations in the testdata ( cf.
Table 2).Figure 9 depicts test NLL alongside the estimated shift parameter, ˆ β . Also here, in case of strongperturbations anchor regression protects against unseen perturbations for larger ξ ( e.g., do( A =1 .
8) and do( A = 3) for M A = − .
5) resulting in improved test predictions. However, if theshift perturbations are not innovative, test prediction suffers with increasing amounts of causalregularization ( e.g., do( A = 1) for M X = 2). Note the interplay between the strength of theanchors, M X , and the strength of the shift interventions. For larger M X , the training data becomesmore heterogeneous and the larger shifts are not as innovative, resulting in a weaker performanceof anchor TMs for increasing ξ . Again, the estimated shift parameter seems to be slightly biased(Figure 9 B). 15 l l l l lllllllllll ll lll ll llllll llllll lll llllll llll llllll lll llll llll l llll lll ll lll llllll llll lll lll lllllllll ll l ll l llllllll llll l llll lll l lll l z = z = z = do ( A = ) do ( A = . ) do ( A = ) - ¥ - ¥ - ¥ x N LL A ll llllllll lllll lllll llllll lllll lllllll lllllllll ll lllllll lllll llllll lll llllllll llll lllll z = z = z = do ( A ˛ { , , . , } ) - ¥ - ¥ - ¥ x b ^ B Figure 9: Test performance and coefficient estimates over 200 simulations for scenario iv2 . Becausethe results are comparable for differing sample sizes and numbers of classes, solely the results for n train = 1000 and K = 10 are displayed. A : Test log-likelihood contributions for varyingly stronginstruments (columns) and perturbation sizes (rows). B : Parameter estimates ˆ β for all interventionscenarios together, since they do not influence estimation. The simulated ground truth β = 0 . The proposed method of distributional anchor regression generalizes (non-) linear anchor regres-sion beyond the assumptions of normality and homoscedasticity and beyond estimating solely aconditional mean.In an exemplary analysis of the BostonHousing2 data we have illustrated the flexibility of anchorTMs and demonstrated a gain in prediction performance in terms of worst-case cross validated log-likelihood, while preserving interpretability and appropriately accounting for censored observations.The simulations show comparable results to established linear and non-linear anchor regression mod-els under both IV and invalid IV scenarios and extend the notion of invariance between residualsand environments to other than continuous responses. Although anchor TMs are generally unableto recover the causal parameter, we argue that the “diluted causal” (B¨uhlmann, 2020) parameter,16 β →∞ := ˆ β ( ξ ) as ξ → ∞ , is interesting in its own right, for it induces invariance between anchorsand score residuals and allows robust test predictions in the presence of distributional shifts. Muchlike the causal parameter, the diluted causal parameter leads to (aspects of) invariant conditionaldistributions across environments. Even though the powerful causal interpretation is lost, distri-butional anchor regression yields models that allow causally flavored interpretations in terms ofstabilization and robustification across environments.Anchor TMs estimate the whole conditional distribution and thus enable robust prediction of amultitude of responses, which we demonstrated for (censored) continuous and ordered categoricalresponses. Possible extensions of anchor TMs are numerous. For instance, other types of responsesinclude count and time-to-event data. The framework of anchor TMs contains a fully parametricversion of the Cox proportional hazard model (Hothorn et al., 2018), although an extension to clas-sical survival models is also possible. For instance, the Cox proportional hazard model (Cox, 1972)can be fitted by substituting the likelihood for the partial likelihood (Cox, 1975) in the distribu-tional anchor loss, while the score residuals are equivalent to martingale residuals ( cf. Appendix B;Barlow and Prentice, 1988; Therneau et al., 1990). As in high-dimensional linear and non-linearanchor regression, anchor TMs could be fitted under a lasso penalty (Tibshirani, 1996). The ideaof using a different class of residuals can also be translated to other model classes, such as devianceresiduals for GLMs, as long as the theoretical requirements discussed in Section 3 are met.In terms of future work, further theoretical investigation of the distributional anchor loss, suchas bounds on the generalization error, is warranted. So far we restricted distributional regressionto linear (in x ) TMs because of their already highly flexible nature and simplicity in the consideredDGPs. However, more complex experimental designs require, for instance, random effects or time-varying effects of covariates in time-to-event data. Taken together, anchor TMs lay the foundationfor future work on distributional extensions of anchor regression. Acknowledgements.
The research of L. Kook and B. Sick was supported by the Novartis Re-search Foundation (FreeNovation 2019). The research of P. B¨uhlmann was supported by the Eu-ropean Research Council under the Grant Agreement No 786461 (CausalStats - ERC-2017-ADG).We thank Torsten Hothorn for fruitful discussions and input on the exemplary application. We alsothank Malgorzata Roos for her helpful comments on the manuscript.
References
Mart´ın Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015.URL https://tensorflow.org/ . Software available from tensorflow.org.Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identification of Causal EffectsUsing Instrumental Variables.
Journal of the American Statistical Association , 91(434):444–455,1996. doi: 10.1080/01621459.1996.10476902. URL .A. Azzalini and A. W. Bowman. A Look at Some Data on the Old Faithful Geyser.
AppliedStatistics , 39(3):357, 1990. doi: 10.2307/2347385. URL .William E. Barlow and Ross L. Prentice. Residuals for relative risk regression.
Biometrika ,75(1):65–74, 1988. doi: 10.1093/biomet/75.1.65. URL https://academic.oup.com/biomet/article-lookup/doi/10.1093/biomet/75.1.65 .17tephen Boyd and Lieven Vandenberghe.
Convex Optimization . Cambridge University Press, 2004.doi: 10.1017/CBO9780511804441.Peter B¨uhlmann. Invariance, Causality and Robustness.
Statistical Science , 35(3):404–426, 2020.doi: 10.1214/19-STS721. URL https://projecteuclid.org/euclid.ss/1599789698 .Peter B¨uhlmann and Domagoj ´Cevid. Deconfounding and Causal Regularisation for Stability andExternal Validity.
International Statistical Review , 88(1):114–134, 2020. doi: 10.1111/insr.12426.URL https://onlinelibrary.wiley.com/doi/10.1111/insr.12426 .Rich Caruana. Multitask Learning.
Machine Learning , 28(1):41–75, 1997. doi: 10.1023/A:1007379606734. URL https://link.springer.com/article/10.1023/A:1007379606734 .Yuansi Chen and Peter B¨uhlmann. Domain adaptation under structural causal models. arXivpreprint arXiv:2010.15764 , 2020. URL https://arxiv.org/abs/2010.15764 .Rune Christiansen, Niklas Pfister, Martin Emil Jakobsen, Nicola Gnecco, and Jonas Peters. Acausal framework for distribution generalization. arXiv preprint , 2020. URL https://arxiv.org/abs/2006.07433 .D. R. Cox. Regression Models and Life-Tables.
Journal of the Royal Statistical Society: Series B(Methodological) , 34(2):187–202, 1972. doi: 10.1111/j.2517-6161.1972.tb00899.x. URL https://doi.wiley.com/10.1111/j.2517-6161.1972.tb00899.x .D. R. Cox. Partial likelihood.
Biometrika , 62(2):269–276, 1975. doi: 10.1093/biomet/62.2.269.URL https://academic.oup.com/biomet/article-lookup/doi/10.1093/biomet/62.2.269 .D. R. Cox and E. J. Snell. A General Definition of Residuals.
Journal of the Royal Statistical Society:Series B (Methodological) , 30(2):248–265, 1968. doi: 10.1111/j.2517-6161.1968.tb00724.x. URL https://doi.wiley.com/10.1111/j.2517-6161.1968.tb00724.x .Bradley Efron. Prediction, Estimation, and Attribution.
Journal of the American Statisti-cal Association , 115(530):636–655, 2020. doi: 10.1080/01621459.2020.1762613. URL https://doi.org/10.1080/01621459.2020.1762613 .Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian Marx.
Regression . Springer, 2007.C P Farrington. Residuals for Proportional Hazards Models with Interval-Censored Survival Data.
Biometrics , 56(2):473–482, 2000. doi: 10.1111/j.0006-341X.2000.00473.x. URL https://doi.wiley.com/10.1111/j.0006-341X.2000.00473.x .Anqi Fu, Balasubramanian Narasimhan, and Stephen Boyd. CVXR: An R package for disciplinedconvex optimization.
Journal of Statistical Software , 2020. URL https://arxiv.org/abs/1711.07582 . Accepted for publication.Otis W. Gilley and R. Kelley Pace. On the Harrison and Rubinfeld data.
Journal of EnvironmentalEconomics and Management , 31(3):403–405, 1996. doi: 10.1006/jeem.1996.0052. URL https://linkinghub.elsevier.com/retrieve/pii/S0095069696900522 .I. J. Good. Rational decisions.
Journal of the Royal Statistical Society. Series B (Methodological) ,14(1):107–114, 1952. doi: 10.1111/j.2517-6161.1952.tb00104.x. URL . 18rygve Haavelmo. The Statistical Implications of a System of Simultaneous Equations.
Econo-metrica , 11(1):1, 1943. doi: 10.2307/1905714. URL .David Harrison and Daniel L. Rubinfeld. Hedonic housing prices and the demand forclean air.
Journal of Environmental Economics and Management , 5(1):81–102, 1978.doi: 10.1016/0095-0696(78)90006-2. URL https://linkinghub.elsevier.com/retrieve/pii/0095069678900062 .Torsten Hothorn. Most Likely Transformations: The mlt Package.
Journal of Statistical Software ,92(1), 2020. doi: 10.18637/jss.v092.i01. URL .Torsten Hothorn, Thomas Kneib, and Peter B¨uhlmann. Conditional transformation models.
Journalof the Royal Statistical Society. Series B: Statistical Methodology , 76(1):3–27, 2014. doi: 10.1111/rssb.12017. URL .Torsten Hothorn, Lisa M¨ost, and Peter B¨uhlmann. Most Likely Transformations.
ScandinavianJournal of Statistics , 45(1):110–134, 2018. doi: 10.1111/sjos.12291. URL https://doi.wiley.com/10.1111/sjos.12291 .Guido W. Imbens and Paul R. Rosenbaum. Robust, accurate confidence intervals with a weakinstrument: quarter of birth and education.
Journal of the Royal Statistical Society: SeriesA (Statistics in Society) , 168(1):109–126, 2005. doi: 10.1111/j.1467-985X.2004.00339.x. URL https://doi.wiley.com/10.1111/j.1467-985X.2004.00339.x .Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In . International Conference on Learning Representations, ICLR, 2015. URL https://arxiv.org/abs/1412.6980v9 .Lucas Kook and Torsten Hothorn.
Regularized Transformation Models: The tramnet Package , 2020.URL https://cran.r-project.org/package=tramnet .Lucas Kook, Lisa Herzog, Torsten Hothorn, Oliver D¨urr, and Beate Sick. Ordinal Neural NetworkTransformation Models: Deep and interpretable regression models for ordinal outcomes. arXivpreprint arXiv:2010.08376 , 2020. URL https://arxiv.org/abs/2010.08376 .S. W. Lagakos. The graphical evaluation of explanatory variables in proportional hazard regressionmodels.
Biometrika , 68(1):93–98, 1981. doi: 10.1093/biomet/68.1.93. URL https://academic.oup.com/biomet/article-lookup/doi/10.1093/biomet/68.1.93 .James K Lindsey et al.
Parametric statistical inference . Oxford University Press, 1996.Tina Lohse, Sabine Rohrmann, David Faeh, and Torsten Hothorn. Continuous outcome logisticregression for analyzing body mass index distributions.
F1000Research , 6:1933, 2017. doi: 10.12688/f1000research.12934.1. URL https://doi.org/10.12688/f1000research.12934.1 .Sara Magliacane, Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris MMooij. Domain adaptation by using causal inference to predict invariant conditional distributions.In
Advances in Neural Information Processing Systems , pages 10846–10856, 2018.19lorian Markowetz, Steffen Grossmann, and Rainer Spang. Probabilistic soft interventions in condi-tional gaussian networks. In
Tenth International Workshop on Artificial Intelligence and Statis-tics , pages 214–221. Society for Artificial Intelligence and Statistics, 2005.Jovana Mitrovic, Brian McWilliams, Jacob Walker, Lars Buesing, and Charles Blundell. Represen-tation learning via invariant causal mechanisms. arXiv preprint arXiv:2010.07922 , 2020. URL https://arxiv.org/abs/2010.07922 .Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.
IEEE Transactions on knowledgeand data engineering , 22(10):1345–1359, 2009.Judea Pearl.
Causality . Cambridge university press, 2009.R Core Team.
R: A Language and Environment for Statistical Computing . R Foundation forStatistical Computing, Vienna, Austria, 2020. URL .Ievgen Redko, Emilie Morvant, Amaury Habrard, Marc Sebban, and Youn`es Bennani. A survey ondomain adaptation theory. arXiv preprint arXiv:2004.11829 , 2020. URL https://arxiv.org/abs/2004.11829 .Mateo Rojas-Carulla, Bernhard Sch¨olkopf, Richard Turner, and Jonas Peters. Invariant Models forCausal Transfer Learning.
Journal of Machine Learning Research , 19:1–34, 2018. doi: 10.5555/3291125.3291161. URL https://jmlr.org/papers/v19/16-432.html.
Dominik Rothenh¨ausler, Nicolai Meinshausen, Peter B¨uhlmann, and Jonas Peters. Anchor re-gression: heterogeneous data meets causality. arXiv preprint arXiv:1801.06229 , 2018. URL https://arxiv.org/abs/1801.06229 . To appear in J. Royal Statistical Society: Series B (Sta-tistical Methodology).Anders Skrondal and Sophia Rabe-Hesketh.
Generalized latent variable modeling: Multilevel, lon-gitudinal, and structural equation models . Crc Press, 2004.Stephen M Stigler.
The seven pillars of statistical wisdom . Harvard University Press, 2016.Adarsh Subbaswamy and Suchi Saria. From development to deployment: dataset shift, causal-ity, and shift-stable models in health AI.
Biostatistics , 21(2):345–352, 2019. doi: 10.1093/biostatistics/kxz041. URL https://academic.oup.com/biostatistics/advance-article/doi/10.1093/biostatistics/kxz041/5631850 .Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A Efros, and Moritz Hardt. Test-timetraining for out-of-distribution generalization. arXiv preprint arXiv:1909.13231 , 2019. URL https://arxiv.org/abs/1909.13231 .T. M. Therneau, P. M. Grambsch, and T. R. Fleming. Martingale-based residuals for sur-vival models.
Biometrika , 77(1):147–160, 1990. doi: 10.1093/biomet/77.1.147. URL https://academic.oup.com/biomet/article-lookup/doi/10.1093/biomet/77.1.147 .Robert Tibshirani. Regression Shrinkage and Selection Via the Lasso.
Journal of the Royal StatisticalSociety: Series B (Methodological) , 58(1):267–288, 1996. doi: 10.1111/j.2517-6161.1996.tb02080.x. URL https://rss.onlinelibrary.wiley.com/doi/10.1111/j.2517-6161.1996.tb02080.x . 20
Notation
Random variables are written in uppercase italic, Y , and realizations thereof in lowercase, y . Whenstacked to a vector of n observations, we write y = ( y , . . . , y n ) (cid:62) . Random vectors are writtenlike random variables, but in bold, X , and realizations thereof in lowercase bold, x . Stacked toa matrix for n observations, we write X = ( x (cid:62) , . . . , x (cid:62) n ) (cid:62) ∈ R n × p . Matrices are written in bold,normal uppercase, A , vectors in bold italic lowercase, a . The probability measure of a randomvariable Y is denoted by P Y . Coefficient matrices in the SEMs are denoted by M for the anchors A and by B for all other components. When restricting the coefficient matrix B to a single component, e.g., the effect of X on Y , we write B Y X . Because for M it is clear from context, we omit the A in the index. B Background on Score Residuals
Stigler (2016) considers residuals the seventh “pillar of statistical wisdom”, which highlights theirconceptual importance. Here, we will briefly justify the use of the score residual in transformationmodels based on the work of Lagakos (1981), who introduced martingale residuals for survivalanalysis under interval censoring. Farrington (2000) presents a summary of the different types ofresiduals used in survival analysis. The scope of score residuals in transformation models is to takethe notion of a residual and generalize it to a wider class of models, namely TMs, and allow for allkinds of uninformative censoring.In Definition 2, we note that the score residual can be interpreted as the score contributionof a newly introduced intercept parameter, which is constrained to zero. This is equivalent toformulating a score test for α = 0 for a covariate that is not yet included in the model. Since α is constrained to zero in the whole procedure, we do not need to evaluate the model underthe alternative hypothesis. Note also, that we restrict α to an intercept term on the scale of thetransformation function. Farrington gives a compelling argument why one should do so. Thisconnection is drawn next.As an adjustment of Cox-Snell residuals (Cox and Snell, 1968), Lagakos proposes a centeredversion of Cox-Snell residuals. Barlow and Prentice (1988) later drew the connection to stochasticcalculus and coined the term “martingale residual” (see also Therneau et al., 1990). For intervaland right censored, as well as exact responses, Lagakos residuals have a direct connection to scorestatistics (Farrington, 2000). The setting is based in survival analysis, but the connection to trans-formation models will become apparent. Lagakos starts from a general proportional hazards modelwhere λ ( t | x ) = λ ( t ) exp( x (cid:62) β )denotes the hazard function depending on time t and covariates x , which can be decomposed intoa product of a baseline hazard function λ ( t ) and the exponential function of a linear predictor inthe covariates β . The cumulative hazard function is then defined asΛ( t | x ) = (cid:90) t λ ( u | x ) du. From there, the cdf can be recovered as F T ( t | x ) = 1 − exp( − Λ( t ) exp( x (cid:62) β )) .
21y definition Λ( t ) > ∀ t , hence F T ( t | x ) = 1 − exp( − exp(log Λ( t ) + x (cid:62) β )) , which is a transformation model with Minimum-Extreme-Value inverse-link, F Z = F MEV , and thetransformation function, h ( t ) = log Λ( t ), is the log cumulative baseline hazard. Farrington nowassumes that the baseline hazard function belongs to a family F that is closed under scaling by afactor γ ∈ R ++ , i.e., Λ( t | x ) ∈ F ⇒ γ Λ( t | x ) ∈ F . The Lagakos residual is now defined as ˆ r L = ∂∂α (cid:96) (cid:12)(cid:12)(cid:12)(cid:12) ˆ β , ˆ S for α = log γ and ˆ S being the estimated survivor curve under x = 0, i.e., ˆ S = exp( − ˆΛ( t | x = 0)).This translates to the family of log (cumulative) hazard functions being closed under addition of α = log γ ∈ R , which is exactly the case for transformation models. This step corresponds to addingand constraining a new intercept term to zero on the scale of the transformation function. Lagakosresiduals behave like the usual score statistic, i.e., they have mean zero asymptotically and areasymptotically uncorrelated. C Computational Details
Anchor TMs, simulation scenarios and visualizations are written in the R language for statisticalcomputing (Version 3.6.3, R Core Team, 2020). To implement distributional anchor regression, notethat the transformation model log-likelihood is concave w.r.t. ϑ , unless all observations are rightcensored. The penalty resulting from the quadratic form r (cid:62) Π A r is convex if r , from Definition 2,is affine or convex (Boyd and Vandenberghe, 2004). In case of exact observations in Lm and c-probit, the resulting penalty is convex and thus solvable by a constrained convex optimizer. Kookand Hothorn (2020) implement regularized transformation models under elastic net penalties in tramnet , using the domain-specific language optimizer CVXR (Fu et al., 2020). From tramnet ,we use the TM likelihood implementation. For the interval censored or right censored models (o-logit, c-logit) we fit the models using (stochastic) gradient descent (SGD) from package
TensorFlow (Abadi et al., 2015). The implementation of the interval censored log-likelihood for ordinal TMswas taken from Kook et al. (2020) and we used SGD with the Adam optimizer (Kingma and Ba,2015) with learning rate 10 − , batch size 250 and 200 epochs. Parameters were initialized with themaximum likelihood estimate for ξ = 0 obtained via tram::Polr() (Hothorn, 2020). D Invalid Instruments and Shrinkage of Estimates
In the linear IV setting an instrument is called weak if it has little impact on the explanatoryvariables X . Consequently, a two-stage least squares procedure will yield homogeneous predictionsfor X in the first stage, because X ⊥ A = ⇒ Π A X ≡ const . This leads to regressing the response on a matrix with constant columns, making it impossible toseparate β from an intercept. Thus, in case of weak instruments, the resulting estimates ˆ β will22hrink towards 0 as ξ → ∞ when using a causal regularizer. This explains the effect seen for theBostonHousing2 data in Section 4.1 and makes the diluted causal parameter impossible to study,because it is equal to constant 0. However, intermediary amounts of causal regularization yield abenefit in terms of worst-case LOEO CV ( cf.cf.