[PDF] Conditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression

Abstract

We propose to analyse the conditional distributional treatment effect (CoDiTE), which, in contrast to the more common conditional average treatment effect (CATE), is designed to encode a treatment's distributional aspects beyond the mean. We first introduce a formal definition of the CoDiTE associated with a distance function between probability measures. Then we discuss the CoDiTE associated with the maximum mean discrepancy via kernel conditional mean embeddings, which, coupled with a hypothesis test, tells us whether there is any conditional distributional effect of the treatment. Finally, we investigate what kind of conditional distributional effect the treatment has, both in an exploratory manner via the conditional witness function, and in a quantitative manner via U-statistic regression, generalising the CATE to higher-order moments. Experiments on synthetic, semi-synthetic and real datasets demonstrate the merits of our approach.

Full PDF

CC ONDITIONAL D ISTRIBUTIONAL T REATMENT E FFECT WITH K ERNEL C ONDITIONAL M EAN E MBEDDINGS AND

U-S

TATISTIC R EGRESSION

A P

REPRINT

Junhyung Park ∗ Max Planck Institute for Intelligent SystemsT¨ubingen, Germany

Uri Shalit

TechnionIsrael Institute of Technology

Bernhard Sch¨olkopf

Max Planck Institute for Intelligent SystemsT¨ubingen, Germany

Krikamol Muandet

Max Planck Institute for Intelligent SystemsT¨ubingen, Germany A BSTRACT

We propose to analyse the conditional distributional treatment effect (CoDiTE), which, in contrast tothe more common conditional average treatment effect (CATE), is designed to encode a treatment’sdistributional aspects beyond the mean. We ﬁrst introduce a formal deﬁnition of the CoDiTEassociated with a distance function between probability measures. Then we discuss the CoDiTEassociated with the maximum mean discrepancy via kernel conditional mean embeddings, which,coupled with a hypothesis test, tells us whether there is any conditional distributional effect of thetreatment. Finally, we investigate what kind of conditional distributional effect the treatment has,both in an exploratory manner via the conditional witness function, and in a quantitative manner viaU-statistic regression, generalising the CATE to higher-order moments. Experiments on synthetic,semi-synthetic and real datasets demonstrate the merits of our approach.

Analysing the effect of a treatment (medical drug, economic programme, etc.) has long been a problem of greatimportance, and has attracted researchers from diverse domains, including econometrics Imbens and Wooldridge [2009],political sciences K¨unzel et al. [2019], healthcare Foster et al. [2011] and social sciences Imbens and Rubin [2015].The ﬁeld has naturally received much attention of statisticians over the years Rosenbaum [2002], Rubin [2005], Imbensand Rubin [2015], and in the past few years, the machine learning community has started applying its own armoury tothis problem – see Section 1.2 for a succinct review.Traditional methods for treatment effect evaluation focus on the analysis of the average treatment effect (ATE), suchas an increase or decrease in average income, inequality or poverty, aggregated over the population. However, theATE is not informative about the individual responses to the intervention and how the treatment impact varies acrossindividuals (known as treatment effect heterogeneity ). The study of conditional average treatment effect (CATE) hasbeen proposed to analyse such heterogeneity in the mean treatment effect. Although sufﬁcient in many cases, theCATE is still an average. As such, it fails to capture information about distributional aspects of the treatment beyondthe mean. A signiﬁcant amount of interest exists for developing methods that can analyse distributional treatmenteffects conditioned on the covariates Chang et al. [2015], Bitler et al. [2017], Shen [2019], Chernozhukov et al. [2020],Hohberg et al. [2020], Brise˜no Sanchez et al. [2020]. ∗ Corresponding author: [email protected] See Section 1.1 for the deﬁnitions of the ATE and CATE. a r X i v : . [ s t a t . M L ] F e b onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression A P

REPRINT

Figure 1:

Toy illustration of higher-order heterogeneity that cannot be captured by CATE. (a) Data. X ∼ Uniform [0 , , Y = 3 + 5 X + X< . N + 7 X ≥ . (1 + ( X − . N and Y = 4 X + X< . N + 7 X ≥ . (1 +( X − . N , where N ∼ N (0 , ; in particular, the CATE is increasing with X . (b) Hypothesis test (Section 4.2)Each of the hypotheses P Y | X ≡ P Y | X , P Y | X ≡ P Y | X and P Y | X ≡ P Y | X are tested 100 times. The last (false)hypothesis is rejected in most tests, while the ﬁrst two (true) hypotheses are not rejected in most tests, meaning that bothtype I and type II errors are low. (c) Conditional witness function (Section 5.1). The conditional witness function isclose to zero for all Y at X ≥ . , demonstrating that P Y | X and P Y | X are similar in this region of X . For X < . ,the witness function is positive in regions where the density of Y is higher than that of Y , and negative in regionswhere the density of Y is higher than that of Y . (d) U-statistic regression (Section 5.2). True conditional standarddeviation (in black) is estimated (in red and blue for control and treatment groups respectively) as a function of X viaU-statistic regression (since variance is a U-statistic) and the square-root operation. We see that the standard deviationincreases linearly for X ≥ . .Our contributions are as follows. Firstly, we formally deﬁne the conditional distributional treatment effect (CoDiTE)associated with a chosen distance function between distributions. Then we use kernel conditional mean embeddingsSong et al. [2013], Park and Muandet [2020a] to analyse the CoDiTE associated with the maximum mean discrepancy Gretton et al. [2012]. Coupled with a statistical hypothesis test, this can determine whether there exists any effect of thetreatment, conditioned on a set of covariates. Finally, we use conditional witness functions and

U-statistic regression toinvestigate what kind of effect the treatment has.

Throughout this paper, we take (Ω , F , P ) as the underlying probability space, X as the input space and Y ⊆ R as theoutput space. Let Z : Ω → { , } , X : Ω → X and Y , Y , Y : Ω → Y be random variables representing, respectively,the treatment assignment, covariates, the potential outcomes under control and treatment, and the observed outcome, i.e. Y = Y (1 − Z ) + Y Z . For example, Z may indicate whether a subject is administered a medical treatment ( Z = 1 ) ornot ( Z = 0 ). The potential outcomes Y , Y respectively correspond to subject’s responses had they received treatmentor not. The covariates X correspond to subject’s characteristics such as age, gender, race that could inﬂuence both thepotential outcomes and the choice of treatment. We denote the distributions of random variables by subscripting P , e.g. P X for the distribution of X . Throughout, we impose the mild condition that conditional distribution P ( · | X ) admits a regular version [C¸ ınlar, 2011, p.150, Deﬁnition 2.4, Proposition 2.5].Each unit i = 1 , ..., n is associated with an independent copy ( X i , Z i , Y i , Y i ) of ( X, Z, Y , Y ) . However, for each i = 1 , ..., n , we observe either Y i or Y i ; this missing value problem is known as the fundamental problem of causalinference [Holland, 1986], preventing us from directly computing the difference in the outcomes under treatmentand control for each unit. As a result, we only have access to samples { ( x i , z i , y i ) } ni =1 of ( X, Z, Y ) . We write n = (cid:80) ni =1 z i =0 and n = (cid:80) ni =1 z i =1 for the control and treatment sample sizes, and denote the control andtreatment samples by { ( x i , y i ) } n i =1 and { ( x i , y i ) } n i =1 .We assume strong ignorability Rosenbaum and Rubin [1983]: unconfoundedness Z ⊥⊥ ( Y , Y ) | X ; and overlap < e ( X ) = P ( Z = 1 | X ) = E [ Z | X ] < .Causal treatment effects are then identiﬁable from observational data, since P Y | X = P Y | X,Z =0 = P Y | X,Z =0 , andsimilarly for P Y | X . The quantity e ( X ) is the propensity score . In a randomised experiment , e ( X ) is known andcontrolled [Imbens and Rubin, 2015, p.40, Deﬁnition 3.10].The usual objects of interest in the treatment effect literature are the average treatment effect (ATE), E [ Y − Y ] , andthe conditional average treatment effect (CATE), T ( x ) = E (cid:2) Y − Y | X = x (cid:3) . In this paper, we propose to extendthe analysis to compare other aspects of the conditional distributions, P Y | X and P Y | X . One compelling reason to do2onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression A P

REPRINT this is that estimating CATE is inherently a problem of comparing two means , and as such, is only meaningful if thecorresponding variances are given. Consider the toy example in Figure 1. The CATE is constructed to be increasingwith X , but taking into account the variance, the treatment effect is clearly more pronounced for small values of X . Forexample, the probability of Y being greater than Y is much higher for smaller values of X .Beyond the mean and variance, researchers may also be interested in other higher-moment treatment effect heterogeneity,such as Gini’s mean difference or skewness, or indeed how the entire conditional densities of the control and treatmentgroups differ given the covariates, in an exploratory fashion. Panels (b), (c) and (d) in Figure 1 demonstrate each of thesteps we propose in this paper applied to this toy dataset: hypothesis testing of equality of conditional distributions, theconditional witness function and U-statistic regression (variance, in this instance), respectively. In the past few years the machine learning community has focused much effort on models for estimating the CATEfunction. Some approaches include Gaussian processes Alaa and van der Schaar [2017], Alaa and Schaar [2018],Bayesian regression trees Hill [2011], Hahn et al. [2020], random forests Wager and Athey [2018], neural networksJohansson et al. [2016], Shalit et al. [2017], Louizos et al. [2017], Atan et al. [2018], Shi et al. [2019], GANs Yoon et al.[2018], boosting and adaptive regression splines Powers et al. [2018] and kernel mean embeddings Singh et al. [2020].Distributional extensions of the ATE have been considered by many authors. Abadie [2002] tested the hypothesesof equality and stochastic dominance of the marginal outcome distributions P Y and P Y , whereas Kim et al. [2018],Muandet et al. [2018], Singh et al. [2020, Appendix E] focus on estimating P Y and P Y , or some distance betweenthem. These works do not consider treatment effect heterogeneity.The CoDiTE incorporates both distributional considerations of treatment effects and treatment effect heterogeneity.Interest has been growing, especially in the econometrics literature, for such analyses – indeed, Bitler et al. [2017]provided concrete evidence that in some settings, the CATE does not sufﬁce. Existing works that analyse the CoDiTEcan be split into three categories, depending on how distributions are characterised: (i) quantiles, (ii) cumulativedistributional functions, and (iii) speciﬁc distributional parameters, such as the mean, variance, skewness etc. Incategory (i), quantile regression is a powerful tool Koenker [2005]; however, in order to get a distributional picturevia quantiles, one needs to estimate a large number of quantiles, and issues of crossing quantiles arise, wherebyestimated quantiles are non-monotone. In category (ii), Chernozhukov et al. [2013, 2020] propose splitting Y intoa grid and regressing for the cumulative distribution function at each point in the grid, but this also brings issues ofnon-monotonicity of the cumulative distribution function, similar to crossing quantiles. Shen [2019] estimates thecumulative distribution functions P ( Y < y ∗ ) and P ( Y < y ∗ ) for each y ∗ ∈ Y given each value of X = x byessentially applying the Nadaraya-Watson conditional U-statistic of Stute [1991] to the U-kernel h ( y ) = ( y ≤ y ∗ ) . Incategory (iii), generalised additive models for location, scale and shape (GAMLSS) Stasinopoulos et al. [2017] havebeen applied for CoDiTE analysis Hohberg et al. [2020], Brise˜no Sanchez et al. [2020], but being a parametric model,despite its ﬂexibility, the researcher has to choose a model beforehand to proceed, and issues of model misspeciﬁcationare unavoidable.Interest has also always existed for hypothesis tests in the context of treatment effect analysis, especially in econometrics[Imbens and Wooldridge, 2009, Sections 3.3 and 5.12]. Abadie [2002] tested the equality between the marginaldistributions of Y and Y , while Crump et al. [2008] tested for the equality of E [ Y | X ] and E [ Y | X ] . Lee and Whang[2009], Lee [2009], Chang et al. [2015], Shen [2019] were interested, among others, in the hypothesis of the equality of P Y | X and P Y | X , which we consider in Section 4.2. Summary of Contributions

We characterise distributions in two ways – ﬁrst as elements in a reproducing kernelHilbert space via kernel conditional mean embeddings, which, to the best of our knowledge, is a novel attempt inthe treatment effect literature, and secondly via speciﬁc distributional parameters, as in category (iii). The formercharacterisation gives us a novel way of testing for the equality of conditional distributions, as well as an exploratorytool for density comparison between the groups via conditional witness functions. For the latter characterisation, weprovide, to the best of our knowledge, a novel U-statistic regression technique by generalising kernel ridge regression,which, in contrast to GAMLSS, is fully nonparametric. Neither characterisation requires the estimation of a largenumber of quantities, unlike characterisations via quantiles or cumulative distribution functions.

In this section, we brieﬂy review reproducing kernel Hilbert space embeddings and U-statistics. A more completeintroduction can be found in Appendix A. 3onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression

A P

REPRINT

Let l : Y × Y → R be a (scalar) positive deﬁnite kernel on Y with reproducing kernel Hilbert space (RKHS) H [Berlinet and Thomas-Agnan, 2004, p.7, Deﬁnition 1]. Given a random variable Y on Y satisfying E [ (cid:112) l ( Y, Y )] < ∞ ,the kernel mean embedding of Y is deﬁned as µ Y ( · ) = E [ l ( Y, · )] [Smola et al., 2007, Eqn. (2a)]. Given two randomvariables Y and Y (cid:48) , the maximum mean discrepancy (MMD) between them is deﬁned as (cid:107) µ Y − µ Y (cid:48) (cid:107) H [Grettonet al., 2012, Lemma 4], where µ Y − µ Y (cid:48) is the (unnormalised) witness function (Gretton et al., 2012, Section 2.3;Lloyd and Ghahramani, 2015, Eqn. (3.2)). If the embedding is injective from the space of probability measures on Y to H , then we say that l is characteristic [Fukumizu et al., 2008, Section 2.2], in which case the MMD is a propermetric. Given another random variable X on X , the conditional mean embedding (CME) of Y given X is deﬁned as µ Y | X = E [ l ( Y, · ) | X ] [Park and Muandet, 2020a, Deﬁnition 3.1] .Denote by L ( X , P X ; H ) the Hilbert space of (equivalence classes of) measurable functions F : X → H such that (cid:107) F ( · ) (cid:107) H is P X -integrable, with inner product (cid:104) F , F (cid:105) = (cid:82) X (cid:104) F ( x ) , F ( x ) (cid:105) H dP X ( x ) . Given an operator-valuedkernel Γ :

X × X → L ( H ) , where L ( H ) is the Banach space of bounded linear operators H → H , there exists anassociated vector-valued RKHS of functions

X → H [Carmeli et al., 2006, Deﬁnition 2.1, Deﬁnition 2.2, Proposition2.3].

Let Y , ..., Y r be independent copies of Y , and let h : Y r → R be a symmetric function, i.e. for any permutation π of (1 , ..., r ) , h ( y , ..., y r ) = h ( y π (1) , ..., y π ( r ) ) , such that h ( Y , ..., Y r ) is integrable. Given i.i.d. copies { Y i } ni =1 of Y , the U-statistic [Hoeffding, 1948, Serﬂing, 1980, p. 172] for an unbiased estimation of θ ( P Y ) = E [ h ( Y , ..., Y r )] is ˆ θ ( Y , ..., Y n ) = ( nr ) (cid:80) h ( Y i , ..., Y i r ) where (cid:0) nr (cid:1) is the binomial coefﬁcient and the summation is over the (cid:0) nr (cid:1) combinations of r distinct elements { i , ..., i r } from { , ..., n } .This has been extended to the conditional case Stute [1991]. Given another random variable X on X and independentcopies X , ..., X r of it, we can consider the estimation of θ ( P Y | X ) = E [ h ( Y , ..., Y r ) | X , ..., X r ] . Stute [1991],Derumigny [2019] extend the Nadaraya-Watson regressor Nadaraya [1964], Watson [1964] to estimate θ ( P Y | X ) . In this section, we generalise the notion of CATE to account for distributional differences between treatment and controlgroups, rather than just the mean difference.

Deﬁnition 3.1.

Let D be some distance function between probability measures. We deﬁne the conditional distributionaltreatment effect (CoDiTE) associated with D as U D ( x ) = D ( P Y | X = x , P Y | X = x ) . Here, the choice of D depends on what characterisation of distributions is used (c.f. Section 1.2). For example, if D ( P Y | X = x , P Y | X = x ) = E [ Y | X = x ] − E [ Y | X = x ] , we recover the CATE, i.e. U D ( x ) = T ( x ) , therebyshowing that the CoDiTE is a strict generalisation of the CATE. Different choices of D will require different estimators.The usual performance metric of a CATE estimator ˆ T is the precision of estimating heterogeneous effects (PEHE) (ﬁrstproposed in sample form by Hill [2011, Section 4.3]; we report the population-level deﬁnition, found in, for example,Alaa and Van Der Schaar [2019, Eqn. (5)]: (cid:107) ˆ T − T (cid:107) = E [ | ˆ T ( X ) − T ( X ) | ] . We propose a performance metric of an estimator of the CoDiTE in an exactly analogous manner.

Deﬁnition 3.2.

Given a distance function D , for an estimator ˆ U D of U D , we deﬁne the precision of estimatingheterogeneous distributional effects (PEHDE) as ψ D ( ˆ U D ) = (cid:107) ˆ U D − U D (cid:107) = E [ | ˆ U D ( X ) − U D ( X ) | ] . Again, if D measures the difference in expectations, then the associated PEHDE ψ D reduces to the usual PEHE.Henceforth, we explore different choices of the distance function D , as well as methods of estimating the correspondingCoDiTE U D , to answer the following questions: We use the conditional expectation interpretation of the CME. An interpretation of the CME as an operator from an RKHS on X to H also exists Song et al. [2009, 2013], Fukumizu et al. [2013]. A P

REPRINT Q1 Are P Y | X and P Y | X different? In other words, is there any distributional effect of the treatment? (Section 4) Q2 If so, how does the distribution of the treatment group differ from that of the control group? (Section 5)

In this section, we answer Q1, i.e. we investigate whether the treatment has any effect at all. To this end we choose D to be the MMD with the associated kernel l being characteristic. Then writing µ Y | X and µ Y | X for the CMEs of Y and Y given X respectively (c.f. Section 2.1), we have U MMD ( x ) = MMD ( P Y | X = x , P Y | X = x )= (cid:107) µ Y | X = x − µ Y | X = x (cid:107) H . (1)Since l is characteristic, P Y | X = x and P Y | X = x are equal if and only if MMD ( P Y | X = x , P Y | X = x ) = 0 . What makesthe MMD a particularly convenient choice is that for each x ∈ X , P Y | X = x and P Y | X = x are represented by individualelements µ Y | X = x and µ Y | X = x in the RKHS H , which means that we can estimate the associated CoDiTE simply byperforming regression with X as the input space and H as the output space, as will be shown in the next section. We now discuss how to obtain empirical estimates of U MMD ( x ) . Recall that, by the unconfoundedness assumption, wecan estimate µ Y | X and µ Y | X separately from control and treatment samples respectively. We perform operator-valuedkernel regression Micchelli and Pontil [2005], Kadri et al. [2016] in separate vector-valued RKHSs G and G , endowedwith kernels Γ ( · , · ) = k ( · , · ) Id and Γ ( · , · ) = k ( · , · ) Id, where k , k : X × X → R are scalar-valued kernel andId : H → H is the identity operator. Following Park and Muandet [2020a, Eqn. (4)], the empirical estimates ˆ µ Y | X and ˆ µ Y | X of µ Y | X and µ Y | X are constructed, for each x ∈ X , as ˆ µ Y | X = x = k T ( x ) W l ∈ G and ˆ µ Y | X = x = k T ( x ) W l ∈ G , where (2) W = ( K + n λ n I n ) − , W = ( K + n λ n I n ) − , [ K ] ≤ i,j ≤ n = k ( x i , x j ) , [ K ] ≤ i,j ≤ n = k ( x i , x j ) , λ n , λ n > are regularisation parameters, I n and I n are identity matrices, k ( x ) = ( k ( x , x ) , ..., k ( x n , x )) T , k ( x ) = ( k ( x , x ) , ..., k ( x n , x )) T , l = ( l ( y , · ) , ..., l ( y n , · )) T and l = ( l ( y , · ) , ..., l ( y n , · )) T .By plugging in the estimates (2) in the expression (1) for U MMD , we can construct ˆ U MMD as ˆ U MMD ( x ) = (cid:107) ˆ µ Y | X = x − ˆ µ Y | X = x (cid:107) H . The next lemma establishes a closed-form expression for ˆ U MMD based on the control and treatment samples.

Lemma 4.1.

For each x ∈ X , we have ˆ U MMD ( x ) = k T ( x ) W L W T k ( x ) − k T ( x ) W LW T k ( x ) + k T ( x ) W L W T k ( x ) , where [ L ] ≤ i,j ≤ n = l ( y i , y j ) , [ L ] ≤ i ≤ n , ≤ j ≤ n = l ( y i , y j ) and [ L ] ≤ i,j ≤ n = l ( y i , y j ) . The proof of this, and all other results, are deferred to Appendix C. The next theorem shows that, using universalkernels Γ , Γ [Carmeli et al., 2010, Deﬁnition 4.1], ˆ U MMD is universally consistent with respect to the PEHDE.

Theorem 4.2 (Universal consistency) . Suppose that k , k and l are bounded, that Γ and Γ are universal, and that λ n and λ n decay at slower rates than O ( n − / ) and O ( n − / ) respectively. Then as n , n → ∞ , ψ MMD ( ˆ U MMD ) = E [ | ˆ U MMD ( X ) − U MMD ( X ) | ] p → . We are interested in whether or not the two conditional distributions P Y | X and P Y | X , corresponding to control andtreatment, are equal. The hypotheses are then H : P Y | X = x ( · ) = P Y | X = x ( · ) P X -almost everywhere. H : There exists A ⊆ X with positive measure such that P Y | X = x ( · ) (cid:54) = P Y | X = x ( · ) for all x ∈ A .5onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression A P

REPRINT

Algorithm 1:

Kernel conditional discrepancy (KCD) test of conditional distributional treatment effect

Input: data { ( x i , z i , y i ) } ni =1 , signiﬁcant level α , kernels k , k , l , regularisation parameters λ n , λ n , no. ofpermutations m .Calculate ˆ t using Lemma 4.4 based on the input data.KLR of { z i } ni =1 against { x i } ni =1 to obtain ˆ e ( x i ) . for k = 1 to m do For each i = 1 , ..., n , sample ˜ z i ∼ Bernoulli (ˆ e ( x i )) .Calculate ˆ t k from the new dataset { x i , ˜ z i , y i } ni =1 . end for Calculate the p -value as p = (cid:80) ml =1 { ˆ t l > ˆ t } m . if p < α then Reject H . end if The null hypothesis H means that the treatment has no effect for any of the covariates, whereas the alternativehypothesis H means that the treatment has an effect on some of the covariates, where the effect is distributional. Fornotational simplicity, we write P Y | X ≡ P Y | X if H holds.We use the following criterion for P Y | X ≡ P Y | X , which we call the kernel conditional discrepancy (KCD): t = E [ (cid:107) µ Y | X − µ Y | X (cid:107) H ] . The following lemma tells us that t can indeed be used as a criterion of P Y | X ≡ P Y | X . Lemma 4.3. If l is a characteristic kernel, P Y | X ≡ P Y | X if and only if t = 0 . Next, we deﬁne a plug-in estimate ˆ t of t , which we will use as the test statistic of our hypothesis test: ˆ t = 1 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ˆ µ Y | X = x i − ˆ µ Y | X = x i (cid:13)(cid:13)(cid:13) H . Then we have a closed-form expression for ˆ t as follows. Lemma 4.4.

We have ˆ t = 1 n Tr (cid:16) ˜ K W L W T ˜ K T (cid:17) − n Tr (cid:16) ˜ K W LW T ˜ K T (cid:17) + 1 n Tr (cid:16) ˜ K W L W T ˜ K T (cid:17) , where L , L and L are as deﬁned in Lemma 4.1 and [ ˜ K ] ≤ i ≤ n, ≤ j ≤ n = k ( x i , x j ) and [ ˜ K ] ≤ i ≤ n, ≤ j ≤ n = k ( x i , x j ) . The consistency of ˆ t in the limit of inﬁnite data is shown in the following theorem. Theorem 4.5.

Under the same assumptions as in Theorem 4.2, we have ˆ t p → t as n , n → ∞ . Unfortunately, it is extremely difﬁcult to compute the (asymptotic) null distribution of ˆ t analytically, and so we resortto resampling the treatment labels to simulate the null distribution. To ensure that our resampling scheme respectsthe control and treatment covariate distributions P X | Z =0 and P X | Z =1 , we follow the conditional resampling schemeof Rosenbaum [1984]. We ﬁrst estimate the propensity score e ( x i ) for each datapoint x i (e.g. using kernel logisticregression (KLR) Zhu and Hastie [2005], Marteau-Ferey et al. [2019]), and then resample each data label from thisestimated propensity score. By repeating this resampling procedure and computing the test statistic on each resampleddataset, we can simulate from the null distribution of the test statistic. Finally, the test statistic computed from theoriginal dataset is compared to this simulated null distribution, and the null hypothesis is rejected or not rejectedaccordingly. The exact procedure is summarised in Algorithm 1. After determining whether P Y | X and P Y | X are different via MMD-associated CoDiTE and hypothesis testing, wenow turn to Q2, i.e. we investigate how they are different. 6onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression A P

REPRINT

Table 1: Root mean square error in estimating the conditional standard deviation, with standard error from 100simulations, for GAMLSS (implemented via the R package gamlss

Rigby and Stasinopoulos [2005]) and our U-statistic regression via generalised kernel ridge regression (U-regression KRR; implemented via the Falkon library onPython Rudi et al. [2017], Meanti et al. [2020]). Lower is better.

Method Setting SN Setting LN Setting HN

Control Treatment Control Treatment Control TreatmentGAMLSS . ± .

031 0 . ± .

414 3 . ± .

55 15 . ± .

13 2 . ± .

44 10 . ± . U-regression KRR . ± .

059 0 . ± .

059 1 . ± .

31 2 . ± .

61 0 . ± .

25 1 . ± . For two real-valued random variables, the witness function between them is a useful tool for visualising where theirdensities differ, without explicitly estimating the densities (Gretton et al., 2012, Figure 1; Lloyd and Ghahramani, 2015,Figure 1). We extend this to the conditional case with the (unnormalised) conditional witness function µ Y | X − µ Y | X .Let us ﬁx x ∈ X . The witness function between P Y | X = x and P Y | X = x is µ Y | X = x − µ Y | X = x : Y → R . For y ∈ Y inregions where the density of P Y | X = x is greater than that of P Y | X = x , we have µ Y | X = x ( y ) − µ Y | X = x ( y ) > . For y in regions where the converse is true, we similarly have µ Y | X = x ( y ) − µ Y | X = x ( y ) < . The greater the difference indensity, the greater the magnitude of the witness function. For each y ∈ Y , the associated CoDiTE is U witness ,y ( x ) = µ Y | X = x ( y ) − µ Y | X = x ( y ) . The estimates in (2) can be plugged in to obtain the estimate ˆ U witness ,y = ˆ µ Y | X = x ( y ) − ˆ µ Y | X = x ( y ) . Since convergencein the RKHS norm implies pointwise convergence [Berlinet and Thomas-Agnan, 2004, p.10, Corollary 1], Theorem 4.2implies the consistency of ˆ U witness ,y with respect to the corresponding PEHDE. Clearly, if X is more than 1-dimensional,heat maps as in Figure 1(c) cannot be plotted; however, ﬁxing a particular x ∈ X , ˆ µ Y | X = x − ˆ µ Y | X = x can be plottedagainst y , since Y ⊆ R . Such plots will be informative of where the density of P Y | X = x is greater than that of P Y | X = x and vice versa. Next, we consider CoDiTE on speciﬁc distributional quantities, such as the mean, variance or skewness, or somefunction thereof. For example, Brise˜no Sanchez et al. [2020, Eqn. (2)] were interested, in addition to the CATE, inthe treatment effect on the standard deviation U D ( x ) = std ( Y | X = x ) − std ( Y | X = x ) . Our motivating example inFigure 1 could inspire a “standardised” version of the CATE : U D ( x ) = E [ Y | X = x ] − E [ Y | X = x ] (cid:112) Var ( Y | X = x ) + Var ( Y | X = x ) . (3)Many of these quantities can be represented as the expectation of a U-kernel, i.e. E [ h ( Y , ..., Y r )] (c.f. Section 2.2).For example, h ( y ) = y gives the mean, h ( y , y ) = ( y − y ) gives the variance and h ( y , y ) = | y − y | givesGini’s mean difference. We consider their conditional counterparts, i.e. θ ( P Y | X ) = E [ h ( Y , ..., Y r ) | X , ..., X r ] and θ ( P Y | X ) = E [ h ( Y , ..., Y r ) | X , ..., X r ] (c.f. Section 2.2). By C¸ ınlar [2011, p.146, Theorem 1.17], there existfunctions F , F : X r → R such that F ( X , ..., X r ) = θ ( P Y | X ) and F ( X , ..., X r ) = θ ( P Y | X ) .Estimation of F and F can be done via U-statistic regression, by generalising kernel ridge regression as follows. Asin Section 4.1, let k : X × X → R be a kernel on X with RKHS H . Then if we deﬁne k r : X r × X r → R as k r (( x , ..., x r ) , ( x (cid:48) , ..., x (cid:48) r )) = k ( x , x (cid:48) ) ...k ( x r , x (cid:48) r ) , Berlinet and Thomas-Agnan [2004, p.31, Theorem 13] tells us that k r is a reproducing kernel on X r with RKHS H r = H ⊗ ... ⊗ H , the r -times tensor product of H , whose elements are functions X r → R . We estimate F in H r .Given any F ∈ H r , the natural least-squares risk is E ( F ) = E [( F ( X , ..., X r ) − h ( Y , ..., Y r )) ] . In practice, if the CoDiTE involves ratios of estimated quantities, we do not recommend plugging in the estimates directly intothe ratio, since, if the denominator is small, then a small error in the estimation of the denominator will result in a large error in theoverall CoDiTE estimation. Instead, we recommend that the practitioner estimate the numerator and the denominator separately andinterpret the results directly from the raw estimates.

A P

REPRINT

Figure 2:

Hypothesis testing and witness functions on the IHDP dataset. (a) Hypothesis test is conducted on 100simulations for each setting, with the bar chart showing proportion of tests rejected for each setting. In setting ”LN”,where the variance overwhelms the CATE, the test does not reject the hypothesis P Y | X ≡ P Y | X , whereas in the othertwo settings, the hypothesis is rejected. (b) At both X = a and X = b , the density of the control group is larger thanthat of the treatment group around Y = 0 , and the reverse is true around Y = 4 , showing the marked effect of thetreatment. (c) At both X = a and X = b , the density of the control and treatment groups are roughly equal for all Y .(d) At X = a , where the variance engulfs the CATE, the density of the control and treatment groups are roughly equalfor all Y , whereas at X = b , the witness function clearly shows where the density of one group dominates the other.The juxtaposition of witness functions at different points in the covariate space is an exploratory tool to compare therelative strength of the treatment effect.Recalling the control sample { ( x i , y i ) } n i =1 , we solve the following regularised least-squares problem: ˆ F = arg min F ∈H r (cid:110) ˆ E ( F ) + λ n (cid:107) F (cid:107) H r (cid:111) (4)where the empirical least-squares risk ˆ E is deﬁned as ˆ E ( F ) = 1 (cid:0) n r (cid:1) (cid:88) (cid:16) F ( x i , ..., x i r ) − h ( y i , ..., y i r ) (cid:17) , with the summation over the (cid:0) n r (cid:1) combinations of r distinct elements { i , ..., i r } from { , ..., n } . Note that ˆ E ( F ) isitself a U-statistic for the estimation of E ( F ) . The following is a representer theorem for the problem in (4). Theorem 5.1.

The solution ˆ F to the problem in (4) is ˆ F ( x , ..., x r ) = n (cid:88) i ,...,i r k ( x i , x ) ...k ( x i r , x r ) c i ,...,i r where the coefﬁcients c i ,...,i r ∈ R are the unique solution of the n r linear equations, n (cid:88) j ,...,j r =1 (cid:32) k (cid:16) x i , x j (cid:17) ...k (cid:16) x i r , x j r (cid:17) + (cid:18) n r (cid:19) λ n δ i j ...δ i r j r (cid:33) c j ,...,j r = h (cid:16) y i , ..., y i r (cid:17) . Note that if r = 1 and h ( y ) = y , we recover the usual kernel ridge regression. The following result shows that thisestimation procedure is universally consistent. Theorem 5.2.

Suppose k r is a bounded and universal kernel, that h ( Y , ..., Y r ) is almost surely bounded and that λ n decays at a slower rate than O ( n − / ) . Then as n → ∞ , E (cid:20)(cid:16) ˆ F ( X , ..., X r ) − F ( X , ..., X r ) (cid:17) (cid:21) p → . A consistent estimate ˆ F of F is obtained by exactly the same procedure, using the treatment sample { ( x i , y i ) } n i =1 . We demonstrate the use of our methods on the Infant Health and Development Program (IHDP) dataset [Hill, 2011,Section 4]. The covariates are taken from a randomised control trial, from which a non-random portion is removed8onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression

A P

REPRINT

Figure 3:

Witness functions for Black, unmarried participant up to the age of 25, unemployed in both 1974 and1975.

Each curve (witness function) corresponds to an individual in this subset.to imitate an observational study. The reason for its popularity in the CATE literature is that, for each datapoint, theoutcome is simulated for both treatment and control, enabling cross-validation and evaluation, which is usually notpossible in observational studies due to the missing counterfactuals. Existing works ﬁrst deﬁne the noiseless responsesurfaces for the control and treatment groups, and generate realisations of the potential outcomes by applying Gaussiannoise with constant variance across the whole dataset.This last assumption of constant variance is somewhat unrealistic, but of little importance in evaluating CATE estimators.In our experiments, we modify the data generating process in three different ways, all of which have the same parallellinear mean response surfaces, with the CATE of 4 (“response surface A” in Hill [2011]). In setting “SN” (“smallnoise”), the standard deviation of the noise is constant at 1, so that the CATE of 4 translates to a meaningful treatmenteffect. In setting “LN” (“large noise”), the standard deviation of the noise is constant at 20, meaning that the meandifference in the response surfaces is negligible in comparison. In this case, our test does not reject the hypothesis thatthe two conditional distributions are the same, and there is no case for further investigation (see middle bar in Figure2(a)). In setting “HN” (“heterogeneous noise”), the standard deviation is heterogeneous across the dataset, so that thestandard deviation is 1 for some data points while others have standard deviation of 20. The exact data generatingprocess is detailed in Appendix B.In setting “HN”, let us consider points a , b ∈ X with sd ( Y | X = a ) = 20 and sd ( Y | X = b ) = 1 . Then even thoughthe CATE at a and b are equal at 4, we have std ( Y − Y | X = a ) (cid:29) std ( Y − Y | X = b ) , such that there is apronounced treatment effect at b , while the variance engulfs the treatment effect at a . The comparative magnitudes ofthe witness functions conditioned on a and b conﬁrm this heterogeneity (see Figure 2(d)). In Table 1, the quality ofestimation of the standard deviation via our U-statistic regression is compared with GAMLSS [Stasinopoulos et al.,2017] estimation for each setting.An immediate beneﬁt is a better understanding of the treatment. Even a perfect CATE estimator cannot capturesuch heterogeneity in distributional treatment effect (variance, in this case). As argued in Section 1.1, any methodthat involves comparing mean values (of which CATE is one) should also take into account the variance for it tobe meaningful. This will give a clearer picture of the subpopulations on which there is a marked treatment effect,and those on which it is weaker, than relying on the CATE alone. Such knowledge should in turn inﬂuence policydecisions, in terms of which subpopulations should be targeted. We note that recently Jesson et al. [2020] consideredCATE uncertainty in IHDP in the context of a different task: making or deferring treatment recommendations whileusing Bayesian neural networks, focusing on cases where overlap fails or under covariate shift; however, distributionalconsiderations can be important even when overlap is satisﬁed and no covariate shift takes place. In this section, we apply the proposed methods to LaLonde’s well-known National Supported Work (NSW) dataset[LaLonde, 1986, Dehejia and Wahba, 1999] which has been used widely to evaluate estimators of treatment effects.The outcome of interest Y is the real earnings in 1978, with treatment Z being the job training. We refer the interestedreaders to Dehejia and Wahba [1999, Sec. 2.1] for a detailed description of the dataset. As income distributions areknown to be skewed to the right, it may be interesting to investigate not only the CATE, but the entire distributions.The test rejects the hypothesis P Y | X ≡ P Y | X with p-value of 0.013. As a demonstration of the kind of exploratoryanalysis that can be conducted using the conditional witness functions, we focus our attention on a subset of the data onwhich the overlap condition is satisﬁed – Black, unmarried participants up to the age of 25, who were unemployed inboth 1974 and 1975. Figure 3 shows the witness function for each individual in this subset, with the colour of the curvedelineating whether the corresponding individual has a high school diploma.9onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression A P

REPRINT

We can see clearly that for those without a high school diploma, the treatment effect is not so pronounced, whereas thereis a marked treatment effect for those with it. Negative values of the witness function for small income values mean thatwe are more likely to get small income values from the control group than the treatment group, whereas larger incomevalues are more likely to come from the treatment group, as indicated by the positive values of the witness functions. Inparticular, the tail of the blue curves to the right implies a skewness of the density of the treated group relative to thecontrol group, and the treatment group continues to have larger density than the control group for high income values( > ), albeit to a lesser extent. Such comparison of densities in different regions of Y is not possible with theCATE, which is a simple difference of the means between the control and treated groups. In this paper, we discussed the analysis of the conditional distributional treatment effect (CoDiTE). We ﬁrst propose anew kernel-based hypothesis test via kernel conditional mean embeddings to see whether there exists any CoDiTE.Then we proceeded to investigate the nature of the treatment effect via conditional witness functions, revealing whereand how much the conditional densities differ, and U-statistic regression, which is informative about the differences inspeciﬁc conditional distributional quantities.We foresee that much of the work that has been done by the machine learning community on treatment effect analysis,although cast mostly in the context of CATE, applies for the CoDiTE. Examples include meta learners

K¨unzel et al.[2019], model validation Alaa and Van Der Schaar [2019], subgroup analysis Su et al. [2009], Lee et al. [2020] andcovariate balancing Gretton et al. [2009], Kallus [2018]. over A major obstacle in any covariate-conditional analysisof treatment effect is this: when the covariate space is high-dimensional, the accuracy and reliability of the estimatesdeteriorate signiﬁcantly due to the curse of dimensionality, and we heavily rely on changes to be smooth across thecovariate space. This limitation is present not only in methods presented in this paper, but any CATE or CoDiTEanalysis. While out of scope for the present paper, it is of interest to investigate how to mitigate this problem.Last but not least, we argue that the conditional distributional treatment effect can play an important role in making fairand explainable decisions as it provides a more complete picture of the treatment effect. On the one hand, policymakerscan use tools that we develop to identify the groups of individuals for which the outcome distributions differ mostthrough the effect modiﬁers. On the other hand, the presence of effect modiﬁcation that is associated with sensitiveattributes such as race, ethnicity, and gender creates challenges for decision makers. If they knew that there is effectmodiﬁcation by race, for example, certain groups of individuals may be treated unfairly. In practice, our tools canpotentially be used to detect the discrepancy between outcome distributions conditioned on these sensitive attributes,which is also an interesting avenue for future work.

References

A. Abadie. Bootstrap Tests for Distributional Treatment Effects in Instrumental Variable Models.

Journal of theAmerican statistical Association , 97(457):284–292, 2002.A. Alaa and M. Schaar. Limits of Estimating Heterogeneous Treatment Effects: Guidelines for Practical AlgorithmDesign. In

International Conference on Machine Learning , pages 129–138, 2018.A. Alaa and M. Van Der Schaar. Validating Causal Inference Models via Inﬂuence Functions. In

InternationalConference on Machine Learning , pages 191–201, 2019.A. M. Alaa and M. van der Schaar. Bayesian Inference of Individualized Treatment Effects using Multi-Task GaussianProcesses. In

Advances in Neural Information Processing Systems , pages 3424–3432, 2017.N. Aronszajn. Theory of Reproducing Kernels.

Transactions of the American mathematical society , 68(3):337–404,1950.O. Atan, J. Jordon, and M. van der Schaar. Deep-Treat: Learning Optimal Personalized Treatments from ObservationalData using Neural Networks. In

AAAI , pages 2071–2078, 2018.A. Berlinet and C. Thomas-Agnan.

Reproducing Kernel Hilbert Spaces in Probability and Statistics . Kluwer AcademicPublishers, 2004.M. P. Bitler, J. B. Gelbach, and H. W. Hoynes. Can Variation in Subgroups’ Average Treatment Effects ExplainTreatment Effect Heterogeneity? Evidence from a Social Experiment.

Review of Economics and Statistics , 99(4):683–697, 2017.G. Brise˜no Sanchez, M. Hohberg, A. Groll, and T. Kneib. Flexible Instrumental Variable Distributional Regression.

Journal of the Royal Statistical Society: Series A (Statistics in Society) , 183(4):1553–1574, 2020.10onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression

A P

REPRINT

C. Carmeli, E. De Vito, and A. Toigo. Vector Valued Reproducing Kernel Hilbert Spaces of Integrable Functions andMercer Theorem.

Analysis and Applications , 4(04):377–408, 2006.C. Carmeli, E. De Vito, A. Toigo, and V. Umanit´a. Vector valued reproducing kernel hilbert spaces and universality.

Analysis and Applications , 8(01):19–61, 2010.M. Chang, S. Lee, and Y.-J. Whang. Nonparametric Tests of Conditional Treatment Effects with an Application toSingle-Sex Schooling on Academic Achievements.

The Econometrics Journal , 18(3):307–346, 2015.V. Chernozhukov, I. Fern´andez-Val, and B. Melly. Inference on Counterfactual Distributions.

Econometrica , 81(6):2205–2268, 2013.V. Chernozhukov, I. Fernandez-Val, and M. Weidner. Network and Panel Quantile Effects via Distribution Regression.

Journal of Econometrics , 2020.E. C¸ ınlar.

Probability and Stochastics , volume 261. Springer Science & Business Media, 2011.R. K. Crump, V. J. Hotz, G. W. Imbens, and O. A. Mitnik. Nonparametric Tests for Treatment Effect Heterogeneity.

The Review of Economics and Statistics , 90(3):389–405, 2008.F. Cucker and S. Smale. On the Mathematical Foundations of Learning.

Bulletin of the American mathematical society ,39(1):1–49, 2002.R. H. Dehejia and S. Wahba. Causal effects in nonexperimental studies: Reevaluating the evaluation of trainingprograms.

Journal of the American Statistical Association , 94(448):1053–1062, 1999.A. Derumigny. Estimation of a Regular Conditional Functional by Conditional U-Statistics Regression. arXiv preprintarXiv:1903.10914 , 2019.N. Dinculeanu.

Vector Integration and Stochastic Integration in Banach Spaces , volume 48. John Wiley & Sons, 2000.J. C. Foster, J. M. Taylor, and S. J. Ruberg. Subgroup Identiﬁcation from Randomized Clinical Trial Data.

Statistics inmedicine , 30(24):2867–2880, 2011.K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨olkopf. Kernel Measures of Conditional Dependence. In

Advances inneural information processing systems , pages 489–496, 2008.K. Fukumizu, L. Song, and A. Gretton. Kernel Bayes’ Rule: Bayesian Inference with Positive Deﬁnite Kernels.

TheJournal of Machine Learning Research , 14(1):3753–3783, 2013.A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Sch¨olkopf. Covariate Shift by Kernel MeanMatching.

Dataset shift in machine learning , 3(4):5, 2009.A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola. A Kernel Two-Sample Test.

Journal ofMachine Learning Research , 13(Mar):723–773, 2012.P. R. Hahn, J. S. Murray, and C. M. Carvalho. Bayeisan Regression Tree Models for Causal Inference: Regularisation,Confounding, and Heterogeneous Effects (with Discussion).

Bayesian Analysis , 15(3):965–1056, 09 2020.J. L. Hill. Bayesian Nonparametric Modeling for Causal Inference.

Journal of Computational and Graphical Statistics ,20(1):217–240, 2011.W. Hoeffding. A Class of Statistics with Asymptotically Normal Distribution.

The Annals of Mathematical Statistics ,pages 293–325, 1948.M. Hohberg, P. P¨utz, and T. Kneib. Treatment Effects Beyond the Mean Using Distributional Regression: Methods andGuidance.

Plos one , 15(2):e0226514, 2020.P. W. Holland. Statistics and Causal Inference.

Journal of the American statistical Association , 81(396):945–960, 1986.G. W. Imbens and D. B. Rubin.

Causal Inference in Statistics, Social, and Biomedical sciences . Cambridge UniversityPress, 2015.G. W. Imbens and J. M. Wooldridge. Recent Developments in the Econometrics of Program Evaluation.

Journal ofeconomic literature , 47(1):5–86, 2009.A. Jesson, S. Mindermann, U. Shalit, and Y. Gal. Identifying Causal-Effect Inference Failure with Uncertainty-AwareModels.

Advances in Neural Information Processing Systems , 33, 2020.F. Johansson, U. Shalit, and D. Sontag. Learning Representations for Counterfactual Inference. In

Internationalconference on machine learning , pages 3020–3029, 2016.H. Kadri, E. Duﬂos, P. Preux, S. Canu, A. Rakotomamonjy, and J. Audiffren. Operator-Valued Kernels for Learningfrom Functional Response Data.

The Journal of Machine Learning Research , 17(1):613–666, 2016.11onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression

A P

REPRINT

N. Kallus. Optimal A Priori Balance in the Design of Controlled Experiments.

Journal of the Royal Statistical SocietySeries B , 80(1):85–112, 2018.K. Kim, J. Kim, and E. H. Kennedy. Causal Effects Based on Distributional Distances. arXiv preprint arXiv:1806.02935 ,2018.A. Klenke.

Probability Theory: A Comprehensive Course . Springer Science & Business Media, 2013.R. Koenker.

Quantile Regression . Cambridge University Press, 2005.S. R. K¨unzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Metalearners for Estimating Heterogeneous Treatment Effects usingMachine Learning.

Proceedings of the national academy of sciences , 116(10):4156–4165, 2019.R. J. LaLonde. Evaluating the econometric evaluations of training programs with experimental data.

The AmericanEconomic Review , 76(4):604–620, 1986.H.-S. Lee, Y. Zhang, W. Zame, C. Shen, J.-W. Lee, and M. van der Schaar. Robust Recursive Partitioning forHeterogeneous Treatment Effects with Uncertainty Quantiﬁcation.

Advances in Neural Information ProcessingSystems , 33, 2020.M.-J. Lee. Non-parametric Tests for Distributional Treatment Effect for Randomly Censored Responses.

Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 71(1):243–264, 2009.S. S. Lee and Y.-J. Whang. Nonparametric Tests of Conditional Treatment Effects. Technical report, Cowles Foundationfor Research in Economics, Yale University, 2009.J. R. Lloyd and Z. Ghahramani. Statistical Model Criticism using Kernel Two Sample Tests.

Advances in NeuralInformation Processing Systems , 28:829–837, 2015.C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal Effect Inference with DeepLatent-Variable Models. In

Advances in Neural Information Processing Systems , pages 6446–6456, 2017.U. Marteau-Ferey, F. Bach, and A. Rudi. Globally Convergent Newton Methods for Ill-Conditioned GeneralizedSelf-Concordant Losses. In

Advances in Neural Information Processing Systems , 2019.G. Meanti, L. Carratino, L. Rosasco, and A. Rudi. Kernel Methods Through the Roof: Handling Billions of PointsEfﬁciently.

Advances in Neural Information Processing Systems , 33, 2020.C. A. Micchelli and M. Pontil. On Learning Vector-Valued Functions.

Neural computation , 17(1):177–204, 2005.K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Sch¨olkopf, et al. Kernel Mean Embedding of Distributions: A Reviewand Beyond.

Foundations and Trends® in Machine Learning , 10(1-2):1–141, 2017.K. Muandet, M. Kanagawa, S. Saengkyongam, and S. Marukatat. Counterfactual Mean Embedding. arXiv preprintarXiv:1805.08845 , 2018.E. A. Nadaraya. On Estimating Regression.

Theory of Probability & Its Applications , 9(1):141–142, 1964.J. Park and K. Muandet. A Measure-Theoretic Approach to Kernel Conditional Mean Embeddings. In

Advances inNeural Information Processing Systems , 2020a.J. Park and K. Muandet. Regularised Least-Squares Regression with Inﬁnite-Dimensional Output Space. arXiv preprintarXiv:2010.10973 , 2020b.S. Powers, J. Qian, K. Jung, A. Schuler, N. H. Shah, T. Hastie, and R. Tibshirani. Some Methods for HeterogeneousTreatment Effect Estimation in High Dimensions.

Statistics in medicine , 37(11):1767–1787, 2018.R. A. Rigby and D. M. Stasinopoulos. Generalized Additive Models for Location, Scale and Shape,(with discussion).

Applied Statistics , 54:507–554, 2005.P. R. Rosenbaum. Conditional Permutation Tests and the Propensity Score in Observational Studies.

Journal of theAmerican Statistical Association , 79(387):565–574, 1984.P. R. Rosenbaum.

Observational Studies . Springer Science & Business Media, 2002.P. R. Rosenbaum and D. B. Rubin. The Central Role of the Propensity Score in Observational Studies for CausalEffects.

Biometrika , 70(1):41–55, 1983.D. B. Rubin. Causal Inference using Potential Outcomes: Design, Modeling, Decisions.

Journal of the AmericanStatistical Association , 100(469):322–331, 2005.A. Rudi, L. Carratino, and L. Rosasco. Falkon: An Optimal Large Scale Kernel Method. In

Advances in NeuralInformation Processing Systems , pages 3888–3898, 2017.B. Scholkopf and A. J. Smola.

Learning with Kernels: Support Vector Machines, Regularization, Optimization, andBeyond . MIT press, 2001. 12onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression

A P

REPRINT

S. Schwabik and G. Ye.

Topics in Banach Space Integration , volume 10. World Scientiﬁc, 2005.R. J. Serﬂing.

Approximation Theorems of Mathematical Statistics . John Wiley & Sons, 1980.U. Shalit, F. D. Johansson, and D. Sontag. Estimating Individual Treatment Effect: Generalization Bounds andAlgorithms. In

International Conference on Machine Learning , pages 3076–3085. PMLR, 2017.S. Shen. Estimation and Inference of Distributional Partial Effects: Theory and Application.

Journal of Business &Economic Statistics , 37(1):54–66, 2019.C. Shi, D. Blei, and V. Veitch. Adapting Neural Networks for the Estimation of Treatment Effects. In

Advances inNeural Information Processing Systems , pages 2507–2517, 2019.C.-J. Simon-Gabriel and B. Sch¨olkopf. Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels andKernel Metrics on Distributions.

The Journal of Machine Learning Research , 19(1):1708–1736, 2018.R. Singh, L. Xu, and A. Gretton. Kernel Methods for Policy Evaluation: Treatment Effects, Mediation Analysis, andOff-Policy Planning. arXiv preprint arXiv:2010.04855 , 2020.A. Smola, A. Gretton, L. Song, and B. Sch¨olkopf. A Hilbert Space Embedding for Distributions. In

InternationalConference on Algorithmic Learning Theory , pages 13–31. Springer, 2007.L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert Space Embeddings of Conditional Distributions withApplications to Dynamical Systems. In

Proceedings of the 26th Annual International Conference on MachineLearning , pages 961–968, 2009.L. Song, K. Fukumizu, and A. Gretton. Kernel Embeddings of Conditional Distributions: A Uniﬁed Kernel Frameworkfor Nonparametric Inference in Graphical Models.

IEEE Signal Processing Magazine , 30(4):98–111, 2013.B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch¨olkopf, and G. R. Lanckriet. Hilbert Space Embeddings andMetrics on Probability Measures.

Journal of Machine Learning Research , 11(Apr):1517–1561, 2010.B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet. Universality, Characteristic Kernels and RKHS Embedding ofMeasures.

Journal of Machine Learning Research , 12(Jul):2389–2410, 2011.M. D. Stasinopoulos, R. A. Rigby, G. Z. Heller, V. Voudouris, and F. De Bastiani.

Flexible Regression and Smoothing:Using GAMLSS in R . CRC Press, 2017.W. Stute. Conditional U-Statistics.

The Annals of Probability , 19(2):812–825, 1991.X. Su, C.-L. Tsai, H. Wang, D. M. Nickerson, and B. Li. Subgroup Analysis via Recursive Partitioning.

Journal ofMachine Learning Research , 10(2), 2009.S. Wager and S. Athey. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.

Journalof the American Statistical Association , 113(523):1228–1242, 2018.G. S. Watson. Smooth Regression Analysis.

Sankhy¯a: The Indian Journal of Statistics, Series A , pages 359–372, 1964.J. Yoon, J. Jordon, and M. van der Schaar. GANITE: Estimation of Individualized Treatment Effects using GenerativeAdversarial Nets. In

International Conference on Learning Representations , 2018.J. Zhu and T. Hastie. Kernel Logistic Regression and the Import Vector Machine.

Journal of Computational andGraphical Statistics , 14(1):185–205, 2005. 13onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression

A P

REPRINT

A Background Material

In this section, we give a more detailed review of the background on reproducing kernel Hilbert space embeddings andU-statistics. Interested readers can refer to Berlinet and Thomas-Agnan [2004], Muandet et al. [2017] for the former,and Serﬂing [1980, Chapter 5] for the latter.

A.1 Reproducing Kernel Hilbert Space Embeddings

Let H be a vector space of real-valued functions on Y , endowed with the structure of a Hilbert space via an innerproduct (cid:104)· , ·(cid:105) H . Let (cid:107)·(cid:107) H be the associated norm, i.e. (cid:107) f (cid:107) H = (cid:104) f, f (cid:105) H for f ∈ H . Deﬁnition A.1 (Berlinet and Thomas-Agnan [2004, p.7, Deﬁnition 1]) . A function l : Y × Y → R is a reproducingkernel of the Hilbert space H if and only if(i) for all y ∈ Y , l ( y, · ) ∈ H ;(ii) for all y ∈ Y and for all f ∈ H , (cid:104) f, l ( y, · ) (cid:105) H = f ( y ) (the reproducing property ).A Hilbert space of functions Y → R which possesses a reproducing kernel is called the reproducing kernel Hilbertspace (RKHS).For any y ∈ Y , denote by e y : H → R the evaluation functional at y , i.e. e y ( f ) = f ( y ) for f ∈ H . Riesz representationtheorem can be used to prove the following lemma. Lemma A.2 (Berlinet and Thomas-Agnan [2004, p.9, Theorem 1]) . A Hilbert space of functions

Y → R has areproducing kernel if and only if all evaluation functionals e y , y ∈ Y are continuous on H . Next, we characterise reproducing kernels.

Deﬁnition A.3 (Berlinet and Thomas-Agnan [2004, p.10, Deﬁnition 2]) . A function l : Y × Y → R is called a positivedeﬁnite function if, for all n ≥ , any a , ..., a n ∈ R and any y , ..., y n ∈ Y , n (cid:88) i,j =1 a i a j l ( y i , y j ) ≥ . A reproducing kernel is a positive deﬁnite function, since, by the reproducing property, n (cid:88) i,j =1 a i a j l ( y i , y j ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 a i l ( y i , · ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H ≥ (see Berlinet and Thomas-Agnan [2004, p.13, Lemma 2]). The Moore-Aronszajn Theorem [Aronszajn, 1950] showsthat the set of positive deﬁnite functions and the set of reproducing kernels on Y × Y are identical.

Theorem A.4 (Berlinet and Thomas-Agnan [2004, p.19, Theorem 3]) . Let l be a positive deﬁnite function on Y × Y .Then there exists a unique Hilbert space of functions

Y → R with l as its reproducing kernel. The subspace ˜ H of H spanned by { l ( y, · ) : y ∈ Y} is dense in H , and H is the set of functions Y → R which are pointwise limits of Cauchysequences in ˜ H with the inner product (cid:104) f, g (cid:105) ˜ H = n (cid:88) i =1 m (cid:88) j =1 α i β j l ( y i , y j ) where f = (cid:80) ni =1 α i l ( y i , · ) and g = (cid:80) mj =1 β j l ( y j , · ) . Examples of commonly used kernels in Euclidean spaces include the linear kernel l ( y, y (cid:48) ) = y · y (cid:48) , the monomial kernel l ( y, y (cid:48) ) = ( y · y (cid:48) ) p , the polynomial kernel l ( y, y (cid:48) ) = ( y · y (cid:48) + 1) p , the Gaussian kernel l ( y, y (cid:48) ) = e − σ (cid:107) y − y (cid:48) (cid:107) and theLaplacian kernel l ( y, y (cid:48) ) = e − σ (cid:107) y − y (cid:48) (cid:107) .Kernel methods in machine learning turns linear methods into non-linear ones using the so-called “kernel trick”,whereby individual datapoints y ∈ Y are “embedded” into an RKHS H with reproducing kernel l via the mapping y (cid:55)→ l ( y, · ) . The RKHS is high- (and often inﬁnite-)dimensional, and performing a linear method (e.g. linear regression,support vector machine, principal component analysis, etc.) in H with datapoints l ( y i , · ) , i = 1 , ..., n , instead of14onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression A P

REPRINT the original space Y with datapoints y i , i = 1 , ..., n , results in a nonlinear method in the original space. Please seeScholkopf and Smola [2001] for more details.Recently, this idea of RKHS embeddings has been extended to embed entire (conditional) distributions, rather thanindividual datapoints, via the expectation. Suppose Y is a random variable taking values in Y , with distribution P Y .Assuming the integrability condition (cid:82) Y (cid:112) l ( y, y ) dP Y ( y ) < ∞ , we deﬁne the kernel mean embedding µ P Y ∈ H of themeasure P Y , or the random variable Y , as µ P Y ( · ) = E (cid:2) l ( Y, · ) (cid:3) = (cid:90) Y l ( y, · ) dP Y ( y ) = (cid:90) Ω l ( Y ( ω ) , · ) dP ( ω ) . Note that the integrand l ( Y, · ) is an element in a Hilbert space (and therefore a Banach space), so the integral is notthe usual Lebesgue integral on R . There are a number of ways in which one can deﬁne integration on a Banach space[Schwabik and Ye, 2005]. Among those, the Bochner integral [Dinculeanu, 2000, p.15, Deﬁnition 35] is the simplestand most intuitive one, and sufﬁces for our purposes. Riesz representation theorem is again used to prove the followingmean embedding version of the reproducing property. Lemma A.5 (Smola et al. [2007]) . For each f ∈ Y , E (cid:2) f ( Y ) (cid:3) = (cid:90) Y f ( y ) dP Y ( y ) = (cid:10) f, µ P Y (cid:11) H . Using the kernel mean embedding, we can deﬁne a distance function, called the maximum mean discrepancy [Grettonet al., 2012], between two random variables Y and Y (cid:48) on Y , or equivalently, two probability measures P Y and P Y (cid:48) , asMMD ( Y, Y (cid:48) ) = (cid:13)(cid:13) µ P Y − µ P Y (cid:48) (cid:13)(cid:13) H . The name maximum mean discrepancy comes from the following lemma.

Lemma A.6 (Gretton et al. [2012, Lemma 4]) . We have

MMD ( Y, Y (cid:48) ) = sup f ∈H , (cid:107) f (cid:107) H ≤ (cid:110) E (cid:2) f ( Y ) (cid:3) − E (cid:2) f ( Y (cid:48) ) (cid:3)(cid:111) . In this alternative deﬁnition of the MMD, the function in the unit ball of H that maximises E [ f ( Y )] − E [ f ( Y (cid:48) )] iscalled the witness function [Gretton et al., 2012, Section 2.3]. It can easily be seen that the witness function is in fact µ P Y − µ P Y (cid:48) (cid:13)(cid:13) µ P Y − µ P Y (cid:48) (cid:13)(cid:13) H . Lloyd and Ghahramani [2015] uses the unnormalised witness function µ P Y − µ P Y (cid:48) for model criticism.The MMD is not a proper metric, since Y and Y (cid:48) may be distinct and still give MMD ( Y, Y (cid:48) ) = 0 , depending on thekernel l that is used. The notion of characteristic kernels is therefore essential, since it tells us whether the associatedRKHS is rich enough to enable us to distinguish distinct distributions based on their embeddings. Deﬁnition A.7 (Fukumizu et al. [2008, Section 2.2]) . Denote by P the set of all probability measures on Y . A positivedeﬁnite kernel l is characteristic if the kernel mean embedding map P → H : P Y (cid:55)→ µ P Y is injective.For example, of the aforementioned kernels, the Gaussian and Laplacian kernels are characteristic, whereas the linear,monomial and polynomial kernels are not. The MMD associated with a characteristic kernel is then a proper metricbetween probability measures on Y . See Sriperumbudur et al. [2010, 2011], Simon-Gabriel and Sch¨olkopf [2018] forvarious characterisations of characteristic kernels.Now we discuss conditional embedding of distributions into RKHSs. Suppose X is a random variable on a space X . Deﬁnition A.8 (Park and Muandet [2020a, Deﬁnition 3.1]) . The conditional mean embedding of the random variable Y , or equivalently, the distribution P Y , is the Bochner conditional expectation (as deﬁned in Dinculeanu [2000, p.45,Deﬁnition 38]) µ P Y | X = E (cid:2) l ( Y, · ) | X (cid:3) . Notice that this is a straightforward extension of the kernel mean embedding µ P Y = E [ l ( Y, · )] to the conditional case.15onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression A P

REPRINT

A.2 U-Statistics

Suppose Y , Y , ..., Y r are independent copies of the random variable Y , i.e. they are independent and all havedistribution P Y . Let h : Y r → R be a symmetric function (called a kernel in the U-statistics literature; confusion mustbe avoided with the reproducing kernel used throughout this paper), i.e. for any permutation π of { , ..., r } , we have h ( y , ..., y r ) = h ( y π (1) , ..., y π ( r ) ) . Suppose we would like to estimate a function of the form θ ( P Y ) = E (cid:2) h ( Y , ..., Y r ) (cid:3) = (cid:90) Y ... (cid:90) Y h ( y , ..., y r ) dP Y ( y ) ...dP Y ( y r ) . The corresponding

U-statistic for an unbiased estimation of θ ( P Y ) based on a sample Y , ..., Y n of size n ≥ r is givenby ˆ θ ( P Y ) = 1 (cid:0) nr (cid:1) (cid:88) h ( Y i , ..., Y i r ) , where (cid:0) nr (cid:1) is the binomial coefﬁcient and the summation is over the (cid:0) nr (cid:1) combinations of r distinct elements { i , ..., i r } from { , ..., n } . Clearly, since the expectation of each summand yields θ ( P Y ) , we have E [ˆ θ ( P Y )] = θ ( P Y ) , soU-statistics are unbiased estimators.Some examples of h and the corresponding estimator include the sample mean h ( y ) = y , the sample variance h ( y , y ) = ( y − y ) , the sample cumulative distribution up to y ∗ h ( y ) = ( y ≤ y ∗ ) , the k th sample raw moment h ( y ) = y k and Gini’s mean difference h ( y , y ) = | y − y | .To the best of our knowledge, Stute [1991] was the ﬁrst to consider a conditional counterpart of U-statistics. Let X , ..., X r be independent copies of the random variable X . We are now interested in the estimation of the followingquantity: θ (cid:16) P Y | X (cid:17) = E (cid:2) h ( Y , ..., Y r ) | X , ..., X r (cid:3) . By C¸ ınlar [2011, p.146, Theorem 1.17], θ ( P Y | X ) can be considered as a function X r → R , such that for each r -tuple { x , ..., x r } , we have θ (cid:16) P Y | X (cid:17) ( x , ..., x r ) = E (cid:2) h ( Y , ..., Y r ) | X = x , ..., X r = x r (cid:3) . The simplest case is when r = 1 and h ( y ) = y . In this case, the estimand reduces to f ( X ) = E [ Y | X ] , which is theusual regression problem for which a plethora of methods exist. Suppose we have a sample { ( X i , Y i ) } ni =1 . One suchregression method is the Nadaraya-Watson kernel smoother: ˆ f ( x ) = (cid:80) ni =1 Y i K (cid:16) x − X i a (cid:17)(cid:80) ni =1 K (cid:16) x − X i a (cid:17) , where K is the so-called “smoothing kernel” and a is the bandwidth. This was extended by Stute [1991] to r ≥ andmore general h : ˆ θ (cid:16) P Y | X (cid:17) ( x , ..., x r ) = (cid:80) h ( Y i , ..., Y i r ) (cid:81) rj =1 K (cid:16) x j − X ij a (cid:17)(cid:80) (cid:81) rj =1 K (cid:16) x j − X ij a (cid:17) , where the sums are over the (cid:0) nr (cid:1) combinations of r distinct elements { i , ..., i r } from { , ..., n } as before. Derumigny[2019] considers a parametric model of the form Λ (cid:18) θ (cid:16) P Y | X (cid:17) ( x , ..., x r ) (cid:19) = ψ ( x , ..., x r ) T β ∗ , where Λ is a strictly increasing and continuously differentiable “link function” such that the range of Λ ◦ θ is exactly R , β ∗ ∈ R s is the true parameter and ψ ( · ) = (cid:0) ψ ( · ) , ..., ψ s ( · ) (cid:1) T ∈ R s is some basis, such as polynomials, exponentials,indicator functions etc. However, the estimation of β ∗ still makes use of the Nadaraya-Watson kernel smoothersconsidered above.Of course, Nadaraya-Watson kernel smoothers are far from being the only method of regression that can be extended toestimate conditional U-statistics, and in the main body of the paper (Section 5.2), we consider extending kernel ridgeregression for this purpose. 16onditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression A P

REPRINT

B More Details on IHDP Dataset

In this section, we give more details on the data generating process of the semi-synthetic IHDP (Infant Health andDevelopment Program) dataset that was ﬁrst used in the treatment effect literature by Hill [2011].The data consists of 25 covariates: birth weight, head circumference, weeks born preterm, birth order, ﬁrst born,neonatal health index, sex, twin status, whether or not the mother smoked during pregnancy, whether or not the motherdrank alcohol during pregnancy, whether or not the mother took drugs during pregnancy, the mother’s age, maritalstatus, education attainment, whether or not the mother worked during pregnancy, whether she received prenatal care,and 7 dummy variables for the 8 sites in which the family resided at the start of the intervention.These covariates are originally taken from a randomised experiment, and included information about the ethnicityof the mothers. Hill [2011] removed all children with nonwhite mothers from the treatment group, which is clearlya non-random (biased) portion of the data, thereby imitating an observational study. This leaves 608 children in thecontrol group and 139 in the treatment group. The overlap condition is now only satisﬁed for the treatment group.In creating the parallel linear response surfaces, which are used in all three of the settings “SN”, “LN” and “HN”, we let E [ Y | X ] = βX and E [ Y | X ] = βX + 4 , where the 25-dimensional coefﬁcient vector β is generated in the same way asin Alaa and Schaar [2018]: for the 6 continuous variables (birth weight, head circumference, weeks born preterm, birthorder, neonatal health index, mother’s age), the corresponding coefﬁcients is sampled from { , . , . , . , . } withprobabilities { . , . , . , . , . } respectively, whereas for the other 19 binary variables, the correspondingcoefﬁcients are sampled from { , . , . , . , . } with probabilities { . , . , . , . , . } respectively.Finally, we generate realisations of the potential outcomes by adding noise to the mean response surfaces. We let Y = βX + (cid:15) ( X ) and Y = βX + 4 + (cid:15) ( X ) , where (cid:15) ( X ) = (cid:15) SN in setting “SN”, (cid:15) ( X ) = (cid:15) LN in setting “LN”and (cid:15) ( X ) = X (cid:15) SN + (1 − X ) (cid:15) LN in setting “HN”, with (cid:15) SN ∼ N (0 , ) and (cid:15) LN ∼ N (0 , ) . The covariate X corresponds to the sex of the child, and was chosen because there are roughly the same number of each sex in both thecontrol and the treatment groups. C Proofs

Lemma 4.1.

For each x ∈ X , we have ˆ U MMD ( x ) = k T ( x ) W L W T k ( x ) − k T ( x ) W LW T k ( x ) + k T ( x ) W L W T k ( x ) , where [ L ] ≤ i,j ≤ n = l ( y i , y j ) , [ L ] ≤ i ≤ n , ≤ j ≤ n = l ( y i , y j ) and [ L ] ≤ i,j ≤ n = l ( y i , y j ) . Proof.

We use the reproducing property of H and (2) to see that, for any x ∈ X , ˆ U MMD ( x ) = (cid:13)(cid:13)(cid:13) ˆ µ Y | X = x − ˆ µ Y | X = x (cid:13)(cid:13)(cid:13) H = (cid:13)(cid:13)(cid:13) k T ( x ) W l − k T ( x ) W l (cid:13)(cid:13)(cid:13) H = (cid:42) n (cid:88) i,j =1 k ( x, x i ) W ,ij l ( y j , · ) , n (cid:88) p,q =1 k ( x, x p ) W ,pq l ( y q , · ) (cid:43) H − (cid:42) n (cid:88) i,j =1 k ( x, x i ) W ,ij l ( y j , · ) , n (cid:88) p,q =1 k ( x, x p ) W ,pq l ( y q , · ) (cid:43) H + (cid:42) n (cid:88) i,j =1 k ( x, x i ) W ,ij l ( y j , · ) , n (cid:88) p,q =1 k ( x, x p ) W ,pq l ( y q , · ) (cid:43) H = n (cid:88) i,j,p,q =1 k ( x, x i ) W ,ij l ( y j , y q ) W T ,qp k ( x p , x ) − n (cid:88) i,j =1 n (cid:88) p,q =1 k ( x, x i ) W ,ij l ( y j , y q ) W T ,qp k ( x p , x )+ n (cid:88) i,j,p,q =1 k ( x, x i ) W ,ij l ( y j , y q ) W T ,qp k ( x p , x ) A P

REPRINT = k T ( x ) W L W T k ( x ) − k T ( x ) W LW T k ( x ) + k T ( x ) W L W T k ( x ) . Theorem 4.2.

Suppose that k , k and l are bounded, that Γ and Γ are universal, and that λ n and λ n decay atslower rates than O ( n − / ) and O ( n − / ) respectively. Then as n , n → ∞ , ψ MMD ( ˆ U MMD ) = E [ | ˆ U MMD ( X ) − U MMD ( X ) | ] p → . Proof.

See that ψ MMD ( ˆ U MMD ) = E (cid:20)(cid:12)(cid:12)(cid:12) ˆ U MMD ( X ) − U MMD ( X ) (cid:12)(cid:12)(cid:12) (cid:21) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H − (cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) ≤ E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X − ˆ µ Y | X + µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) by the reverse triangle inequality ≤ E (cid:34)(cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:19) (cid:35) by the triangle inequality = E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) + 2 E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) ≤ E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) + E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) (cid:115) E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) by the Cauchy-Schwarz inequality.Hence, it sufﬁces to know that E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) p → and E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) p → . But this follows immediately from [Park and Muandet, 2020b, Lemma 2.1, Theorem 2.3], so the proof is complete.

Lemma. l is a characteristic kernel, P Y | X ≡ P Y | X if and only if t = 0 . Proof.

We can assume without loss of generality that P Y | X and P Y | X are obtained from a regular version of P ( · | X ) .Then by [Park and Muandet, 2020a, Theorem 2.9], there exist C , C ∈ F with P ( C ) = P ( C ) = 1 such that for all ω ∈ C , µ Y | X ( ω ) = (cid:82) Y l ( y, · ) dP Y | X ( ω )( y ) and for all ω (cid:48) ∈ C , µ Y | X ( ω (cid:48) ) = (cid:82) Y l ( y, · ) dP Y | X ( ω (cid:48) )( y ) .Suppose for contradiction that there exists some measurable A ⊆ X with P X ( A ) > such that for all x ∈ A , µ Y | X = x (cid:54) = (cid:82) Y l ( y, · ) dP Y | X = x ( y ) . Then P ( X − ( A )) = P X ( A ) > , and hence P ( X − ( A ) ∩ C ) > . For all ω ∈ X − ( A ) ∩ C , we have X ( ω ) ∈ A , and hence µ Y | X ( ω ) (cid:54) = (cid:90) Y l ( y, · ) dP Y | X = X ( ω ) ( y ) = (cid:90) Y l ( y, · ) P Y | X ( ω )( dy ) = µ Y | X ( ω ) . This is a contradiction, hence there does not exist a measurable A ⊆ X with P X ( A ) > such that for all x ∈ A , µ Y | X = x (cid:54) = (cid:82) Y l ( y, · ) dP Y | X = x ( y ) . Therefore, there must exist some measurable A ⊆ X with P X ( A ) = 1 suchthat for all x ∈ A , µ Y | X = x = (cid:82) Y l ( y, · ) dP Y | X = x ( y ) . Similarly, there must exist some measurable A ⊆ X with P X ( A ) = 1 such that for all x ∈ A , µ Y | X = x = (cid:82) Y l ( y, · ) dP Y | X = x ( y ) . ( = ⇒ ) Suppose that P Y | X ≡ P Y | X . This means that there exists a measurable A ⊆ X with P X ( A ) = 1 such thatfor all x ∈ A , the measures P Y | X = x ( · ) and P Y | X = x ( · ) are the same. Then for all x ∈ A ∩ A ∩ A , µ Y | X = x = (cid:90) Y l ( y, · ) dP Y | X = x ( y ) since x ∈ A A P

REPRINT = (cid:90) Y l ( y, · ) dP Y | X = x ( y ) since x ∈ A = µ Y | X = x since x ∈ A . Now, we have P X ( A ) = P X ( A ) = P X ( A ) = 1 , so P X ( A ∩ A ∩ A ) = 1 . Since µ Y | X = x = µ Y | X = x for all x ∈ A ∩ A ∩ A , we have µ Y | X = · = µ Y | X = · P X -almost everywhere. Hence, t = E (cid:20)(cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) = 0 ( ⇐ = ) Now suppose that t = 0 , i.e. µ Y | X = · = µ Y | X = · P X -almost everywhere, say on a measurable set A ⊆ X with P X ( A ) = 1 . Suppose x ∈ A ∩ A ∩ A . Then (cid:90) Y l ( y, · ) dP Y | X = x ( y ) = µ Y | X = x since x ∈ A = µ Y | X = x since x ∈ A = (cid:90) Y l ( y, · ) dP Y | X = x ( y ) since x ∈ A . Since k Y is characteristic, this means that P Y | X = x and P Y | X = x are the same measure. As before, we have P X ( A ∩ A ∩ A ) = 1 , hence P Y | X ≡ P Y | X . Lemma 4.4.

See that, using the reproducing property in H again, ˆ t = 1 n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ˆ µ Y | X = x i − ˆ µ Y | X = x i (cid:13)(cid:13)(cid:13) H = 1 n n (cid:88) i =1 (cid:26)(cid:13)(cid:13)(cid:13) ˆ µ Y | X = x i (cid:13)(cid:13)(cid:13) H − (cid:68) ˆ µ Y | X = x i , ˆ µ Y | X = x i (cid:69) H + (cid:13)(cid:13)(cid:13) ˆ µ Y | X = x i (cid:13)(cid:13)(cid:13) H (cid:27) = 1 n n (cid:88) i =1 (cid:26)(cid:13)(cid:13)(cid:13) k T ( x i ) W l (cid:13)(cid:13)(cid:13) H − (cid:68) k T ( x i ) W l , k T ( x i ) W l (cid:69) H + (cid:13)(cid:13)(cid:13) k T ( x i ) W l (cid:13)(cid:13)(cid:13) H (cid:27) = 1 n n (cid:88) i =1 (cid:42) n (cid:88) p,q =1 k ( x p , x i ) W ,pq l ( y q , · ) , n (cid:88) r,s =1 k ( x r , x i ) W ,rs l ( y s , · ) (cid:43) H − n n (cid:88) i =1 (cid:42) n (cid:88) p,q =1 k ( x p , x i ) W ,pq l ( y q , · ) , n (cid:88) r,s =1 k ( x r , x i ) W ,rs l ( y s , · ) (cid:43) H + 1 n n (cid:88) i =1 (cid:42) n (cid:88) p,q =1 k ( x p , x i ) W ,pq l ( y q , · ) , n (cid:88) r,s =1 k ( x r , x i ) W ,rs l ( y s , · ) (cid:43) H = 1 n n (cid:88) i =1 n (cid:88) p,q,r,s =1 k ( x i , x p ) W ,pq l ( y q , y s ) W T ,sr k ( x r , x i ) − n n (cid:88) i =1 n (cid:88) p,q =1 n (cid:88) r,s =1 k ( x i , x p ) W ,pq l ( y q , y s ) W T ,sr k ( x r , x i ) A P

REPRINT + 1 n n (cid:88) i =1 n (cid:88) p,q,r,s =1 k ( x i , x p ) W ,pq l ( y q , y s ) W T ,sr k ( x r , x i )= 1 n (cid:26) Tr (cid:16) ˜ K W L W T ˜ K T (cid:17) − Tr (cid:16) ˜ K W LW T ˜ K T (cid:17) + Tr (cid:16) ˜ K W L W T ˜ K T (cid:17)(cid:27) Theorem 4.5.

Under the same assumptions as in Theorem 4.2, we have ˆ t p → t as n , n → ∞ . Proof.

We decompose (cid:12)(cid:12)(cid:12) ˆ t − t (cid:12)(cid:12)(cid:12) as follows using the triangle inequality: (cid:12)(cid:12)(cid:12) ˆ t − t (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ˆ µ Y | X = x i − ˆ µ Y | X = x i (cid:13)(cid:13)(cid:13) H − E (cid:20)(cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ˆ µ Y | X = x i − ˆ µ Y | X = x i (cid:13)(cid:13)(cid:13) H − E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) − E (cid:20)(cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) We show that the two terms converge in probability to 0. For the ﬁrst term, take any (cid:15), δ > and deﬁne the real-valuedrandom variables ζ and ζ i , i = 1 , ..., n by ζ := (cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H , ζ i = (cid:13)(cid:13)(cid:13) ˆ µ Y | X i − ˆ µ Y | X i (cid:13)(cid:13)(cid:13) H . Here, we have (cid:107) ˆ µ Y | X = · (cid:107) G ≤ (cid:113) Bλ n and (cid:107) ˆ µ Y | X = · (cid:107) G ≤ (cid:113) Bλ n , since otherwise ∈ G and ∈ G have smallervalues of the regularised least-squares objectives. Using this, the triangle inequality, we have the following almost surebound: | ζ | = (cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H ≤ (cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) ˆ µ Y | X (cid:13)(cid:13)(cid:13) H (cid:19) ≤ B (cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X = · (cid:13)(cid:13)(cid:13) G + (cid:13)(cid:13)(cid:13) ˆ µ Y | X = · (cid:13)(cid:13)(cid:13) G (cid:19) ≤ (cid:32) B (cid:112) λ n + B (cid:112) λ n (cid:33) , and subsequently, by the triangle inequality again, (cid:12)(cid:12) ζ − E [ ζ ] (cid:12)(cid:12) ≤ (cid:32) B (cid:112) λ n + B (cid:112) λ n (cid:33) almost surely . Deﬁne σ = Var ( ζ ) . We have an obvious bound for σ as follows: σ ≤ (cid:32) B (cid:112) λ n + B (cid:112) λ n (cid:33) . Then by Bernstein’s inequality [Cucker and Smale, 2002, p.7, Proposition 2], we have the following bound: P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ζ i − E [ ζ ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:15)  = P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ˆ µ Y | X i − ˆ µ Y | X i (cid:13)(cid:13)(cid:13) H − E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:15)  ≤  − n(cid:15) (cid:32) (cid:18) B √ λ n + B √ λ n (cid:19) + (cid:18) B √ λ n + B √ λ n (cid:19) (cid:15) (cid:33)  . A P

REPRINT

Now, since λ n and λ n converge to 0 at a slower rate than O ( n − / ) and O ( n − / ) , we can ensure that N is largeenough such that for all n , n ≥ N , the right-hand side is at most δ . This means that we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:13)(cid:13)(cid:13) ˆ µ Y | X i − ˆ µ Y | X i (cid:13)(cid:13)(cid:13) H − E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p → . For the second term, see that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) − E (cid:20)(cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34)(cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H − (cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:19) (cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:19)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H − (cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:12)(cid:12)(cid:12)(cid:12) (cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:19)(cid:35) ≤ E (cid:34)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X − ˆ µ Y | X + µ Y | X (cid:13)(cid:13)(cid:13) H (cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:19)(cid:35) ≤ E (cid:34)(cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:19) (cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:19)(cid:35) ≤ (cid:118)(cid:117)(cid:117)(cid:116) E (cid:34)(cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:19) (cid:35) E (cid:34)(cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − ˆ µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:19) (cid:35) Here, we look at the ﬁrst and second factors separately. The ﬁrst factor converges to 0 in probability: E (cid:34)(cid:18)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:19) (cid:35) = E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H + 2 (cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) ≤ E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H + (cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) + 2 (cid:115) E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) ; here, as in the proof of Theorem 4.2, we have E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) p → and E (cid:20)(cid:13)(cid:13)(cid:13) ˆ µ Y | X − µ Y | X (cid:13)(cid:13)(cid:13) H (cid:21) p → . The second factor is clearly bounded, so we have that the whole expression converges in probability to 0.

Theorem 5.1.

The solution ˆ F to the problem in (4) is ˆ F ( x , ..., x r ) = n (cid:88) i ,...,i r =1 k (cid:16) x i , x (cid:17) ...k (cid:16) x i r , x r (cid:17) c i ,...,i r where the coefﬁcients c i ,...,i r ∈ R are the unique solution of the n r linear equations n (cid:88) j ,...,j r =1 (cid:32) k (cid:16) x i , x j (cid:17) ...k (cid:16) x i r , x j r (cid:17) + (cid:18) n r (cid:19) λ n δ i j ...δ i r j r (cid:33) c j ,...,j r = h (cid:16) y i , ..., y i r (cid:17) . Proof.

Recall from (4) that ˆ F = arg min F ∈H r (cid:40) (cid:0) n r (cid:1) (cid:88) (cid:18) F (cid:16) x i , ..., x i r (cid:17) − h (cid:16) y i , ..., y i r (cid:17)(cid:19) + λ n (cid:107) F (cid:107) H r (cid:41) , A P

REPRINT where the summation is over the (cid:0) n r (cid:1) combinations of r distinct elements { i , ..., i r } from , ..., n . Write ˆ F (cid:48) ( x , ..., x r ) = n (cid:88) i ,...,i r =1 k (cid:16) x i , x (cid:17) ...k (cid:16) x i r , x r (cid:17) c i ,...,i r where the coefﬁcients c i ,...,i r ∈ R are the unique solution of the n r linear equations n (cid:88) j ,...,j r =1 (cid:32) k (cid:16) x i , x j (cid:17) ...k (cid:16) x i r , x j r (cid:17) + (cid:18) n r (cid:19) λ n δ i j ...δ i r j r (cid:33) c j ,...,j r = h (cid:16) y i , ..., y i r (cid:17) . Also, for any F ∈ H r , write ˆ E reg ( F ) for the empirical regularised least-squares risk of F : ˆ E reg ( F ) = 1 (cid:0) n r (cid:1) (cid:88) (cid:18) F (cid:16) x i , ..., x i r (cid:17) − h (cid:16) y i , ..., y i r (cid:17)(cid:19) + λ n (cid:107) F (cid:107) H r , so that ˆ F = arg min F ∈H r ˆ E reg ( F ) . We will show that ˆ F (cid:48) = ˆ F . For any F ∈ H r , write G = F − ˆ F (cid:48) . Then ˆ E reg ( F ) = 1 (cid:0) n r (cid:1) (cid:88) (cid:18) F (cid:16) x i , ..., x i r (cid:17) − h (cid:16) y i , ..., y i r (cid:17)(cid:19) + λ n (cid:107) F (cid:107) H r = 1 (cid:0) n r (cid:1) (cid:88) (cid:18) F (cid:16) x i , ..., x i r (cid:17) − ˆ F (cid:48) (cid:16) x i , ..., x i r (cid:17) + ˆ F (cid:48) (cid:16) x i , ..., x i r (cid:17) − h (cid:16) y i , ..., y i r (cid:17)(cid:19) + λ n (cid:107) F (cid:107) H r = ˆ E reg (cid:16) ˆ F (cid:48) (cid:17) + 1 (cid:0) n r (cid:1) (cid:88) G (cid:16) x i , ..., x i r (cid:17) + 2 (cid:0) n r (cid:1) (cid:88) G (cid:16) x i , ..., x i r (cid:17) (cid:18) ˆ F (cid:48) (cid:16) x i , ..., x i r (cid:17) − h (cid:16) y i , ..., y i r (cid:17)(cid:19) + λ n (cid:107) G (cid:107) H r + 2 λ n (cid:68) G, ˆ F (cid:48) (cid:69) H r ≥ ˆ E reg (cid:16) ˆ F (cid:48) (cid:17) − (cid:0) n r (cid:1) (cid:88) G (cid:16) x i , ..., x i r (cid:17) (cid:18) h (cid:16) y i , ..., y i r (cid:17) − ˆ F (cid:48) (cid:16) x i , ..., x i r (cid:17)(cid:19) + 2 λ n (cid:68) G, ˆ F (cid:48) (cid:69) H r = ˆ E reg (cid:16) ˆ F (cid:48) (cid:17) − λ n (cid:88) G (cid:16) x i , ..., x i r (cid:17) c i ,...,i r + 2 λ n n (cid:88) i ,...,i r =1 G (cid:16) x i , ..., x i r (cid:17) c i ,...,i r by the reproducing property and the deﬁnition of c i ,...,i r = ˆ E reg (cid:16) ˆ F (cid:48) (cid:17) Hence, ˆ F (cid:48) minimises ˆ E reg in H r , and so ˆ F (cid:48) = ˆ F as required. Theorem 5.2.

Recall that we have the population version risk E ( ˆ F ) = E (cid:20)(cid:16) ˆ F ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:17) (cid:21) , E ( F ) = E (cid:104)(cid:0) F ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:1) (cid:105) . See that E (cid:20)(cid:16) ˆ F ( X , ..., X r ) − F ( X , ..., X r ) (cid:17) (cid:21) − (cid:16) E ( ˆ F ) − E ( F ) (cid:17) = E (cid:20)(cid:16) ˆ F ( X , ..., X r ) − F ( X , ..., X r ) (cid:17) (cid:21) − E (cid:20)(cid:16) ˆ F ( X , ..., X r ) − F ( X , ..., X r ) + F ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:17) (cid:21) A P

REPRINT + E (cid:104)(cid:0) F ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:1) (cid:105) = − E (cid:20)(cid:16) ˆ F ( X , ..., X r ) − F ( X , ..., X r ) (cid:17) (cid:0) F ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:1)(cid:21) = − E (cid:20)(cid:16) ˆ F ( X , ..., X r ) − F ( X , ..., X r ) (cid:17) (cid:16) F ( X , ..., X r ) − E (cid:2) h ( Y , ..., Y r ) (cid:3)(cid:17)(cid:21) = 0 , so if E ( ˆ F ) − E ( F ) p → , then we have E (cid:20)(cid:16) ˆ F ( X , ..., X r ) − F ( X , ..., X r ) (cid:17) (cid:21) p → . Now we prove that E ( ˆ F ) − E ( F ) p → . Firstly, see that, by the boundedness of k r , there exists some B > such that k ( x , x ) ...k ( x r , x r ) ≤ B for any r -tuple ( x , ..., x r ) ∈ X × ... × X . This allows us to bound any ﬁxed F ∈ H r uniformly over X × ... × X . We use the reproducing property and the Cauchy-Schwarz inequality repeatedly to obtain: (cid:12)(cid:12) F ( x , ..., x r ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:10) F, k ( x , · ) ...k ( x r , · ) (cid:11) H r (cid:12)(cid:12)(cid:12) ≤ (cid:107) F (cid:107) H r (cid:13)(cid:13) k ( x , · ) ...k ( x r , · ) (cid:13)(cid:13) H r = (cid:107) F (cid:107) H r (cid:112) k ( x , x ) ...k ( x r , x r ) ≤ (cid:107) F (cid:107) H r √ B (5)Fix arbitrary (cid:15) > and δ > . By the universality of k r , there exists some F (cid:15) ∈ H r with E (cid:104)(cid:0) F ( X , ..., X r ) − F (cid:15) ( X , ..., X r ) (cid:1) (cid:105) ≤ (cid:15) E ( F ) − (cid:112) (cid:15) E ( F ) + 4 E ( F ) , noting that (cid:15) + 2 E ( F ) > (cid:112) (cid:15) E ( F ) + 4 E ( F ) .We want to show that there exists some N ∈ N such that for all n ≥ N , P (cid:16) E ( ˆ F ) − E ( F ) > (cid:15) (cid:17) ≤ δ . We decomposethis probability using the union bound as follows: P (cid:16) E ( ˆ F ) − E ( F ) > (cid:15) (cid:17) ≤ P (cid:18) E ( ˆ F ) − ˆ E ( ˆ F ) > (cid:15) (cid:19) (a) + P (cid:18) ˆ E ( ˆ F ) − E ( F (cid:15) ) > (cid:15) (cid:19) (b) + P (cid:18) E ( F (cid:15) ) − E ( F ) > (cid:15) (cid:19) . (c)The proof is thus complete if we show that there exists some N ∈ N such that for all n ≥ N , (a) P (cid:16) E ( ˆ F ) − ˆ E ( ˆ F ) > (cid:15) (cid:17) ≤ δ , (b) P (cid:16) ˆ E ( ˆ F ) − E ( F (cid:15) ) > (cid:15) (cid:17) ≤ δ and (c) P (cid:0) E ( F (cid:15) ) − E ( F ) > (cid:15) (cid:1) = 0 .(a) Deﬁne the real-valued random variables ζ and ζ i ,...,i r by ζ := (cid:16) ˆ F ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:17) , ζ i ,...,i r = (cid:16) ˆ F ( X i , ..., X i r ) − h ( Y i , ..., Y i r ) (cid:17) , where { i , ..., i r } runs over all (cid:0) n r (cid:1) combinations of r distinct elements from { , ..., n } . Then we have ˆ E ( ˆ F ) = 1 (cid:0) n r (cid:1) (cid:88) ζ i ,...,i r , E ( ˆ F ) = E [ ζ ] . Since h ( Y , ..., Y r ) is almost surely bounded, there exists some M > such that h ( Y , ..., Y r ) ≤ M almostsurely. Note then that since ˆ E reg (0) ≤ M , we cannot have (cid:107) ˆ F (cid:107) H r > M λ n , as otherwise ˆ E reg ( ˆ F ) > M ≥ ˆ E reg (0) , contradicting the fact that ˆ F minimises ˆ E reg in H r (c.f. (4)). Hence, (cid:107) ˆ F (cid:107) H r ≤ M √ λ n . Using this,the triangle inequality and bound in (5), we have the following almost sure bound: | ζ | = (cid:16) ˆ F ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:17) A P

REPRINT ≤ (cid:18)(cid:12)(cid:12)(cid:12) ˆ F ( X , ..., X r ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12) h ( Y , ..., Y r ) (cid:12)(cid:12)(cid:19) ≤ (cid:18) √ B (cid:13)(cid:13)(cid:13) ˆ F (cid:13)(cid:13)(cid:13) H r + M (cid:19) ≤ (cid:32) √ BM (cid:112) λ n + M (cid:33) , and subsequently, deﬁning σ = Var ( ζ ) , we have an obvious bound for σ as follows: σ ≤ (cid:32) √ BM (cid:112) λ n + M (cid:33) . Then by Chebyshev’s inequality [Klenke, 2013, p.108, Theorem 5.11], we have the following bound: P (cid:18)(cid:12)(cid:12)(cid:12) E ( ˆ F ) − ˆ E ( ˆ F ) (cid:12)(cid:12)(cid:12) > (cid:15) (cid:19) ≤ (cid:15) (cid:0) n r (cid:1) (cid:32) √ BM (cid:112) λ n + M (cid:33) . Now, since λ n converges to 0 at a slower rate than O ( n − / ) , we can ensure that N is large enough such thatfor all n ≥ N , the right-hand side is at most δ . Then for all n ≥ N , P (cid:18) E ( ˆ F ) − ˆ E ( ˆ F ) > (cid:15) (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12) E ( ˆ F ) − ˆ E ( ˆ F ) (cid:12)(cid:12)(cid:12) > (cid:15) (cid:19) ≤ δ . (b) Now we deﬁne the real-valued random variables ξ and ξ i ,...,i r by ξ := (cid:0) F (cid:15) ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:1) , ξ i ,...,i r = (cid:0) F (cid:15) ( X i , ..., X i r ) − h ( Y i , ..., Y i r ) (cid:1) , where { i , ..., i r } runs over all (cid:0) n r (cid:1) combinations of r distinct elements from { , ..., n } . Then we have ˆ E ( F (cid:15) ) = 1 (cid:0) n r (cid:1) (cid:88) ξ i ,...,i r , E ( F (cid:15) ) = E [ ξ ] . Then, using the triangle inequality and bound in (5), we have the following almost sure bound: | ξ | = (cid:0) F (cid:15) ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:1) ≤ (cid:16)(cid:12)(cid:12) F (cid:15) ( X , ..., X r ) (cid:12)(cid:12) + (cid:12)(cid:12) h ( Y , ..., Y r ) (cid:12)(cid:12)(cid:17) ≤ (cid:16) √ B (cid:107) F (cid:15) (cid:107) H r + M (cid:17) almost surely , and subsequently, deﬁning σ = Var ( ξ ) , we have an obvious bound for σ as follows: σ ≤ (cid:16) √ B (cid:107) F (cid:15) (cid:107) H r + M (cid:17) . Then by Chebyshev’s inequality again, we have the following bound: P (cid:18)(cid:12)(cid:12)(cid:12) ˆ E ( F (cid:15) ) − E ( F (cid:15) ) (cid:12)(cid:12)(cid:12) > (cid:15) (cid:19) ≤ (cid:15) (cid:0) n r (cid:1) (cid:16) √ B (cid:107) F (cid:15) (cid:107) H r + M (cid:17) . Now, we can ensure that N is large enough such that for all n ≥ N , the right-hand side is at most δ , andfurther, λ n (cid:107) F (cid:15) (cid:107) H r ≤ (cid:15) . Then see that P (cid:18) ˆ E ( ˆ F ) − E ( F (cid:15) ) > (cid:15) (cid:19) ≤ P (cid:18) ˆ E reg ( ˆ F ) − E ( F (cid:15) ) > (cid:15) (cid:19) as ˆ E reg ( ˆ F ) ≥ ˆ E ( ˆ F ) ≤ P (cid:18) ˆ E reg ( F (cid:15) ) − E ( F (cid:15) ) > (cid:15) (cid:19) as ˆ F minimises ˆ E reg A P

REPRINT ≤ P (cid:18) ˆ E ( F (cid:15) ) − E ( F (cid:15) ) > (cid:15) (cid:19) as λ n (cid:107) F (cid:15) (cid:107) H r ≤ (cid:15) ≤ P (cid:18)(cid:12)(cid:12)(cid:12) ˆ E ( F (cid:15) ) − E ( F (cid:15) ) (cid:12)(cid:12)(cid:12) > (cid:15) (cid:19) ≤ δ by above.(c) By the triangle and Cauchy-Schwarz inequalities, E ( F (cid:15) ) − E ( F ) = E (cid:104)(cid:0) F (cid:15) ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:1) (cid:105) − E ( F )= E (cid:104)(cid:0) F (cid:15) ( X , ..., X r ) − F ( X , ..., X r ) + F ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:1) (cid:105) − E ( F )= E (cid:104)(cid:0) F (cid:15) ( X , ..., X r ) − F ( X , ..., X r ) (cid:1) +2 (cid:0) F (cid:15) ( X , ..., X r ) − F ( X , ..., X r ) (cid:1) (cid:0) F ( X , ..., X r ) − h ( Y , ..., Y r ) (cid:1)(cid:105) ≤ E (cid:104)(cid:0) F (cid:15) ( X , ..., X r ) − F ( X , ..., X r ) (cid:1) (cid:105) + 2 (cid:114) E (cid:104)(cid:0) F (cid:15) ( X , ..., X r ) − F ( X , ..., X r ) (cid:1) (cid:105) E ( F ) ≤ (cid:15) E ( F ) − (cid:112) (cid:15) E ( F ) + 4 E ( F ) + 2 (cid:114) (cid:15) E ( F ) + 2 E ( F ) − E ( F ) (cid:112) (cid:15) E ( F ) + 4 E ( F ) = (cid:15) E ( F ) − (cid:112) (cid:15) E ( F ) + 4 E ( F ) + 2 (cid:115)(cid:18) E ( F ) − (cid:112) (cid:15) E ( F ) + 4 E ( F ) (cid:19) = (cid:15) E ( F ) − (cid:112) (cid:15) E ( F ) + 4 E ( F ) + (cid:112) (cid:15) E ( F ) + 4 E ( F ) − E ( F )= (cid:15) , where, in the penultimate equality, we noted that E ( F ) < (cid:112) (cid:15) E ( F ) + 4 E ( F ) . This is not a probabilisticstatement, so trivially, P (cid:18) E ( F (cid:15) ) − E ( F ) > (cid:15) (cid:19) = 0 ..