[PDF] Adaptive Doubly Robust Estimator from Non-stationary Logging Policy under a Convergence of Average Probability

Abstract

Adaptive experiments, including efficient average treatment effect estimation and multi-armed bandit algorithms, have garnered attention in various applications, such as social experiments, clinical trials, and online advertisement optimization. This paper considers estimating the mean outcome of an action from samples obtained in adaptive experiments. In causal inference, the mean outcome of an action has a crucial role, and the estimation is an essential task, where the average treatment effect estimation and off-policy value estimation are its variants. In adaptive experiments, the probability of choosing an action (logging policy) is allowed to be sequentially updated based on past observations. Due to this logging policy depending on the past observations, the samples are often not independent and identically distributed (i.i.d.), making developing an asymptotically normal estimator difficult. A typical approach for this problem is to assume that the logging policy converges in a time-invariant function. However, this assumption is restrictive in various applications, such as when the logging policy fluctuates or becomes zero at some periods. To mitigate this limitation, we propose another assumption that the average logging policy converges to a time-invariant function and show the doubly robust (DR) estimator's asymptotic normality. Under the assumption, the logging policy itself can fluctuate or be zero for some actions. We also show the empirical properties by simulations.

Full PDF

CCounterfactual Inference of the Mean Outcomeunder a Convergence of Average Logging Probability

Masahiro Kato

Cyberagent, Inc.Shibuya, Tokyo

E-mail: masahiro [email protected]

Received: February 2021

Summary

Adaptive experiments , including eﬃcient average treatment eﬀect esti-mation and multi-armed bandit algorithms, have garnered attention in various applica-tions, such as social experiments, clinical trials, and online advertisement optimization.This paper considers estimating the mean outcome of an action from samples obtainedin adaptive experiments. In causal inference, the mean outcome of an action has acrucial role, and the estimation is an essential task, where the average treatment eﬀectestimation and oﬀ-policy value estimation are its variants. In adaptive experiments, theprobability of choosing an action (logging probability) is allowed to be sequentially up-dated based on past observations. Due to this logging probability depending on the pastobservations, the samples are often not independent and identically distributed (i.i.d.),making developing an asymptotically normal estimator diﬃcult. A typical approachfor this problem is to assume that the logging probability converges in a time-invariantfunction. However, this assumption is restrictive in various applications, such as whenthe logging probability ﬂuctuates or becomes zero at some periods. To mitigate this lim-itation, we propose another assumption that the average logging probability convergesto a time-invariant function and show the doubly robust (DR) estimator’s asymptoticnormality. Under the assumption, the logging probability itself can ﬂuctuate or be zerofor some actions. We also show the empirical properties by simulations.

1. INTRODUCTIONEstimating the mean outcome of an action is an essential task in statistical inferenceunder Neyman-Rubin potential outcomes model (Luedtke and van der Laan, 2016). Theaverage treatment eﬀect (ATE) estimation (Holland, 1986; Rubin, 1987; Robins et al.,1994; Hirano et al., 2003; Imai and Ratkovic, 2014; Imbens and Rubin, 2015) and oﬀ-policy value (OPV) estimation for multi-armed bandit (MAB) algorithms (Precup et al.,2000; Dud´ık et al., 2011; Mahmood et al., 2014; Li et al., 2015; Jiang and Li, 2016; Wanget al., 2017; Bibaut et al., 2019) are its special cases. We consider mean outcome esti-mation for dependent samples obtained from adaptive experiments (Chow and Chang,2011; Hahn et al., 2011; Kasy and Sautmann, 2021), including MAB algorithms (Villar,2018) and treatment regimes (TR) (Zhang et al., 2012; Zhao et al., 2012; Chakrabortyand Moodie, 2013). In adaptive experiments, we gather samples via logging probability (probability of choosing an action), which is sequentially updated based on past observa-tions. For instance, van der Laan (2008) considered a situation where a research subjectvisits at each period t = 1 , , . . . , T , and we select a treatment following a logging prob-ability sequentially updated based on past observation to minimize the variance of an a r X i v : . [ s t a t . M E ] F e b Masahiro Kato

Figure 1 . The blue line denotes the logging probability converging to 0 .

5. The orange linedenotes the logging probability that does not converge, but the average converges to 0 . independent and identically distributed (i.i.d.).For statistical inference for the mean outcome, we aim to construct an asymptoticallynormal estimator, which also implies √ T consistency for a sample size T . Under thedependency, we cannot apply the standard central limit theorem (CLT). For mitigatingthis problem, existing studies proposed various approaches, and one of the main ap-proaches is to apply the martingale CLT, which requires that the variance of the targetrandom variable to a time-invariant one. Existing studies have proposed the followingthree strategies for satisfying this requirement. The ﬁrst strategy is to assume that thelogging probability converges to a time-invariant function in probability (van der Laan,2008; Hadad et al., 2019; Kato et al., 2020). The second strategy is to assume the ex-istence of batched samples, where there are inﬁnite samples in each batch (Hahn et al.,2011; van der Laan and Lendle, 2014; Zhang et al., 2020; Kato and Kaneko, 2020). Thethird strategy is to standardize the score function to equalize the variance of each period(Luedtke and van der Laan, 2016).However, these strategies are often restrictive in practice. For instance, we can raisethe following two situations, where the ﬁrst strategy is not applicable: (I) the loggingprobability ﬂuctuates, and (II) the logging probability is 0 or 1 at a period. In best armidentiﬁcation (BAI), Kaufmann et al. (2016) and Garivier and Kaufmann (2016) showedthat pulling arms with a speciﬁc ratio achieves the lower bound of the sample complexity.Their methods deterministically select an arm at a period to keep the ratio. Here, thelogging probability does not converge to a time-invariance function. The value is 1 for anarm and 0 for the others. Therefore, we cannot apply an existing mean outcome estimatorfor the situations.When the logging probability ﬂuctuates, the martingale CLT is not applicable owingto the time-variant variance. To mitigate this problem, Luedtke and van der Laan (2016)proposed standardizing the score function by its estimated variance, which is applicable to ean Outcome Estimation under a Convergence of Average Logging Probability many cases. However, the estimator of Luedtke and van der Laan (2016) does not achieve √ T -consistency by splitting the samples to estimate the variance for the standardization.For overcoming these problems, instead of the conventional strategies, this paper pro-poses a new strategy based on the assumption that the average logging probability con-verges to a time-invariant function in probability . This assumption is a generalizationof the ﬁrst strategy based on the assumption that the logging probability converges toa time-invariant function in probability. This is because when the logging probabilityitself converges, the average logging probability also converges. The new assumption isgreatly useful in practice. For instance, we can apply our method to cases, where thelogging probability can ﬂuctuate, and the logging probability is 0 or 1 if the averagelogging probability converges. We also illustrate an example when the average loggingprobability converges in contrast to a case when the logging probability itself convergesin Figure 1. Organization of this paper.

In Section 2, we introduce our problem setting andthe parameter that we want to estimate. In Section 4, we propose two DR-type estimatorsand show the asymptotic normalities under the assumption that the average loggingprobability converges in probability. The ﬁrst estimator is more natural and empiricallyperforms well but requires conditions that are not easy to be conﬁrmed (Theorem 4.1).The second estimator does not empirically perform well as the ﬁrst one, but we canshow the asymptotic normality by using an assumption that is easier to be conﬁrmed(Theorem 4.2). In Section 5, we numerically investigate the performance of the proposedestimators. In Section 6, we discuss the remaining problems.2. PROBLEM SETTINGIn this section, we describe our problem setting.

Consider a time-series t = 1 , , . . . , T . For each period t , let A t be an action in A = { , , . . . , K } , X t ∈ X be a covariate observed by the decision maker when choosing anaction, and X be the space of covariate. Let a random variable denoting an outcome atperiod t be Y t = (cid:80) Ka =1 [ A t = a ] Y t ( a ), where Y t ( a ) ∈ R is a random variable denotingthe potential (random) outcome of an action a ∈ A . We have a dataset (cid:8) ( X t , A t , Y t ) (cid:9) Tt =1 .The DGP is described as follows:( X t , A t , Y t ( A t )) ∼ P t = p ( x ) p t ( a | x ) p ( y a | x ) , where X t is generated from p ( x ), A t is generated from p t ( a | x ) at the period t , and Y t ( a ) is generated from p ( y a | x ). While p ( x ) and p ( y a | x ) are invariant across periods, p t ( a | x ) can be diﬀerent across periods based on past observations. In this case, thesamples (cid:8) ( X t , A t , Y t ) (cid:9) Tt =1 are correlated over time; that is, the samples are not i.i.d.Let Ω t − = { X t − , A t − , Y t − , . . . , X , A , Y } be the history with the space M t − . Theprobability p t ( a | x ) is determined by a logging probability π t : A×X ×M t − → (0 , π t is conditionally independent of Y t ( a ) to satisfy the unconfoundedness(Remark 2.2). Remark 2.1. (Stable unit treatment value assumption)

The DGP also implies

Masahiro Kato the stable unit treatment value assumption , that is, p ( y ( a ) | x ) is invariant for any p t ( a | x ) (Rubin, 1986). Remark 2.2. (Unconfoundedness)

In this paper, unconfoundedness refers indepen-dence between ( Y t (1) , . . . , Y t ( K )) and A t conditioned on X t and Ω t − , which is requiredfor identiﬁcation of the mean outcome.2.2. Parameter of Interest in Mean Outcome Estimation Let a function π e : A × X → R be an evaluation weight . We consider estimating the meanoutcome weighted by an evaluation weight π e ( a | x ) deﬁned as R ( π e ) := E (cid:34) K (cid:88) a =1 π e ( a | x ) Y t ( a ) (cid:35) . Dud´ık et al. (2011) regarded the weight as an policy that we want to evaluate by limitingthe range into (0 ,

1) and the sum to 1. The ATE is also a special case of the mean outcomefor two actions A = { , } , where π e (1 | x ) = 1 and π e (2 | x ) = −

1. To identify R ( π e ),we assume the boundedness of the potential outcome. Assumption 2.1.

For all a ∈ A and t ∈ { , , . . . , T } , there exists a constant C Y suchthat | Y t ( a ) | ≤ C Y Notations.

Let us denote E [ Y t ( a ) | x ] and Var( Y t ( a ) | x ) as f ∗ ( a, x ) and v ∗ ( a, x ),respectively. Let ˆ f t ( a, x ) be an estimators of f ∗ ( a, x ) constructed from Ω t . Let N ( µ, var)be the normal distribution with the mean µ and the variance var. For a random variable Z and function µ , let (cid:107) µ ( Z ) (cid:107) = (cid:82) | µ ( z ) | p ( z ) dz be the L -norm.3. PRELIMINARIES OF MEAN OUTCOME ESTIMATION We introduce well-known estimators for a standard mean outcome estimation problemwith i.i.d. samples where π ( a | x, Ω ) = π ( a | x, Ω ) = · · · = π ( a | x ). One of thestandard estimators is an inverse probability weighting (IPW) estimator (cid:98) R IPW T ( π e ) = 1 T T (cid:88) t =1 K (cid:88) a =1 π e ( a | X t ) [ A t = a ] Y t π t ( a | X t ) , which are also called importance sampling (Horvitz and Thompson, 1952). If ˆ f is aconsistent estimator of f ∗ , the direct method (DM) estimator deﬁned as (cid:80) Ka =1 π e ( a | x ) ˆ f ( a | X t ) is known to be consistent to the policy value R ( π e ). By extending an IPW,Robins et al. (1994), Scharfstein et al. (1999), and Robins (1999) proposed an AugmentedIPW (AIPW) estimator deﬁned as (cid:98) R AIPW T ( π e ) = 1 T T (cid:88) t =1 K (cid:88) a =1 (cid:40) π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˆ f ( a, X t ) (cid:17) π ( a | X t ) + π e ( a | X t ) ˆ f ( a, X t ) (cid:41) , ean Outcome Estimation under a Convergence of Average Logging Probability where ˆ f is a consistent estimator of f ∗ . In addition, a doubly robust (DR) estimator isalso a standard choice (Scharfstein et al., 1999; Bang and Robins, 2005), which is deﬁnedas (cid:98) R DR T ( π e ) = 1 T T (cid:88) t =1 K (cid:88) a =1 (cid:40) π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˆ f ( a, X t ) (cid:17) ˆ g ( a | X t ) + π e ( a | X t ) ˆ f ( a, X t ) (cid:41) , where ˆ g is a consistent estimator of π . Semiparametric eﬃciency bound.

In many cases, we are interested in the asymp-totic eﬃciency of the estimators. The lower bound of the asymptotic variance is deﬁned foran estimator under some posited models of the DGP. If this posited model is a parametricmodel, then the lower bound is equal to the Cram´er-Rao lower bound. When this positedmodel is a non- or semiparametric model, the corresponding lower bound can still be de-ﬁned (Bickel et al., 1998). For OPV estimation setting, Narita et al. (2019) shows thatthe semiparametric lower bound of the DGP under p ( a | x ) = · · · = p T ( a | x ) = p ( a | x )is Ψ( π e ) = E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X t ) (cid:1) v ∗ ( a, X t ) p ( a | X t ) + (cid:32) K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X t ) − R ( π e ) (cid:33) (cid:35) . The asymptotic variance of the asymptotic distribution is also known as the asymptoticmean squared error (MSE). By constructing a mean outcome estimator achieving thesemiparametric lower bound, we can also minimize the MSE between the estimator andthe true value R ( π e ), not only obtain a tight conﬁdence interval. There are mainly three approaches for deriving the asymptotic normality of a meanoutcome estimator from dependent samples: (i) assuming the convergence of the loggingprobability π t ( a | x, Ω t − ) to a time-invariant probability (van der Laan and Lendle,2014; Hadad et al., 2019; Kato et al., 2020); (ii) assuming the presence of batched samples(Hahn et al., 2011; van der Laan and Lendle, 2014; Zhang et al., 2020); (iii) standardizingthe score functions. Under the ﬁrst approach, van der Laan (2008) and Kato et al. (2020)put the following assumption. Assumption 3.1.

For all a ∈ A and x ∈ X , ˆ f t − ( a, x ) p −→ f ∗ ( a, x ) and π t ( a | x, Ω t − ) p −→ α ( a | x ) , where α : A × X → (0 , is a time-invariant function such that (cid:80) Ka (cid:48) =1 α ( a (cid:48) | x ) = 1 and there exists a constant C π satisfying (cid:12)(cid:12)(cid:12) π e ( a | x ) π t ( a | x, Ω t − ) (cid:12)(cid:12)(cid:12) ≤ C π for all a ∈ A and x ∈ X , and Ω t − ∈ M t − . Then, van der Laan (2008) proposed the adaptive version of an IPW (AdaIPW) estima-tor deﬁned as R AdaIPW T ( π e ) = T (cid:80) Tt =1 π e ( A t | X t ) [ A t = a ] Y t π t ( A t | X t , Ω t − ) and van der Laan (2008) andKato et al. (2020) proposed estimators based on an Adaptive AIPW (A2IPW) estimator (cid:98) R A2IPW T ( π e ) deﬁned as1 T T (cid:88) t =1 K (cid:88) a =1 (cid:40) π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˆ f t − ( a, X t ) (cid:17) π t ( a | X t , Ω t − ) + π e ( a | X t ) ˆ f t − ( a, X t ) (cid:41) , Masahiro Kato where ˆ f t is a consistent estimator of f ∗ constructed only using Ω t − . Under Assump-tion 3.1. Kato et al. (2020) showed the asymptotic normality of an A2IPW estimator. Proposition 3.1. (Asymptotic distribution of an A2IPW estimator)

Under As-sumptions 2.1 and 3.1, √ T (cid:16) (cid:98) R A2IPW T ( π e ) − R ( π e ) (cid:17) d −→ N (0 , Ψ( α )) . Besides, By replacing the true logging probability π t with its estimator ˆ g t − , Kato (2020)proposed an ADR estimator (cid:98) R ADR T ( π e ) deﬁned as1 T T (cid:88) t =1 K (cid:88) a =1 (cid:40) π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˆ f t − ( a, X t ) (cid:17) ˆ g t − ( a | X t ) + π e ( a | X t ) ˆ f t − ( a, X t ) (cid:41) . Kato (2020) showed Proposition 3.2 using sample-ﬁtting, which is also used in van derLaan and Lendle (2014). First, we assume boundedness and consistencies of nuisanceestimators ˆ f t − and ˆ g t − . Assumption 3.2.

For all a ∈ A and x ∈ X , there exist constants C f , C g > such that | ˆ f t − ( a, x ) | ≤ C f and (cid:12)(cid:12)(cid:12) π e ( a | x )ˆ g t − ( a | x ) (cid:12)(cid:12)(cid:12) ≤ C g for all a ∈ A , x ∈ X , and t ∈ { , , . . . , T } . Assumption 3.3.

For all a ∈ A , (cid:107) ˆ g t − ( a | X t ) − α ( a | X t ) (cid:107) = o p (1) , (cid:107) ˆ f t − ( a, X t ) − f ∗ ( a, X t ) (cid:107) = o p (1) . In addition, we put the following assumption on the convergence rate.

Assumption 3.4.

For all a ∈ A , (cid:107) ˆ g t − ( a | X t ) − α ( a | X t ) (cid:107) (cid:107) ˆ f t − ( a, X t ) − f ∗ ( a, X t ) (cid:107) =o p ( t − / ) . Then, Kato (2020) proved the following proposition.

Proposition 3.2. (Asymptotic normality of an ADR estimator)

Then under As-sumptions 2.1–3.4, for the ADR estimator, √ T (cid:16) (cid:98) R ADR T ( π e ) − R ( π e ) (cid:17) d −→ N (0 , Ψ( α )) . Sample splitting and Donsker’s condition.

When estimating f ∗ and t (cid:80) ts =1 π s ,we only use Ω t − . Owing to this construction, we can derive the asymptotic normality ofthe semiparametric estimator without Donsker’s condition. This technique is a variantof sample-splitting (Klaassen, 1987; Zheng and van der Laan, 2011; Chernozhukov et al.,2018). See van der Laan and Lendle (2014) and Kato (2020) for more details. We consider a class of OPV estimators ˆ R T such that there exists a function φ : X × A × R → R satisfying √ T (cid:16) ˆ R T ( π e ) (cid:17) = 1 √ T T (cid:88) φ ( X t , A t , Y t ) + o p (1) . ean Outcome Estimation under a Convergence of Average Logging Probability Such an estimator ˆ R t and function φ are called asymptotically linear estimator andinﬂuence function, respectively. If samples are i.i.d., an asymptotically linear estimatorhas an asymptotic normality as √ T (cid:16) ˆ R T ( π e ) (cid:17) → N (0 , E [ φ ( X t , A t , Y t ) φ ( X t , A t , Y t )]).However, when samples are dependent, we need to carefully consider the condition forasymptotic normality of √ T (cid:80) T φ ( X t , A t , Y t ). A standard strategy is to apply martingaleCLT to √ T (cid:80) T φ ( X t , A t , Y t ). For a martingale diﬀerence sequence (MDS), the martingaleCLT is provided as follows. Proposition 3.3. [CLT for a martingale diﬀerence sequence, Hamilton (1994), Propo-sition 7.9, p. 194] Let { R t } ∞ t =1 be a scalar martingale diﬀerence sequence with R T = T (cid:80) Tt =1 R t . Suppose that (a) E [ R t ] = σ t , a positive value with (1 /T ) (cid:80) Tt =1 σ t → σ , a positive value; (b) E [ | R t | r ] < ∞ for some r > ; (c) (1 /T ) (cid:80) Tt =1 R t p −→ σ .Then √ T R

T d −→ N ( , σ ) . Here, the martingale CLT requires a mean outcome estimator a asymptotically constantvariance, and there are several directions to construct estimators satisfying the conditions.For instance, convergence of the logging probability is an instance. In this paper, becausethe assumption is too restrictive, we consider more practical assumptions.

Compared with studies on mean outcome estimation for i.i.d. samples (Horvitz andThompson, 1952; Hahn, 1998; Hirano et al., 2003; Bang and Robins, 2005; Dud´ık et al.,2011; Narita et al., 2019; Bibaut et al., 2019). there are fewer studies on mean outcomeestimation for not i.i.d. samples. When the logging probability converges, van der Laan(2008), van der Laan and Lendle (2014), and Luedtke and van der Laan (2016) proposedIPW, AIPW, and DR type estimators mainly for ATE estimation. van der Laan (2008)proposed an A2IPW estimator. van der Laan and Lendle (2014) only implied a possibilityof an ADR estimator, and Kato (2020) showed it. Without the convergence assumption,the asymptotic normality still can be derived based on batched samples (Hahn et al., 2011;van der Laan and Lendle, 2014) and standardization (Luedtke and van der Laan, 2016).Because an A2IPW estimator is unstable, Hadad et al. (2019) proposed a stabilizationmethod for an A2IPW estimator, and Kato (2020) empirically showed that an ADRestimator is more stable than an A2IPW estimator, which has the same asymptoticdistribution.A semiparametric estimator usually requires Donsker’s condition for its √ N -consistency,where N is a sample size (Bickel et al., 1998). For semiparametric inference withoutDonsker’s condition, sample-splitting is a typical approach (Klaassen, 1987; Zheng andvan der Laan, 2011; Chernozhukov et al., 2018), which is also referred to as cross-ﬁtting .As a variant of the sample-splitting for time-series, van der Laan and Lendle (2014) andKato (2020) proposed adaptive-ﬁtting . Kallus and Uehara (2019) also proposed mixingale-based sample-splitting. Masahiro Kato

Finally, we introduce existing studies in other related topics. Adaptive importancesampling is a sample selection framework for eﬃcient Monte Carlo simulation, similar toadaptive experiments (Kloek and van Dijk, 1978; Naylor and Smith, 1988; Evans, 1988;Oh and Berger, 1992; Capp´e et al., 2008; Portier and Delyon, 2018). In causal inference,the conditional mean outcome is also a standard target, and van der Laan (2008) andZhang et al. (2020) proposed methods for estimating it. The method of Zhang et al. (2020)is a variant of the generalized method of moments for martingales (Hayashi, 2000), whichis also applied in van der Laan and Lendle (2014). (Li et al., 2010, 2011) proposed oﬀ-policy policy evaluation of (adaptive) MAB algorithm using i.i.d. samples generated froma random policy.4. ADR ESTIMATOR WHEN AVERAGE LOGGING PROBABILITY CONVERGESLet us consider a situation where t (cid:80) ts =1 π s ( a | x, Ω s − ) converges to α ( a | x ) for all x ∈ X and { Ω s − } ts =1 in probability. For this case, we consider an adaptive DR (ADR)estimator (cid:98) R ADR T ( π e ) deﬁned as1 T T (cid:88) t =1 K (cid:88) a =1 (cid:40) π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˆ f t − ( a, X t ) (cid:17) ˆ g t − ( a | X t ) + π e ( a | X t ) ˆ f t − ( a, X t ) (cid:41) , where ˆ g t − ( a | x ) is an estimator of t (cid:80) ts =1 π s ( a | x, Ω s − ) or α ( a | x ), which is con-structed only from Ω t − . For instance, when minimizing the risk with the logistic loss,we can show that the solution is given as t (cid:80) ts =1 π s . Let us consider the following riskof binary classiﬁcation problem: (cid:90) t t (cid:88) s =1 (cid:16) p ( a = 1) log( h ( x )) p s ( x | a = 1) + p ( a = 2) log(1 − h ( x )) p s ( x | a = 2) (cid:17) dx, where p s ( x | a = 1) is the conditional density of x at the period t . By taking the derivativeand ﬁrst order condition, the minimizer is given as h ∗ ( x ) = t (cid:80) ts =1 p s ( x | a =1) p ( x ) = t π s ( a | x, Ω s − ). By the law of large numbers for martingales, the risk can be approximated by1 t t (cid:88) s =1 (cid:16) [ A s = 1] log( h ( X s )) + [ A s = 2] log(1 − h ( X s )) (cid:17) . Therefore, by naively applying the logistic regression for { ( X s , A s ) } ts =1 , we can obtainthe consistent estimator ˆ g . This estimator In this paper, we show that even though the assumption does not hold, we can derivethe asymptotic normality of a mean outcome estimator under the following alternativeassumption that the average logging probability converges in probability. ean Outcome Estimation under a Convergence of Average Logging Probability Assumption 4.1.

For all a ∈ A , as t → ∞ , (cid:13)(cid:13)(cid:13) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ¯ α ( a | X ) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:13)(cid:13)(cid:13) = o p ( t − / ) , (cid:107) ˆ g t − ( a | X t ) − ¯ α ( a | X t ) (cid:107) (cid:107) f ∗ ( a, X t ) − ˆ f t − ( a, X t ) (cid:107) = o p ( t − / ) , where ¯ α : A × X → (0 , is a time-invariant function such that (cid:80) Ka (cid:48) =1 ¯ α ( a (cid:48) | x ) = 1 forall x ∈ X and there exists a constant C α satisfying (cid:12)(cid:12)(cid:12) π e ( a | x ) α ( a | x ) (cid:12)(cid:12)(cid:12) < C α , and the expectationof the norm is deﬁned over X t . Assumption 4.2.

For all a ∈ A , as T → ∞ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − ) t (cid:80) ts =1 π s ( a | X, Ω s − ) − (cid:33) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p ( T − / ) . Assumption 4.1 is weaker than Assumption 3.1 because Assumption 4.1 holds underAssumption 3.1. Note that under Assumption 4.1, the logging probability π t can bedeﬁcient; that is π t can be 0 at a period t . Kato (2020) derived the asymptotic normality of the ADR estimator under Assump-tion 3.1. In this section, we show the asymptotic normality under Assumption 4.1.

Theorem 4.1. (Asymptotic normality of an ADR estimator when average Logging probability converges)

Under Assumptions 2.1, 3.2–3.3 and 4.1–4.2, for the ADR estimator, √ T (cid:16) (cid:98) R ADR T ( π e ) − R ( π e ) (cid:17) d −→N (0 , Ψ(¯ α )) . To show Theorem 4.1, we decompose √ T (cid:16) (cid:98) R ADR T ( π e ) − R ( π e ) (cid:17) as √ T (cid:16) (cid:98) R ADR T ( π e ) − ¨ R T ( π e ) + ¨ R T ( π e ) − R ( π e ) (cid:17) , where ¨ R T ( π e ) is deﬁned as1 T T (cid:88) t =1 K (cid:88) a =1 (cid:40) π e ( a | X t ) [ A t = a ] ( Y t − f ∗ ( a, X t ))¯ α ( a | X t ) + π e ( a | X t ) f ∗ ( a, X t ) (cid:41) . The remaining problems are to show that √ T (cid:16) (cid:98) R ADR T ( π e ) − ¨ R T ( π e ) (cid:17) = o p (1) (4.1)and √ T (cid:16) ¨ R T ( π e ) − R ( π e ) (cid:17) d −→ N (cid:0) , ¯ σ (cid:1) . (4.2)We separately show (4.1) and (4.2) in Lemma 4.1 and 4.2, respectively. First, we showLemma 4.1. Masahiro Kato

Lemma 4.1.

Under Assumptions 2.1, 3.2–3.3 and 4.1–4.2, √ T (cid:16) (cid:98) R ADR T ( π e ) − ¨ R T ( π e ) (cid:17) =o p (1)Lemma 4.1 is proved by a technique based on sample-splitting, as van der Laan andLendle (2014) and Kato (2020). Here, we show the sketch of proof. The full proof isshown in Appendix B of the supplementary material. Proof. (Sketch of proof)

Let us deﬁne φ ( X t , A t , Y t ; g, f ) = K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t − f ( a, X t )) g ( a | X t ) , φ ( X t ; f ) = K (cid:88) a =1 π e ( a | X t ) f ( a, X t ) . We decompose √ T (cid:16) (cid:98) R ADR T ( π e ) − ¨ R T ( π e ) (cid:17) as (cid:98) R ADR T ( π e ) − ¨ R ( π e ) = 1 T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) | Ω t − (cid:105) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t ; ˆ f t − ) | Ω t − (cid:105) − T T (cid:88) t =1 E [ φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − ] − T T (cid:88) t =1 E [ φ ( X t ; f ∗ ) | Ω t − ] . In the following parts, we separately show that √ T T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) (4.3) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41) = o p (1);and1 T T (cid:88) t =1 E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) | Ω t − (cid:105) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t ; ˆ f t − ) | Ω t − (cid:105) (4.4) − T T (cid:88) t =1 E [ φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − ] − T T (cid:88) t =1 E [ φ ( X t ; f ∗ ) | Ω t − ] = o p (1 / √ T ) . To show (4.3), we show the mean of the LHS of (4.3) is 0 and its variance converges to0 in probability. To show (4.4), we use Assumption 4.1.Next, Lemma 4.2 provides the asymptotic normality of ¨ R T ( π e ). ean Outcome Estimation under a Convergence of Average Logging Probability Lemma 4.2.

Under Assumptions 2.1 and 4.1, √ T (cid:16) ¨ R T ( π e ) − R ( π e ) (cid:17) d −→ N (0 , Ψ(¯ α )Here, we show the sketch of proof. The full proof is shown in Appendix C of the supple-mentary material. Proof. (Sketch of proof)

The proof procedure follows Kato et al. (2020). Let Γ t ( a )be Γ t ( a ; π e ) = π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a | X t ))¯ α ( a | X t ) − π e ( a | X t ) f ∗ ( a | X t ) . Note that ¨ R T ( π e ) = T (cid:80) Tt =1 (cid:80) Ka =1 Γ t ( a ; π e ). Then, for Z t = (cid:80) Ka =1 Γ t ( a ; π e ) − R ( π e ), wewant to show that √ T (cid:16) ¨ R T ( π e ) − R ( π e ) (cid:17) = √ T (cid:32) T T (cid:88) t =1 Z t (cid:33) d −→ N (cid:0) , σ (cid:1) . Then, the sequence { Z t } Tt =1 is an MDS; that is, E (cid:2) Z t | Ω t − (cid:3) = E (cid:34) K (cid:88) a =1 Γ t ( a ; π e ) − R ( π e ) | Ω t − (cid:35) = E (cid:34) K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X t ) − R ( π e t ) | Ω t − (cid:35) + E (cid:34) K (cid:88) a =1 π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( A t | X t ) | Ω t − (cid:35) = 0 + E (cid:34) E (cid:34) K (cid:88) a =1 π e ( a | X t ) π ( a | X t , Ω t − )( f ∗ ( a, X t ) − f ∗ ( a, X t ))¯ α ( a | X t ) | X t , Ω t − (cid:35) | Ω t − (cid:35) = 0 . Therefore, to derive the asymptotic distribution, we consider applying the CLT for aMDS introduced in Proposition 3.3. There are following three conditions in the statement. (a) E (cid:2) Z t (cid:3) = ν t > (cid:0) /T (cid:1) (cid:80) Tt =1 ν t → ν > (b) E (cid:2) | Z t | r (cid:3) < ∞ for some r > (c) (cid:0) /T (cid:1) (cid:80) Tt =1 Z t p −→ ν .Because we assumed the boundedness of z t by assuming the boundedness of Y t , f ∗ , and π e / ¯ α , the condition (b) holds. Therefore, the remaining task is to show the conditions (a)and (c) hold. Here, the convergence of the average logging probability has an importantrole by making th variance time-invariance asymptotically.Then, from Lemma 4.1 and 4.2, we can show Theorem 4.1. To show the asymptotic normality of an ADR estimator, we need to check Assump-tion 4.2, but it is not easy in practice. In this section, we modify the ADR estimator to Masahiro Kato guarantee the asymptotic normality more easily. We deﬁne a Modiﬁed ADR (MADR) (cid:101) R MADR T ( π e ) as1 T T (cid:88) t =1 K (cid:88) a =1 (cid:40) π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˜ f t − ,u ( T ) ( a, X t ) (cid:17) ˆ g t − ( a | X t ) + π e ( a | X t ) ˜ f t − ,u ( T ) ( a, X t ) (cid:41) , where u ( T ) is a function of T and˜ f t − ,u ( T ) ( a, X t ) = (cid:40) ˆ f t − ( a, X t ) if t ≤ u ( T )ˆ f u ( T ) ( a, X t ) otherwise . Then, we put the following assumption.

Assumption 4.3.

There exists a function u ( T ) > such that for all a ∈ A , u ( T ) √ T → as T → ∞ , and for T > u ( T ) , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − u ( T ) − T (cid:88) t = u ( T )+1 π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f ∗ ( a, X t ) − ˆ f u ( T ) ( a, X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p ( T − / ) , (4.5)1 T T (cid:88) u ( T )+1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ˆ g t − ( a | X t ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˆ f u ( T ) ( a, X ) (cid:13)(cid:13)(cid:13) = o p ( T − / ) , (4.6) where ¯ α : A × X → (0 , is a time-invariant function such that (cid:80) Ka (cid:48) =1 ¯ α ( a (cid:48) | x ) = 1 andthere exists a constant C α satisfying (cid:12)(cid:12)(cid:12) π e ( a | x ) α ( a | x ) (cid:12)(cid:12)(cid:12) < C α , and the expectation of the norm isover X t . Then, we show Lemma 4.3 on the asymptotic bias of the MADR estimator.

Lemma 4.3.

Under Assumptions 2.1, 3.2–3.3 and 4.3, √ T (cid:16) (cid:101) R ADR T ( π e ) − ¨ R T ( π e ) (cid:17) =o p (1) Proof. (Sketch of proof)

By using φ ( X t , A t , Y t ; g, f ) and φ ( X t ; f ) = (cid:80) Ka =1 π e ( a | X t ) f ( a, X t ) deﬁned in the proof of Lemma 4.1, we decompose √ T (cid:16) (cid:98) R ADR T ( π e ) − ¨ R T ( π e ) (cid:17) ean Outcome Estimation under a Convergence of Average Logging Probability as (cid:98) R ADR T ( π e ) − ¨ R ( π e ) = 1 T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˜ f t − ,u ( T ) ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˜ f t − ,u ( T ) ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˜ f t − ,u ( T ) ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˜ f t − ,u ( T ) ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˜ f t − ,u ( T ) ) | Ω t − (cid:105) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t ; ˜ f t − ,u ( T ) ) | Ω t − (cid:105) − T T (cid:88) t =1 E [ φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − ] − T T (cid:88) t =1 E [ φ ( X t ; f ∗ ) | Ω t − ] . Following the almost same process as the proof of Lemma 4.1, we can show that √ T T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˜ f t − ,u ( T ) ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˜ f t − ,u ( T ) ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˜ f t − ,u ( T ) ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˜ f t − ,u ( T ) ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41) = o p (1);Therefore, we consider showing1 T T (cid:88) t =1 E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˜ f t − ,u ( T ) ) | Ω t − (cid:105) (4.7)+ 1 T T (cid:88) t =1 E (cid:104) φ ( X t ; ˜ f t − ,u ( T ) ) | Ω t − (cid:105) − T T (cid:88) t =1 E [ φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − ] (4.8) − T T (cid:88) t =1 E [ φ ( X t ; f ∗ ) | Ω t − ] = o p (1 / √ T ) . (4.9)If (4.7) holds, then we can prove the statement. Masahiro Kato

First, as shown in the proof of Lemma 4.1, we bound the LHS of (4.7) as follows:1 T T (cid:88) t =1 E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˜ f t − ,u ( T ) ) | Ω t − (cid:105) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t ; ˜ f t − ,u ( T ) ) | Ω t − (cid:105) − T T (cid:88) t =1 E [ φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − ] − T T (cid:88) t =1 E [ φ ( X t ; f ∗ ) | Ω t − ] = o p (1 / √ T ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:16) π t ( a | X, Ω t − ) − ˆ g t − ( a | X ) (cid:17) (cid:16) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:17) ˆ g t − ( a | X ) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:16) t (cid:80) ts =1 π s ( a | X, Ω s − ) − ˆ g t − ( a | X ) (cid:17) (cid:16) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:17) ˆ g t − ( a | X ) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:16) π t ( a | X, Ω t − ) − t (cid:80) ts =1 π s ( a | X, Ω s − ) (cid:17) (cid:16) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:17) ˆ g t − ( a | X ) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ CT T (cid:88) t =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) (cid:32) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ˆ g t − ( a | X ) (cid:33) (cid:16) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (4.10)+ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − t (cid:80) ts =1 π s ( a | X, Ω s − )ˆ g t − ( a | X ) (cid:33) × (cid:16) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (4.11)where C > p ( T − / ). By using the H¨older’s inequality (cid:107) µν (cid:107) ≤ (cid:107) µ (cid:107) (cid:107) ν (cid:107) ,1 T T (cid:88) t =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) (cid:32) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ˆ g t − ( a | X ) (cid:33) (cid:16) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ T T (cid:88) t =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ˆ g t − ( a | X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:13)(cid:13)(cid:13) ≤ T T (cid:88) t =1 (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ¯ α ( a | X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:107) ¯ α ( a | X ) − ˆ g t − ( a | X ) (cid:107) (cid:33) × (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:13)(cid:13)(cid:13) = 1 T T (cid:88) t =1 (cid:32) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ¯ α ( a | X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:13)(cid:13)(cid:13) + (cid:107) ¯ α ( a | X ) − ˆ g t − ( a | X ) (cid:107) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:13)(cid:13)(cid:13) (cid:33) . ean Outcome Estimation under a Convergence of Average Logging Probability From the deﬁnition of u ( T ), we can bound it as1 T u ( T ) (cid:88) t =1 (cid:32) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ¯ α ( a | X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:13)(cid:13)(cid:13) + (cid:107) ¯ α ( a | X ) − ˆ g t − ( a | X ) (cid:107) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:13)(cid:13)(cid:13) (cid:33) + 1 T T (cid:88) t = u ( T )+1 (cid:32) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ¯ α ( a | X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˆ f u ( a, X ) (cid:13)(cid:13)(cid:13) + (cid:107) ¯ α ( a | X ) − ˆ g t − ( a | X ) (cid:107) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˆ f u ( a, X ) (cid:13)(cid:13)(cid:13) (cid:33) ≤ C u ( T ) T + 1 T T (cid:88) t = u ( T )+1 o p ( T − / ) + 1 T T (cid:88) t = u ( T )+1 o p ( T − / )= o p ( T − / ) + o p ( T − / ) = o p ( T − / ) , where C > p ( T − / ). By using u ( T )of the statement, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − t (cid:80) ts =1 π s ( a | X, Ω s − )ˆ g t − ( a | X ) (cid:33) × (cid:16) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T u ( T ) (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − t (cid:80) ts =1 π s ( a | X, Ω s − )ˆ g t − ( a | X ) (cid:33) × (cid:16) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (4.12)+ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t = u ( T )+1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − t (cid:80) ts =1 π s ( a | X, Ω s − )ˆ g t − ( a | X ) (cid:33) × (cid:16) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (4.13)Because all variables are bounded, for a constant C >

0, the ﬁrst term (4.12) is bounded Masahiro Kato as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T u ( T ) (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − t (cid:80) ts =1 π s ( a | X, Ω s − )ˆ g t − ( a | X ) (cid:33) × (cid:16) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C u ( T ) T .

Here, from the deﬁnition of u ( T ), u ( T ) √ T → T → ∞ . Then, we consider bounding thesecond term (4.13). First, we bound it as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t = u ( T )+1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − t (cid:80) ts =1 π s ( a | X, Ω s − )ˆ g t − ( a | X ) (cid:33) × (cid:16) f ∗ ( a, X ) − ˜ f t − ,u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t = u ( T )+1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − t (cid:80) ts =1 π s ( a | X, Ω s − )ˆ g t − ( a | X ) (cid:33) × (cid:16) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t = u ( T )+1 E (cid:34) π e ( a | X ) (cid:18) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:19) (cid:16) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t = u ( T )+1 E (cid:34) π e ( a | X ) (cid:32) − t (cid:80) ts =1 π s ( a | X, Ω s − )ˆ g t − ( a | X ) (cid:33) × (cid:16) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t = u ( T )+1 E (cid:34) π e ( a | X ) (cid:18) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:19) × (cid:16) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (4.14)+ 1 T T (cid:88) t = u ( T )+1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) π e ( a | X ) (cid:32) − t (cid:80) ts =1 π s ( a | X, Ω s − )ˆ g t − ( a | X ) (cid:33) × (cid:16) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (4.15)We separately bound (4.14) and (4.15). First, we bound (4.14). By using the H¨older’s ean Outcome Estimation under a Convergence of Average Logging Probability inequality (cid:107) µν (cid:107) ≤ (cid:107) µ (cid:107) (cid:107) ν (cid:107) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t = u ( T )+1 E (cid:34) π e ( a | X ) (cid:18) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:19) (cid:16) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t = u ( T )+1 E X (cid:34) π e ( a | X ) (cid:18) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:19) (cid:16) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:17) (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E X (cid:34) π e ( a | X ) (cid:16) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:17) T T (cid:88) t = u ( T )+1 (cid:18) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:19) (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ T − u ( T ) − T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) π e ( a | X ) (cid:16) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:17) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) × (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − u ( T ) − T (cid:88) t = u ( T )+1 (cid:18) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:19) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − u ( T ) − T (cid:88) t = u ( T )+1 π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , where E X denotes the expectation over X and C > .

14) = o p ( T − / ) from Assumption 4.3.Second, by using Assumption 4.3, we show that (4.15) is o p ( T − / ) as1 T T (cid:88) t = u ( T )+1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) π e ( a | X ) (cid:32) − t (cid:80) ts =1 π s ( a | X, Ω s − )ˆ g t − ( a | X ) (cid:33) (cid:16) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ CT T (cid:88) t = u ( T )+1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) (cid:32) ˆ g t − ( a | X ) − t t (cid:88) s =1 π s ( a | X, Ω s − ) (cid:33) (cid:16) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ CT T (cid:88) t = u ( T )+1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˆ g t − ( a | X ) − t t (cid:88) s =1 π s ( a | X, Ω s − ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˜ f u ( T ) ( a, X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Then, the term is o p ( T − / ) from Assumption 4.3.By using Lemma 4.2 and 4.3, we can show the following theorem. Theorem 4.2. (Asymptotic normality of (cid:101) R MADR T ( π e ) ) For u ( t ) > such that u ( T ) √ T → , under Assumptions 2.1, 3.2–3.3 and 4.3, MADR estimator has the asymptotic nor-mality as √ T (cid:16) (cid:101) R MADR T ( π e ) − R ( π e ) (cid:17) d −→ N (0 , Ψ(¯ α ) . Unlike an ADR estimator, a MADR estimator does not require Assumption 4.2. Thisproperty is an advantage from the theoretical viewpoint. However, as shown in experi-ments, an ADR estimator shows better empirical performance. The remaining problemis to check that standard estimators satisfy Assumption 4.3. Here, we show an examplewhere nuisance estimators satisfy the requirement. For instance, we consider u ( T ) = T / , Masahiro Kato which satisﬁes u ( T ) √ T → T → ∞ . Under some conditions, sample averages and regres-sion estimators have O p ( T − / ) convergence rate. Therefore, we can assume that thereexist p, q, r < / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − u ( T ) − T (cid:88) t = u ( T )+1 π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p ( T − p ) , (4.16) (cid:107) f ∗ ( a, X t ) − ˆ f t − ( a, X t ) (cid:107) = o p ( t − q ) , (cid:107) ¯ α ( a | X ) − ˆ g t − ( a | X ) (cid:107) = o p ( t − r ) . Here, note that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − u ( T ) − (cid:80) Tt = u ( T )+1 π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) is bounded by (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) α ( a | X )  T − u ( T ) − T (cid:88) t = u ( T )+1 π t ( a | X, Ω t − ) − ¯ α ( a | X )  (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − u ( T ) − T (cid:88) t = u ( T )+1 π t ( a | X, Ω t − ) (cid:8) ¯ α ( a | X ) − ˆ g t − ( a | X ) (cid:9) ˆ g t − ( a | X )¯ α ( a | X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Therefore, we can assume (4.16) by assuming (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T − u ( T ) − (cid:80) Tt = u ( T )+1 π t ( a | X, Ω t − ) − ¯ α ( a | X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p ( T − p ) and (cid:13)(cid:13)(cid:13) ¯ α ( a | X ) − ˆ g t − ( a | X ) (cid:13)(cid:13)(cid:13) = o p ( T − p ). Then, we have (4 .

5) =o p ( T − p )o p ( u ( T ) − q ) = o p ( T − p )o p (( T / ) − q ) = o p ( T − p ( T / ) − q ) = o p ( T − ( p + q/ ). In thiscase, for instance, if p = 7 / < /

2) and q = 1 /

3, we obtain (4 .

5) = o p ( T − / ).Similarly, we have (4 .

6) = o p ( u ( T ) − q )o p ( t − r ) = o p ( T − q/ ) T (cid:80) Tt =1 o p ( t − r ). By a propertyof Riemann Zeta function, T (cid:80) Tt =1 o p ( t − r ) = o p ( T − r ) Hence, if r = 7 /

18 and q = 1 / .

6) = o p ( T − r − q/ ) = o p ( T − / ).5. MONTE CARLO EXPERIMENTSTo investigate the empirical properties of ADR and Modiﬁed ADR (MADR) estimators,we simulate two situations based on whether the logging probability converges, wherethe average logging probability converges in all experiments. We compare ADR andMADR estimators with an IPW estimator with the true logging probability (IPW),IPW estimator with an estimated logging probability (EIPW), AIPW estimator withoutcross ﬁtting (AIPW), DM estimator (DM), DR estimator without cross ﬁtting (DR),and A2IPW estimator with the true logging probability (A2IPW). We also considerestimators with the following form:1 T T (cid:88) t =1 K (cid:88) a =1 (cid:40) π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˆ f ( a, X t ) (cid:17) t (cid:80) ts =1 π s ( a | X t , Ω s − ) + π e ( a | X t ) ˆ f ( a, X t ) (cid:41) . When using ˆ f = ˆ f t − , we call it Average A2IPW (A3IPW) estimator; when using ˆ f =˜ f t − ,u ( T ) , we call it an Modiﬁed A3IPW (MA3IPW) estimator. These estimators arespecial cases of ADR and MADR estimators where using t (cid:80) ts =1 π s ( a | X t , Ω s − ) forˆ g t − . Note that only the ADR, MADR, EIPW, and DM estimators are applicable even ean Outcome Estimation under a Convergence of Average Logging Probability Figure 2 . This ﬁgure illustrates the error distributions of estimators in Section 5.1. Wesmoothed the error distributions using kernel density estimation.

Table 1 . Experimental results for dependent samples in adaptive eﬃcient ATE estimation.The upper table shows the RMSEs and SDs and the lower table shows the coverage ratiosof the conﬁdence interval. We show the RMSEs, SDs, and CRs.

Sample size T

250 500 750DM RMSE 0.164 0.132 0.076SD 0.030 0.025 0.008CR 0.000 0.000 0.000EIPQ RMSE 1.332 0.514 1.334SD 0.147 0.120 0.074CR 0.000 0.039 0.000ADR RMSE 0.245 0.138 0.095SD 0.037 0.027 0.013CR 0.125 0.926 0.452MADR RMSE 1.215 0.148 0.146SD 0.151 0.031 0.030CR 0.000 0.908 0.332A3IPW RMSE 0.173 0.141 0.077SD 0.028 0.028 0.009CR 1.000 0.973 0.952MA3IPW RMSE 0.180 0.158 0.077SD 0.022 0.036 0.009CR 1.000 0.949 0.952 when the true logging probability is unknown among these estimators. In addition, tothe best of our knowledge, the EIPW, AIPW, and DR estimators are not shown tobe asymptotically normal. When the logging probability does not converge to a time-invariant function, the A2IPW estimator also does not have the asymptotic normality.When the average logging probability converges and convergence rate conditions hold,the MADR and MA3IPW estimators have asymptotic normality. For the asymptoticnormalities of ADR and MADR estimator, we need Assumption 4.1, which is not easyto be conﬁrmed.

In this section, we conduct adaptive experiments for eﬃcient ATE estimation followingvan der Laan (2008) and Hahn et al. (2011). For brevity, we consider a situation where Masahiro Kato there are two actions and no covariates; that is, there is no sample selection bias basedon p t ( a | x ). We generate a pair of potential outcomes ( Y t (1) , Y t (2) , ), where Y t ( a ) isgenerated from the normal distribution N ( a, a ). Let us deﬁne an ATE by deﬁning anevaluation weight as π e (1) = − π e (2) = 1. van der Laan (2008) and Hahn et al.(2011) showed that we can achieve the minimum asymptotic variance when choosingan action 1 following a probability π ∗ (1) = √ Var( Y t (1)) √ Var( Y t (1))+ √ Var( Y t (2)) and the other actionfollowing π ∗ (2) = 1 − π ∗ (1). However, because we do not know Var( Y t ( a )), we needto consider obtaining an estimator with the same asymptotic distribution as the oneobtained under an optimal logging probability π ∗ . In this paper, we select an actionwith probability 1 so that the ratio of (cid:80) Tt =1 π (1) = (cid:80) Tt =1 [ A t = 1] and (cid:80) Tt =1 π (2) = (cid:80) Tt =1 [ A t = 2] is √ Var( Y t (1)) √ Var( Y t (1))+ √ Var( Y t (2)) : √ Var( Y t (2)) √ Var( Y t (1))+ √ Var( Y t (2)) . If the average loggingprobability converges to ˜ α ( a ) = π ∗ ( a ), the asymptotic distribution of and ADR estimatoris the same as that of an estimator obtained when choosing an action a with a probability π ∗ ( a ). To keep the desirable ratio, at each period t , we estimate the standard deviationVar( Y t ( a )) using Ω t − . Next, we construct an estimator ˆ π ∗ ( a ) of π ∗ ( a ). If (cid:80) Tt =1 [ A t =1] ≤ ˆ π ∗ ( a ), we choose A t = 1; otherwise, A t = 2. We conduct this procedure for threecases with diﬀerent sample sizes T = 250 , , Next, we investigate the performances of the estimators for logging probabilities thatconverge to a time-invariant function. We generate an artiﬁcial pair of covariate andpotential outcome ( X t , Y t (1) , Y t (2) , Y t (3)). The covariate X t is a 10 dimensional vectorgenerated from the standard normal distribution. For a ∈ { , , } , the potential outcome Y t ( a ) is 1 if a is chosen by following a probability deﬁned as p ( a | x ) = exp( g ( a,x )) (cid:80) a (cid:48) exp( g ( a (cid:48) ,x )) ,where g (1 , x ) = (cid:80) d =1 X t,d , g (2 , x ) = (cid:80) d =1 W d X t,d , and g (3 , x ) = (cid:80) d =1 W d | X t,d | ,where W d is uniform randomly chosen from {− , } . Let us generate three datasets, S (1) T (1) , S (2) T (2) , and S (3) T (3) , where S ( m ) T ( m ) = { ( X ( m ) t , Y ( m ) t (1) , Y ( m ) t (2) , Y ( m ) t (3)) } T ( m ) t =1 . Firstly,we train an evaluation probability π e by solving a prediction problem between X (1) t and Y (1) t (1) , Y (1) t (2) , Y (1) t (3) using the dataset S (1) T (1) . Then, we apply the evaluation policy π e on the independent dataset S (2) T (2) , and artiﬁcially construct bandit data { ( X (cid:48) t , A (cid:48) t , Y (cid:48) t ) } T (2) t =1 , ean Outcome Estimation under a Convergence of Average Logging Probability where A (cid:48) t is a chosen action from the evaluation policy and Y ( m ) t = (cid:80) a =1 [ A ( m ) t = a ] Y ( m ) t ( a ). Then, we set the true policy value R ( π e ) as T (2) (cid:80) T (2) t =1 Y ( m ) t . Next, using thedatasets S (3) T (3) and a MAB algorithm, we generate a bandit dataset as S = { ( X t , A t , Y t ) } T (3) t =1 .For the dataset S , we apply the IPW estimator with the true logging probability, IPWestimator with estimated logging probability, AIPW estimator with cross ﬁtting, DMestimator, DR estimator with cross ﬁtting, A2IPW estimator, and ADR estimator. Forestimating ˆ f and ˆ g , we use the kernelized Ridge least squares and kernelized Ridge lo-gistic regression, respectively. We use the Gaussian kernel for the kernel, and the hyper-parameters of the regularization and the kernel are chosen from { . , . , } . Let usdeﬁne an estimation error as R ( π e ) − (cid:98) R ( π e ). We conduct six experiments by changingthe sample size and the MAB algorithms. For the sample size T (3) , we use 250, 500, and750. For each sample size, we apply the LinUCB and LinTS algorithms. For the samplesize T (1) and T (2) , we use 1 ,

000 and 100 , f ∗ . Note that the previous experiment, there is no covariate and no sampleselection bias caused by p t ( a | x ). Hence, in this case, it is easy to estimate f ∗ . Here notethat the A3IPW and MA3PIPW estimators require the true logging probability, unlikethe ADR and MADR estimators. 6. DISCUSSIONWe discuss the remaining problems. First, we consider a paradox of using an estimatedlogging probability. Hadad et al. (2019) pointed out an A2IPW estimator’s unstablebehavior using samples obtained from a MAB algorithm. On the other hand, Kato (2020)pointed out that the ADR estimator experimentally shows better performance than theA2IPW estimator even though their asymptotic properties are the same. This paperpoints out that estimating the logging probability is equivalent to estimating the averagelogging probability. In our experiment, directly using the average logging probabilityfor an A2IPW estimator also improves the performance. Therefore, we conjecture thatan ADR estimator’s stabilization eﬀect comes from the stability of the average loggingprobability.Unlike an A2IPW estimator with the true logging probability, an ADR estimator doesnot suﬀer the deﬁcient support problem (Sachdeva et al., 2020). In many cases of MABalgorithms, the logging probability π t often becomes 0. However, even if π t becomes 0,we can show the asymptotic normality under Assumption 4.1.Next, in addition to the double robustness on the consistency, we explain the impor-tance of the DR-type estimators’ form. We showed the asymptotic normality only forthe DR-type estimator. Readers may feel that we can show the other type estimators’asymptotic normality, such as an IPW-type estimator. However, it is not obvious howwe use this paper’s inference strategy to such estimators. As Chernozhukov et al. (2018)discussed, the DR-type estimators relax the condition for asymptotic normality when Masahiro Kato

Figure 3 . This ﬁgure illustrates the error distributions of estimators from dependent sam-ples generated from LinUCB and LinTS policies with the sample size 250 (left graphs),500 (center graphs), and 750 (right graphs). The upper graphs show the error distribu-tions of the LinUCB policy; the lower graphs show the error distributions of the LinTSpolicy. We smoothed the error distributions using kernel density estimation.using sample-splitting. The asymptotic normalities of our proposed estimators are alsobased on this property. Therefore, the form of the DR-type estimators is also essential.Finally, Luedtke and van der Laan (2016) pointed out that it is diﬃcult to show theasymptotic normality when using a non-unique optimal treatment strategy; that is, π t ﬂuctuates and does not converge. This problem is also partially solved in our proposedmethod if Assumption 4.1 holds. For instance, in BAI, Kaufmann et al. (2016) proposedan algorithm that they deterministically choose an arm with a probability 1 to keep someoptimal selection ratio of arms. In this case, in addition to the deﬁcient support problem,there is no unique treatment strategy. However, because the algorithm attempts to keepsome desirable ratio, we can apply our method under Assumption 4.1. We derived theasymptotically normal mean outcome estimator for dependent samples based on a newassumption that the average logging probability converges to a time-invariant functionin probability. We can regard the average logging probability as a propensity score ofinverse probability weighting under this setting. On the other hand, existing studies needto assume that the logging probability itself converges and use the logging probabilityas a propensity score. We also experimentally conﬁrmed that the inverse weighting usingthe average logging probability is more stable than the A2IPW estimator, which uses thetrue logging probability itself. 7. CONCLUSIONWe derived the asymptotically normal mean outcome estimator for dependent samplesbased on a new assumption that the average logging probability converges to a time-invariant function in probability. We can regard the average logging probability as a ean Outcome Estimation under a Convergence of Average Logging Probability Table 2 . Experimental results of mean outcome estimation from dependent sample gen-erated from LinUCB and LinTS policies with the sample size 500. We show the RMSEs,SDs, and CRs.

Sample size T

250 500 750MAB algorithm LinUCB LinTS LinUCB LinTS LinUCB LinTSIPW RMSE 0.080 0.095 0.064 0.073 0.043 0.050SD 0.006 0.009 0.004 0.005 0.002 0.003CR 0.920 0.910 0.880 0.860 0.980 0.950DM RMSE 0.068 0.069 0.051 0.048 0.038 0.036SD 0.005 0.005 0.003 0.002 0.002 0.001CR 0.160 0.200 0.120 0.190 0.220 0.170AIPW RMSE 0.056 0.067 0.048 0.054 0.036 0.033SD 0.004 0.006 0.003 0.004 0.002 0.002CR 0.940 0.910 0.930 0.870 0.910 0.980A2IPW RMSE 0.066 0.082 0.052 0.061 0.038 0.037SD 0.006 0.010 0.004 0.004 0.002 0.002CR 0.920 0.930 0.930 0.880 0.960 0.930EIPW RMSE 0.093 0.082 0.118 0.098 0.131 0.107SD 0.010 0.010 0.013 0.009 0.012 0.007CR 0.770 0.860 0.330 0.530 0.130 0.170DM RMSE 0.046 0.045 0.038 0.033 0.031 0.023SD 0.003 0.004 0.002 0.001 0.001 0.001CR 0.930 0.910 0.890 0.960 0.920 0.960ADR RMSE 0.052 0.052 0.039 0.034 0.033 0.025SD 0.004 0.004 0.002 0.001 0.002 0.001CR 0.980 0.950 0.930 0.970 0.940 0.990MADR RMSE 0.100 0.080 0.117 0.114 0.119 0.119SD 0.016 0.009 0.015 0.018 0.015 0.014CR 0.800 0.890 0.470 0.540 0.360 0.290A3IPW RMSE 0.055 0.053 0.042 0.035 0.032 0.023SD 0.004 0.004 0.002 0.002 0.001 0.001CR 0.980 0.940 0.930 0.970 0.940 0.970MA3IPW RMSE 0.059 0.061 0.048 0.054 0.039 0.044SD 0.005 0.005 0.004 0.005 0.002 0.002CR 0.970 0.940 0.930 0.870 0.930 0.850 propensity score of inverse probability weighting under this setting. On the other hand,existing studies need to assume that the logging probability itself converges and use thelogging probability as a propensity score. We also experimentally conﬁrmed that theinverse weighting using the average logging probability is more stable than using thelogging probability itself. REFERENCESBang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causalinference models.

Biometrics .Bibaut, A., I. Malenica, N. Vlassis, and M. Van Der Laan (2019). More eﬃcient oﬀ-policy evaluation through regularized targeted learning. In

Proceedings of the 36thInternational Conference on Machine Learning .Bickel, P. J., C. A. J. Klaassen, Y. Ritov, and J. A. Wellner (1998).

Eﬃcient and Adaptive Masahiro Kato

Estimation for Semiparametric Models . Springer.Capp´e, O., R. Douc, A. Guillin, J.-M. Marin, and C. P. Robert (2008). Adaptive impor-tance sampling in general mixture classes.

Statistics and Computing 18 (4).Chakraborty, B. and E. E. M. Moodie (2013).

Statistical methods for dynamic treatmentregimes : reinforcement learning, causal inference, and personalized medicine . NewYork (N.Y.) : Springer.Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duﬂo, C. Hansen, W. Newey, andJ. Robins (2018). Double/debiased machine learning for treatment and structuralparameters.

Econometrics Journal 21 , C1–C68.Chow, S.-C. and M. Chang (2011).

Adaptive Design Methods in Clinical Trials (2 ed.).Chapman and Hall/CRC.Dud´ık, M., J. Langford, and L. Li (2011). Doubly Robust Policy Evaluation and Learning.In

Proceedings of the 31th International Conference on Machine Learning .Evans, M. (1988). Monte carlo computation of marginal posterior quantiles.Garivier, A. and E. Kaufmann (2016). Optimal best arm identiﬁcation with ﬁxed conﬁ-dence. In

Proceedings of the 12th annual conference on Computational learning theory .Hadad, V., D. A. Hirshberg, R. Zhan, S. Wager, and S. Athey (2019). Conﬁdence intervalsfor policy evaluation in adaptive experiments.Hahn, J. (1998). On the role of the propensity score in eﬃcient semiparametric estimationof average treatment eﬀects.

Econometrica 66 , 315–331.Hahn, J., K. Hirano, and D. Karlan (2011). Adaptive experimental design using thepropensity score.

Journal of Business and Economic Statistics 29 (1), 96–108.Hall, P., C. Heyde, Z. Birnbaum, and E. Lukacs (2014).

Martingale Limit Theory andIts Application . Communication and Behavior. Elsevier Science.Hamilton, J. (1994).

Time series analysis . Princeton Univ. Press.Hayashi, F. (2000).

Econometrics . Princeton Univ. Press.Hirano, K., G. W. Imbens, and G. Ridder (2003). Eﬃcient estimation of average treat-ment eﬀects using the estimated propensity score.

Econometrica 71 (4), 1161–1189.Holland, P. W. (1986). Statistics and causal inference.

Journal of the American StatisticalAssociation .Horvitz, D. G. and D. J. Thompson (1952). A generalization of sampling without replace-ment from a ﬁnite universe.

Journal of the American Statistical Association 47 (260),663–685.Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. (1), 243–263.Imbens, G. W. and D. B. Rubin (2015). Causal Inference for Statistics, Social, andBiomedical Sciences: An Introduction . Cambridge University Press.Jiang, N. and L. Li (2016). Doubly robust oﬀ-policy value evaluation for reinforcementlearning.

In Proceedings of the 33rd International Conference on International Con-ference on Machine Learning-Volume , 652–661.Kallus, N. and M. Uehara (2019). Intrinsically eﬃcient, stable, and bounded oﬀ-policyevaluation for reinforcement learning. In

Advances in Neural Information ProcessingSystems 32 , pp. 3320–3329.Kasy, M. and A. Sautmann (2021). Adaptive treatment assignment in experiments forpolicy choice.

Econometrica 89 (1), 113–132.Kato, M. (2020). Theoretical and experimental comparison of oﬀ-policy evaluation fromdependent samples.Kato, M., T. Ishihara, J. Honda, and Y. Narita (2020). Adaptive experimental designfor eﬃcient treatment eﬀect estimation: Randomized allocation via contextual banditalgorithm. ean Outcome Estimation under a Convergence of Average Logging Probability Kato, M. and Y. Kaneko (2020). Oﬀ-policy evaluation of bandit algorithm from depen-dent samples under batch update policy.Kaufmann, E., O. Capp´e, and A. Garivier (2016). On the complexity of best-arm iden-tiﬁcation in multi-armed bandit models.

Journal of Machine Learning Research .Klaassen, C. A. J. (1987). Consistent estimation of the inﬂuence function of locallyasymptotically linear estimators.

Ann. Statist. .Kloek, T. and H. van Dijk (1978). Bayesian estimates of equation system parameters:An application of integration by monte carlo.

Econometrica 46 (1), 1–19.Li, L., W. Chu, J. Langford, and R. E. Schapire (2010). A contextual-bandit approachto personalized news article recommendation. In

Proceedings of the 19th internationalconference on World wide web , pp. 661–670.Li, L., W. Chu, J. Langford, and X. Wang (2011). Unbiased oﬄine evaluation ofcontextual-bandit-based news article recommendation algorithms. In

Proceedings of theFourth ACM International Conference on Web Search and Data Mining , pp. 297–306.Association for Computing Machinery.Li, L., R. Munos, and C. Szepesvari (2015). Toward minimax oﬀ-policy value estima-tion.

In Proceedings of the 18th International Conference on Artiﬁcial Intelligence andStatistics , 608–616.Loeve, M. (1977).

Probability Theory . Graduate Texts in Mathematics. Springer.Luedtke, A. R. and M. J. van der Laan (2016, Apr). Statistical inference for the meanoutcome under a possibly non-unique optimal treatment strategy.

Annals of statis-tics 44 (2), 713–742.Mahmood, A. R., H. P. van Hasselt, and R. S. Sutton (2014). Weighted importancesampling for oﬀ-policy learning with linear function approximation. In

Advances inNeural Information Processing Systems 27 , pp. 3014–3022.Narita, Y., S. Yasui, and K. Yata (2019). Eﬃcient counterfactual learning from banditfeedback. In

AAAI .Naylor, J. and A. Smith (1988). Econometric illustrations of novel numerical integrationstrategies for bayesian inference.

Journal of Econometrics 38 (1), 103 – 125.Oh, M.-S. and J. O. Berger (1992). Adaptive importance sampling in monte carlo inte-gration.

Journal of Statistical Computation and Simulation 41 (3-4), 143–168.Portier, F. and B. Delyon (2018). Asymptotic optimality of adaptive importance sam-pling. In

Advances in Neural Information Processing Systems 31 .Precup, D., R. Sutton, and S. Singh (2000). Eligibility traces for oﬀ-policy policy eval-uation.

In Proceedings of the 17th International Conference on Machine Learning ,759–766.Robins, J. M. (1999). Robust estimation in sequentially ignorable missing data andcausal inference models, session, american statistical association; section on bayesianstatistical science. In

Proceedings of the American Statistical Association Section onBayesian Statistical Science .Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coeﬃcientswhen some regressors are not always observed.

Journal of the American StatisticalAssociation 89 , 846–866.Rubin, D. B. (1986). Statistics and causal inference: Comment: Which ifs have causalanswers.

Journal of the American Statistical Association .Rubin, D. B. (1987).

Multiple Imputation for Nonresponse in Surveys . New York: Wiley.Sachdeva, N., Y. Su, and T. Joachims (2020). Oﬀ-policy bandits with deﬁcient support.In

KDD . Masahiro Kato

Scharfstein, D. O., A. Rotnitzky, and J. M. Robins (1999). Adjusting for nonignorabledrop-out using semiparametric nonresponse models.

Journal of the American Statis-tical Association .van der Laan, M. J. (2008). The construction and analysis of adaptive group sequentialdesigns.van der Laan, M. J. and S. D. Lendle (2014). Online targeted learning.Villar, S. S. (2018). Bandit strategies evaluated in the context of clinical trials in rare life-threatening diseases.

Probability in the engineering and informational sciences 32 (2),229–245.Wang, Y.-X., A. Agarwal, and M. Dudik (2017). Optimal and adaptive oﬀ-policy eval-uation in contextual bandits. In

Proceedings of the 34th International Conference onMachine Learning .Zhang, B., A. A. Tsiatis, M. Davidian, M. Zhang, and E. Laber (2012). Estimatingoptimal treatment regimes from a classiﬁcation perspective.

Stat 1 (1), 103–114.Zhang, K., L. Janson, and S. Murphy (2020). Inference for batched bandits. In

Advancesin Neural Information Processing Systems 33 .Zhao, Y., D. Zeng, A. Rush, and M. Kosorok (2012, 09). Estimating individualizedtreatment rules using outcome weighted learning.

Journal of the American StatisticalAssociation 107 , 1106–1118.Zheng, W. and M. J. van der Laan (2011). Cross-validated targeted minimum-loss-basedestimation. In

Targeted Learning: Causal Inference for Observational and ExperimentalData , Springer Series in Statistics. ean Outcome Estimation under a Convergence of Average Logging Probability A. PRELIMINARIES

Definition A.1. [Uniformly Integrable, Hamilton (1994), p. 191] A sequence { A t } issaid to be uniformly integrable if for every (cid:15) > there exists a number c > such that E [ | A t | · I [ | A t ≥ c | ]] < (cid:15) for all t . Proposition A.1. [Suﬃcient Conditions for Uniformly Integrable, Hamilton (1994),Proposition 7.7, p. 191] (a) Suppose there exist r > and M < ∞ such that E [ | A t | r ] < M for all t . Then { A t } is uniformly integrable. (b) Suppose there exist r > and M < ∞ such that E [ | b t | r ] < M for all t . If A t = (cid:80) ∞ j = −∞ h j b t − j with (cid:80) ∞ j = −∞ | h j | < ∞ , then { A t } is uniformly integrable. Proposition A.2. ( L r Convergence Theorem, Loeve (1977))

Let < r < ∞ ,suppose that E (cid:2) | a n | r (cid:3) < ∞ for all n and that a n p −→ a as n → ∞ . The following areequivalent:(i) a n → a in L r as n → ∞ ;(ii) E (cid:2) | a n | r (cid:3) → E (cid:2) | a | r (cid:3) < ∞ as n → ∞ ;(iii) (cid:8) | a n | r , n ≥ (cid:9) is uniformly integrable. Proposition A.3. [Weak Law of Large Numbers for Martingale, Hall et al. (2014)] Let { S n = (cid:80) ni =1 X i , H t , t ≥ } be a martingale and { b n } a sequence of positive constantswith b n → ∞ as n → ∞ . Then, writing X ni = X i [ | X i | ≤ b n ] , ≤ i ≤ n , we have that b − n S n p −→ as n → ∞ if (i) (cid:80) ni =1 P ( | X i | > b n ) → ; (ii) b − n (cid:80) ni =1 E [ X ni | H t − ] p −→ , and; (iii) b − n (cid:80) ni =1 (cid:8) E [ X ni ] − E (cid:2) E (cid:2) X ni | H t − (cid:3)(cid:3) (cid:9) → . Remark A.1.

The weak law of large numbers for martingale holds when the randomvariable is bounded by a constant.

B. PROOF OF LEMMA 4.1We prove Lemma 4.1 by using a similar technique used in van der Laan and Lendle(2014) and Theorem 1 of Kato (2020).

Proof.

Let us deﬁne φ ( X t , A t , Y t ; g, f ) = K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t − f ( a, X t )) g ( a | X t ) ,φ ( X t ; f ) = K (cid:88) a =1 π e ( a | X t ) f ( a, X t ) . Masahiro Kato

We decompose √ T (cid:16) (cid:98) R ADR T ( π e ) − ¨ R T ( π e ) (cid:17) as (cid:98) R ADR T ( π e ) − ¨ R ( π e )= 1 T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) | Ω t − (cid:105) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t ; ˆ f t − ) | Ω t − (cid:105) − T T (cid:88) t =1 E [ φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − ] − T T (cid:88) t =1 E [ φ ( X t ; f ∗ ) | Ω t − ] . In the following parts, we separately show that √ T T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) (B.17) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41) = o p (1);and1 T T (cid:88) t =1 E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) | Ω t − (cid:105) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t ; ˆ f t − ) | Ω t − (cid:105) (B.18) − T T (cid:88) t =1 E [ φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − ] − T T (cid:88) t =1 E [ φ ( X t ; f ∗ ) | Ω t − ] = o p (1 / √ T ) . Step 1: Proof of (B.17)

For any ε >

0, to show that P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ T T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ε (cid:33) → , ean Outcome Estimation under a Convergence of Average Logging Probability we show that the mean is 0 and the variance of the component converges to 0. Then,from the Chebyshev’s inequality, this result yields the statement. The mean is √ T T T (cid:88) t =1 E (cid:34)(cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41)(cid:35) = √ T T T (cid:88) t =1 E (cid:34) E (cid:34)(cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41) | Ω t − (cid:35)(cid:35) = 0Because the mean is 0, the variance isVar (cid:32) √ T T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41)(cid:33) = E (cid:34)(cid:32) √ T T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41)(cid:33) (cid:35) = 1 T E (cid:34)(cid:32) T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41)(cid:33) (cid:35) . Masahiro Kato

Therefore, we have= 1 T T (cid:88) t =1 E (cid:34)(cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:33) (cid:35) + 2 T T − (cid:88) t =1 T (cid:88) s = t +1 E (cid:34)(cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:33) × (cid:32) φ ( X s , A s , Y s ; ˆ g s − , ˆ f s − ) − φ ( X s , A s , Y s ; ¯ α, f ∗ ) − E (cid:104) φ ( X s , A s , Y s ; ˆ g s − , ˆ f s − ) − φ ( X s , A s , Y s ; ¯ α, f ∗ ) | Ω s − (cid:105) + φ ( X s ; ˆ f s − ) − φ ( X s ; f ∗ ) − E (cid:104) φ ( X s ; ˆ f s − ) − φ ( X s ; f ∗ ) | Ω s − (cid:105) (cid:33)(cid:35) . For s > t , E (cid:34)(cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:33) × (cid:32) φ ( X s , A s , Y s ; ˆ g s − , ˆ f s − ) − φ ( X s , A s , Y s ; ¯ α, f ∗ ) − E (cid:104) φ ( X s , A s , Y s ; ˆ g s − , ˆ f s − ) − φ ( X s , A s , Y s ; ¯ α, f ∗ ) | Ω s − (cid:105) + φ ( X s ; ˆ f s − ) − φ ( X s ; f ∗ ) − E (cid:104) φ ( X s ; ˆ f s − ) − φ ( X s ; f ∗ ) | Ω s − (cid:105) (cid:33)(cid:35) = E (cid:34) U E (cid:34)(cid:32) φ ( X s , A s , Y s ; ˆ g s − , ˆ f s − ) − φ ( X s , A s , Y s ; ¯ α, f ∗ ) − E (cid:104) φ ( X s , A s , Y s ; ˆ g s − , ˆ f s − ) − φ ( X s , A s , Y s ; ¯ α, f ∗ ) | Ω s − (cid:105) + φ ( X s ; ˆ f s − ) − φ ( X s ; f ∗ ) − E (cid:104) φ ( X s ; ˆ f s − ) − φ ( X s ; f ∗ ) | Ω s − (cid:105) (cid:33) | Ω s − (cid:35)(cid:35) = 0 , ean Outcome Estimation under a Convergence of Average Logging Probability where U = (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:33) . Therefore, the variance is calculated asVar (cid:32) √ T T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41)(cid:33) = 1 T T (cid:88) t =1 E (cid:34)(cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:33) (cid:35) = 1 T T (cid:88) t =1 E (cid:34) E (cid:34)(cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:33) | Ω t − (cid:35)(cid:35) = 1 T T (cid:88) t =1 E (cid:34) Var (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:33)(cid:35) = 1 T T (cid:88) t =1 E (cid:34) Var (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:33)(cid:35) + 1 T T (cid:88) t =1 E (cid:34) Var (cid:32) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:33)(cid:35) + 2 T T (cid:88) t =1 E (cid:34) Cov (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) , φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:33)(cid:35) . Masahiro Kato

Then, we want to show that1 T T (cid:88) t =1 E (cid:34) Var (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:33)(cid:35) → , (B.19)1 T T (cid:88) t =1 E (cid:34) Var (cid:32) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:33)(cid:35) → , (B.20)2 T T (cid:88) t =1 E (cid:34) Cov (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) , φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:33)(cid:35) (B.21) → (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:33) = o p (1) , (B.23)Var (cid:32) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:33) = o p (1) (B.24)Cov (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) , φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:33) = o p (1) , (B.25)The ﬁrst equation (B.23) is shown asVar (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:33) ≤ E (cid:34)(cid:40) K (cid:88) a =1 π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˆ f t − ( a, X t ) (cid:17) ˆ g t − ( a | X t ) − K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t − f ∗ ( a, X t ))¯ α ( a | X t ) (cid:41) | Ω t − (cid:35) = E (cid:34)(cid:40) K (cid:88) a =1 π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˆ f t − ( a, X t ) (cid:17) ˆ g t − ( a | X t ) − K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t − f ∗ ( a, X t ))ˆ g t − ( a | X t )+ K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t − f ∗ ( a, X t ))ˆ g t − ( a | X t ) − K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t − f ∗ ( a, X t ))¯ α ( a | X t ) (cid:41) | Ω t − (cid:35) ean Outcome Estimation under a Convergence of Average Logging Probability Here, we have used a parallelogram law from the second to the third equation. Then, wecan show that ≤ E (cid:34)(cid:40) K (cid:88) a =1 π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˆ f t − ( a, X t ) (cid:17) ˆ g t − ( a | X t ) − K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t − f ∗ ( a, X t ))ˆ g t − ( a | X t ) (cid:41) | Ω t − (cid:35) + 2 E (cid:34)(cid:40) K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t − f ∗ ( a, X t ))ˆ g t − ( a | X t ) − K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t − f ∗ ( a, X t ))¯ α ( a | X t ) (cid:41) | Ω t − (cid:35) ≤ C (cid:107) f ∗ − ˆ f t − (cid:107) + 2 × C (cid:107) ˆ g t − − ¯ α (cid:107) = o p (1) , where C > | ˆ f t − | < C f , and 0 < π e ¯ α < C ¯ α and Assump-tion 3.3, from the ﬁrst line to the second inequality. Then, from the L r convergencetheorem (Proposition A.2) and the boundedness of the random variables, we can showthat as t → ∞ , E (cid:34) Var (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:33)(cid:35) ≤ E (cid:34) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Var (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:35) → . Therefore, for any (cid:15) >

0, there exists a constant

C > T T (cid:88) t =1 E (cid:34) Var (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:33)(cid:35) ≤ C/T + (cid:15). The second equation (B.24) is derived by Jensen’s inequality, and we show (B.20) aswell as (B.19) by using L r convergence theorem.Next, we bound the LHS of (B.25) asCov (cid:32) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) , φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:33) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34)(cid:16) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) (cid:17) × (cid:16) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) (cid:105)(cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Masahiro Kato

Then, by using the Jensen’s inequality, ≤ E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:16) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) (cid:17) × (cid:16) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) (cid:105)(cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | Ω t − (cid:35) ≤ C E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | Ω t − (cid:35) = o p (1) , where C > f t − and ˆ g t − , which imply that for all X t ∈ X , φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ )= K (cid:88) a =1  π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˆ f t − ( a, X t ) (cid:17) ˆ g t − ( a | X t ) − π e ( a | X t ) [ A t = a ] ( Y t − f ∗ ( a, X t ))¯ α ( a | X t )  ≤ K (cid:88) a =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) π e ( a | X t ) [ A t = a ] (cid:16) Y t − ˆ f t − ( a, X t ) (cid:17) ˆ g t − ( a | X t ) − π e ( a | X t ) [ A t = a ] ( Y t − f ∗ ( a, X t ))¯ α ( a | X t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C K (cid:88) a =1 (cid:12)(cid:12)(cid:12) ¯ α ( a | X t ) (cid:16) Y t − ˆ f t − ( a, X t ) (cid:17) − ˆ g t − ( a | X t ) ( Y t − f ∗ ( a, X t )) (cid:12)(cid:12)(cid:12) ≤ C K (cid:88) a =1 (cid:12)(cid:12)(cid:12) ¯ α ( a | X t ) − ˆ g t − ( a | X t ) (cid:12)(cid:12)(cid:12) − C K (cid:88) a =1 (cid:12)(cid:12)(cid:12) ¯ α ( a | X t ) ˆ f t − ( a, X t ) − ˆ g t − ( a | X t ) ˆ f t − ( a, X t )+ ˆ g t − ( a | X t ) ˆ f t − ( a, X t ) − ˆ g t − ( a | X t ) f ∗ ( a, X t ) (cid:12)(cid:12)(cid:12) ≤ C K (cid:88) a =1 (cid:12)(cid:12)(cid:12) ¯ α ( a | X t ) − ˆ g t − ( a | X t ) (cid:12)(cid:12)(cid:12) − C K (cid:88) a =1 (cid:12)(cid:12)(cid:12) ˆ f t − ( a, X t ) − f ∗ ( a, X t ) (cid:12)(cid:12)(cid:12) = o p (1) , where C > ean Outcome Estimation under a Convergence of Average Logging Probability Chebyshev’s inequality, P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ T T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ε (cid:33) ≤ Var (cid:32) √ T T T (cid:88) t =1 (cid:40) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) − E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) − φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − (cid:105) + φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) − E (cid:104) φ ( X t ; ˆ f t − ) − φ ( X t ; f ∗ ) | Ω t − (cid:105) (cid:41)(cid:33) /ε → . Step 2: Proof of (B.18)

We can calculate the LHS of (B.18) as1 T T (cid:88) t =1 E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) | Ω t − (cid:105) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t ; ˆ f t − ) | Ω t − (cid:105) − T T (cid:88) t =1 E [ φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − ] − T T (cid:88) t =1 E [ φ ( X t ; f ∗ ) | Ω t − ]= 1 T T (cid:88) t =1 E  K (cid:88) a =1 π e ( a | X t ) [ A t = a ] (cid:16) Y t ( a ) − ˆ f t − ( a, X t ) (cid:17) ˆ g t − ( a | X t ) | Ω t −  + 1 T T (cid:88) t =1 E (cid:34) K (cid:88) a =1 π e ( a, X t ) ˆ f t − ( a, X t ) | Ω t − (cid:35) − T T (cid:88) t =1 E (cid:34) K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) | Ω t − (cid:35) (B.26) − T T (cid:88) t =1 E (cid:34) K (cid:88) a =1 π e ( a, X t ) f ∗ ( a, X t ) | Ω t − (cid:35) . Here, (B.26) is 0 because1 T T (cid:88) t =1 E (cid:34) K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) | Ω t − (cid:35) = 1 T T (cid:88) t =1 E (cid:34) E (cid:34) K (cid:88) a =1 π e ( a | X t ) [ A t = a ] ( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) | X t , Ω t − (cid:35) | Ω t − (cid:35) = 1 T T (cid:88) t =1 E (cid:34) K (cid:88) a =1 π e ( a | X t ) π t ( a | X t , Ω t − )¯ α ( a | X t ) E [ f ∗ ( a, X t ) − f ∗ ( a, X t ) | X t , Ω t − ] | Ω t − (cid:35) . Masahiro Kato

We used the law of iterated expectation, E [ [ A t = a ] | X t , Ω t − ] = π t ( a | X t , Ω t − ), and E [ Y t ( a ) | X t , Ω t − ] = f ∗ ( a, X t ). Therefore, we have1 T T (cid:88) t =1 E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) | Ω t − (cid:105) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t ; ˆ f t − ) | Ω t − (cid:105) − T T (cid:88) t =1 E [ φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − ] − T T (cid:88) t =1 E [ φ ( X t ; f ∗ ) | Ω t − ]= 1 T T (cid:88) t =1 K (cid:88) a =1 E (cid:34) E (cid:34) π e ( a | X t ) [ A t = a ] (cid:16) Y t ( a ) − ˆ f t − ( a, X t ) (cid:17) ˆ g t − ( a | X t ) − π e ( a, X t ) (cid:16) f ∗ ( a, X t ) − ˆ f t − ( a, X t )) (cid:17) | X t , Ω t − (cid:35) | Ω t − (cid:35) = K (cid:88) a =1 T T (cid:88) t =1 E (cid:34) π e ( a | X t ) (cid:0) π t ( a | X t , Ω t − ) − ˆ g t − ( a | X t ) (cid:1) (cid:16) f ∗ ( a, X t ) − ˆ f t − ( a, X t ) (cid:17) ˆ g t − ( a | X t ) | Ω t − (cid:35) . From here, we drop the subscript t because X t does not depend on the period in theexpectation conditioned on Ω t − . Then, the sum of the expectations is bounded as1 T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:16) π t ( a | X, Ω t − ) − ˆ g t − ( a | X ) (cid:17) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) ˆ g t − ( a | X ) | Ω t − (cid:35) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:16) π t ( a | X, Ω t − ) − ˆ g t − ( a | X ) (cid:17) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) ˆ g t − ( a | X ) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:18) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:19) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ean Outcome Estimation under a Convergence of Average Logging Probability This is decomposed and bounded as ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:18) π t ( a | X, Ω t − )ˆ g t − ( a | X ) − (cid:19) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) | Ω t − (cid:35) − T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − ) t (cid:80) ts =1 π s ( a | X, Ω s − ) − (cid:33) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) | Ω t − (cid:35) + 1 T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − ) t (cid:80) ts =1 π s ( a | X, Ω s − ) − (cid:33) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) π t ( a | X, Ω s − )ˆ g t − ( a | X ) t (cid:80) ts =1 π s ( a | X, Ω t − ) × (cid:32) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ˆ g t − ( a | X ) (cid:33) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − ) t (cid:80) ts =1 π s ( a | X, Ω s − ) − (cid:33) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Then, we want to show that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) π t ( a | X, Ω t − )ˆ g t − ( a | X ) t (cid:80) ts =1 π s ( a | X, Ω s − ) (cid:32) t t (cid:88) s =1 π s ( a | X, Ω t − ) − ˆ g t − ( a | X ) (cid:33) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ CT T (cid:88) t =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) (cid:32) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ˆ g t − ( a | X ) (cid:33) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p ( T − / ) (B.27)and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) π e ( a | X ) (cid:32) π t ( a | X, Ω t − ) t (cid:80) ts =1 π s ( a | X, Ω s − ) − (cid:33) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p ( T − / ) . (B.28) Masahiro Kato

We show (B.27) by using Assumption 4.1 as1 T T (cid:88) t =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) (cid:32) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ˆ g t − ( a | X ) (cid:33) (cid:16) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ T T (cid:88) t =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ˆ g t − ( a | X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:13)(cid:13)(cid:13) ≤ T T (cid:88) t =1 (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ¯ α ( a | X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:107) ¯ α ( a | X ) − ˆ g t − ( a | X ) (cid:107) (cid:33) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:13)(cid:13)(cid:13) = 1 T T (cid:88) t =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t t (cid:88) s =1 π s ( a | X, Ω s − ) − ¯ α ( a | X ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:13)(cid:13)(cid:13) + 1 T T (cid:88) t =1 (cid:107) ¯ α ( a | X ) − ˆ g t − ( a | X ) (cid:107) (cid:13)(cid:13)(cid:13) f ∗ ( a, X ) − ˆ f t − ( a, X ) (cid:13)(cid:13)(cid:13) = 1 T T (cid:88) t =1 o p ( t − / ) + 1 T T (cid:88) t =1 o p ( t − / )The equation (B.28) holds from Assumption 4.2.Then, by using the property of Riemann zeta function,1 T T (cid:88) t =1 o p ( t − / ) + 1 T T (cid:88) t =1 o p ( t − / ) ≈ o p ( T − / ) + o p ( T − / )= o p ( T − / ) . Therefore,1 T T (cid:88) t =1 E (cid:104) φ ( X t , A t , Y t ; ˆ g t − , ˆ f t − ) | Ω t − (cid:105) + 1 T T (cid:88) t =1 E (cid:104) φ ( X t ; ˆ f t − ) | Ω t − (cid:105) − T T (cid:88) t =1 E [ φ ( X t , A t , Y t ; ¯ α, f ∗ ) | Ω t − ] − T T (cid:88) t =1 E [ φ ( X t ; f ∗ ) | Ω t − ] = o p ( T − / )C. PROOF OF LEMMA 4.2The proof procedure follows Kato et al. (2020). Proof.

Let Γ t ( a ) beΓ t ( a ; π e ) = π e ( a | X t ) [ A t = a ]( Y t − f ∗ ( a | X t ))¯ α ( a | X t ) − π e ( a | X t ) f ∗ ( a | X t )= π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a | X t ))¯ α ( a | X t ) − π e ( a | X t ) f ∗ ( a | X t ) . Here, we used [ A t = a ] Y t = [ A t = a ] (cid:80) Ka =1 [ A t = a ] Y t ( a ) = [ A t = a ] Y t ( a ). Note ean Outcome Estimation under a Convergence of Average Logging Probability that ¨ R T ( π e ) = T (cid:80) Tt =1 (cid:80) Ka =1 Γ t ( a ; π e ). Then, for Z t = (cid:80) Ka =1 Γ t ( a ; π e ) − R ( π e ), we wantto show that √ T (cid:16) ¨ R T ( π e ) − R ( π e ) (cid:17) = √ T (cid:32) T T (cid:88) t =1 Z t (cid:33) d −→ N (cid:0) , σ (cid:1) . Then, the sequence { Z t } Tt =1 is an MDS; that is, E (cid:2) Z t | Ω t − (cid:3) = E (cid:34) K (cid:88) a =1 Γ t ( a ; π e ) − R ( π e ) | Ω t − (cid:35) = E (cid:34) K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X t ) − R ( π e t ) | Ω t − (cid:35) + E (cid:34) K (cid:88) a =1 π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( A t | X t ) | Ω t − (cid:35) = 0 + E (cid:34) E (cid:34) K (cid:88) a =1 π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) | X t , Ω t − (cid:35) | Ω t − (cid:35) = E (cid:34) E (cid:34) K (cid:88) a =1 π e ( a | X t ) π ( a | X t , Ω t − )( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) | X t , Ω t − (cid:35) | Ω t − (cid:35) = E (cid:34) E (cid:34) K (cid:88) a =1 π e ( a | X t ) π ( a | X t , Ω t − )( f ∗ ( a, X t ) − f ∗ ( a, X t ))¯ α ( a | X t ) | X t , Ω t − (cid:35) | Ω t − (cid:35) = 0 . Therefore, to derive the asymptotic distribution, we consider applying the CLT for amartingale diﬀerence sequences (MDS) introduced in Proposition 3.3. There are followingthree conditions in the statement. (a) E (cid:2) Z t (cid:3) = ν t > (cid:0) /T (cid:1) (cid:80) Tt =1 ν t → ν > (b) E (cid:2) | Z t | r (cid:3) < ∞ for some r > (c) (cid:0) /T (cid:1) (cid:80) Tt =1 Z t p −→ ν .Because we assumed the boundedness of z t by assuming the boundedness of Y t , f ∗ , and π e / ¯ α , the condition (b) holds. Therefore, the remaining task is to show the conditions (a)and (c) hold. Masahiro Kato

Step 1: Check of Condition (a)

For E (cid:2) Z t (cid:3) , we have E (cid:2) Z t (cid:3) = E (cid:32) K (cid:88) a =1 (cid:18) π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) + π e ( a | X t ) f ∗ ( a, X t ) (cid:19) − R ( π e ) (cid:33)  = E (cid:32) K (cid:88) a =1 π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) + K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X t ) − R ( π e ) (cid:33)  = E (cid:34) (cid:32) K (cid:88) a =1 π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) (cid:33) (C.29)+ 2 (cid:32) K (cid:88) a =1 π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) (cid:33) (cid:32) K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X t ) − R ( π e ) (cid:33) (C.30)+ (cid:32) K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X t ) − R ( π e ) (cid:33) (cid:35) . For the ﬁrst term (C.29), we have E (cid:34) (cid:32) K (cid:88) a =1 π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) (cid:33) (cid:35) = K (cid:88) a =1 E (cid:34) (cid:18) π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) (cid:19) (cid:35) = K (cid:88) a =1 E (cid:34) (cid:0) π e ( a | X t ) (cid:1) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t )) ¯ α ( a | X t ) (cid:35) = K (cid:88) a =1 E (cid:34) (cid:0) π e ( a | X t ) (cid:1) π t ( a | X t , Ω t − )Var( Y t ( a ) | X t )¯ α ( a | X t ) (cid:35) . From the ﬁrst to second line, we used [ A t = a ] [ A t = a (cid:48) ] = 0 for a (cid:54) = a (cid:48) . Fromthe third to fourth line, we used the conditional independence between [ A t = a ] and ean Outcome Estimation under a Convergence of Average Logging Probability ( Y t − f ∗ ( a, X t )) . The second term (C.30) is 0 because E (cid:34) (cid:32) K (cid:88) a =1 π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) (cid:33) (cid:32) K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X t ) − R ( π e ) (cid:33) (cid:35) = E (cid:34) (cid:32) K (cid:88) a =1 π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) (cid:33) (cid:32) K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X t ) − R ( π e ) (cid:33) (cid:35) = E (cid:34) (cid:32) K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X t ) − R ( π e ) (cid:33) E (cid:34) K (cid:88) a =1 π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) | X t , Ω t − (cid:35)(cid:35) = E (cid:34) (cid:32) K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X t ) − R ( π e ) (cid:33) (cid:32) K (cid:88) a =1 π e ( a | X t ) π t ( a | X t , Ω t − )( f ∗ ( a, X t ) − f ∗ ( a, X t ))¯ α ( a | X t ) (cid:33)(cid:35) = 0 . In conclusion, we have E (cid:2) Z t (cid:3) = E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X t ) (cid:1) π t ( a | X t , Ω t − )Var( Y t ( a ) | X t )¯ α ( a | X t ) + (cid:32) K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X t ) − R ( π e ) (cid:33) (cid:35) . Because X t and Y t ( a ) does not depend on the period in the expectation, by droppingtheir subscripts, we represent ν t = E (cid:2) Z t (cid:3) as ν t = E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) π t ( a | X, Ω t − )Var( Y ( a ) | X )¯ α ( a | X ) + (cid:32) K (cid:88) a =1 π e ( a | X ) f ∗ ( a, X ) − R ( π e ) (cid:33) (cid:35) . Next, we show that for ν = E (cid:34) (cid:80) Ka =1 (cid:0) π e ( a | X ) (cid:1) ¯ α ( a | X )Var( Y ( a ) | X )¯ α ( a | X ) + (cid:16)(cid:80) Ka =1 π e ( a | X ) f ∗ ( a, X ) − R ( π e ) (cid:17) (cid:35) ,1 T T (cid:88) t =1 ν t − ν → T → ∞⇔ T T (cid:88) t =1 E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) π t ( a | X, Ω t − )Var( Y ( a ) | X )¯ α ( a | X ) + (cid:32) K (cid:88) a =1 π e ( a | X ) f ∗ ( a, X ) − R ( π e ) (cid:33) (cid:35) − E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) ¯ α ( a | X )Var( Y ( a ) | X )¯ α ( a | X ) + (cid:32) K (cid:88) a =1 π e ( a | X ) f ∗ ( a, X ) − R ( π e ) (cid:33) (cid:35) ⇔ K (cid:88) a =1 E (cid:34) T T (cid:88) t =1 (cid:0) π e ( a | X ) (cid:1) π t ( a | X, Ω t − )Var( Y ( a ) | X )¯ α ( a | X ) − (cid:0) π e ( a | X ) (cid:1) ¯ α ( a | X )Var( Y ( a ) | X )¯ α ( a | X ) (cid:35) → ⇔ K (cid:88) a =1 E (cid:34) (cid:0) π e ( a | X ) (cid:1) Var( Y ( a ) | X )¯ α ( a | X ) (cid:32) T T (cid:88) t =1 π t ( a | X, Ω t − ) − ¯ α ( a | X ) (cid:33) (cid:35) → . Masahiro Kato

Because (cid:0) π e ( a | X ) (cid:1) Var( Y ( a ) | X )¯ α ( a | X ) is upper bounded, there is a constant C > E (cid:34) (cid:0) π e ( a | X ) (cid:1) Var( Y ( a ) | X )¯ α ( a | X ) (cid:32) T T (cid:88) t =1 π t ( a | X, Ω t − ) − ¯ α ( a | X ) (cid:33) (cid:35) ≤ C E (cid:34) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 π t ( a | X, Ω t − ) − ¯ α ( a | X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:35) . We assumed that the point-wise convergence of T (cid:80) Tt =1 π t ( a | x, Ω t − ); that is, forall x ∈ X , k ∈ A , and Ω t − ∈ M t − , T (cid:80) Tt =1 π t ( a | x, Ω t − ) d −→ ¯ α ( a | x ). From thisassumption, if T (cid:80) Tt =1 π t ( a | x, Ω t − ) is uniformly integrable, we can show that E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 π t ( a | X, Ω t − ) − ¯ α ( a | x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X = x (cid:35) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 π t ( a | x, Ω t − ) − ¯ α ( a | x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) → , as t → ∞ using L r -convergence theorem (Proposition A.2). Note that X t is independentfrom Ω t − . Here, for a ﬁxed x , we can show that T (cid:80) Tt =1 π t ( a | X, Ω t − ) is uniformlyintegrable from the boundedness of T (cid:80) Tt =1 π t ( a | X, Ω t − ) (Proposition A.1). From thepoint-wise convergence of E (cid:104)(cid:12)(cid:12)(cid:12) T (cid:80) Tt =1 π t ( a | x, Ω t − ) − ¯ α ( a | x ) (cid:12)(cid:12)(cid:12) | X = x (cid:105) , by using theLebesgue’s dominated convergence theorem, we can show that E (cid:34) E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 π t ( a | X, Ω t − ) − ¯ α ( a | X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | X (cid:35)(cid:35) → . In conclusion, as t → ∞ , E (cid:2) Z t (cid:3) − E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) ¯ α ( a | X )Var( Y ( a ) | X )¯ α ( a | X ) + (cid:32) K (cid:88) a =1 π e ( a | X ) f ∗ ( a, X ) − R ( π e ) (cid:33) (cid:35) → . Step 2: Check of Condition (c)

Let U t be an MDS such that U t = Z t − E (cid:2) Z t | Ω t − (cid:3) = (cid:32) K (cid:88) a =1 (cid:18) π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) + π e ( a | X t ) f ∗ ( a, X t ) (cid:19) − R ( π e ) (cid:33) − E (cid:32) K (cid:88) a =1 (cid:18) π e ( a | X t ) [ A t = a ]( Y t ( a ) − f ∗ ( a, X t ))¯ α ( a | X t ) + π e ( a | X t ) f ∗ ( a, X t ) (cid:19) − R ( π e ) (cid:33) | Ω t −  . From the boundedness of each variable in Z t , we can apply weak law of large numbersfor an MDS (Proposition A.3 in Appendix A). Then, we have ean Outcome Estimation under a Convergence of Average Logging Probability T T (cid:88) t =1 U t = 1 T T (cid:88) t =1 (cid:0) Z t − E (cid:2) Z t | Ω t − (cid:3)(cid:1) p −→ . Next, we show that 1 T T (cid:88) t =1 E (cid:2) Z t | Ω t − (cid:3) − σ −→ . From Markov’s inequality, for any ε >

0, we have P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:2) Z t | Ω t − (cid:3) − σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε (cid:33) ≤ E (cid:104)(cid:12)(cid:12)(cid:12) T (cid:80) Tt =1 E (cid:2) Z t | Ω t − (cid:3) − σ (cid:12)(cid:12)(cid:12)(cid:105) ε . Then, we consider showing E (cid:104)(cid:12)(cid:12)(cid:12) T (cid:80) Tt =1 E (cid:2) Z t | Ω t − (cid:3) − σ (cid:12)(cid:12)(cid:12)(cid:105) →

0. As well as Step 1, wehave E (cid:2) Z t | Ω t − (cid:3) = E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) π t ( a | X, Ω t − )Var( Y ( a ) | X )¯ α ( a | X ) + (cid:32) K (cid:88) a =1 π e ( a | X t ) f ∗ ( a, X ) − R ( π e ) (cid:33) | Ω t − (cid:35) . As Step 1, we drop the subscript t from X t and Y t ( a ). Then, E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:2) Z t | Ω t − (cid:3) − σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) π t ( a | X, Ω t − )Var( Y ( a ) | X )¯ α ( a | X )+ (cid:32) K (cid:88) a =1 π e ( a | X ) f ∗ ( a, X ) − R ( π e ) (cid:33) | Ω t − (cid:35) − E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) ¯ α ( a | X )Var( Y ( a ) | X )¯ α ( a | X ) + (cid:32) K (cid:88) a =1 π e ( a | X ) f ∗ ( a, X ) − R ( π e ) (cid:33) (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) π t ( a | X, Ω t − )Var( Y ( a ) | X )¯ α ( a | X ) | Ω t − (cid:35) − E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) ¯ α ( a | X )Var( Y ( a ) | X )¯ α ( a | X ) (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) Masahiro Kato = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) π t ( a | X, Ω t − )Var( Y ( a ) | X )¯ α ( a | X ) | Ω t − (cid:35) − E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) ¯ α ( a | X )Var( Y ( a ) | X )¯ α ( a | X ) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) K (cid:88) a =1 (cid:0) π e ( a | X ) (cid:1) Var( Y ( a ) | X )¯ α ( a | X ) (cid:16) π t ( a | X, Ω t − ) − ¯ α ( a | X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ K (cid:88) a =1 E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:34) (cid:0) π e ( a | X ) (cid:1) Var( Y ( a ) | X )¯ α ( a | X ) (cid:16) π t ( a | X, Ω t − ) − ¯ α ( a | X ) (cid:17) | Ω t − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = K (cid:88) a =1 E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) (cid:0) π e ( a | X ) (cid:1) Var( Y ( a ) | X )¯ α ( a | X ) 1 T T (cid:88) t =1 (cid:16) π t ( a | X, Ω t − ) − ¯ α ( a | X ) (cid:17) | Ω T − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = K (cid:88) a =1 E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) (cid:0) π e ( a | X ) (cid:1) Var( Y ( a ) | X )¯ α ( a | X ) E (cid:34) T T (cid:88) t =1 (cid:16) π t ( a | X, Ω t − ) − ¯ α ( a | X ) (cid:17) | X, Ω T − (cid:35) | Ω T − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) = K (cid:88) a =1 E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) (cid:0) π e ( a | X ) (cid:1) Var( Y ( a ) | X )¯ α ( a | X ) E (cid:34) T T (cid:88) t =1 π t ( a | X, Ω t − ) − ¯ α ( a | X ) | X, Ω T − (cid:35) | Ω T − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) . Then, by using Jensen’s inequality, E (cid:2)(cid:12)(cid:12) E (cid:2) Z t | Ω t − (cid:3) − σ (cid:12)(cid:12)(cid:3) = K (cid:88) a =1 E (cid:34) E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:0) π e ( a | X ) (cid:1) Var( Y ( a ) | X )¯ α ( a | X ) E (cid:34) T T (cid:88) t =1 π t ( a | X, Ω t − ) − ¯ α ( a | X ) | X t , Ω T − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | Ω T − (cid:35)(cid:35) = K (cid:88) a =1 E (cid:34) (cid:0) π e ( a | X ) (cid:1) Var( Y ( a ) | X )¯ α ( a | X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) T T (cid:88) t =1 π t ( a | X, Ω t − ) − ¯ α ( a | X ) | X t , Ω T − (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) . Then, from L r convergence theorem, by using pointwise convergence of T (cid:80) Tt =1 π ( a | x, Ω t − ) and the boundedness of Z t , we have E (cid:104)(cid:12)(cid:12)(cid:12) T (cid:80) Tt =1 E (cid:2) Z t | Ω t − (cid:3) − σ (cid:12)(cid:12)(cid:12)(cid:105) ε →

0. Therefore, P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 E (cid:2) Z t | Ω t − (cid:3) − σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ε (cid:33) ≤ E (cid:104)(cid:12)(cid:12)(cid:12) T (cid:80) Tt =1 E (cid:2) Z t | Ω t − (cid:3) − σ (cid:12)(cid:12)(cid:12)(cid:105) ε → . In conclusion,1 T T (cid:88) t =1 Z t − σ = 1 T T (cid:88) t =1 (cid:0) Z t − E (cid:2) Z t | Ω t − (cid:3) + E (cid:2) Z t | Ω t − (cid:3) − σ (cid:1) p −→ . Step 3: Conclusion

We can use CLT for an MDS. Hence, we have ean Outcome Estimation under a Convergence of Average Logging Probability √ T (cid:16) ˜ R ( π e ) − R ( π e ) (cid:17) → N (cid:0) , σ (cid:1) , where σ = E (cid:34) (cid:80) Ka =1 (cid:0) π e ( a | X ) (cid:1) Var( Y ( a ) | X )¯ α ( a | X ) + (cid:16)(cid:80) Ka =1 π e ( a | X ) f ∗ ( a, X ) − R ( π e ) (cid:17) (cid:35)(cid:35)