[PDF] Optimal designs for the development of personalized treatment rules

Abstract

In the present paper, personalized treatment means choosing the best treatment for a patient while taking into account certain relevant personal covariate values. We study the design of trials whose goal is to find the best treatment for a given patient with given covariates. We assume that the subjects in the trial represent a random sample from the population, and consider the allocation, possibly with randomization, of these subjects to the different treatment groups in a way that depends on their covariates. We derive approximately optimal allocations, aiming to minimize expected regret, assuming that future patients will arrive from the same population as the trial subjects. We find that, for the case of two treatments, an approximately optimal allocation design does not depend on the value of the covariates but only on the variances of the responses. In contrast, for the case of three treatments the optimal allocation design does depend on the covariates as we show for specific scenarios. Another finding is that the optimal allocation can vary a lot as a function of the sample size, and that randomized allocations are relevant for relatively small samples, and may not be needed for very large studies.

Full PDF

OOptimal designs for the development of personalized treatment rules

David Azriel a , Yosef Rinott b and Martin Posch c ∗ February 16, 2021 a Faculty of Industrial Engineering and Management, The Technion b Department of Statistics and Federmann Center for the Study of Rationality, The Hebrew University c Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna

Abstract

In the present paper, personalized treatment means choosing the best treatment for a patient whiletaking into account certain relevant personal covariate values. We study the design of trials whose goalis to ﬁnd the best treatment for a given patient with given covariates. We assume that the subjectsin the trial represent a random sample from the population, and consider the allocation, possibly withrandomization, of these subjects to the diﬀerent treatment groups in a way that depends on theircovariates.We derive approximately optimal allocations, aiming to minimize expected regret, assuming thatfuture patients will arrive from the same population as the trial subjects. We ﬁnd that, for the caseof two treatments, an approximately optimal allocation design does not depend on the value of thecovariates but only on the variances of the responses. In contrast, for the case of three treatmentsthe optimal allocation design does depend on the covariates as we show for speciﬁc scenarios. Anotherﬁnding is that the optimal allocation can vary a lot as a function of the sample size, and that randomizedallocations are relevant for relatively small samples, and may not be needed for very large studies.

Key words: personalized medicine; minimal regret; optimal allocation; experimental design.

This work concerns the study of optimal designs of clinical trials with the goal of estimating personalizedtreatment rules. We consider multi-armed clinical trials for treatments, where the primary outcome maydepend on patients’ covariates. Modeling the outcome a function of the covariates does not only reduce ∗ Corresponding author: Martin Posch; email: [email protected]. a r X i v : . [ m a t h . S T ] F e b he error variance, but allows one also to estimate “personalized” treatment rules. These rules assign thetreatment with the best estimated outcome for a patient having given values of the covariates.In the related context of bandit problems with covariates, Goldenshluger and Zeevi (2013) studied asequential allocation scheme where at each stage one out of two treatments needs to be assigned undera minimax framework. A high-dimensional version of this problem was studied in Bastani and Bayati(2015). The goal in the Bandit formulation is to determine the best treatment while minimizing someloss or regret function for subjects in the trial sample. This is a diﬀerent setting than the one consideredhere, since rather than be concerned with optimal treatment for subjects in the trial sample, we look fordesigns that will allocate patients to treatments eﬃciently for the goal of ﬁnding the best treatment forfuture patients who may require one of the treatments under study. Personalized design as a function ofcovariates appears in many papers, such as Qian and Murphy (2011), Tian et al. (2014), Ballarini et al.(2018), or Pan and Zhao (2020). However, these papers are concerned with inference about the optimaltreatment rather than the design of the trial, which is the goal of this paper. Problem statement:

Consider K possible treatments T , . . . , T K . Let Y be a one-dimensional con-tinuous response variable and let X ∈ R p be a vector of a subject’s covariates. We assume that thejoint distribution of the covariates is continuous with density denoted by f ( x ). The expected responseis E ( Y | X , T k ) = g k ( X ), where g k is an unknown function. The optimal treatment for a subject withcovariate x is δ ∗ ( x ) = arg max k ∈{ ,...,K } g k ( x ) (assuming that a higher response is better). If the abovearg max contains more than one k , one arbitrary treatment is selected.Suppose that a clinical trial with n subjects is performed in order to estimate δ ∗ . Let X , . . . , X n denote a sample according to f of the covariates of the n subjects. The design we study here consistsof allocating subjects to treatments, taking account of their covariates. Each subject i is allocatedindependently to treatment k with probability π k ( X i ), where π ( x ) , . . . , π K ( x ) are non-negative functionssatisfying for each x , (cid:80) Kk =1 π k ( x ) = 1. The allocation functions deﬁne densities for the covariates in the K treatment groups. Speciﬁcally, in treatment k , X is sampled from the density f k ( x ) := f ( x ) π k ( x ) /ν k where ν k := (cid:82) f ( x ) π k ( x ) d x . Our purpose is to ﬁnd the optimal allocation functions π ( x ) , . . . , π K ( x ) thatminimize the regret, which is deﬁned next.Let (cid:98) δ denote the ﬁnal estimate of the optimal rule δ ∗ based on the covariates, responses, and allocationsof the subjects in the trial. Then the regret for a randomly chosen future patient is deﬁned as R ( π , . . . , π K ) := E (cid:104) g δ ∗ ( (cid:101) X ) ( (cid:101) X ) − g (cid:98) δ ( (cid:101) X ) ( (cid:101) X ) (cid:105) , (1)where (cid:101) X denotes the covariate vector of the chosen patient and the expectation is over (cid:101) X , (cid:98) δ , and therandomizations in the allocations. In words, the regret is the diﬀerence between the expected response2o the optimal treatment δ ∗ and its estimate (cid:98) δ for any independent future subject arising from the samepopulation as those in the trial. A similar criterion appears in Qian and Murphy (2011). Our goalis to minimize the regret over all feasible allocations, that is, non-negative functions π k ( · ) that satisfy (cid:80) Kk =1 π k ( x ) = 1 for every x .In this study we assume that we have some preliminary knowledge about the parameters of theproblem, such as the functions g k , and the variances of Y given X , T k and the density f of the covariates.This approach concurs with existing literature on locally optimal designs; see, e.g., Chernoﬀ (1953) andSilvey (2013, Chapter 6), where optimality is achieved for a given set of parameters. Our goal is toﬁnd optimal allocation functions relative to the assumed parameter values, and obtain a treatment rule (cid:98) δ based on an experiment according to our design. Knowledge about the parameters can arise fromprevious experience or theory, from a pilot study, or in some situations from earlier phases of the study.Other approaches, like sequential or Bayesian designs will not be discussed here. However, our approachsuggests a natural adaptive formulation where after observing an initial part of the sample one estimatesthe parameters and aims at an optimal design relative to these estimates, and the process can be repeated.A trial of the kind we study ends with recommendations on treatments, amounting to a claim ofcausality. Here we allow the allocation to treatments to be a function of the covariate vector X only,and thus conditioned on X , the allocation and the response are obviously independent. In this paper weshall assume that a linear model holds true (see Section 2); however, assuming any tractable parametricmodel on E ( Y | X ) can be handled in the same way, using non-linear least squares. Assuming a model,linear or more complex, rather than taking the approach that “all models are wrong”, is necessary if onewants to avoid a separate trial for each level of X , which seems ineﬃcient. A linear model assumptionmay be reasonable if the range of X is limited, which is often very natural. Another possibility is toconsider a polynomial linear model of some degree rather than a simple linear model (see Section 3.3).The parameters of any model for E ( Y | X ) obviously depend only on the conditional distribution of Y given X , allowing us to use them for causal inference given X ; see, e.g., Hern´an and Robins (2010). Thisremains true even if the allocation functions are “deterministic”, taking only the values 0 and 1. On theother hand, suppose we use the best linear predictors without assuming a linear model. In this case, theparameters of the best linear predictor (projection coeﬃcients) depend on the distribution of X . In thiscase, causality cannot be claimed. Starting in Section 2 we shall assume a linear model. Main results:

We study the regret via an approximation that assumes “ideal” conditions, such asnormality instead of asymptotic normality of linear regression coeﬃcient estimators. We show that thisyields a useful approximation for quite small sample sizes. We also study an asymptotic regret. It turnsout that when there are three or more treatments, optimal designs for moderate sample sizes may be3otally diﬀerent from the asymptotic design, and in this sense one should beware of asymptotics.For the case of two treatments, we show that the optimal trial design allocates patients in proportionto the standard deviations of the response under each treatment, and the optimal allocation probabilitiesdo not depend on the covariates. This is shown for the case where the outcome in each treatment groupdepends on the covariates according to a p -dimensional linear regression model, as well as for the case ofa single covariate and a polynomial regression model.For the case of three or more treatments and a linear regression model in a single covariate, wecombine theoretical arguments and numerical calculations to study some examples of optimal designs.This includes a study comparing three treatments (see Ebbeling et al. (2018)) based on data that arereconstructed from a published paper; we compare the design in this paper to an optimal design obtainedby our approach. We show that the optimal allocation in general depends on the covariate. For speciﬁccases, we show that the asymptotically optimal allocation is deterministic, i.e., the range of the covariateis partitioned into intervals and in each interval the allocation probability to a particular treatment is 1.Such allocation rules arise typically in very large studies. Let ( Y , X ) , . . . , ( Y n , X n ) be a sample obtained by given allocation functions π , . . . , π n , along withthe treatment allocated for each X i . Assume K ≥ X ∈ R p , and under treatment T k we have Y = g k ( X ) + ε , where g k ( X ) = E ( Y | X , T k ) = α k + β tk X , and σ k := V ar ( ε | X , T k ). Let (cid:98) α k , (cid:98) β k denote theOLS estimators based on data from treatment T k , (cid:98) g k ( X ) the corresponding estimated regression functions,and (cid:98) δ ( x ) = arg max k (cid:98) g k ( X ) the decision rule. Then, R ( π , . . . , π K ) = K (cid:88) k =1 (cid:90) R p P (cid:16)(cid:98) δ ( x ) = k (cid:17) (cid:2) g δ ∗ ( x ) ( x ) − g k ( x ) (cid:3) f ( x ) d x . (2)For k = 1 , . . . , K , let Q k := (cid:90) R p (cid:18) x (cid:19) (1 , x t ) f k ( x ) d x , Σ k := 1 ν k σ k Q − k , ξ k ( x ) := (1 , x t ) Σ k (cid:18) x (cid:19) . (3)It is well known that √ n [( (cid:98) α k , (cid:98) β k ) − ( α k , β k )] → N (0 , Σ k ), and in particular Σ k is the asymptotic varianceof the OLS estimators.In order to approximate the probability P (cid:16)(cid:98) δ ( x ) = k (cid:17) , we ﬁrst impose “ ideal ” conditions, to bedropped later: The asymptotic variance Σ k is the exact variance matrix of the OLS estimatorsand thus ξ k ( x ) /n = V ar ( (cid:98) g k ( x )); and The error ε | X , T k ∼ N (0 , σ k ). Under the ideal conditions the4stimators (cid:98) g k ( x ) are jointly normal, and independent conditionally on X , . . . X n , T , . . . T K , with constatexpectation α k + β tk x , and by normality they are independent. Therefore, with P I and E I denoting theprobability and expectation under the ideal conditions, we have P I (cid:16)(cid:98) δ ( x ) = k (cid:17) = P I (cid:18) max l =1 ,...,K,l (cid:54) = k (cid:98) g l ( x ) < (cid:98) g k ( x ) (cid:19) = E I (cid:89) (cid:96) (cid:54) = k P I (cid:18) √ n [ (cid:98) g l ( x ) − g l ( x )] ξ (cid:96) ( x ) < √ n [ (cid:98) g k ( x ) − g (cid:96) ( x )] ξ (cid:96) ( x ) (cid:12)(cid:12)(cid:12) (cid:98) g k ( x ) (cid:19) = E I (cid:89) (cid:96) (cid:54) = k P (cid:18) Z ≤ √ n [ (cid:98) g k ( x ) − g (cid:96) ( x )] ξ (cid:96) ( x ) (cid:12)(cid:12)(cid:12) (cid:98) g k ( x (cid:19) = E (cid:89) (cid:96) (cid:54) = k P (cid:18) (cid:101) Z ≤ Zξ l ( x ) + √ n [ g k ( x ) − g l ( x )] ξ k ( x ) (cid:12)(cid:12)(cid:12) Z (cid:19) = (cid:90) (cid:89) l =1 ,...,K,l (cid:54) = k Φ (cid:18) zξ l ( x ) + √ n [ g k ( x ) − g l ( x )] ξ k ( x ) (cid:19) ϕ ( z ) dz, (4)where the E and P in the penultimate expression are with respect to Z and (cid:101) Z , which are independent N (0 , (cid:101) Z represents √ n [ (cid:98) g k ( x ) − g k ( x )] ξ k ( x ) ; ϕ and Φ denote the standard normal density and cumulativedistribution function. The product in the second line was justiﬁed above by independence under the idealconditions, which are used also in both equalities of the third line. We deﬁne the ideal regret , denoted by R I , by R I ( π , . . . , π K ) := K (cid:88) k =1 (cid:90) R p P I (cid:16)(cid:98) δ ( x ) = k (cid:17) (cid:2) g δ ∗ ( x ) ( x ) − g k ( x ) (cid:3) f ( x ) d x , (5)where P I (cid:16)(cid:98) δ ( x ) = k (cid:17) is given in (4). We assume that K = 2 and Y = g k ( X ) + ε under treatment T k for k = 1 ,

2, where g k ( X ) denotes theconditional mean of Y given X , T k , X ∈ R p and g ( X ) = α + β t X , g ( X ) = α + β t X , V ar ( ε | X , T ) = σ and V ar ( ε | X , T ) = σ . (6)We do not assume normality of the errors. However, we assume the existence of the moment generatingfunction of ε when conditioned on X , T k . This assumption is needed for some large deviation exponentialbounds used below, and could be relaxed to assuming ﬁniteness of some high order moments instead. Wealso assume that X is continuous with a bounded density f ( x ) supported on [0 , p .5nder the ideal conditions of Section 2 with K = 2, Equation (4) becomes P I (cid:16)(cid:98) δ ( x ) = k (cid:17) = Φ (cid:32) √ n [ g k ( x ) − g l ( x )] (cid:112) V ( x ) (cid:33) , where V ( x ) := ξ ( x ) + ξ ( x ) = (1 , x t ) ( Σ + Σ ) (cid:18) x (cid:19) (7)for 1 ≤ k (cid:54) = l ≤

2. The ideal regret (5) is R I ( π , π ) = (cid:90) { x : g ( x ) >g ( x ) } Φ (cid:32) √ n [ g ( x ) − g ( x )] (cid:112) V ( x ) (cid:33) [ g ( x ) − g ( x )] f ( x ) d x + (cid:90) { x : g ( x )

The allocation functions π ( x ) := σ σ + σ and π ( x ) := σ σ + σ minimize both V ( x ) uni-formly over x , and R I ( π , π ) . Recalling that V ( x ) can be written as a quadratic form, see (7), Theorem 3.1 follows immediatelyfrom Lemma 3.2 below whose proof, as all other proofs, is given in the Appendix. Given matrices A and B , we write A (cid:23) B if A − B is positive semi-deﬁnite. Let Q := (cid:82) [0 , p (cid:0) x (cid:1) (1 , x ) f ( x ) d x , and recall thenotation of (3). We have Lemma 3.2.

With the above deﬁnitions, Σ + Σ (cid:23) ( σ + σ ) Q − . (9) Moreover, the lower bound is attained when π ( x ) = π ( x ) and π ( x ) = π ( x ) . To see the latter statement note that when π k ( x ) do not depend on x , both matrices Q k are propor-tional to Q and therefore both matrices Σ k are proportional to Q − .We now discuss the regret (2). In Theorem 3.3 we approximate the regret by the ideal regret, andthen we use the above results for the ideal regret to ﬁnd an asymptotically optimal design for the regretitself. We later demonstrate numerically that the latter design provides a good approximation for ﬁnite6ample sizes. See also Theorem 3.3. The regret is given by R ( π , π ) = (cid:90) { x : g ( x ) >g ( x ) } P ( (cid:98) g ( x ) > (cid:98) g ( x ))( g ( x ) − g ( x )) f ( x ) d x + (cid:90) { x : g ( x ) >g ( x ) } P ( (cid:98) g ( x ) > (cid:98) g ( x ))( g ( x ) − g ( x )) f ( x ) d x . (10) Theorem 3.3.

Under model (6) with ν , ν > , we have for any ε > n / − ε | R ( π , π ) − R I ( π , π ) | → as n → ∞ . To facilitate the representation, we state the next result in the one-dimensional case ﬁrst. In this case g and g are functions of a single variable, and we assume that their intersection point θ := α − α β − β is in[0 , β > β . It follows that for x < θ , g ( x ) > g ( x ), that is, T is the better treatment, and when x > θ , T is better. In the case p = 1 the limit of the regret is simple,given by the following theorem. Theorem 3.4.

Under model (6) with p = 1 , if ν , ν > , then lim n →∞ nR ( π , π ) = V ( θ ) f ( θ )2( β − β ) . It follows that in order to minimize lim n →∞ nR ( π , π ) we have to minimize V ( θ ), which by Theorem3.1 is achieved by taking π ( x ) = π ( x ) and π ( x ) = π ( x ).We remark that by a delta method calculation, the asymptotic variance of (cid:98) θ = (cid:98) α − (cid:98) α (cid:98) β − (cid:98) β is V ( θ )( β − β ) ,which is proportional to the limit of nR ( π , π ). Thus, minimizing this limit is equivalent to minimizingthe asymptotic variance of the estimator of the intersection point θ .We return to general dimension p . Set β tk := ( β k, , . . . , β k,p ), k = 1 , β tk, − := ( β k, , . . . , β k,p )and without loss of generality let β , > β , . For given x − = ( x , . . . , x p ) deﬁne θ := θ ( x − ) := α − α +( β , − − β , − ) t x − β , − β , . If θ ∈ [0 , T is better for a covariate vector x satisfying x < θ = θ ( x − ), and otherwise T is better. The limit of the regret is given in the following theorem. Theorem 3.5.

Assume model (6) and ν , ν > , then lim n →∞ nR ( π , π ) = 12( β , − β , ) (cid:90) [0 , p − V ( θ , x − ) f ( θ , x − ) I { θ ∈ [0 , } d x − . (11)Thus, asymptotically, nR ( π , π ) is proportional to the integral of the variance of the estimate of theintersection curve θ ( x − ), with weights proportional to the density at the intersection points and by (11)lim n →∞ nR ( π , π ) is minimized by the allocation π ( x ) , π ( x ) since this alloction minimizes V uniformly.The following corollary of Theorem 3.3 shows that π , π , which minimize the ideal, regret are alsoapproximately optimal for the regret R ( π , π ) itself.7 orollary 3.1. For any ε > there exists C > such that for any allocation functions π , π with ν , ν > R ( π , π ) ≥ R ( π , π ) − Cn / − ε . In particular, the optimal design that minimizes R ( π , π ) can improve the allocation π , π by at most C/n / − ε . The error term is meaningful as it is of smaller order than the regret, which is 1 /n according toTheorem 3.5. To illustrate the asymptotic approximations of the regret and the optimal allocations derived in Section3.1 consider an example of model (6) with p = 1, α = 0 . , α = 0 , β = 0 . , β = 1 , σ = 0 . , σ = 0 . X ∼ U [0 , ν = 0 . σ ’s are chosen such that R in both regressionsis about 0 .

2. Figure 1(a) shows the regression lines. For x smaller than θ = 0 .

4, treatment 1 is better andotherwise treatment 2 is preferred. The regret (10) is evaluated using simulations (with 10 replications)and is compared to the ideal regret (8). Two scenarios are considered for the residuals in the regressionmodel: normal and centered exponential. Figures (b) and (c) show the regret and ideal regret for n = 100under the allocation function π ( x ) = ν where ν vary from 0.2 to 0.6. The allocation optimizing the idealregret is marked with a vertical line. While there is a slight deviation between the ideal and actual regret,it seems that the optimal allocations, with respect to both, are very close. Thus, the approximation ofTheorem 3.3 works well also for small n in this example. Figure 1 (d) shows nR I ( π , π ) of the balanceddesign ( π = π = 1 /

2) for large n and its limit, see Theorem 3.4. It is shown that the regret slowlyconverges to its limit, and it is close to the limit only when n is quite large.8 a) The regression lines (b) Actual and ideal regret - normal residual(c) Actual and ideal regret - exponential residual (d) Ideal regret and asymptotic regret Figure 1:

Figure (a): the regression lines. Figures (b) and (c) show the regret (computed by a simulation with 95%conﬁdence intervals), and the ideal regret for n = 100 and diﬀerent allocation ratios, where in (b) the residual is normal andin (c) it is exponential (centered). The regret is calculated for the allocation function π ( x ) = ν where ν varies from 0.2 to0.6. The optimal allocation is marked by the vertical line. Figure (d) shows n times the ideal regret (blue line) and its limit(gray line). .3 Polynomial Regression We now consider the model of (6) where the functions g k ( x ) are assumed to be polynomials of a singlecontinuous covariate X with density f , i.e., g k ( X ) = α k + (cid:80) Jj =1 β jk X j , where J denotes the degree of thepolynomial. Let θ < · · · < θ L , where L ≤ J , be the crossing points of g ( x ) and g ( x ), and deﬁne θ = 0and θ L +1 = 1. Assume without loss of generality that g ( x ) > g ( x ) for x ∈ ( θ (cid:96) , θ (cid:96) +1 ) if (cid:96) is odd, and thereverse inequality holds when (cid:96) is even. Therefore, the regret is R ( π , π ) = (cid:88) (cid:96) is odd (cid:90) θ (cid:96) +1 θ (cid:96) P ( (cid:98) g ( x ) > (cid:98) g ( x ))( g ( x ) − g ( x )) f ( x ) dx + (cid:88) (cid:96) is even (cid:90) θ (cid:96) +1 θ (cid:96) P ( (cid:98) g ( x ) > (cid:98) g ( x ))( g ( x ) − g ( x )) f ( x ) dx. Notice that Theorems 3.3 and 3.5, which provide results for multivariate X , do not apply in thiscase because the random vector X = ( X, X , . . . , X J ) does not have a joint density. However, usingthe arguments of Theorem 3.3 (under similar notation and regularity conditions) it can be shown that n / − ε | R ( π , π ) − R I ( π , π ) | →

0, for any ε >

0, where R I ( π , π ) = (cid:90) Φ (cid:32) −√ n | g ( x ) − g ( x ) | (cid:112) V ( x ) (cid:33) | g ( x ) − g ( x ) | f ( x ) dx, (12)and now x t = (1 , x, . . . , x J ), Q k = (cid:82) R p xx t f k ( x ) dx and Σ k = ν k σ k Q − k , k = 1 ,

2, and V ( x ) = x t ( Σ + Σ ) x , which is the asymptotic variance of (cid:98) g ( x ) − (cid:98) g ( x ).Lemma 3.2 implies that the design where π ( x ) = σ σ + σ minimizes V ( x ) uniformly over all x ’s. Itfollows that this design is asymptotically optimal also for this problem.By Proposition .1 (in the appendix) we havelim n →∞ nR ( π , π ) = L (cid:88) (cid:96) =1 f ( θ (cid:96) ) V ( θ (cid:96) )2 | ( β − β ) t ζ (cid:96) | , where ζ (cid:96) := (1 , θ (cid:96) , ..., J θ J − (cid:96) ) t .A careful inspection of the proofs of the above results shows that the optimality of the design with π ( x ) = σ σ + σ generalizes to regression functions of the form g k ( X ) = (cid:80) j β jk h j ( X ) for any functions h j having bounded derivatives. 10 K Treatments and one Covariate

We consider the case p = 1 and g k ( X ) = α k + β k X, V ar ( Y | X, T k ) = σ k , k = 1 , . . . , K . As in Section3, we assume the existence of the moment generating function of ε when conditioned on ( X , T k ). Wealso assume that X is continuous with density f ( x ) and is supported on [0 , θ k , k = 0 , . . . , K bein increasing order, and such that treatment k is best in the interval ( θ k − , θ k ); we assume that eachtreatment is best in some open interval, or equivalently that the intervals are nonempty. Equations (2)and (5) for the regret and ideal regret imply immediately R ( π , . . . , π K ) = K (cid:88) k =1 K (cid:88) m =1 (cid:90) θ m θ m − P (cid:18)(cid:98) g k ( x ) > max (cid:96) (cid:54) = k (cid:98) g (cid:96) ( x ) (cid:19) [ g m ( x ) − g k ( x )] f ( x ) dx and the ideal regret is R I ( π , . . . , π K )= K (cid:88) k =1 K (cid:88) m =1 (cid:90) θ m θ m − (cid:90) (cid:89) l =1 ,...,K,l (cid:54) = k Φ (cid:18) zξ l ( x ) + √ n [ g k ( x ) − g l ( x )] ξ k ( x ) (cid:19) ϕ ( z ) dz  [ g m ( x ) − g k ( x )] f ( x ) dx. (13)Similar to the results in Section 3, we give asymptotic results on the rate of convergence of the regretto the ideal regret R I as well as the limit of the regret. Given allocation functions π k ( x ), the distributionof the estimated regression coeﬃcients ( (cid:98) α k , (cid:98) β k ) is approximately bivariate normal with means ( α k , β k )and covariance Σ k = σ k ν k  τ k + µ k τ k − µ k τ k − µ k τ k τ k  , (14)where we assume positivity of ν k := (cid:82) f ( x ) π k ( x ) dx , and denote the mean and variance of the covariatein group k by µ k := (cid:82) xf k ( x ) dx and τ k := (cid:82) ( x − µ k ) f k ( x ) dx , respectively, k = 1 , . . . , K . Parallel toTheorems 3.3 and 3.4 we have Theorem 4.1.

Under the assumptions in the beginning of Section 4.1 we have for any ε > n →∞ n / − ε | R ( π , . . . , π K ) − R I ( π , . . . , π K ) | = 0 . Theorem 4.2.

Under the assumptions in the beginning of Section 4.1, we have n lim n →∞ R ( π , . . . , π K ) = K − (cid:88) m =1 V m ( θ m ) f ( θ m )2 | β m +1 − β m | , (15)11 here V m ( x ) = (1 , x ) ( Σ m + Σ m +1 ) (cid:0) x (cid:1) . Using (14) it is possible to write V m ( θ m ) explicitly as follows V m ( θ m ) = σ m ν m (cid:20) θ m − µ m ) τ m (cid:21) + σ m +1 ν m +1 (cid:20) θ m − µ m +1 ) τ m +1 (cid:21) . Theorem 4.2 implies that asymptotically the optimal allocation problem reduces to minimize a weightedaverage of the variances of (cid:98) θ , . . . , (cid:98) θ K − (see the discussion after Theorem 3.4), which are the estimatesof the intersection points. The weights depend on the β ’s and on the density f at the intersection points.Note also that V m ( θ m ) is the sum of the asymptotic variances of (cid:98) α m + (cid:98) β m θ m and (cid:98) α m +1 + (cid:98) β m +1 θ m . Corollary3.1 continues to hold for K treatments and thus, optimizing the ideal regret approximately optimizes theregret itself. By (13), (14), and (15) (see also (3)) both the ideal and the asymptotic regret depend on the functions π k ( x ) only via the quantities ν k , µ k , τ k , k = 1 , . . . , K . Therefore, we can minimize (13) and (15) in theseparameters to obtain a lower bound for the regret. Let µ := (cid:82) xf ( x ) dx , τ := (cid:82) ( x − µ ) f ( x ) dx andrecall that f ( x ) = (cid:80) Kk =1 ν k f k ( x ). Then, by the relations for central moments of mixture distributions wehave K (cid:88) k =1 ν k = 1 , K (cid:88) k =1 ν k µ k = µ, K (cid:88) k =1 ν k ( τ k + µ k ) = τ + µ . (16)Minimizing (13) or (15) in ν k , µ k , τ k , k = 1 , . . . , K subject to the constraints (16) and ν k , τ k ≥ , k =1 , . . . , K yields a lower bound for the achievable regret. To obtain the minimum regret (instead of a lowerbound) one needs to add additional constraints on µ k , τ k , k = 1 , . . . , K to restrict optimization to valuesfor which there exist mixture components f k with weighted sum f that assume these moments.We now consider an example with uniform f and K = 3, and minimize the asymptotic regret (15).Since now f ( x ) ≤ f ( x ) = (cid:80) k =1 ν k f k ( x ) implies that ν k f k ( x ) ≤ , k = 1 , , , x ∈ [0 , f k ( x ) , k = 1 , , ν k / α , α , α ) = (0 . , − . , − . , ( β , β , β ) = (0 . , . , . σ k , k = 1 , , ν k , µ k , τ k , k = 1 , . . . , K subject to the constraints(16) as well as τ k ≥ ν k / , k = 1 , , ν k = 0 . , . , . , µ k = 0 . , . , . , τ k =0 . , . , . . τ k = ν k / k = 1 ,

3. Applying Lemma 1 for c = 1 /ν k for k = 1 ,

3, the only density f k ( x ) with mean µ k x M ean O u t c o m e Treatments

Treatment 1 Treatment 2 Treatment 3 0.000.250.500.751.00 0.00 0.25 0.50 0.75 1.00 x A ll o c a t i on P r obab ili t i e s Treatments

Treatment 1 Treatment 2 Treatment 3

Figure 2: The functions g k for the three treatments (left) and the optimal allocation probabilities min-imizing (15) (as a stacked area chart, right) for the example in Section 4.2. For instance, for x ’s underthe black area, treatment 1 is assigned with probability π ( x ) = 1.and variance τ k and satisfying the constraint f k ( x ) ≤ /ν k , x ∈ [0 ,

1] is the uniform distributions on[ µ k − ν k / , µ k + ν k /

2] with densities f k ( x ) = 1 [ µ k − ν k / , µ k + ν k / ( x ) /ν k . Furthermore, we set f ( x ) = 1 − ν f ( x ) − ν f ( x ) ν . Because the supports of f , f are disjoint and f k ( x ) ≤ /ν k , k = 1 , f ( x ) is a valid density. By con-struction, the mean and variance of f ( x ) are µ , τ and the three densities satisfy f ( x ) = (cid:80) k =1 ν k f k ( x ).Thus, the resulting allocation probabilities π k ( x ) = f k ( x ) /ν k achieve the lower bound computed above,and hence, assuming the minimization algorithm yielded the correct minimum (which we veriﬁed in sev-eral ways, including the fact that it is unique), it is an optimal allocation (see Figure 2 for a plot). Itfollows that in this case the above constraints suﬃce. Furthermore, as f , f are unique, they are a uniqueoptimal allocation (under the assumption that the minimization in ν k , µ k , τ k , k = 1 , . . . , K gives a uniquesolution). It also follows that the optimal solution leads to a deterministic allocation.Some comments: (1) To implement the numerical optimization of (15) subject to the constraints, weuse the R command optim with the following reparametrization, which allows one to apply optimizationalgorithms for unrestricted optimization. Let A i = ( a i, , a i, ) ∈ R , i = 1 , ,

3. First we transform thevectors A i into proportions setting a (cid:48) ik = exp( a i,k ) / (cid:80) m =1 exp( a i,m ) , i, k = 1 , , a i, ≡

0, and set ν k = a (cid:48) k , µ k = a (cid:48) k µ/ν k , and τ k = ν k

12 + a (cid:48) ,k (cid:16) µ + τ − (cid:80) m =1 ( ν m µ m + ν m / (cid:17) ν k , k = 1 , , .

13s long as the second summand remains positive, this parametrization ensures that τ k ≥ ν k /

12 and thatthe constraints of (16) hold. To conﬁrm the robustness of the solution, the optimization is performedrepeatedly with randomly chosen initial conditions. (2) If the intersection points θ k,k +1 of the lines g k movecloser together so will the intervals of x ’s where all patients are allocated to Treatment 1 or Treatment3 (the blue and black regions in Figure 2). For scenarios where they overlap, we can no longer use theabove approach to obtain an optimal solution and the solutions seem to become more complex. This isthe case, for example, for α k = 0 . , − . , − . , β k = 0 . , . , . , k = 1 , , Lemma 1

Let c >

0. Among all continuous distributions f with (i) mean µ such that (ii) f ( x ) ≤ c forall x ∈ R , the uniform distribution on [ µ − / (2 c ) , µ + 1 / (2 c )] has the smallest variance. This minimumvariance is 1 / ( c To demonstrate our approach, we consider a study by Ebbeling et al. (2018) in which three diets werecompared in a parallel group design. We ﬁtted a model to the data of that trial and derived an alternativeoptimal design based on the estimated model parameters.Patients were randomly assigned to three diet groups. The diets diﬀer in their carbohydrate content,high (60%), moderate (40%) or low (20%) and were given for 20 weeks. The total number of subjectsallocated to these three treatment groups was 54, 53 and 57, respectively, so the design is (almost)balanced. The main covariate is insulin secretion (insulin concentration 30 minutes after oral glucose).The primary outcome in the original trial was averaged total energy expenditure over two measurements,in the middle and the end of the trial. Total energy expenditure is measured by doubly labeled water; fordetails on this measure see Hills et al. (2014).Of the 1685 subjects initially screened, 234 participated in a run-in phase which preceded the trialitself. Of these, 164 achieved a target of 12% ( ± T k , k = 1 , , α k + β k X + ε k − cost k , (17)14here cost k = 0, 150, 300 for k = 1 , ,

3, respectively, X is insulin secretion and its distribution isGamma(3.12, 0.02), and the standard deviations of ε , ε , ε are 190, 150, 130, respectively. The de-sign we propose is based on assumed values of α k , which are 40, -80, -240, and of β k , which are -0.8,0.1, and 0.8 for k = 1 , ,

3, respectively. These values and the parameters of the gamma distribution of X and the standard deviation of ε ’s are all based on Figure 4 of Ebbeling et al. (2018), which plots theenergy expenditure as a function of the insulin secretion for the three diets in the data. In particular, theparameters of the Gamma distribution are chosen to match empirical quantiles of the X ’s.In order to numerically ﬁnd the optimal allocation probabilities π k ( x ) that minimize the regret, weuse a certain parameterization. Let a = ( a , . . . , a m ), and set h a ( x ) = exp( (cid:80) mj =0 a j x j ). Furthermore, let A denote a 2 × ( m + 1) matrix of real numbers. Let π k ( x ) = h Ak ( x ) (cid:80) j =0 h Aj ( x ) , k = 1 , ,

3, where A k denotesthe k -th row of A and we set h A ( x ) ≡

1. Now we can approximately minimize the (ideal) regret byplugging π k ( x ) into (13) and minimizing with respect to A . In this example we chose m = 4 and use theR function optim for the numerical optimization.Figure 3(a) plots the regression lines (17) for the three diets and the assumed distribution of thecovariate x . The high, moderate, and low carbohydrate diets are optimal when X belongs to the intervals(0 , , (133 , , (229 , ∞ ), respectively. The diﬀerence between the treatments is more pronouncedfor extreme values of X . Hence, people with very small or very large X would beneﬁt more from apersonalized treatment choice compared to those with medium values of X .Figures 3(b)-(d) present the optimal allocation probabilities, as a stacked area chart, when n =164 , ∞ . When the sample size is larger, the optimal allocation rule is closer to being deter-ministic. For n = 164, which was the actual trial size, the optimal trial allocation rule tends to allocatesubjects to T (respectively, T ) for large (respectively, small) values of X . Compared to equal allocation,which was the actual design used, there is a reduction of about 11% of the regret. A saving of 11% maybe quite signiﬁcant if the treatments are applied to a large number of future patients. Furthermore, theregret of the balanced design with n = 164 can be achieved under the optimal design with n = 143,which represents a reduction of 13% of the sample size. Taking into account that in this experiment onlyabout 10% of recruited patients passed the screening process, a saving of about 20 subjects amounts to areduction of about 200 subjects that would have to be recruited and pretested.To investigate if the reduction in the regret is due to unbalanced allocation or the covariate dependentallocation, we also computed the regret for the optimal unbalanced allocation that is restricted to alloca-tion probabilities that do not depend on the covariate. We refer to this design as unbalanced . This can beachieved as above setting m = 0. We found numerically the optimal unbalanced design for n = 164 leadsto reduction of the regret by about 8%, whereas optimizing with allocation functions that depend on the15ovariate, leads to a reduction by an additional 3%. For larger sample sizes, it appears that the gain byoptimized covariate dependent allocation probabilities increases (see Table 1). As the algorithm only givesan approximate optimal solution, we also used the approach in Section 4.2 to compute a lower bound forthe limit nR I for n → ∞ . The obtained lower bound of 735.1 is close to 738.2, the corresponding regretobtained by the derived allocation function depicted in Figure 3d.It should be noticed that all these designs are based on the true values of the parameters and thereforerequire a preliminary study, or an adaptive design.% reduction optimized allocation n nR I optimal unbalanced high moderate low164 842.8 10.9 7.9 45.3 36.7 18.31000 829.6 13.1 4.4 36.8 39.8 23.7 ∞ nR I , the reductions of the regret of the optimal design compared to a balanceddesign, the reductions of the regret of a design with optimized allocation ratios only (where allocationprobabilities do not depend on the X ) compared to the balanced design, and the optimized allocationprobabilities in percent (for the design where allocation probabilities do not depend on X ).16 a) the regression models (b) optimal allocation for n = 164(c) optimal allocation for n = 1000 (d) optimal allocation for n → ∞ Figure 3: Figure (a) plots the regression models (17) and the assumed density of the covariate. Figures(b)-(d) present the optimal allocation probabilities (as a stacked area chart) when n = 164 , , ∞ .17 ppendix: Proofs Proof of Lemma 3.2

We use the convexity relation λA − + (1 − λ ) B − (cid:23) [ λA + (1 − λ ) B ] − (Moore, 1973) where A and B arepositive deﬁnite matrices. We have Σ + Σ = σ (cid:18) ν Q σ (cid:19) − + σ (cid:18) ν Q σ (cid:19) − (cid:23) ( σ + σ ) (cid:104) σ σ + σ (cid:18) ν Q σ (cid:19) + σ σ + σ (cid:18) ν Q σ (cid:19) (cid:105) − = ( σ + σ ) Q − , where in the last equality we used the fact that ν Q + ν Q = Q . Equality holds when the two matricesto the left of the (cid:23) sign are equal, which happens when π ( x ) = σ σ + σ and π ( x ) = σ σ + σ . Proof of Theorem 3.3

We start with the case of p = 1. It is enough to show (as the other part of the integral, from θ to 1, issymmetric) that n / − ε (cid:90) θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P ( (cid:98) g ( x ) > (cid:98) g ( x )) − Φ (cid:32) −√ n [ g ( x ) − g ( x )] (cid:112) V ( x ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ g ( x ) − g ( x )] f ( x ) dx → . (18)The idea of the proof is quite simple. For x ∈ θ ± / √ n the quantity in absolute value converges to zero,and the range of the integral and g ( x ) − g ( x ) are of order 1 / √ n each. For other x ’s both the probabilityand Φ in the integrand are exponentially small in n .We will prove (18) by showing that n / − ε (cid:90) θ (cid:12)(cid:12)(cid:12)(cid:12) P ( (cid:98) g ( x ) > (cid:98) g ( x )) − E Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) [ g ( x ) − g ( x )] f ( x ) dx → n / − ε (cid:90) θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19) − Φ (cid:32) −√ n [ g ( x ) − g ( x )] (cid:112) V ( x ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ g ( x ) − g ( x )] f ( x ) dx → , (20)where S ( x ) = n (1 , x )  σ (cid:88) i ∈ T (cid:18) X i (cid:19) (1 , X i )  − + σ (cid:88) i ∈ T (cid:18) X i (cid:19) (1 , X i )  −  (cid:18) x (cid:19) , (21)and the expectation in (19) and (20) is with respect to X , . . . , X n which appear in S ( x ) and therandom sets T k . We start with (20). Since θ is the intersection point, α + β θ = α + β θ ; therefore,18 ( x ) − g ( x ) = α + β x − ( α + β x ) = ( β − β )( θ − x ). Hence, (20) can be written as n / − ε (cid:90) θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19) − Φ (cid:32) −√ n ( β − β )( θ − x ) (cid:112) V ( x ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( β − β )( θ − x ) f ( x ) dx → . Fix c n = n ε/ . We will show that n / − ε (cid:90) θ − c n / √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19) − Φ (cid:32) −√ n ( β − β )( θ − x ) (cid:112) V ( x ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( β − β )( θ − x ) f ( x ) dx → , (22)and that n / − ε (cid:90) θθ − c n / √ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19) − Φ (cid:32) −√ n ( β − β )( θ − x ) (cid:112) V ( x ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( β − β )( θ − x ) f ( x ) dx → . (23)We show (22) by arguing that the probabilities in (22) are exponentially small. With notation deﬁned inSection 2 we have that V ( x ) ≤ (1 + x ) (cid:18) σ ν λ min ( Q ) + σ ν λ min ( Q ) (cid:19) ≤ C, for a constant C > x is bounded). With the bound Φ( − t ) ≤ exp( − t /

2) for t > θ − x ) √ n ≤ c n we have Φ (cid:32) − ( β − β )( θ − x ) √ n (cid:112) V ( x ) (cid:33) ≤ exp (cid:18) − ( β − β ) c n C (cid:19) , which shows that the second normal probability Φ( · ) in (22) is exponentially small. We now argue thatthe expectation in (22) is also exponentially small. Consider the event A n :=  λ min  n (cid:88) i ∈ T (cid:18) X i (cid:19) (1 , X i )  ≤ ν λ min ( Q )2 (cid:91)  λ min  n (cid:88) i ∈ T (cid:18) X i (cid:19) (1 , X i )  ≤ ν λ min ( Q )2  . (24)Writing n (cid:80) i ∈ T (cid:0) X i (cid:1) (1 , X i ) = n (cid:80) ni =1 (cid:0) X i (cid:1) (1 , X i ) I ( i ∈ T ), a sum of iid terms, we can apply Theorem 5.1of Tropp (2012) and the union bound to conclude that P ( A n ) ≤ C exp( − Cn ) for some constant C > E Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19) = E Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19) I ( A n ) + E Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19) I ( A cn ) ≤ P ( A n ) + E Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19) I ( A cn ) . When A cn occurs, the denominator S ( x ) is bounded and the same argument as above shows that thisprobability is exponentially small. We conclude that both probabilities in (22) are exponentially small,which implies that (22) is correct.We now prove (23). By the mean value theorem, for x ∈ ( θ − c n / √ n, θ ) and for some V ∗ ( x ) between S ( x ) and V ( x ), we have for some C > (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Φ (cid:32) −√ n ( β − β )( θ − x ) (cid:112) S ( x ) (cid:33) − Φ (cid:32) −√ n ( β − β )( θ − x ) (cid:112) V ( x ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = | S ( x ) − V ( x ) | ϕ (cid:32) −√ n ( β − β )( θ − x ) (cid:112) V ∗ ( x ) (cid:33) √ n ( β − β )( θ − x )2 ( V ∗ ( x )) − / ≤ C | S ( x ) − V ( x ) | c n (min { S ( x ) , V ( x ) } ) − / , and { S ( x ) − V ( x ) } = (1 , x ) (cid:26) σ ν (cid:0) M − − Q − (cid:1) + σ ν (cid:0) M − − Q − (cid:1)(cid:27) (cid:18) x (cid:19) , (25)where M k = nν k (cid:80) i ∈ T k (cid:0) X i (cid:1) (1 , X i ), k = 1 ,

2. We have that M − − Q − = M − ( Q − M ) Q − , implyingthat the diﬀerence of the inverses is small when the diﬀerence of the matrices is small and the inverse isbounded. Deﬁne the event (cid:101) A n := { max {(cid:107) Q − M (cid:107) ∞ , (cid:107) Q − M (cid:107) ∞ } ≥ c n / √ n } , where (cid:107) · (cid:107) ∞ denotes the maximum (entry-wise) norm. By Hoeﬀding’s inequality (since the randomvariables are bounded), P ( (cid:101) A n ) ≤ C exp( − Cc n ) (which is exponentially small). Therefore, as before, thisevent can be ignored. On the complement of (cid:101) A n , M − , M − , S ( x ) are all bounded, and by similararguments S ( x ) and V ( x ) are bounded below. Hence, when (cid:101) A cn occurs, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19) − Φ (cid:32) −√ n ( β − β )( θ − x ) (cid:112) V ( x ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C c n √ n . Cn − ε c n (cid:90) θθ − c n / √ n ( θ − x ) f ( x ) dx ≤ Cn / − ε c n (cid:90) θθ − c n / √ n f ( x ) dx ≤ Cc n /n ε = n − ε/ → , which completes the proof of (23) and hence of (20).We now show (19) by dividing the integral into two ranges: (a) [ θ − c n / √ n, θ ], and (b) [0 , θ − c n / √ n ].We have that P ( (cid:98) g ( x ) > (cid:98) g ( x )) = P (cid:32)(cid:20)(cid:18)(cid:98) α − α (cid:98) β − β (cid:19) − (cid:18)(cid:98) α − α (cid:98) β − β (cid:19)(cid:21) t (cid:18) x (cid:19) > ( β − β )( x − θ ) (cid:33) = P  M − nν (cid:88) i ∈ T (cid:18) ε i X i ε i (cid:19) − M − nν (cid:88) i ∈ T (cid:18) ε i X i ε i (cid:19) t (cid:18) x (cid:19) > ( β − β )( x − θ )  = P (cid:32) n (cid:88) i =1 a i ε i > ( β − β )( x − θ ) (cid:33) , where a i =  (1 , X i ) M − nν (cid:0) x (cid:1) i ∈ T − (1 , X i ) M − nν (cid:0) x (cid:1) i ∈ T , M , M are deﬁned in (25) and ε i are the regression error terms.Notice that V ar ( ε i ) is equal to σ if i ∈ T and to σ if i ∈ T . We are going to condition on X , . . . , X n obtained in the experiment and their allocation, denoted together by D . Then V ar ( (cid:80) ni =1 a i ε i | D ) = S ( x ) /n , where S ( x ) is deﬁned in (21).We aim to apply the Berry Esseen Theorem, for which we need some bounds. The values na i arebounded below since the matrices M k are bounded above. In order to obtain upper bounds, recallthe event A n deﬁned in (24). On the set A cn we have that na i are bounded above. Since P ( A n ) isexponentially small, we can ignore it, and assume that na i are bounded with probability one. We canapply the Berry-Esseen Theorem for the non-identically distributed case (for a convenient reference seeChen et al. (2010) (3.27)) to (cid:80) ni =1 √ na i S ( x ) ε i ≡ (cid:80) ni =1 V i where (cid:80) ni =1 V ar ( V i | D ) = 1 and E ( | V i | | D ) ≤ max { na i /S ( x ) E ( | ε i | | D ) : 1 ≤ i ≤ n } n − / to obtain by unconditioning on D (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P (cid:32) n (cid:88) i =1 a i ε i > ( β − β )( x − θ ) (cid:33) − E Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C/ √ n. n / − ε (cid:90) θθ − c n / √ n (cid:12)(cid:12)(cid:12)(cid:12) P ( (cid:98) g ( x ) > (cid:98) g ( x )) − E Φ (cid:18) −√ n ( β − β )( θ − x ) S ( x ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ( β − β )( θ − x ) f ( x ) dx ≤ n / − ε C √ n (cid:90) θθ − c n / √ n ( β − β )( θ − x ) f ( x ) dx ≤ n / − ε C √ n · c n √ n (cid:90) θθ − c n / √ n f ( x ) dx ≤ C c n n ε = C n ε/ n ε → . (26)By a standard large deviation bound, the integral of (19) from 0 to θ − c n / √ n multiplied by n / − ε iseasily shown to be of order n / − ε O ( exp ( − c n √ n )) →

0. This proves the theorem for covariates in R ( p =1).We now consider the case of covariates in R p for p ≥

2. Suppose, without loss of generality, that β , > β , . For given x − = ( x , . . . , x p ) deﬁne θ ( x − ) := α − α +( β , − − β , − ) t x − β , − β , . For x such that θ ( x − ) / ∈ [0 ,

1] the probability of making a mistake (i.e., (cid:98) g ( x ) − (cid:98) g ( x ) has the wrong sign) is exponentiallysmall. Therefore, R ( π , π ) = (cid:90) [0 , p − (cid:90) θ P ( (cid:98) g ( x ) > (cid:98) g ( x ))( g ( x ) − g ( x )) f ( x ) dx I ( θ ∈ [0 , d x − + (cid:90) [0 , p − (cid:90) θ P ( (cid:98) g ( x ) > (cid:98) g ( x ))( g ( x ) − g ( x )) f ( x ) dx I ( θ ∈ [0 , d x − + a n , where a n is exponentially small. By a slight variation on the above one-dimensional case, the inner integralequals (cid:90) Φ (cid:32) −√ n | g ( x ) − g ( x ) | (cid:112) V ( x ) (cid:33) | g ( x ) − g ( x ) | f ( x ) dx I ( θ ∈ [0 , o (1 /n / − ε ) , (27)where the error term o is uniform in x − . Taking the outer integral, the theorem follows. Proof of Theorem 3.4

As computed two lines below (21), g ( x ) − g ( x ) = ( β − β )( θ − x ). We will show that for continuous t lim t →∞ t (cid:90) θ Φ (cid:32) −√ t ( β − β )( θ − x ) (cid:112) V ( x ) (cid:33) ( β − β )( θ − x ) f ( x ) dx = V ( θ ) f ( θ )4( β − β ) . (28)The integral from θ to 1 is similar, yielding the desired limit. By L’Hˆopital’s rule the limit in (28) equalslim t →∞ t (cid:90) θ ϕ (cid:32) √ t ( β − β )( θ − x ) (cid:112) V ( x ) (cid:33) ( β − β )( θ − x ) (cid:112) V ( x ) 1 √ t ( β − β )( θ − x ) f ( x ) dx. x by y = ( θ − x ) √ t in the integral yieldslim t →∞ (cid:90) θ √ t ϕ  ( β − β ) y (cid:113) V ( θ − y/ √ t )  ( β − β ) (cid:113) V ( θ − y/ √ t ) y f ( θ − y/ √ t ) dy. The limit is 12 (cid:90) ∞ ϕ (cid:32) ( β − β ) y (cid:112) V ( θ ) (cid:33) ( β − β ) (cid:112) V ( θ ) y f ( θ ) dy. Substitution of y with z = ( β − β ) y √ V ( θ ) in the integral yields the claimed limit f ( θ ) V ( θ )2( β − β ) (cid:82) ∞ ϕ ( z ) z dz = f ( θ ) V ( θ )4( β − β ) . Proof of Theorem 3.5

Consider the inner integral in (27). By Theorem 3.4, the limit of the inner integral times n is V ( θ , x − ) f ( θ , x − )2( β , − β , ) . By taking the outer integral the result follows.

The limit of the regret for polynomial regression

Suppose that g ( x ) = α + ( x, . . . , x J ) t β and g ( x ) = α + ( x, . . . , x J ) t β . Let θ be an intersection point,i.e., g ( θ ) = g ( θ ). We have g ( x ) − g ( x ) = g ( x ) − g ( θ ) − [ g ( x ) − g ( θ )] = η t ( β − β ) , where η := ( x − θ, x − θ , . . . , x J − θ J ) t . The following proposition is parallel to Theorem 3.4 (in theone-dimensional case) and provides the limit of the regret; see (12). The proof is similar, although somemodiﬁcations are required. Proposition .1.

For continuous t lim t →∞ t (cid:90) θ Φ (cid:32) −√ t ( β − β ) t η (cid:112) V ( x ) (cid:33) ( β − β ) t η f ( x ) dx = f ( θ ) V ( θ )4 | ( β − β ) t ζ | , (29) where ζ := (1 , θ, ..., J θ J − ) t . roof. By L’Hˆopital’s rule the limit in (29) equalslim t →∞ t (cid:90) θ ϕ (cid:32) √ t ( β − β ) t η (cid:112) V ( x ) (cid:33) ( β − β ) t η (cid:112) V ( x ) 1 √ t ( β − β ) t η f ( x ) dx. Substituting y = ( θ − x ) √ t we get √ t η = √ t  x − θx − θ ... x J − θ J  = √ t  x − θ ( x − θ )( x + θ )...( x − θ )( x J − + x J − θ + . . . + θ J − )  = y (cid:101) η , where (cid:101) η := (1 , x + θ, . . . , x J − + x J − θ + . . . + θ J − ) t = (1 , θ − y/ √ t, . . . , ( θ − y/ √ t ) J − + ( θ − y/ √ t ) J − θ + . . . + θ J − ) t . The integral reads lim t →∞ (cid:90) θ √ t ϕ  y ( β − β ) t (cid:101) η (cid:113) V ( θ − y/ √ t )  y [( β − β ) t (cid:101) η ] (cid:113) V ( θ − y/ √ t ) f ( θ − y/ √ t ) dy. The limit is 12 (cid:90) ∞ ϕ (cid:32) y ( β − β ) t ζ (cid:112) V ( θ ) (cid:33) [ y ( β − β ) t ζ ] (cid:112) V ( θ ) f ( θ ) dy. Substitution of y with z = y ( β − β ) t ζ √ V ( θ ) yields f ( θ ) V ( θ )2 | ( β − β ) t ζ | (cid:90) ∞ ϕ ( z ) z dz = f ( θ ) V ( θ )4 | ( β − β ) t ζ | , which completes the proof.For the proofs of Theorems 4.1 and 4.2 we need the lemma below and some notation. For every m and k , let I m,k := (cid:82) θ m θ m − P ( (cid:98) g k ( x ) > max (cid:96) (cid:54) = k (cid:98) g (cid:96) ( x )) [ g m ( x ) − g k ( x )] f ( x ) dx . We then have R ( π , . . . , π K ) = (cid:80) Kk =1 (cid:80) Km =1 I m,k . Lemma .2.

Under the assumptions in the beginning of Section 4.1, the integral I m,k is exponentially mall for k (cid:54) = m − , m + 1 , and for every ε > n →∞ n / − ε (cid:12)(cid:12)(cid:12)(cid:12) I m,m − − (cid:82) θ m − + cn √ n θ m − Φ (cid:18) −√ n { g m ( x ) − g m − ( x ) } √ V m − ( x ) (cid:19) [ g m ( x ) − g m − ( x )] f ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) = 0 , (30)lim n →∞ n / − ε (cid:12)(cid:12)(cid:12)(cid:12) I m,m +1 − (cid:82) θ m θ m − cn √ n Φ (cid:18) −√ n { g m ( x ) − g m +1 ( x ) } √ V m ( x ) (cid:19) [ g m ( x ) − g m +1 ( x )] f ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12) = 0 , where c n = n ε/ and V m ( x ) = (1 , x ) ( Σ m + Σ m +1 ) (cid:0) x (cid:1) . Moreover, for every m and kn / − ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) I m,k − (cid:90) θ m θ m − (cid:90) (cid:89) l =1 ,...,K,l (cid:54) = k Φ (cid:18) zξ k ( x ) + √ n [ g k ( x ) − g l ( x )] ξ l ( x ) (cid:19) ϕ ( z ) dz  [ g m ( x ) − g k ( x )] f ( x ) dx (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → , as n → ∞ . Proof of Lemma .2

The proof of the ﬁrst claim in Lemma .2 is similar to that of Theorem 3.3. Here is a sketch. For x ∈ I m,k and bounded away from θ m − and θ m , and k (cid:54) = m both the quantities P (cid:18)(cid:98) g k ( x ) > max (cid:96) (cid:54) = k (cid:98) g (cid:96) ( x ) (cid:19) and (cid:90) (cid:89) l =1 ,...,K,l (cid:54) = k Φ (cid:18) zξ k ( x ) + √ n [ g k ( x ) − g l ( x )] ξ l ( x ) (cid:19) ϕ ( z ) dz are exponentially small (for k = m the regret is zero). The same holds true for every x and when k (cid:54) = m − , m + 1.We now prove (30) concerning I m,m − (where m > I m,m +1 is similar. Fix c n = n ε/ and consider x such that x ∈ ( θ m − , θ m − + c n / √ n ) . For such x we have by a standard largedeviations argument that P (cid:18)(cid:98) g m − ( x ) > max (cid:96) (cid:54) = m − (cid:98) g (cid:96) ( x ) (cid:19) = P ( (cid:98) g m − ( x ) > (cid:98) g m ( x )) + a n , and (cid:90) (cid:89) l =1 ,...,K,l (cid:54) = m − Φ (cid:18) zξ m − ( x ) + √ n [ g m − ( x ) − g l ( x )] ξ l ( x ) (cid:19) ϕ ( z ) dz = Φ (cid:32) −√ n [ g m ( x ) − g m − ( x )] (cid:112) V m − ( x ) (cid:33) + b n , (31)where a n and b n are exponentially small (uniformly in x ). By a Berry-Esseen type bound applied to theﬁrst part of the integrand, and the smallness of the other part for x is the given range of the integral, andthe smallness of the range itself , as in (26), we obtain (cid:90) θ m − + cn √ n θ m − (cid:34) P ( (cid:98) g m − ( x ) > (cid:98) g m ( x )) − Φ (cid:32) −√ n [ g m ( x ) − g m − ( x )] (cid:112) V m ( x ) (cid:33)(cid:35) [ g m ( x ) − g m − ( x )] f ( x ) dx = o (1 /n / − ε ) , k = m − Proof of Theorem 4.1

This follows directly from the last part of Lemma .2 and the fact that R ( π , . . . , π K ) = (cid:80) Kk,m =1 I m,k . Notethat we can replace that latter sum by (cid:80) Km =1 I m,m − since I m,k is exponentially small for k (cid:54) = m − , m +1. Proof of Theorem 4.2

The ﬁrst part of Lemma .2 implies that K (cid:88) k =1 K (cid:88) m =1 I m,k = K − (cid:88) m =1 (cid:90) θ m + cn √ n θ m − cn √ n Φ (cid:32) −√ n | g m ( x ) − g m − ( x ) | (cid:112) V m − ( x ) (cid:33) | g m ( x ) − g m − ( x ) | f ( x ) dx + o (1 /n ) . The limit of the latter integrals can be calculated using Theorem 3.4 implying the result.

References

Ballarini, N. M., G. K. Rosenkranz, T. Jaki, F. K¨onig, and M. Posch (2018). Subgroup identiﬁcation inclinical trials via the predicted individual treatment eﬀect.

PloS one 13 (10), e0205971.Bastani, H. and M. Bayati (2015). Online decision-making with high-dimensional covariates.

Available atSSRN 2661896 .Chen, L. H., L. Goldstein, and Q.-M. Shao (2010).

Normal approximation by Stein’s method . SpringerScience & Business Media.Chernoﬀ, H. (1953). Locally optimal designs for estimating parameters.

Ann. Math. Statist. 24 (4),586–602.Ebbeling, C. B., H. A. Feldman, G. L. Klein, J. M. W. Wong, L. Bielak, S. K. Steltz, P. K. Luoto, R. R.Wolfe, W. W. Wong, and D. S. Ludwig (2018). Eﬀects of a low carbohydrate diet on energy expenditureduring weight loss maintenance: randomized trial.

BMJ 363 .Goldenshluger, A. and A. Zeevi (2013). A linear response bandit problem.

Stochastic Systems 3 (1),230–261.Hagberg, L., A. Winkvist, H. K. Brekke, F. Bertz, E. H. Johansson, and E. Huseinovic (2019). Cost-eﬀectiveness and quality of life of a diet intervention postpartum: 2-year results from a randomizedcontrolled trial.

BMC Public Health 19 (1), 38. 26ern´an, M. A. and J. M. Robins (2010). Causal inference.Hills, A. P., N. Mokhtar, and N. M. Byrne (2014). Assessment of physical activity and energy expenditure:an overview of objective measures.

Frontiers in Nutrition 1 , 5.Moore, M. (1973). A convex matrix function.

The American Mathematical Monthly 80 (4), 408–409.Pan, Y. and Y.-Q. Zhao (2020). Improved doubly robust estimation in learning optimal individualizedtreatment rules.

Journal of the American Statistical Association 0 (0), 1–12.Qian, M. and S. A. Murphy (2011). Performance guarantees for individualized treatment rules.

Annalsof statistics 39 (2), 1180.Silvey, S. (2013).

Optimal design: an introduction to the theory for parameter estimation , Volume 1.Springer Science & Business Media.Tian, L., A. A. Alizadeh, A. J. Gentles, and R. Tibshirani (2014). A simple method for estimatinginteractions between a treatment and a large number of covariates.

Journal of the American StatisticalAssociation 109 (508), 1517–1532.Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices.