[PDF] Minimax rates without the fixed sample size assumption

Abstract

We generalize the notion of minimax convergence rate. In contrast to the standard definition, we do not assume that the sample size is fixed in advance. Allowing for varying sample size results in time-robust minimax rates and estimators. These can be either strongly adversarial, based on the worst-case over all sample sizes, or weakly adversarial, based on the worst-case over all stopping times. We show that standard and time-robust rates usually differ by at most a logarithmic factor, and that for some (and we conjecture for all) exponential families, they differ by exactly an iterated logarithmic factor. In many situations, time-robust rates are arguably more natural to consider. For example, they allow us to simultaneously obtain strong model selection consistency and optimal estimation rates, thus avoiding the "AIC-BIC dilemma".

Full PDF

aa r X i v : . [ m a t h . S T ] J un MINIMAX RATES WITHOUT THE FIXED SAMPLE SIZE ASSUMPTION B Y A LISA K IRICHENKO AND P ETER G RÜNWALD Department of Statistics, University of Oxford, [email protected] CWI and Mathematical Institute, Leiden University, [email protected]

We generalize the notion of minimax convergence rate. In contrast to thestandard deﬁnition, we do not assume that the sample size is ﬁxed in advance.Allowing for varying sample size results in time-robust minimax rates andestimators. These can be either strongly adversarial, based on the worst-caseover all sample sizes, or weakly adversarial, based on the worst-case over allstopping times. We show that standard and time-robust rates usually differby at most a logarithmic factor, and that for some (and we conjecture for all)exponential families, they differ by exactly an iterated logarithmic factor. Inmany situations, time-robust rates are arguably more natural to consider. Forexample, they allow us to simultaneously obtain strong model selection con-sistency and optimal estimation rates, thus avoiding the “AIC-BIC dilemma”.

1. Introduction.

Minimax rates are an essential tool for evaluation and comparison ofestimators in a wide variety of applications. Classic references on the topic include, amongmany others, Tsybakov (2009), Wasserman (2006) and Van der Vaart (1998). For a ﬁxedsample size n , the standard minimax rate is computed by ﬁrst taking the supremum of theexpected loss over all parameters (distributions) in the model for each estimator, and thenminimizing this value over all possible estimators. Here, we consider a natural extension ofthis setting in which data comes in sequentially and one does not know n in advance: insteadof considering n ≥ ﬁxed, we include it in the worst-case analysis.At ﬁrst it may seem that such time-robustness trivializes the problem: a naive approachwould be to take n as a parameter just like the distribution and compute the supremum ofthe expected loss over all sample sizes and all distributions in the model. In most cases thesupremum would then be trivially attained for sample size one, since the precision of anestimator tends to get better with the increase of the sample size. Therefore, another ap-proach has to be taken. We manage to give meaningful deﬁnitions by rewriting the standarddeﬁnition in terms of a ratio. The precise new deﬁnitions, given in Section 2.3, come in twoforms: weakly adversarial , in which we take the sup (worst-case) over all stopping times; and strongly adversarial , in which we take the sup over all sample sizes. In general, the weaklyadversarial minimax rate cannot be larger than the strongly adversarial one. The weakly ad-versarial setting corresponds to what has recently been called the always valid (sometimesalso “anytime-valid”) setting for conﬁdence intervals and testing (Howard et al., 2018): atany point in time n , Nature can decide whether or not to stop generating data and presentthe data so far for analysis, using a rule that can take into account both past data and thetrue distribution. This can be seen as a form of minimax analysis under ‘optional stopping’.Note however that in the standard interpretation (e.g. in the Bayesian literature) of optionalstopping, stopping rules are assumed independent of the underlying distribution P θ , whereashere Nature is more powerful, her stopping rule being allowed to depend on θ . Nature is evenmore powerful in the second, strongly adversarial setting that we consider. Here, Nature canbe thought of as generating a very large sample of data and then simply producing the n sothat the initial sub-sample up to size n is as misleading as possible. While one may arguewhich of these two settings is more appropriate, our initial results show that in some casesthey lead to the same rates, and we conjecture that the rates coincide more generally. KIRICHENKO AND GRUNWALD

Motivation.

One advantage of the new deﬁnitions is that, in some contexts, they may bemore natural. Of course, minimax approaches are truly optimal in the zero-sum game settingin which Statistician plays against Nature, Nature being a player that actively determines theparameters of the problem in an adversarial manner. In practice, one is interested in minimaxestimators and rates not because one really thinks that Nature will actually be adversarial inthis way, but simply because one wants to be robust against whatever might happen. But ifone wants to be robust against whatever might happen, then it seems natural to be robust notjust for all parameters, but also for sample sizes: in modern practice, the data analyst is oftenpresented with a ﬁxed sample of a particular size, and she has no control whatsoever on howexactly that sample size was determined. Time-robust minimax optimal estimators are robustin this situation — one might of course argue ‘time will not be determined by an adversary!’but this is no different from arguing ‘the true θ will not be determined by an adversary!’:once one takes a worst-case approach at all, it makes sense to include time as well. Moreover,even in the setting of controlled experiments such as clinical trials, where the statistician isnormally supposed to determine the sample size in advance, early stopping and the like mighthappen for reasons outside of the statistician’s control, see e.g. (Molenberghs et al., 2014)and references therein. As such, the time-robust minimax setting nicely ﬁts in recent workpromoting always-valid conﬁdence intervals (Howard et al., 2018; Pace and Salvan, 2019)and testing safe under optional continuation (Grünwald, de Heide and Koolen, 2019) as ageneric, more robust replacement of traditional testing and conﬁdence.Given the fact that time-robustness is a natural mode of analysis, it is perhaps not so sur-prising, that the somewhat disturbing conﬂict between consistency and rate optimality in stan-dard estimation theory known as AIC-BIC dilemma (Yang, 2005; Van Erven, Rooij and Grünwald,2008; Van der Pas and Grünwald, 2018), quite simply disappears under the novel deﬁnitionof minimax rate. We discuss this motivating application at length towards the end of thepaper, in Section 3.

Results.

We provide several results comparing the time-robust to the standard minimax rates.First, in Theorem 3.1 we show that for most estimation problems the strongly adversarialminimax rate goes up by at most a logarithmic factor. A natural question arises: is there anestimation problem, for which time-robust minimax rates and standard minimax rates do notcoincide? The answer is positive: in Theorem 3.2 we show that, under the standard squarederror loss, both the weakly and the strongly adversarial time-robust rates for estimating aparameter in the Gaussian location family are equal to n − log log n , while the standard min-imax rate for this problem is n − . The proof for the upper bound easily extends to moststandard multivariate exponential families, as we show in Theorem 5.3, and we conjecturethe lower bound extends as well.These results originate from the law of iterated logarithm. To get an intuition, consider themaximum likelihood estimator for the mean of a one-dimensional Gaussian distribution withknown variance. The estimator is simply equal to the sample average. By the law of iteratedlogarithm (see, for instance, Hartman and Wintner (1941)) the squared distance between thesample average and the truth is of order n − log log n inﬁnitely often with probability one.Therefore, for a suitable stopping rule the expected loss will also be of at least the sameorder. While a lower bound on the rate for the MLE is thus easy to determine, it turns out tobe considerably more difﬁcult to show this lower bound for arbitrary estimators — despitethe simple Gaussian location setting, this required new techniques. One reason is that wemust allow for arbitrary estimators, and these can depend on the data in tricky ways. Forexample, one might change one’s estimate if the empirical average on the ﬁrst half of thedata is more than a constant times p n − log log n from the empirical average on the secondhalf. Since we show that no estimator (decision rule) at all can beat n − log log n , we maythink of Theorem 3.2 as a decision-theoretic law of the iterated logarithm . INIMAX RATES WITHOUT THE FIXED N ASSUMPTION The proofs for the upper bound on the strongly adversarial rate are based on ﬁnite-time laws of the iterated logarithm based on nonnegative supermartingales, a technique ini-tially proposed by Darling and Robbins (1968), and recently extended by e.g. Balsubramani(2014); Howard et al. (2018). To adjust these techniques to our strongly adversarial setting,we use two fundamental results from Shafer et al. (2011) that link nonnegative supermartin-gales to p -values and so-called E -values (Vovk and Wang, 2019; Grünwald, de Heide and Koolen,2019).The remainder of the paper is organized as follows. We give the necessary measure-theoretic background in Section 2.1. Section 2.2 recalls the standard deﬁnition of minimaxrates. Section 2.3 extends this deﬁnition to time-robust minimax rates. Section 3 containsthe main results of the paper, and the AIC-BIC example showing how the new deﬁnitionscan be used in the context of combined model selection and estimation. We provide a shortdiscussion in Section 4. All proofs are given in Section 5, with some details deferred to theappendix.

2. Basic deﬁnitions.

Background on measure theory; notation.

Let X be a topological space endowedwith Borel sigma-algebra B . Consider a probability space (Ω , A , P ) . We say that a randomvariable X : Ω → X is measurable on (Ω , A ) if X − ( B ) = { ω ∈ Ω : X ( ω ) ∈ B } ∈ A forevery Borel set B ∈ B . For a random variable X : Ω → X let σ ( X ) be a sigma-algebra gen-erated by X , deﬁned as the smallest sigma-algebra such that X is measurable on (Ω , σ ( X )) .Similarly, for a sequence of random variables X , . . . , X n : Ω → X denote the sigma-algebragenerated by X , . . . , X n by σ ( X , . . . , X n ) . A ﬁltration F = ( F n ) n ∈ N is deﬁned as a non-decreasing family of sigma-algebras. We say that a random variable τ : Ω → N is a stoppingtime with respect to F , if { ω : τ ( ω ) ≤ n } ∈ F n for all n ∈ N . For more background on mea-sure theory and stopping times see, for instance, Kallenberg (2002) (Chapters 1, 2, and 7).We write a n . b n , when there exists a constant c > such that a n ≤ cb n holds for all n ∈ N ; and a n ≍ b n , when there exist constants c , c > such that c a n ≤ b n ≤ c a n holdsfor all n ∈ N .2.2. Standard deﬁnition of convergence rates.

Suppose we observe a random i.i.d. sam-ple X , . . . , X n ∈ X from a distribution P θ indexed by a parameter θ ∈ Θ , where Θ is poten-tially inﬁnite dimensional. Consider the problem of estimating parameter θ from the avail-able data X n = ( X , . . . , X n ) . We measure the estimation error with respect to some metric d : Θ × Θ → R +0 . In order to choose an estimator for a particular setting it is important to havea way of comparing the performance of estimators. The minimax paradigm offers a classicsolution for performance evaluation. It judges the performance of an estimator by its rate ofconvergence, which is deﬁned by taking the worst case scenario over all elements in the givenparameter space. More precisely, we deﬁne an estimator ˆ θ to be a collection { ˆ θ n } n ∈ N suchthat for each n , ˆ θ n = ˆ θ ( X n ) : X n → Θ is a function from samples of size n to Θ . We say that ˆ θ has a rate of convergence f ˆ θ : N → R + if(i) there exists C > such that for every sample size n ∈ N sup θ ∈ Θ E X n ∼ P θ " d ( θ, ˆ θ n ) f ˆ θ ( n ) ≤ C. (ii) For any function ˜ f : N → R + such that f ˆ θ ( n ) / ˜ f ( n ) → ∞ sup θ ∈ Θ E X n ∼ P θ " d ( θ, ˆ θ n )˜ f ( n ) = ∞ . KIRICHENKO AND GRUNWALD

An estimator ˆ θ is called minimax optimal (up to a constant factor), if(2.1) f ˆ θ ( n ) ≍ inf ˜ θ f ˜ θ ( n ) , where the inﬁmum is taken over all estimators that can be deﬁned on the domain.Here we expressed the minimax rate in terms of a supremum over a ratio. It is perhaps morecommon to express ˆ θ being minimax optimal, i.e. (2.1), without using ratios, but directly (yetequivalently) as sup θ ∈ Θ E X n ∼ P θ h d ( θ, ˆ θ n ) i ≤ C inf ˜ θ sup θ ∈ Θ E X n ∼ P θ h d ( θ, ˜ θ n ) i . Under this formulation, the straightforward extension to taking a worst-case over time trivial-izes the problem: if we take the supremum on the right not just over θ ∈ Θ but also over n , itwill be achieved for n = 1 (or other small sample sizes) — Nature would always choose thesmallest possible sample size and the problem would become uninteresting. By rephrasingminimax optimality in terms of ratios, and taking a supremum over stopping times/rules, wedo get a useful extension, as we now show.2.3. Time-Robust Convergence rates.

The classic deﬁnitions for minimax rates assumethe sample size is ﬁxed and known in advance. Now we propose our generalized deﬁnitionsthat account for not knowing the sample size in advance.Let T be a collection of all possible almost surely ﬁnite stopping times with respect tothe sequence of sigma algebras F n = σ ( X , . . . , X n ) generated by the data X n . We say thatan estimator ˆ θ (with ˆ θ n = ˆ θ ( X n ) ) has a weakly adversarial time-robust rate of convergence f ˆ θ : N → R + if sup θ ∈ Θ sup τ ∈T E X ∞ ∼ P θ " d ( θ, ˆ θ τ ) f ˆ θ ( τ ) ≤ C and for any function ˜ f : N → R + such that f ˆ θ ( n ) / ˜ f ( n ) → ∞ sup θ ∈ Θ sup τ ∈T E X ∞ ∼ P θ " d ( θ, ˆ θ n )˜ f ( n ) = ∞ . An estimator ˆ θ is weakly adversarial time-robust minimax optimal if its weakly-adversarialtime-robust rate of convergence f ˆ θ satisﬁes(2.2) f ˆ θ ( n ) ≍ inf ˜ θ f ˜ θ ( n ) , where the inﬁmum is taken over all estimators. Then the function f ˆ θ is called the weaklyadversarial time-robust minimax rate for the given statistical problem.We say that an estimator ˆ θ (with ˆ θ n = ˆ θ ( X n ) ) has a strongly adversarial time-robust rateof convergence g ˆ θ : N → R + if sup θ ∈ Θ E X ∞ ∼ P θ " sup n ∈ N d ( θ, ˆ θ n ) g ˆ θ ( n ) ≤ C and for any function ˜ g : N → R + such that g ˆ θ ( n ) / ˜ g ( n ) → ∞ sup θ ∈ Θ E X ∞ ∼ P θ " sup n ∈ N d ( θ, ˆ θ n )˜ g ( n ) = ∞ . INIMAX RATES WITHOUT THE FIXED N ASSUMPTION An estimator ˆ θ is strongly adversarial time-robust minimax optimal if its strongly adversarialtime-robust rate of convergence g ˆ θ satisﬁes (2.2) with f · replaced by g · , where again theinﬁmum is taken over all estimators. Then the function g ˆ θ is called the strongly adversarialtime-robust minimax rate for the given statistical problem.We may also call the weakly adversarial time-robust rate of convergence the always-validconvergence rate , since the freedom in when to stop is exactly the same as in the recent pa-pers on always-valid (also known as ‘anytime-valid’) conﬁdence intervals and p -values. Thestrongly adversarial time-robust rate may also be called the worst-case-sample size conver-gence rate. Statistical estimation in which the stopping time τ is not known in advance is oftenreferred to as estimation with optional stopping . However, in e.g. the Bayesian literature thisis usually interpreted as ‘the stopping rule may be unknown, but it is chosen independentlyof θ . We may think of the weakly time-robust or “always-valid” rate as the rate obtained in asetting with a stronger form of optional stopping, in which Nature jointly chooses θ and thestopping time, which can then be chosen as a function of θ . Note that choosing a stoppingtime is equivalent to choosing a stopping rule , which, at each sample size n decides, basedon θ and all past data, whether to stop or not. In contrast, the strongly adversarial time-robustrate corresponds to deciding to stop at the worst n , a rule does not depend on the true θ butinstead, unlike a stopping time, requires a look into the future.Clearly any estimator ˆ θ , if it has strongly adversarial time-robust rate f ( n ) , has weaklyadversarial time-robust rate and standard minimax rate that are at most f ( n ) . Similarly, anystandard minimax rate can be no larger, up to a constant factor, than any weakly adversarialtime-robust minimax rate, which in turn can be no larger, up to a constant factor, than thecorresponding strongly adversarial time-robust minimax rate. In the next section we studythe relationship between these three quantities more closely.

3. Main results.

Our ﬁrst result gives a general upper bound on the strongly adversarialtime-robust minimax rate, and hence also on the weakly adversarial time-robust minimaxrate, as compared to the usual minimax rate. This result holds under very weak conditionsfor any parameter estimation setting. Then, we consider estimating the mean parameter inthe Gaussian location family with the usual Euclidean distance. It turns out that for thisproblem both the weakly and the strongly adversarial time-robust minimax rates are equal to n − log log n , while the usual minimax rate is n − .3.1. Time-robust rates are never much worse than standard rates.

In the theorem belowwe show that the strongly (and hence the weakly) adversarial time-robust minimax rate differsfrom the usual minimax rate by at most a logarithmic factor under a very mild assumptionon the decay of the usual minimax rate function. The result makes no assumption about themetric d .T HEOREM

Let f : N → R + be a minimax rate for some given statistical estimationproblem, such that f is non-increasing and (3.1) f (2 n ) f ( n ) ≥ C for some C > . Then the strongly adversarial time-robust minimax rate g ( n ) for the sameproblem satisﬁes g ( n ) . f ( n ) log n. Notice that the assumption (3.1) holds for f ( n ) ≍ n − γ (log n ) β with < γ ≤ , β ≥ ,which is equal to the minimax rate for most standard parametric and nonparametric esti-mation problems, and under most standard metrics, see e.g. Tsybakov (2009). The proof of KIRICHENKO AND GRUNWALD

Theorem 3.1 involves constructing an estimator that uses only part of the available data. Wethen show that the standard minimax rate for this estimator is f ( n ) log n and that it remainsunaffected if we include the supremum over n .3.2. The time-robust rate can be different from the standard rate.

Now we present aproblem for which the time-robust minimax rates, while equal to each other, do not co-incide with the usual minimax rate. Consider a Gaussian location family with ﬁxed vari-ance { P µ , µ ∈ R } , where each P µ is a Gaussian distribution with mean µ and variance one.Let d ( µ, µ ′ ) = ( µ − µ ′ ) be the usual Euclidian distance. The following theorem showsthat the strongly adversarial time-robust minimax rate for estimating µ is upper boundedby n − log log n and the weakly adversarial time-robust minimax rate is lower bounded n − log log n , so that both rates coincide and are equal to n − log log n . Furthermore, it showsthat the rate is attained by the maximum likelihood estimator (MLE).To avoid taking a logarithm of a negative number we set f ( n ) = 1 for n = 1 , and f ( n ) = n − log log n, when n ≥ . T HEOREM

Let { P µ , µ ∈ R } represent the Gaussian location family, i.e. under P µ ,the X , X , . . . are i.i.d. ∼ N ( µ, . Then, (i) Upper bound. There exists a constant

C > such that (3.2) sup µ ∈ R E X ∞ ∼ P µ (cid:20) sup n ∈ N ( µ − ˆ µ MLE n ) f ( n ) (cid:21) ≤ C, where ˆ µ MLE n = ˆ µ MLE ( X n ) is the maximum likelihood estimator. (ii) Lower bound. Let T be the collection of all (a.s. ﬁnite) stopping times w.r.t. the ﬁltration F = { σ ( X n ) , n ∈ N } . There is a C > such that for any estimator ˆ µ with ˆ µ n = ˆ µ ( X n ) , (3.3) sup µ ∈ R sup τ ∈T E X ∞ ∼ P µ (cid:20) ( µ − ˆ µ τ ) f ( τ ) (cid:21) ≥ C, and (3.4) sup µ ∈ R sup τ ∈T E X ∞ ∼ P µ (cid:20) ( µ − ˆ µ τ ) g ( τ ) (cid:21) = ∞ , for all non-increasing g : N → R + such that f ( n ) /g ( n ) → ∞ . R EMARK p -values and so-called E -values.The proof for the lower bound relies on a number of steps. We ﬁrst show that the boundmust hold for estimators that are ‘MLE-like’: they are sufﬁciently ‘close’, in a particularsense, to the MLE and hence also to Bayes optimal estimators based on standard priors. INIMAX RATES WITHOUT THE FIXED N ASSUMPTION In the second step, we relate the problem of bounding the minimax risk to the problem ofbounding the Bayes risk. This idea is not new in itself and is widely used in minimax theory,see e.g. Tsybakov (2009). The difﬁculty we face is that this standard argument does not giveanything useful if directly applied to MLE-like (and standard Bayes-like) estimators. In theﬁnal step of the proof, we thus construct a stopping rule for each ‘non-MLE-like’ estimatorthat stops when the estimator is far away from the MLE, and give a lower bound for the Bayesrisk, i.e. the risk of the non-MLE-like estimator under the Bayesian posterior. The completeproofs are given in Section 5.3.3.

Application: avoiding the AIC-BIC Dilemma in Model Selection and Post-SelectionInference.

Consider a simple model selection problem. Data X n are used to select betweentwo nested exponential family models, with M = { p µ , µ ∈ M } and M = { p µ , µ ∈ M } ,where M ⊂ M , and M is an exponential family with mean-value parameter set M ⊂ R k .For simplicity let M = { µ } , so that M is a singleton. Examples include testing whethera coin is fair (using Bernoulli distributions) or whether a treatment has an effect (using theGaussian location family). We consider combined model selection and estimation: ﬁrst, amodel is selected using some model selection method such as for example, AIC, BIC, cross-validation, Bayes factor model selection, or one of its many variations. Then, based on thechosen model, a parameter within the model is estimated using some estimator such as theMLE or a Bayes predictive distribution (note that if the singleton model M is selected, thenthe estimator must return µ ).Two desirable properties for such combined procedures are (i) consistency and (ii) rate op-timality of the post-model selection estimation. Yang (2005) shows that at least for some set-tings having both (i) and (ii) at the same time is impossible: any combination of a consistentmodel selection and subsequent estimation method misses the standard minimax rate (equalto n − in our case) by a factor g ( n ) such that g ( n ) → ∞ , as n → ∞ . Yang’s setting can be ad-justed to include the exponential family setting presented here, see Van der Pas and Grünwald(2018) for more details. Yang also shows that a similar problem occurs if we average overthe models (in a Bayesian or any other way) rather than select one of the models. This hasbeen called the AIC-BIC dilemma . While for the mathematician, it may not be so surprisingthat there is no procedure which is optimal under two different deﬁnitions of optimality, it hasbeen argued (by Yang and many others) that for the practitioner, there really is a dilemma: shesimply wants to get an initial idea of which model best explains her data, indicating how tofocus her subsequent research, and views consistency and rate optimality as desirable proper-ties, both indicating that her procedure will do something reasonable in idealized situations,but neither one being the ultimate goal. Which of the two properties is more important is thenoften not clear.However, if one accepts the novel deﬁnition of minimax rate presented here, onehas a way out of the dilemma after all: there exist model selection procedures that arestrongly consistent, while, if combined with the MLE, have a standard convergence rateequal to n − log log n under the squared error loss. Van der Pas and Grünwald (2018)showed this explicitly for model selection based on the switch distribution introducedby Van Erven, Rooij and Grünwald (2008). The switch distribution was speciﬁcally de-signed for this purpose, but some other methods achieve this as well. For example,while Bayesian model selection based on standard priors achieves only an n − log n rate(Van Erven, Rooij and Grünwald, 2008), it seems quite likely that if M is equipped withthe quite special stitching priors (Howard et al., 2018) which asymptote at µ , one can alsoget strongly consistent model selection and an n − log log n estimation rate by Bayes factormodel selection. Since we have no explicit proof of this, we continue the discussion withswitching rather than stitching. The estimation rate for the switch distribution (at least for the KIRICHENKO AND GRUNWALD

Gaussian location family, but we conjecture for general exponential families) is equal to thetime-robust minimax convergence rate, derived in Theorem 3.2. Thus the switch procedureis both strongly consistent and minimax optimal in the new, time-robust, sense. We see thatusing time-robust deﬁnitions of minimax optimality, the gap between (i) and (ii) above canbe bridged, whereas, by Yang’s theorem, this is impossible under the standard deﬁnition ofminimax rate. Hence, by redeﬁning minimax optimality so as to be robust with respect to all parameters (including n ) that we as statisticians do not have under control, the minimax rateslightly changes and the AIC-BIC dilemma simply disappears: combined consistency andestimation optimality can be achieved by, for example, the switch distribution. As an aside,neither AIC nor BIC itself ‘solve’ the dilemma with the time-robust deﬁnitions: AIC is stillinconsistent, whereas BIC, when combined with efﬁcient post-selection estimation, achievesa standard estimation rate of order n − log n ; since the time-robust rate is at least the standardrate, it must still be rate-sub-optimal under the time-robust deﬁnition of minimax rate.

4. Discussion.

In this paper we suggested a generalization of minimax theory enablingit to deal with unknown and data-dependent sample sizes. We introduced two notions oftime-robust minimax rates and compared them to the standard notion of minimax rates. Weshowed that for most problems the rates differ by at most logarithmic factor. We also providedan example of a (parametric) setting, for which the weak and the strong rates are the same,yet they differ from the standard rates by an iterated logarithmic factor. However, it is notyet clear under what circumstances the logarithmic upper bound on the difference, derivedin Theorem 3.1, is tight: for example, it might be possible that in some standard (e.g. non-parametric) problems, the gap vanishes (the strongly adversarial time-robust rate is within aconstant factor of the standard rate); but in others it may even be larger than order log log n .Similarly, in some settings, the weak and strong time-robust rates may coincide, and in someothers they may differ. A major goal for future research is thus to sort out more generallywhen the three rates coincide and when they differ, and if so, by how much.

5. Proofs.

Proof of Theorem 3.1.

Let ˆ θ be an estimator that achieves the standard minimaxrate, i.e. there exists C ′ > such that for every n ∈ N sup θ ∈ Θ E X n ∼ P θ " d (ˆ θ n , θ ) f ( n ) ≤ C ′ , where ˆ θ n = ˆ θ ( X n ) . For k ≥ deﬁne ˆ θ ( k ) = ˆ θ ( X , . . . , X k ) . Let ⌊ x ⌋ = max { z ∈ Z , z ≤ x } . Consider ˆ θ ′ n = ˆ θ ′ ( X n ) = ˆ θ ( ⌊ log n ⌋ ) . to be the function of data X n that outputs ˆ θ ( ⌊ log n ⌋ ) . In what follows we show that ˆ θ ′ achievesa time-robust minimax rate satisfying the claim of the theorem.Consider a probability mass function π : N → [0 , with P j ≥ π ( j ) = 1 . Denote E θ [ · ] = E X ∞ ∼ P θ [ · ] . Because of assumption (3.1) for any θ ∈ Θ we have E θ " sup n ∈ N π ( ⌊ log n ⌋ ) · d (ˆ θ ′ n , θ ) f ( n ) ≤ C E θ " sup j ∈ N π ( j ) · d (ˆ θ ( j ) , θ ) f (2 j ) ≤≤ C E θ X j ∈ N π ( j ) · d (ˆ θ ( j ) , θ ) f (2 j )  ≤ C sup j ∈ N E θ " d (ˆ θ ( j ) , θ ) f (2 j ) . INIMAX RATES WITHOUT THE FIXED N ASSUMPTION The last inequality is due to P j ≥ π ( j ) = 1 . Since ˆ θ achieves the standard minimax rate wehave sup j ∈ N E θ " d (ˆ θ ( j ) , θ ) f (2 j ) ≤ sup n ∈ N E θ " d (ˆ θ n , θ ) f ( n ) ≤ C ′ . Putting everything together we arrive at E θ " sup n ∈ N π ( ⌊ log n ⌋ ) · d (ˆ θ ′ n , θ ) f ( n ) ≤ CC ′ . Let π ( j ) ≍ j − − α . Then for every α > there exists a constant C ′′ > such that for any θ ∈ Θ E θ " sup n ∈ N ⌊ log n ⌋ + 1) α · d (ˆ θ ′ n , θ ) f ( n ) ≤ C ′′ . This ﬁnishes the proof of the theorem.5.2.

Proof of (i) in Theorem 3.2 .

We will prove a more general result that holds for allexponential families.5.2.1.

Preliminaries.

Consider an exponential family P ¯Θ = { P θ , θ ∈ ¯Θ } , ¯Θ ⊂ R k , de-ﬁned as the family of densities p θ ( x ) = r ( x ) e θ T φ ( x ) − ψ ( θ ) , x ∈ X , θ ∈ ¯Θ , Here φ ( x ) is a sufﬁcient statistics for θ . When we write X ∞ ∼ P θ we mean that X , X , . . . are i.i.d. with each X i ∼ P θ . We use the mean-value parametrization of the exponential familyand set P ¯ M = { P µ , µ ∈ ¯ M } , ¯ M ⊂ R k with the link function(5.1) µ ( θ ) = E X ∼ P θ [ φ ( X )] . We let θ ( · ) , the inverse of µ ( · ) , be the transformation function from the mean-valueparametrization to the canonical one. θ ( · ) exists for all exponential families, see e.g. Brown(1986).We assume the parameter space ¯ M is such that maximum likelihood estimator lies in ¯ M and is unique. That means we potentially have to extend the original family { P θ , θ ∈ ¯Θ } to accommodate that by including distributions ‘on the boundary’. For example, in theBernoulli model, the natural parameter ranges from −∞ to ∞ , corresponding to { P µ , µ ∈ (0 , } , excluding the degenerate distributions P and P . We then simply set ¯ M = [0 , toinclude these distributions.More formally, the assumption is as follows.A SSUMPTION ¯ M is such that the maximum likelihood estimator ˆ µ MLE n = ˆ µ MLE ( x n ) satisﬁes ˆ µ MLE n = 1 n n X i =1 φ ( x i ) ∈ ¯ M for all x , . . . , x n ∈ X . This assumption is needed since we are using the properties of the average of i.i.d. randomvariables to prove a statement about the MLE. However, the assumption is rather weak: moststandard exponential families either satisfy Assumption 1 or can be extended to satisfy it; seeChapter 5 of Brown (1986).Furthermore, we introduce a deﬁnition of a CINECSI subset of ¯ M . KIRICHENKO AND GRUNWALD D EFINITION ¯ M is a connected subset of the interior of ¯ M that is itself compact andhas nonempty interior.For discussion on CINECSI subsets see Grünwald (2007).Finally, we introduce an additional assumption on the set of true parameters M ⊆ ¯ M , fromwhich the data is assumed to be generated and over which we are taking the supremum.A SSUMPTION M ⊆ ¯ M is such that there exist constants σ > and δ > such thatfor all η ∈ R k with k η k ≤ δ , where k · k is the usual Euclidian distance, and all µ ∈ M (5.2) E X ∼ P µ h e η T ( φ ( X ) − µ ) i ≤ e ση T η/ . In the proposition below we show that the Assumption 2 is satisﬁed for the Gaussianlocation family with M = ¯ M and for other exponential families when M is a CINECSIsubset of ¯ M . This condition is required in our proofs for bounding Fisher information, butmight potentially be relaxed if one uses different proof techniques.P ROPOSITION

Assumption 2 is satisﬁed for the following settings: (i)

When P ¯ M = { P µ , µ ∈ ¯ M } is Gaussian location family, i.e. P µ represents a N ( µ, dis-tribution, and M = ¯ M = R . (ii) When P ¯ M is any exponential family and M is a CINECSI subset of ¯ M . P ROOF .(i) For the Gaussian location family φ ( X ) = X . Inspecting the deﬁnition of the momentgenerating function, we immediately ﬁnd that for all η ∈ RE X ∼ P µ h e η ( X − µ ) i = e η / . Then (5.2) is satisﬁed with σ = 1 and any δ > . (ii) Let M be a CINECSI subset of ¯ M . Consider the canonical parametrization of the ex-ponential family with Θ = θ ( M ) (where θ ( · ) is as deﬁned underneath (5.1). By Taylorexpansion we have for every η ∈ R k and any θ ∈ Θ E X ∼ P θ h e η T φ ( X ) i = e ψ ( θ + η ) − ψ ( θ ) = e η T µ ( θ )+ η T I ( θ ′ ) η/ , where I ( · ) is Fisher information and θ ′ is between θ and θ + η . Now we construct a set B δ (0) = { η ∈ R k : k η k ≤ δ } such that for all η ∈ B δ (0) the Fisher information at θ ′ (located between θ and θ + η ) is bounded.Notice that since µ ( · ) = θ − ( · ) is continuous, Θ is a CINECSI subset of ¯Θ = θ ( ¯ M ) .Hence, there exists δ > and Θ δ such that Θ ⊂ Θ δ , Θ δ is a CINECSI subset of ¯Θ , and inf θ ∈ Θ ,θ ′ ∈ ∂ Θ δ k θ − θ ′ k ≥ δ. Then for all η ∈ B δ (0) , we have θ ′ ∈ Θ δ . Since Θ δ is a CINECSI subset of ¯Θ , the Fisherinformation is bounded on Θ δ . Therefore, there exists σ = sup θ ′ ∈ Θ δ I ( θ ′ ) > such that E X ∼ P µ h e η T ( φ ( X ) − µ ) i ≤ e ση T η/ for all η ∈ B δ (0) and all µ ∈ M . INIMAX RATES WITHOUT THE FIXED N ASSUMPTION General theorem.

In the following theorem we show that under Assumptions 1and 2 the strongly adversarial time-robust minimax rate for the MLE is at most n − log log n .Note that below, the MLE ˆ µ MLE is deﬁned relative to the full set ¯ M , not the potentiallyrestricted set M . Also, observe that (3.2) directly follows from Theorem 5.3, since for theGaussian location family, Assumption 1 is satisﬁed for ¯ M = R .T HEOREM

Let ¯ M be such that Assumption 1 is satisﬁed. Let M ⊆ ¯ M be such thatAssumption 2 is satisﬁed. Then there exists a constant C > such that sup µ ∈ M E X ∞ ∼ P µ (cid:20) sup n ∈ N k µ − ˆ µ MLE n k f ( n ) (cid:21) ≤ C, where f ( n ) = n − log log n for n ≥ and f ( n ) = 1 for n = 1 , . Proof of Theorem 5.3.

Let S n = P ni =1 ( φ ( X i ) − µ ( θ )) . Notice that for n ≥ wehave n log log n ≥ / . Furthermore, there exists a constant C ′ such that X m =1 k S m k ≤ C ′ . Also, for every µ ∈ M E X ∞ ∼ P µ (cid:20) sup n ∈ N k µ − ˆ µ MLE n k f ( n ) (cid:21) ≤ E X ∞ ∼ P µ " X m =1 k S m k + sup n> k S n k n log log n . It is then sufﬁcient to show that there exists a constant

C > such that for every parameter θ ∈ Θ = θ ( M ) (where θ ( · ) is as deﬁned underneath (5.1)), E X ∞ ∼ P θ (cid:20) sup n> k S n k n log log n (cid:21) ≤ C. First, following Shafer et al. (2011), we deﬁne a test supermartingale ( U n ) n ∈ N relative to ﬁl-tration ( F n ) n ∈ N and distribution P to be a nonnegative supermartingale relative to ( F n ) n ∈ N with starting value bounded by , i.e. ( U n ) n ∈ N is a test martingale iff for all n ∈ N , U n ≥ a.s., E P [ U n | F n − ] ≤ U n − , and E [ U ] ≤ . The following lemma is an immediate conse-quence of combining two of Shafer et al. (2011)’s fundamental results:L EMMA

Suppose that ( U n ) n ∈ N is a test supermartingale under distribution P . Then E P (cid:20) sup n ∈ N p U n / (cid:21) ≤ . P ROOF . Let V = (1 / (sup n ∈ N U n ) . From Theorem 2, part (1) of Shafer et al. (2011) wehave that, for all ≤ α ≤ , P ( V ≤ α ) ≤ α , i.e. (the value taken by) V can be interpretedas a p -value. Now Theorem 3, part (1) of Shafer et al. (2011), together with (8) in that pa-per instantiated to α = 1 / , gives that / (2 √ V ) is an E -variable , i.e. E [1 / (2 √ V )] ≤ ,and the ﬁrst result above follows. [In the terminology of Shafer et al. (2011), a random vari-able W with E [1 /W ] ≤ is called “Bayes factor”. In recent publications, the terminologyhas changed to calling /W an E -variable and its value E -value (Vovk and Wang, 2019;Grünwald, de Heide and Koolen, 2019).] KIRICHENKO AND GRUNWALD

Consider δ > and σ > such that Assumption 2 is satisﬁed. Let B δ (0) = { η ∈ R k : k η k ≤ δ } . For a probability distribution on B δ (0) with the density function γ : B δ (0) → R deﬁne Z n = Z B δ (0) γ ( η ) e η T S n − nση T η/ dη. Additionally, let Z = 1 . Due to the properties of the conditional expectation and Assumption2 we know that ( Z n ) n ∈ N is a test supermartingale relative to ﬁltration ( σ ( X n )) n ∈ N . ThenLemma 5.1 has the following corollary.C OROLLARY

For any distribution γ on B δ (0) (5.3) E X ∞ ∼ P θ (cid:20) sup n ∈ N p Z n (cid:21) ≤ . By choosing the right distribution on η , i.e. the right γ , we can show that (5.3) implies thefollowing lemma (we provide the proof for this lemma in the next subsection).L EMMA

Let S n = ( S n , . . . , S kn ) T and T n = | S n | + · · · + | S kn | . For every c < √ δk and for all θ ∈ Θ , E X ∞ ∼ P θ (cid:20) sup n> e c Tn √ n log log n A c (cid:21) ≤ e K + K c , where K = 1 .

5+ ( k + 1) log 2 , K = 18 σk , and A c = n sup n> T n √ n log log n ≥ c (cid:0) K c + 3 (cid:1)o . By Markov’s inequality for any a > and any c > P (cid:20) A c sup n> T n n log log n ≥ a (cid:21) = P (cid:20) A c sup n> e c Tn √ n log log n ≥ e c √ a (cid:21) ≤≤ e − c √ a E (cid:20) A c sup n> e c Tn √ n log log n (cid:21) . Combining it with Lemma 5.2 we get that for all c < √ δk E (cid:20) A c sup n> T n n log log n (cid:21) = Z ∞ P (cid:20) A c sup n> T n n log log n ≥ a (cid:21) da ≤≤ Z ∞ e − c √ a + K + K c da = 2 c − e K + K c . The minimum of the RHS is achieved, when c = min n √ δk , √ K o . Also, k S n k = ( S n ) + . . . ( S kn ) ≤ ( | S n | + · · · + | S kn | ) = T n . Therefore, for any θ ∈ Θ and for c = min n √ δk , √ K o we have E X ∞ ∼ P θ (cid:20) sup n> k S n k n log log n (cid:21) ≤ E X ∞ ∼ P θ (cid:20) A c sup n> T n n log log n (cid:21) + INIMAX RATES WITHOUT THE FIXED N ASSUMPTION + E " (cid:26) sup n> Tn √ n log log n < c (2 K c +3) (cid:27) sup n> T n n log log n ≤≤ c − e K + K c + c − (cid:0) K c + 3 (cid:1) / . This ﬁnishes the proof of the theorem.5.2.4.

Proof of Lemma 5.2.

For simplicity of exposition we only provide the proof for k = 1 , the proof for k > can be found in the Appendix.Consider a discrete probability measure on B δ (0) with density γ ( η ) = X i ∈ N γ i η = η i , where γ i = i ( i +1) and η i = c q − log γ i e i for a constant c > such that η i ∈ B δ (0) for all i ∈ N . Notice that the above holds for all c < δ . Then for any ﬁxed n > Z n = ∞ X i =1 γ i e η i S n − nση i / ≥ max i ∈ N γ i e η i S n − nση i / . Let i = ⌊ log n ⌋ , where ⌊ x ⌋ = max { m ∈ N : m ≤ x } . Then we have(5.4) Z n ≥ γ i e η i S n − nση i / ≥ e − log( ⌊ log n ⌋ ( ⌊ log n ⌋ +1)) · e − nσc

20 log( ⌊ log n ⌋ ( ⌊ log n ⌋ +1))2 e ⌊ log n ⌋ · e c q log( ⌊ log n ⌋ ( ⌊ log n ⌋ +1)) e ⌊ log n ⌋ S n S n ≥ . Note that for n ≥ , we have log n ≤ n − . Also log n > , when n > . Therefore,(5.5) log ( ⌊ log n ⌋ ( ⌊ log n ⌋ + 1)) ≥ log(log n −

1) + log log n ≥ / n. On the other hand,(5.6) log ( ⌊ log n ⌋ ( ⌊ log n ⌋ + 1)) ≤ log log n + 2 log log n = 3 log log n. Additionally,(5.7) n/ ≤ e log n − ≤ e ⌊ log n ⌋ ≤ e log n +1 ≤ n. We use (5.6) to rewrite the ﬁrst factor in (5.4); (5.5) and (5.7) to rewrite the second and thirdfactor in (5.4). Then for any n > Z n ≥ e − n − (9 σc /

2) log log n + c √ log log n n S n S n ≥ = e u ( n,S n ) · S n ≥ . where u ( n, s ) = c √

22 log log n (cid:18) s √ n log log n − √ c (cid:0) σc + 6 (cid:1)(cid:19) . Similarly, by deﬁning Z ′ n just as Z n but now relative to a distribution with the same γ i but now η i = − c q − log γ i e i (rather than η i = c q − log γ i e i as before) for any n > we get Z ′ n ≥ e u ( n, − S n ) · S n < . This equation and the previous one can be re-expressed as: p Z n ≥ e u ( n,S n ) / · S n ≥ and p Z ′ n ≥ e u ( n, − S n ) / · S n < . KIRICHENKO AND GRUNWALD

Since T n = | S n | , we combine the previous two equations with taking the supremum to get: sup n> e u ( n,T n ) / ≤ sup n> e u ( n,S n ) / · S n ≥ + sup n> e u ( n, − S n ) / · S n < ≤≤ sup n> p Z n + sup n> p Z ′ n . We can now invoke Corollary 5.4, which gives us E (cid:20) sup n> e u ( n,T n ) / (cid:21) ≤ . Observe that on the set A c = n sup n> T n √ n log log n ≥ √ c (cid:0) σkc + 6 (cid:1)o the expression in-side the brackets in u ( n, T n ) is positive. Since log log n > , when n > , by writing out u ( · , · ) in full we get: E (cid:20) sup n> e c √ Tn √ n log log n − · (9 σc +6) A c (cid:21) ≤ . The desired result follows by taking c = c / √ .5.3. Proof of (ii) in Theorem 3.2.

To prove the lower bound we ﬁrst show in Lemma 5.3that by the law of iterated logarithm there is a constant c > such the distance between the(slightly modiﬁed) MLE and the truth is at least c · f ( n ) inﬁnitely often. Then, in Lemma5.4 we show that for each estimator either (3.3) and (3.4) holds with some constant depen-dent on c and the probability of being ‘close’ to the MLE; or the estimator is ‘far’ from theMLE inﬁnitely often. Therefore, we only need to consider the latter estimators. For thoseestimators in Section 5.3.2 we lower bound the minimax risk by the Bayes risk (a standardtrick in minimax theory). Finally, in Section 5.3.3 for each estimator we introduce a suitablestopping time such that the Bayes risk is bounded below by a constant that again depends on c introduced above.5.3.1. Reducing the number of considered estimators.

Consider a standard Gaussianprior W (i.e. N (0 , ) on the parameter µ ∈ R . Let ˜ µ ( x n ) = E µ ⋆ ∼ W | X n = x n [ µ ⋆ ] be the pos-terior mean based on data X n = x n . Notice that ˜ µ ( X n ) = nn + 1 ˆ µ MLE ( X n ) = (cid:18) − n + 1 (cid:19) n X i =1 X i . Let ˜ µ n = ˜ µ ( X n ) and deﬁne the events {F µ,n } n ∈ N as F µ,n ⇔ ( µ − ˜ µ n ) ≥ c · f ( n ) for a ﬁxed c > . In the next lemma we use the law of iterated logarithm to show that F µ,n happens inﬁnitely often with probability one.L EMMA

There exists c > such that for all µ ∈ R , (5.8) P µ [ F µ,n i.o. ] = 1 . P ROOF . Assume that P µ [ F µ,n i.o. ] = 1 . Then for every c > there exists µ ∈ R and δ > such that(5.9) P µ (cid:2) ∃ N > s.t. ∀ n > N ( µ − ˜ µ n ) ≤ cf ( n ) (cid:3) ≥ δ. INIMAX RATES WITHOUT THE FIXED N ASSUMPTION Fix c > and consider µ and δ such that (5.9) is satisﬁed. Notice that | µ − ˆ µ MLE n | ≤ (cid:12)(cid:12)(cid:12)(cid:12) µ − ˆ µ MLE n + 1 n µ (cid:12)(cid:12)(cid:12)(cid:12) + 1 n | µ | = n + 1 n | µ − ˜ µ n | + 1 n | µ | When ( µ − ˜ µ n ) ≤ cf ( n ) and n > max { , µ /c } we have | µ − ˆ µ MLE n | ≤ n + 1 n p cf ( n ) + 1 n | µ | ≤ p cf ( n ) . Therefore, P µ (cid:2) ∃ N > s.t. ∀ n > max { , µ /c, N } ( µ − ˆ µ MLE n ) ≤ cf ( n ) (cid:3) ≥ δ. However, since ˆ µ MLE = n P ni =1 X i , we can apply the law of the iterated logarithm (see e. g.Hartman and Wintner (1941)), according to which for any c ′ ∈ (0 , P µ (cid:2) ∀ N > ∃ n > N s. t. ( µ − ˆ µ MLE ) ≥ c ′ f ( n ) (cid:3) = 1 . Choosing c ∈ (0 , / leads to a contradiction, which proves the result.Let c be such that (5.8) holds. Furthermore, let g : N → R + be any non-increasing functionwith f ( n ) /g ( n ) → ∞ . Finally, for a ﬁxed estimator ˆ µ deﬁne the events G n G n ⇔ (˜ µ ( X n ) − ˆ µ ( X n )) ≥ · c · f ( n ) . In the following lemma we show that we only need to consider estimators for which P µ ( G n i. o ) = 1 holds for all µ ∈ R .L EMMA

Let ˆ µ be an arbitrary estimator. Then either (3.3) and (3.4) holds, or (5.10) for all µ ∈ R : P µ [ G n i.o. ] = 1 . P ROOF . Fix arbitrary µ and let ¯ G n be the complement of G n . We consider two cases,depending on whether P µ [ F µ,n ∩ ¯ G n i.o. ] > or not. Case 1: P µ [ F µ,n ∩ ¯ G n i.o. ] > .In this case, there exists ǫ > such that for all n ∈ N there is an n > n such that P µ (cid:2) F µ,n ∩ ¯ G n holds for some n with n < n < n (cid:3) ≥ ǫ. Deﬁne the stopping time τ = min { n , n } with n as above and n the smallest n > n suchthat F µ,n ∩ G n holds. Then τ is ﬁnite and we must have E X ∞ ∼ P µ (cid:20) (ˆ µ ( X τ ) − µ ) g ( τ ) (cid:21) = E (cid:20) (ˆ µ ( X τ ) − µ ) f ( τ ) · f ( τ ) g ( τ ) (cid:21) ≥ ǫ · c · min n

Translating the problem into bounding Bayes risk.

First, we state a generalmeasure-theoretic result on the existence of the joint density of ( τ, X τ ) . The proof can befound in the Appendix.P ROPOSITION

Let { P µ , µ ∈ M ⊆ R } be a set of probability measures on some space ( X , B , ν ) such that for all µ ∈ M there exists a conditional density function p ( · | µ ) : X → R + with respect to ν . For each µ ∈ M consider X n = ( X , . . . , X n ) to be a vector of i.i.d.random variables distributed according to P µ . Let τ be an a.s.-ﬁnite stopping time withrespect to ﬁltration F = ( F n = σ µ ( X , . . . , X n )) n ∈ N . Let Y = ( τ, X τ ) be a random variabletaking values in Y = ∞ S n =1 { n } × X n . Then there exists a σ -algebra Σ and measure ν Y on Y such that for each µ ∈ M rv Y is Σ -measurable and has a density function p ( · | µ ) : Y → R + with respect to ν Y . Then, using Bayes’ theorem (see e.g. Theorem 1. 16 in Schervish (1995)) we get thefollowing.C

OROLLARY

Consider any prior W on R with density w ( µ ) . Let W y ( · | y ) denotethe conditional distribution of µ given Y = y . Then, W y ≪ W and there exists a conditionaldensity w ( µ | n, x n ) such that for any Y = ( n, x n ) w ( µ | n, x n ) = w ( µ ) p ( n, x n | µ ) R M p ( n, x n | ˜ µ ) w (˜ µ ) d ˜ µ . Notice that for any estimator ˆ µ n = ˆ µ ( X n ) , stopping time τ ⋆ ∈ T , prior W on R , and afunction g : N → R + we have(5.11) sup µ ∈ R sup τ ∈T E X ∞ ∼ P µ (cid:20) ( µ − ˆ µ τ ) g ( τ ) (cid:21) ≥ E µ ∼ W E X ∞ ∼ P µ (cid:20) ( µ − ˆ µ τ ⋆ ) g ( τ ⋆ ) (cid:21) . Denote Y = ( τ ⋆ , X τ ⋆ ) ∈ Y and h ( µ, Y ) = ( µ − ˆ µ τ⋆ ) g ( τ ⋆ ) . Let w ( µ ) be the density function ofthe prior W = N (0 , . Then using the results of Corollary 5.6 we have E µ ∼ W E X ∞ ∼ P µ [ h ( µ, Y )] = Z R Z Y h ( µ, Y ) p ( Y | µ ) w ( µ ) dY dµ == Z R Z Y h ( µ, Y ) w ( µ | Y ) Z R p ( y | µ ⋆ ) w ( µ ⋆ ) dµ ⋆ dY dµ == Z R Z Y Z R h ( µ, Y ) w ( µ | Y ) dµp ( y | µ ⋆ ) dY w ( µ ⋆ ) dµ ⋆ == E µ ⋆ ∼ W E Y ∼ P µ⋆ E µ ∼ W | Y [ h ( µ, Y )] == E Y ∼ ¯ P E µ ∼ W | Y [ h ( µ, Y )] , where ¯ P is the Bayes marginal distribution based on the prior W .Therefore, in order to prove the theorem we only need to show that for all estimators ˆ µ thatsatisfy (5.10) there exists a stopping time τ ⋆ such that for some C > (which will depend on c )(i) for a function f ( n ) = n − log log n (with f ( n ) = 1 when n = 1 , ) E Y ∼ ¯ P E µ ∼ W | Y (cid:20) ( µ − ˆ µ τ ⋆ ) f ( τ ) (cid:21) ≥ C. INIMAX RATES WITHOUT THE FIXED N ASSUMPTION (ii) for all non-increasing functions g : N → R + such that f ( n ) /g ( n ) → ∞ E Y ∼ ¯ P E µ ∼ W | Y (cid:20) ( µ − ˆ µ τ ⋆ ) g ( τ ) (cid:21) = ∞ . Deﬁning the suitable stopping time.

For a ﬁxed estimator ˆ µ that satisﬁes (5.10)and a ﬁxed n > deﬁne the stopping time τ ⋆ as τ ⋆ = min { n ∈ R : n > n and G n holds } . By (5.10) this stopping time is P µ -a.s. ﬁnite for all µ . Let Y = ( τ ⋆ , X τ ⋆ ) . For every Y =( n, x n ) we have E µ ∼ W | Y =( n,x n ) [ h ( µ, Y )] = E µ ∼ W | X n = x n (cid:20) ( µ − ˆ µ n ) g ( n ) (cid:21) == E µ ∼ W | X n = x n (cid:20) ( µ − ˜ µ n ) + (˜ µ n − ˆ µ n ) g ( n ) (cid:21) ≥≥ (˜ µ n − ˆ µ n ) g ( n ) . The ﬁrst equality holds due to the event { τ = n } be completely determined by the event { X n = x n } . The second equality is due to ˜ µ n = ˜ µ ( X n ) being the posterior mean given X n based on prior W . Furthermore, by deﬁnition of τ ⋆ , the vector X τ ⋆ satisﬁes (˜ µ ( X τ ⋆ ) − ˆ µ ( X τ ⋆ )) ≥ ( c/ · f ( τ ⋆ ) . Then for every Y = ( n, x n ) E µ ∼ W | Y =( n,x n ) (cid:20) ( µ − ˆ µ n ) g ( n ) (cid:21) ≥ c · min n>n f ( n ) g ( n ) . Since we can choose n arbitrarily large and f ( n ) /g ( n ) → ∞ , the desired results follows.APPENDIX: REMAINING PROOFS Proof of Lemma 5.2: k > Let γ i = i ( i +1) , η i = c q − log γ i e i for some constant c > such that ( η i ) ≤ δk for all i = 1 , . . . , ∞ . The above holds for all positive c < δ/k. Furthermore, let P = { ρ = ( ρ , . . . , ρ k ) ∈ {− , } k , such that ρ = 1 } . Notice that | P | =2 k − . For a ﬁxed ρ ∈ P consider a discrete probability measure on B δ (0) with density γ ( η ) = X i ∈ N γ i η = η i,ρ with η i,ρ = ( η i ρ , . . . , η i ρ k ) T . Then, since η Ti,ρ η i,ρ = k ( η i ) we have Z n = ∞ X i =1 γ i e η Ti,ρ S n − nση Ti,ρ η i,ρ / ≥ max i ∈ N γ i e η i T n,ρ − nσk ( η i ) / , where T n,ρ = S n ρ + · · · + S kn ρ k . Now we apply the argument from the one-dimensional caseand get for any ρ ∈ P E " sup n> e c √ log log n (cid:18) | Tn,ρ | √ n log log n − √ c (9 σkc +6) (cid:19) ≤ . KIRICHENKO AND GRUNWALD

Notice that T n = | S n | + · · · + | S kn | = max ρ ∈ P | T n,ρ | . Since | P | = 2 k − , we have E " sup n> e c √ log log n (cid:18) Tn √ n log log n − √ c (9 σkc +6) (cid:19) == E " sup n> max ρ ∈ P e c √ log log n (cid:18) | Tn,ρ | √ n log log n − √ c (9 σkc +6) (cid:19) ≤≤ X ρ ∈ P E " sup n> e c √ log log n (cid:18) | Tn,ρ | √ n log log n − √ c (9 σkc +6) (cid:19) ≤ k +1 . The desired result follows by taking c = c √ and the fact that log log n > , when n > . Proof of Proposition 5.5

Consider Y = ∞ S n =1 { n } × X n . For each A ⊂ Y let A [ n ] = { a ∈ X n : ( n, a ) ∈ A } . We endow Y with a σ -algebra Σ by setting A ∈ Σ , iff A [ n ] ∈ B n . Here B n is a product σ -algebra on X n . Clearly Σ is a σ -algebra. For A ∈ Σ let ν Y ( A ) = ∞ X n =1 ν n ( A [ n ]) , where ν n = ν ⊗ n . Notice that ν is a σ -ﬁnite measure on Y .Consider any µ ∈ M. For A ∈ Σ let P Y [ A | µ ] = P [( τ, X τ ) ∈ A | µ ] . Notice that P Y [ ∅ | µ ] =0 , P Y [ Y | µ ] = 1 , and P Y [ A | µ ] ∈ [0 , for every A ∈ Σ . Also, for a any countable collection A i ∈ Σ of pairwise disjoints sets we have P Y " ∞ [ i =1 A i | µ = X P Y [ A i | µ ] . Therefore P is a probability measure. Furthermore, consider A ∈ Σ such that ν Y ( A ) = 0 . Denote A [ n ] = ( A [ n ] , . . . , A n [ n ]) , where A i [ n ] ⊂ X . Then ν Y ( A ) = ∞ X n =1 n Y i =1 ν ( A i [ n ]) = 0 . Thus, ν ( A i [ n ]) = 0 for all n ∈ N and i ∈ { , . . . , n } . Since P is absolutely continues withrespect to ν , we have P [ A i [ n ]] = 0 . Therefore, for each n P Y [ n, A [ n ] | µ ] = P [ τ = n, X n ∈ A [ n ] | µ ] ≤ P [ X n ∈ A [ n ] | µ ] = n Y i =1 P [ X i ∈ A i [ n ] | µ ] = 0 . Then P Y [ A | µ ] = P ∞ n =1 P Y [ n, A [ n ] | µ ] = 0 . Therefore, P Y [ · | µ ] is absolutely continuouswith respect to ν Y for every µ ∈ M . By the Radon-Nikodym theorem there exists a density p Y ( n, x n | µ ) with respect to measure ν Y . Acknowledgements.

This work is part of the research program

Safe Bayesian Inference with project number 617. 001. 651 which is ﬁnanced by the Dutch Research Council (NWO).We thank Wouter Koolen for several useful conversations.

INIMAX RATES WITHOUT THE FIXED N ASSUMPTION REFERENCES B ALSUBRAMANI , A. (2014). Sharp ﬁnite-time iterated-logarithm martingale concentration. arXiv preprintarXiv:1405.2639 .B ROWN , L. D. (1986).

Fundamentals of statistical exponential families with applications in statistical decisiontheory . Institute of Mathematical Statistics Lecture Notes—Monograph Series . Institute of MathematicalStatistics, Hayward, CA. MR882001D ARLING , D. A. and R

OBBINS , H. (1968). Some further remarks on inequalities for sample sums.

Proc. Nat.Acad. Sci. U.S.A. RÜNWALD , P. D. (2007).

The minimum description length principle . MIT press.G

RÜNWALD , P., DE H EIDE , R. and K

OOLEN , W. (2019). Safe Testing. arXiv preprint arXiv:1906.07801.H

ARTMAN , P. and W

INTNER , A. (1941). On the law of the iterated logarithm.

Amer. J. Math. OWARD , S. R., R

AMDAS , A., M C A ULIFFE , J. and S

EKHON , J. (2018). Uniform, nonparametric, non-asymptotic conﬁdence sequences. arXiv preprint arXiv:1810.08240 .K ALLENBERG , O. (2002).

Foundations of modern probability , second ed.

Probability and its Applications (NewYork) . Springer-Verlag, New York. MR1876169M

OLENBERGHS , G., K

ENWARD , M. G., A

ERTS , M., V

ERBEKE , G., T

SIATIS , A. A., D

AVIDIAN , M. andR

IZOPOULOS , D. (2014). On random sample size, ignorability, ancillarity, completeness, separability, anddegeneracy: sequential trials, random sample sizes, and missing data.

Stat. Methods Med. Res. ACE , L. and S

ALVAN , A. (2019). Likelihood, Replicability and Robbins’ Conﬁdence Sequences.

InternationalStatistical Review .S CHERVISH , M. J. (1995).

Theory of statistics . Springer Series in Statistics . Springer-Verlag, New York.MR1354146S

HAFER , G., S

HEN , A., V

ERESHCHAGIN , N. and V

OVK , V. (2011). Test martingales, Bayes factors and p -values. Statist. Sci. SYBAKOV , A. B. (2009).

Introduction to nonparametric estimation . Springer Series in Statistics . Springer, NewYork Revised and extended from the 2004 French original, Translated by Vladimir Zaiats.V

AN DER P AS , S. and G RÜNWALD , P. (2018). Almost the best of three worlds: risk, consistency and optionalstopping for the switch criterion in nested model selection.

Statist. Sinica AN DER V AART , A. W. (1998).

Asymptotic statistics . Cambridge Series in Statistical and Probabilistic Mathe-matics . Cambridge University Press, Cambridge. MR1652247V AN E RVEN , T., R

OOIJ , S. D. and G

RÜNWALD , P. (2008). Catching up faster in Bayesian model selection andmodel averaging. In

Advances in Neural Information Processing Systems

OVK , V. and W

ANG , R. (2019). Combining e-values and p-values.

Available at SSRN .W ASSERMAN , L. (2006).

All of nonparametric statistics . Springer Texts in Statistics . Springer, New York.MR2172729Y

ANG , Y. (2005). Can the strengths of AIC and BIC be shared? A conﬂict between model indentiﬁcation andregression estimation.

Biometrika92