Admissible anytime-valid sequential inference must rely on nonnegative martingales
aa r X i v : . [ m a t h . S T ] S e p Admissible anytime-valid sequential inferencemust rely on nonnegative martingales
Aaditya Ramdas , Johannes Ruf , Martin Larsson , Wouter M. Koolen Departments of Statistics and Machine Learning, Carnegie Mellon University Department of Mathematics, London School of Economics Department of Mathematics, Carnegie Mellon University Machine Learning Group, CWI Amsterdam [email protected], [email protected]@andrew.cmu.edu, [email protected]
September 24, 2020
Abstract
Wald’s anytime-valid p -values and Robbins’ confidence sequences enable sequential inference forcomposite and nonparametric classes of distributions at arbitrary stopping times, as do more recentproposals involving Vovk’s ‘ e -values’ or Shafer’s ‘betting scores’. Examining the literature, one findsthat at the heart of all these (quite different) approaches has been the identification of compositenonnegative (super)martingales. Thus, informally, nonnegative (super)martingales are known to besufficient for valid sequential inference. Our central contribution is to show that martingales arealso universal—all admissible constructions of (composite) anytime p -values, confidence sequences,or e -values must necessarily utilize nonnegative martingales (or so-called max-martingales in thecase of p -values). Sufficient conditions for composite admissibility are also provided. Our proofsutilize a plethora of modern mathematical tools for composite testing and estimation problems: max-martingales, Snell envelopes, and new Doob-Lévy martingales make appearances in previously un-encountered ways. Informally, if one wishes to perform anytime-valid sequential inference, then anyexisting approach can be recovered or dominated using martingales. We provide several sophisticatedexamples, with special focus on the nonparametric problem of testing if a distribution is symmetric,where our new constructions render past methods inadmissible. Keywords:
Admissibility; anytime p -value; composite nonnegative supermartingale; confidence sequence; Doob-Lévy martingale; e -value; max-martingale; optional stopping; sequential inference; Snell envelope; symmetricdistribution; Ville’s inequality. ontents p -values and Robbins’ confidence sequences 5 Q -valid p -values, Q -safe e -values, and ( Q , α ) -sequential tests . . . . . . . . . . . . . . . . . . . . . . 104.2 ( φ, P , α ) -confidence sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3 Reductions between the four instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.4 Some basic closure properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 p -values (proof) . . . . . . . . . . . . . . . . . . . . . . . . . 166.2 Necessary and sufficient conditions for e -values (proof) . . . . . . . . . . . . . . . . . . . . . . . . . 176.3 Necessary and sufficient conditions for sequential tests (proof) . . . . . . . . . . . . . . . . . . . . . 18 A.1 Reference measures and local absolute continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29A.2 Essential supremum and infimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29A.3 On the choice of filtration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
B Omitted proofs 31C Auxiliary examples 34 Introduction
Our fairly mathematical treatment will be immensely helped by a concrete hypothetical example. Con-sider a sociology (or your favorite discipline) laboratory that wishes to understand if a particular inter-vention (‘treatment’) has any positive effect whatsoever on a prespecified outcome of interest. Withoutgetting too bogged down by the details, suppose the ‘average treatment effect’ of the intervention overthe relevant population is denoted by θ . Suppose they want to test H : θ ≤ against H : θ > , or toestimate θ using a confidence interval. The lab believes that there is an effect, but has no idea how manysubjects to collect data from: a larger sample size means more power, but also more time and money. Sothey conduct their experiment sequentially: subjects enter the study one at a time and are assigned totreatment or control completely at random; denote the data from subject t as X t .After observing X t , they analyze the data X , . . . , X t they have so far, and decide if they wish tocollect more data, or whether what they already have suffices to demonstrate an effect (to themselves,or to a journal, or to the world). Thus, the lab stops their experiment at a data-dependent stoppingtime τ —maybe time ran out, or they used the money up faster than expected; maybe the effect wassufficiently large, or perhaps they lowered their sights by being satisfied with a smaller effect, or becamemore optimistic and kept the experiment running longer in the hope for a narrower confidence intervalaround a (hopefully) large effect. In other words, the stopping criterion used may itself have changed overtime with funding coming in or drying up, with initial results being more/less promising than anticipated.In any case, the experiment was stopped at time τ and not earlier or later, and there could be multipledata-dependent reasons for stopping at τ that were impossible to anticipate in advance.On adopting the testing approach, they may hope to construct a sequence of p -values ( p t ) t ∈ N thatsatisfies for any arbitrary stopping time τ , Pr H ( p τ ≤ a ) ≤ a, for all a ∈ [0 , , (1)which is simply the definition of a p -value at stopping time τ . Unfortunately, naively using a t -test, achi-squared test, or permutations, does not yield a p -value with this property. Indeed, these types of‘standard’ non-sequential p -values only satisfy the weaker propertyfor any data-independent time t , Pr H ( p t ≤ a ) ≤ a, for all a ∈ [0 , . The most straightforward way to construct ‘anytime-valid p -values’ that satisfy property (1) is to employbonafide sequential tests like Wald’s sequential likelihood ratio test [25, 26], but extensions to nonpara-metric settings have also been explored recently [10, 27]. These are also called sequentially-adjusted p -values [5], or always-valid p -values [12].If instead the lab had adopted the confidence interval approach, and specifying an error tolerance α ∈ [0 , , they may hope to construct a sequence of confidence sets (often intervals) ( C τ ( α )) t ∈ N whichsatisfies for any arbitrary stopping time τ , Pr( θ ∈ C τ ( α )) ≥ − α. (2)Once more, unfortunately, a naive confidence interval based on the central limit theorem or the bootstrapdoes not satisfy the desired property. As before, these ‘standard’ constructions instead satisfy the aboveproperty only at fixed data-independent times t . To satisfy property (2), one could employ ‘confidence se-quences’ proposed by Robbins and collaborators like Darling, Siegmund, and Lai [3, 18, 14], and regaininginterest in recent years [17, 11]; this is a central topic of this paper and we return to it later.Recently, another set of highly interrelated ideas has been put forward under a variety of namesby authors such as Shafer, Vovk, Grünwald, and their collaborators: test martingales [20] (since thetest statistics are sometimes martingales), e -values or sequential e -values [23, 24] ( e for expectation),betting scores [19] (since they have roots in gambling), or safe e -values [9] (safe under optional stopping).Even though these concepts often have origins in parametric settings, the ideas have been extended tocomplicated nonparametric settings involving composite irregular models [27, 10]. We try to strike abalance of terminologies: we use the terminology of an ‘ e -value’, we denote it by e to reinforce the factthat it is an e -value. Importantly, our use of the term ‘safe’ does not in any way imply that the othertwo concepts (anytime p -values and confidence sequences) are unsafe; indeed they are also safe against3ptional stopping and continuation of experiments. For this paper, a ‘safe e -value’ is a nonnegativesequence ( e t ) t ∈ N that satisfies for any arbitrary stopping time τ , E H [ e τ ] ≤ . (3)As before, a standard e -value (that is not anytime-valid) exhibits the above property only at fixed times t , which does not suffice for our sequentially motivated applications.More generally, each of the above modes of inference are often used to perform sequential testing of H , but they are not necessarily exhaustive. We may instead want to directly consider a level- α sequentialtest, which is a decision rule that maps the data (and α ) onto { , } , and stop when the test first outputsone (rejection of the null). Formally, a level- α sequential test is a binary sequence ( ψ t ) t ∈ N that satisfiesfor any arbitrary stopping time τ , Pr H ( ψ τ = 1) ≤ α. (4)Once more, standard nonsequential tests only satisfy such a type-I error guarantee at fixed times t .Instead, anytime-valid p -values, e -values, or confidence sets can each be used to derive a level- α sequentialtest.We formally define all of the above concepts for composite nulls in Section 4, but the above semi-formal description suffices for the moment. One common theme amongst all the aforementioned worksover the decades is the repeated appearance of various, often sophisticated, nonnegative supermartingalesas the central object that enables all four types of anytime-valid inference, no matter what name theygo under. In the rest of this paper, we further examine this central role of nonnegative martingales inconstructing p -values, confidence sets, and e -values with the desired robustness to optional stopping (andcontinuation) of experiments. Specifically, we show that all admissible constructions of these objects mustemploy nonnegative martingales (either explicitly, or implicitly under the hood).We provide a single example here of condition (2) for the reader to have a concrete instance in mind.In the above setup, suppose X s is i.i.d. standard Gaussian, then it can be shown [11] thatfor any finite stopping time τ , Pr θ ∈ " P s ≤ τ X s τ ± r (1 + 1 /τ ) log(( τ + 1) /α ) τ ≥ − α. (5)Of course, at any fixed time t , with z q denoting the q -quantile of a standard Gaussian Z , we could haveused a width of z − α/ / √ t . To approximate z − α/ , note that the Gaussian tail inequality yields thatfor t ≥ , we have Pr(
Z > t ) ≤ ( √ π ) − / exp( − t / . Setting the right hand side to α/ , we get that z − α/ ≤ p log(2 / ( πα )) , which is known to be reasonably tight for small α . Thus the main difference inthe time-uniform bound is the presence of an additional ≈ √ log t factor. The above inequality is provedby applying Ville’s inequality (see (7a) below) to an exponential Gaussian-mixture martingale. These aretools we encounter later in this paper so we do not elaborate on them further here. Admissibility.
Naturally, the desire for methods satisfying properties like (1), (2), (3), or (4) comeswith an implicit wish for efficiency. In other words, setting p τ = 1 , e τ = 1 , ψ τ = 0 , or C τ = R (or C τ = Θ for a more general parameter space) trivially satisfies those requirements, but is clearly uninformative.We want p τ to be as small as possible, e τ and ψ τ to be as large as possible, and C τ to be as narrowas possible, while still being statistically valid measures of uncertainty. We use the term ‘dominates’ tocompare pairs of these objects (in order to avoid using case-by-case adjectives like small/large/narrow)—so, if p ′ ≤ p then p ′ dominates p . Similarly, if e ′ ≥ e then e ′ dominates e , if ψ ′ ≥ ψ then ψ ′ dominates ψ , and if C ′ ⊆ C then C ′ dominates C . In this paper, we use the notion of admissibility to capture thisidea: informally, a p -value (or e -value, test, confidence set) is inadmissible if it is strictly dominated byanother p -value (or e -value, test, confidence set). We define admissibility more formally in Section 4. Paper outline.
Sections 2 and 3 lay out the formal definitions of several of the basic tools—nonnegative(super)martingales, max-martingales, Doob’s optional stopping theorem, and Ville’s inequality. Section 44ntroduces the four central tools of anytime-valid sequential inference: Wald’s anytime p -values, safe e -values, sequential tests, and Robbins’ confidence sequences. Section 5 provides two simple examples:Gaussian and symmetric (super)martingales. Then Section 6 and Section 7 summarize this paper’scentral message about the centrality of nonnegative martingales in constructing the aforementioned tools.Section 6 formalizes the necessary and sufficient conditions for admissibility in the point null setting; ituses a Doob-Lévy max-martingale construction to show the necessary conditions for p -values, a Doob-Le´vymartingale for sequential tests, and uses the Doob decomposition of an appropriate Snell envelope to provethat admissible e -values must also (explicitly or implicitly) employ nonnegative martingales. Section 7develops several novel reductions of admissibility in the composite null setting to the point null case,and presents extensions to estimation (confidence sequences). Section 8 presents deeper investigations onadmissibility, including anti-concentration bounds and the role of randomization. Section 9 utilizes thelearnt lessons to produce admissible tests for symmetry. Appendix A recaps certain technical conceptslike local domination and essential suprema. Appendix B details all proofs that are not in the mainpaper. Finally, Appendix C contains examples and counterexamples to support several claims made inthe paper. p -values and Robbins’ confidence sequences The following lemma is quite central to the construction and interpretation of p -values, confidence sets,and sequential tests that are valid at arbitrary stopping times. Lemma 1 (Equivalence lemma) . Let ( A t ) t ∈ N be an adapted sequence of events in some filtered probabilityspace and let A ∞ := lim sup t →∞ A t := T t ∈ N S s ≥ t A s . The following statements are then equivalent: (i) Pr( S t ∈ N A t ) ≤ α. (ii) sup T Pr( A T ) ≤ α , where T ranges over random times, possibly infinite, not necessarily stopping times. (iii) sup τ Pr( A τ ) ≤ α , where τ ranges over stopping times, possibly infinite. The proof can be found in Appendix B. If the event A t is associated with making an erroneous claimat time t , we interpret the aforementioned three statements as follows:(i) The probability of ever making an erroneous claim, from time one to infinity, is at most α .(ii) The probability of making an erroneous claim at an arbitrary data-dependent time T , perhapschosen post-hoc as a past time after an experiment is stopped, is at most α .(iii) When we stop an experiment at an arbitrary stopping time τ , the probability of making an erroneousclaim at that time is at most α .Intuitively, it is clear that (i) implies (ii), which in turn implies (iii), but the aforementioned lemmaestablishes that all three properties are actually equivalent: if you want one of them, you get all of them forfree. This lemma gives the first hint of the centrality of martingales—the third statement is very directlyabout optional stopping, even though this fact is somewhat masked in the first two ways of framing thedesired error control. While (iii) enables inferences at stopping times as initially motivated, property (ii)allows further introspection at previous times, enabling statistically valid answers to questions like ‘whatwas the estimate of the treatment effect at time τ / ?’ (where τ was the stopping time).The above lemma first appeared recently in Howard et al. [11]. While Lemma 1 did not motivate itsoriginal definition in 1967, Darling and Robbins [3] first defined a ‘confidence sequence’ for a parameter θ as an infinite sequence of confidence sets ( C t ) t ∈ N such that Pr( ∃ t ∈ N : θ / ∈ C t ) ≤ α. In other words the aforementioned confidence sets satisfy property (i) for A t := θ / ∈ C t .5ince its inception 75 years ago, the field of sequential analysis has devoted much effort to constructinganytime-valid p -values for testing and confidence sequences for estimation. Underlying the constructionof these objects in a variety of works, one often finds the repeated use of Ville’s inequality for nonnegativesupermartingales (NSMs). We will show that this is not a coincidence: we prove that NSMs underlie alladmissible constructions for performing anytime-valid sequential inference.To make these claims more formal, and especially to handle composite hypothesis testing, we need toclarify what the probability Pr means in the above definitions even further, and we do so next. Let N represent the natural numbers and N = N ∪ { } . We use ( B t ) to denote a sequence ( B t ) t ∈ N or ( B t ) t ∈ N where the indexing of t is either implicitly understood from the context or unimportant, but weuse B t without the brackets to denote a particular element from the sequence. Thus, for example, F t willdenote a sigma-field at time t but ( F t ) denotes a filtration, which is an increasing sequence of sigma-fields.Unless otherwise mentioned, F = σ ( U ) , where U is a [0 , -uniform random variable that is independentof everything else, signifying an external source of randomness, and F t := σ ( U, X , . . . , X t ) will denotethe canonical filtration, where X t is the data observed at time t . We allow X t to take values in somegeneral space, which we do not need to specify here, e.g., in R d , equipped with the Borel sigma algebra.Earlier, we used Pr to represent the probability taken over all sources of randomness, but in whatfollows we will use a more explicit notation: we denote the distribution of an infinite sequence of ob-servations by P ; this means that P is a probability measure on F ∞ := σ ( U, ( X t ) t ∈ N ) = σ ( S t ∈ N F t ) .Expectations with respect to P are denoted E P . A set consisting of distributions over sequences will bedenoted P ; so P = { P } is the singleton case, but more generally there may be uncountably many P ∈ P .In the case of testing, we denote the null set of distributions by Q ⊂ P .Next, τ will always denote a stopping time, while t denotes a fixed time. A subscript t for p t , e t , ψ t ,and C t means that these objects were constructed using only the data available up to time t . In otherwords, p t , e t , ψ t , and C t are F t -measurable, or the sequences ( p t ) , ( e t ) , ( ψ t ) , and ( C t ) are adapted to ( F t ) .It is also understood that a p -value p t has range [0 , , an e -value e t has range [0 , ∞ ] and a sequential test ψ t has range { , } ; the range of the confidence set C t will be formally specified later.If P is a probability measure on F ∞ and τ is a stopping time, we write P τ for the restriction of P to F τ . This is simply the probability measure on F τ defined by P τ ( A ) = P ( A ) , for A ∈ F τ . (Think of thisas the ‘coarsening’ of P that only operates on events observable up to time τ .)We sometimes, but not always, assume that P is ‘locally dominated’ by (i.e., locally absolutely con-tinuous with respect to) a fixed reference measure R ; we review the meaning of this in Appendix A.1. Forexample, if each observation X s has a Lebesgue density under all P ∈ P , one can choose the reference mea-sure to be the distribution of an i.i.d. sequence of standard Gaussians. Existence of a reference measure isneeded to unambiguously interpret conditional expectations like E P [ Y | F t ] under measures different from P , since a priori , such conditional expectations are only defined up to P -nullsets. For completeness, weelaborate on this in Appendix A.1, but this issue will not actually be visible in the proofs of our results. One of this paper’s central contributions is to characterize admissibility in composite settings. With thatmotivation, we present the following extension of Lemma 1 using the notation introduced above.
Lemma 2 (Composite equivalence lemma) . Let Q be a family of probability measures. Let ( A t ) t ∈ N be an adapted sequence of events in some filtered probability space and let A ∞ := lim sup t →∞ A t := T t ∈ N S s ≥ t A s . The following statements are equivalent: (i) sup Q ∈Q Q ( S t ∈ N A t ) ≤ α. (ii) sup T sup Q ∈Q Q ( A T ) ≤ α , where T ranges over all random times, possibly infinite. (iii) sup τ sup Q ∈Q Q ( A τ ) ≤ α , where τ ranges over all stopping times, possibly infinite. urther, if equality holds for any one, then it holds for the other two. The proof is exactly the same as in Lemma 1 and is thus omitted. For contrast, we now state a versionwith expectations instead of probabilities in which the corresponding statements are not equivalent. Sucha non-equivalence points to forthcoming differences between p -values and e -values. The following resultresembles Lemma 1 but a composite version resembling Lemma 2 can also be easily stated. Lemma 3 (Non-equivalence lemma) . Let ( N t ) t ∈ N be an adapted sequence of nonnegative integrable ran-dom variables in a filtered probability space; let N ∞ := lim sup t →∞ N t . Consider the following statements: (i) E [sup t ∈ N N t ] ≤ . (ii) E [ N T ] ≤ for all random times T , possibly infinite and not necessarily stopping times. (iii) E [ N τ ] ≤ for all stopping times τ , possibly infinite. (iv) E [ g (1) ∨ sup t ∈ N g ( N t )] ≤ for any nondecreasing function g such that R ∞ g ( y ) /y d y = 1 ; inparticular, E [1 ∨ sup t ∈ N √ N t ] ≤ .Then (i) and (ii) are equivalent. Both (i) and (ii) imply (iii) , which in turn implies (iv) . The proof can be found in Appendix B. Contrasting the above lemma with Lemma 1 brings out someof the differences between p -values and e -values. To dig deeper at the difference, note that one couldhave equivalently written Lemma 1 in terms of Bernoulli random variables B t := A t , in which casethe formulae above involving Q ( . . . ) would be replaced by (i) E Q [sup t ∈ N B t ] , (ii) E Q [ B T ] , and (iii) E Q [ B τ ] ,respectively. Thus, for these specific nonnegative binary random variables ( B t ) , the relevant statementsare all equivalent, but more generally they are not. This difference later manifests itself in the inabilityto take running suprema for e -values, and overall a rather different underlying structure. A martingale is a stochastic process adapted to an underlying filtration, whose value at any time isthe conditional expectation of its value at any later time. This is however not the only possible notionof martingale; another interesting notion is obtained by replacing conditional expectations by so-calledconditional suprema, leading to max-martingales . Both notions play an important role in this paper. Inparticular, max-martingales turn out to be particularly suitable for dealing with p -values. We brieflyreview the definitions and basic properties of martingales and max-martingales. Given a filtration ( F t ) and a measure P on F ∞ , a process ( M t ) t ∈ N is called a martingale (with respectto ( F t ) ) if M t is F t -measurable, P -integrable, and E P [ M t |F s ] = M s for any t and s ≤ t. (Sub- and supermartingales are defined by relaxing the martingale property and allowing for inequality, ≥ respectively, ≤ .) Since we had earlier mentioned that F includes an initial source of independentrandomness, M is itself allowed to be random. Naturally, we have E P [ M t ] = E P [ M ] . Often, in thispaper, ( M t ) will be nonnegative and the latter quantity equals one and so when we say ‘a nonnegativemartingale with initial value one’, we implicitly mean with initial expected value one.Given an F ∞ -measurable integrable random variable Y , the process M t := E P [ Y |F t ] is known as theDoob (or Doob–Lévy) martingale associated with Y . The fact that this is a martingale follows from thetower rule of the conditional expectation: E P [ M t |F s ] = E P [ E P [ Y |F t ] |F s ] = M s if s ≤ t .We now generalize these definitions to hold for an entire set of measures. Given a set P of measureson F ∞ , a process ( M t ) t ∈ N is called a nonnegative P -supermartingale ( P -NSM) (with respect to ( F t ) ) if M t is nonnegative, F t -measurable, and E P [ M t |F s ] ≤ M s for all t ∈ N , s ≤ t and every P ∈ P . (6)7f (6) holds with equality, ( M t ) is called a nonnegative P -martingale ( P -NM). If P = { P } is a singleton,we write ‘ P -NSM’ instead of ‘ { P } -NSM’. This notational choice is also applied to other objects. We referto a P -NM (or NSM) as a ‘composite’ NM (or NSM), while a P -NM is called a ‘pointwise’ NM (or NSM).Doob’s optional stopping theorem [6] extends the (sub-/super-) martingale property from deterministictimes to stopping times. In general, only bounded stopping times are allowed in the optional stoppingtheorem; however, the nonnegativity of an NSM relieves us of this restriction. In particular, if ( N t ) is(upper bounded by) a P -NSM starting in N , the optional stopping theorem implies that E P [ N τ ] ≤ E P [ N ] for all stopping times τ , potentially infinite, and every P ∈ P .In fact, if ( M t ) is a P -NSM, then we additionally have E P [ M τ |F ρ ] ≤ M ρ for all stopping times ρ and τ suchthat ρ ≤ τ , P -almost surely, for each P ∈ P . Further, by Doob’s supermartingale convergence theorem,we know that if ( M t ) is a P -NSM with initial expected value one, then its limit M ∞ := lim t →∞ M t exists P -almost surely and E P [ M ∞ ] ∈ [0 , , for each P ∈ P .Stemming from his 1939 PhD thesis [21], Ville’s inequality is a time-uniform generalization of Markov’sinequality; for our purposes, the relevant version states that if ( M t ) is (upper-bounded by) a P -NSM withinitial expected value one, then the following three equivalent statements hold: P (cid:18) ∃ t ∈ N : M t ≥ α (cid:19) ≤ α for every P ∈ P and α ∈ [0 , (7a) ⇐⇒ sup P ∈P P (cid:18) sup t ∈ N M t ≥ α (cid:19) ≤ α for every α ∈ [0 , (7b) ⇐⇒ sup P ∈P ,τ ≥ P (cid:18) M τ ≥ α (cid:19) ≤ α for every α ∈ [0 , . (7c)Note that (7b) and (7c) usually only hold with inequality (for example, for the singleton P = { P } ) butit can hold with equality for larger nontrivial nonparametric classes P , as we shall encounter later inSection 8.2. We also remark that a conditional version of Ville’s inequality is also true, though we do notutilize it much in this paper. Specifically, if ( M t ) is a P -NSM, then sup P ∈P P (cid:18) ∃ t ≥ s : M t ≥ M s α (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:19) ≤ α for every α ∈ [0 , . (8) The relationship between likelihood ratios and nonnegative martingales.
The simplest NM(beyond the constant process M t = 1 ) that arises rather naturally is the likelihood ratio; indeed this isat the heart of Wald’s sequential likelihood ratio test [25, 26]. To be specific, when testing H : X s ∼ Q versus H : X s ∼ P , define the likelihood ratio M t := Q s ≤ t d P d Q ( X s ) , assuming that the Radon-Nikodymderivative d P/ d Q exists. If P, Q have densities p, q with respect to a common measure then each termin the product is just p ( X s ) /q ( X s ) . Let Q ∞ now denote the distribution under which the sequence isi.i.d., each element distributed according to Q . Wald effectively proved that ( M t ) is a Q ∞ -NM and a testthat rejects if M t ≥ /α controls the Type-I error at level α due to Ville’s inequality (Wald proved theresult from scratch, but the language of martingales and Ville’s thesis was known to Wald [2]). It is alsoapparent that every nonnegative martingale is a product of nonnegative random variables with conditionalmean one, meaning that if ( M t ) is a Q -NM, then M t = Q s ≤ t Y s , where ( Y t ) is adapted to ( F t ) and E Q [ Y t |F t − ] = 1 for every Q ∈ Q ; to see this, simply define the multiplicative increment as Y t := M t /M t − with / . At first sight, despite having a product form, it may appear like nonnegative martingalesare strict generalizations of likelihood ratio processes. However, in fact, a converse statement is also true:not only is every likelihood ratio a martingale (under the null), but every martingale is also implicitly alikelihood ratio; this was discussed by Shafer et al. [20] for point nulls, and we generalize it below to thecomposite case, borrowing the terminology of ‘implied alternative’ from Shafer [19].To make the following result precise we assume that the sequence of observations ( U, ( X t ) t ∈ N ) is aprocess on the space Ω = R N of real-valued sequences, and we let ( F t ) be the canonical filtration.8 roposition 4. Consider any composite null set Q of probability measures on F ∞ , and let P consistof all probability measures P that are locally absolutely continuous with respect to some Q ∈ Q . (Thus inparticular, Q ⊂ P .) If ( M t ) is a Q -NM starting at one, then for every Q ∈ Q there exists some ‘impliedalternative’ distribution P ∈ P (depending on Q ) that is locally dominated by Q , such that M t = d P t / d Q t .In other words, a composite nonnegative martingale is a ‘composite’ likelihood ratio (meaning, it takesthe form of a likelihood ratio under every element of the null). We recognize that the above statement may be known to different researchers in some form, but itdoes provide useful intuition and we have not seen it stated in the generality above in the statisticsliterature. The proof is in Section B. The informal takeaway message is that nonnegative martingalesare implicitly likelihood ratios, but the former are typically easier to identify (or construct) in compositenull settings.
Max-martingales are defined by replacing the conditional expectation by the conditional supremum , sowe start by reviewing this notion; more information can be found in Barron et al. [1] and Larsson [15];see also Appendix A.2. For a given probability measure P , random variable Y , and sub- σ -algebra G , the G -conditional supremum is defined as the smallest G -measurable almost sure upper bound on Y : W P [ Y | G ] := ess inf { Z : Z is G -measurable and Z ≥ Y , P -almost surely } . Note that W P [ Y | G ] ≥ Y by construction, and is the smallest G -measurable random variable with thatproperty. (Here and in the rest of this subsection, equalities are understood in the P -almost sure sense.)The conditional supremum can be viewed as a nonlinear analog of the conditional expectation, and hassimilar properties. In particular, one has a ‘tower property’ which states that for nested sub- σ -algebras G ⊂ H , one has W P hW P [ Y | H ] (cid:12)(cid:12)(cid:12) G i = W P [ Y | G ] . Given a filtration ( F t ) , a process ( Y t ) is called a P -max-martingale , or P -MM for short, if Y s = W P [ Y t | F s ] , s ≤ t. Any max-martingale ( Y t ) is almost surely decreasing, which ensures that the limit lim t →∞ Y t exists in [ −∞ , ∞ ] . We call a max-martingale closed if Y t = W P [lim s →∞ Y s | F t ] for all t ∈ N .As earlier, given a set P of measures on F ∞ , a process ( Y t ) is called a (closed) P -max-martingale( P -MM) if ( Y t ) is a (closed) P -MM for each P ∈ P .One can also introduce notions of sub- and supermartingales using the conditional supremum, althoughwe will not use these here. Moreover, one can analogously define conditional infimum martingales using V instead of W , and all properties stated above and below also hold analogously.In further analogy with (standard) martingales, max-martingales satisfy an optional stopping theorem[15, Lemma 2.10]. Specifically, consider the process Y t := W P [ Y | F t ] for an F ∞ -measurable Y . Thanksto the tower property, ( Y t ) is then a max-martingale; we call such a construction a Doob-Lévy MM. Wethen have Y τ = W P [ Y | F τ ] and Y ρ = W P [ Y τ | F ρ ] for all finite stopping times ρ and τ such that ρ ≤ τ .The connection between P -MMs and P -NMs goes beyond mere analogies. If ( M t ) is a P -NM with M > , the following statement easily follows from [15, Proposition 4.1]: inf s ≤ t M s = W P h inf s ∈ N M s (cid:12)(cid:12)(cid:12) F t i , t ∈ N . (9)A word of warning: although the conditional supremum and conditional expectation are in some wayssimilar, they sometimes behave very differently. In particular, the conditional supremum only depends9n the underlying measure P through its zero measure sets. Computing the conditional supremum undera different measure P ′ thus gives the same result whenever the two measures are mutually absolutelycontinuous. This is in stark contrast to the behavior of the conditional expectation. (On the other hand,if P and P ′ are mutually singular, as is often the case in our infinite-horizon situations, then one can makeno general statements about the relation between the corresponding conditional suprema.)Although any nonnegative max-martingale ( Y t ) is almost surely decreasing, which ensures that thelimit lim t →∞ Y t exists, it is possible that Y t and W P [lim s →∞ Y s | F t ] are not necessarily the same, i.e., that ( Y t ) is not closed. Indeed, we have the following non-uniqueness property. If ( Y t ) is a max-martingalethen we usually can find another max-martingale ( Y ′ t ) with Y ′ t > Y t for each t ∈ N , but ( Y ′ t ) and ( Y t ) have the same limit, almost surely. For example, assume ( Z t ) is i.i.d. Bernoulli( / ), independent of ( Y t ) , and adapted to ( F t ) . Define Y ′ t = Y t + Q s ≤ t ( Z s ∨ / . Then ( Y ′ t ) is also a max-martingale, with Y ′ t ≥ Y t + 2 − t > Y t for all t ∈ N , but converging almost surely to the same limit as Y ′ t as t → ∞ .A similar phenomenon also holds true for martingales: if ( M t ) is an NM then M ′ t = M t + Q s ≤ t ( Z s +1 / is another NM with M ′ t ≥ M t + (1 / t > M t for all t ∈ N , but lim t →∞ M ′ t = lim t →∞ M t . Hence also fornonnegative martingales, we often do not have that M t and E [lim s →∞ M s |F t ] are the same.We conclude this subsection with an interesting example. Example 5.
Let V denote a uniformly distributed random variable and assume that ( X t ) is, conditionallyon V , i.i.d. Bernoulli( V ) . Then V = lim t →∞ P s ≤ t X s /t (the limit frequency of ones), hence V is F ∞ -measurable. Moreover, Y t := W P [ V | F t ] yields a max-martingale. It is now relatively easy to check that Y t = 1 for each t ∈ N . This yields an instance where lim t →∞ Y t = V , but ( Y t ) is nonetheless a closedmax-martingale since Y t = 1 = W P [1 | F t ] . We formally introduce the four instruments for anytime-valid sequential inference that play central rolesin our paper. We present a rather natural definition of admissibility for each of the instruments below,but recognize that other alternatives may be suitable depending on the context. Q -valid p -values, Q -safe e -values, and ( Q , α ) -sequential tests Let the (unknown) distribution of the observed data sequence be denoted P . Suppose we wish to test thenull H : P ∈ Q for some Q ⊂ P , against H : P ∈ P (or against H : P ∈ P\Q ). A sequence ( p t ) is calledan anytime valid p -value for H if it satisfies Q ( p τ ≤ α ) ≤ α for arbitrary stopping times τ , every Q ∈ Q , and α ∈ [0 , . Above, we have implicitly defined and used p ∞ := lim inf t →∞ p t . For succinctness, we say ( p t ) is Q -valid. By Lemma 2, the above condition is equivalent to requiring that Q ( ∃ t ∈ N : p t ≤ α ) ≤ α for every Q ∈ Q and α ∈ [0 , . Note that ( p t ) is Q -valid if and only if the running infimum (inf s ≤ t p s ) is Q -valid, so it helps to think of ( p t ) as a nonincreasing sequence; in this case, we have p ∞ = lim t →∞ p t (the limit exists and equals the lim inf used earlier). When we refer only to the validity of the single random variable p ∞ , and not thesequence ( p t ) , we say p ∞ is ‘statically’ Q -valid, meaning that its distribution is stochastically larger thanuniform under any Q ∈ Q . The connection to Ville’s inequality (7a), and thus to martingales, should beapparent.Similarly, a sequence ( e t ) is called an anytime valid e -value for H —or, in short, ( e t ) is Q -safe—if E Q [ e τ ] ≤ for arbitrary stopping times τ , and every Q ∈ Q . p -value case above, we have implicitly defined and used e ∞ := lim sup t →∞ e t . It is worth noting that already in the case of a singleton Q := { Q } , if one additionally desired a conditionalsafety property to hold for ( e t ) , namely that E Q [ e τ |F s ] ≤ e s ∧ τ for arbitrary times s and stopping times τ ,then ( e t ) must necessarily be a Q -NSM; indeed by taking τ = t for t ≥ s , we recover the definition of a Q -NSM. In fact, if one would like such a property to hold for every Q in a composite Q , such a requirementcan only be satisfied by a Q -NSM. Despite the fact that we will not require this conditional property,we will see that (composite) Q -NMs or (pointwise) Q -NMs play a central, universal role in constructing e -values that may not themselves be martingales.A p -value or e -value does not directly yield a decision rule for when to reject the null hypothesis; theyare real-valued measures of evidence, and need to be coupled with a decision rule in order to yield a test.A binary sequence ( ψ t ) is called a ( Q , α ) -sequential test for H —or, in short, ( ψ t ) is a ( Q , α ) -ST—if E Q [ ψ τ ] = Q ( ψ τ = 1) ≤ α for arbitrary stopping times τ , and every Q ∈ Q . As with p -values, we may think of ( ψ t ) as nondecreasing, and have implicitly defined and used ψ ∞ :=lim t →∞ ψ t . Admissibility:
We follow the standard convention of using the term ‘admissible’ as a shorthand for‘not inadmissible’, so we only define inadmissibility below.We say that ( p t ) is inadmissible if there exists ( p ′ t ) that is Q -valid, and is always at least as goodbut sometimes strictly better; more formally, Q ( p ′ t ≤ p t ) = 1 for all t ∈ N and all Q ∈ Q , and that Q ( p ′ t < p t ) > for some t ∈ N , and some Q ∈ Q . We say that ( e t ) is inadmissible if there exists ( e ′ t ) that is Q -safe, such that Q ( e ′ t ≥ e t ) = 1 for all t ∈ N and all Q ∈ Q , and also Q ( e ′ t > e t ) > for some t ∈ N , and some Q ∈ Q . Finally, ( ψ t ) is inadmissible if there exists a ( Q , α ) -sequential test ( ψ ′ t ) , suchthat Q ( ψ ′ t ≥ ψ t ) = 1 for all t ∈ N and all Q ∈ Q , and Q ( ψ ′ t > ψ t ) > for some t ∈ N , and some Q ∈ Q . Remark 6.
We could have equivalently formulated admissibility in terms of Q ( e ′ τ > e τ ) > for somestopping time τ , etc.; this is in fact identical to the current definition, because for any discrete-timerandom sequences ( W t ) and ( W ′ t ) , the statement ‘ Q ( W ′ t = W t ) = 1 for all t ’ is equivalent to ‘ Q ( W ′ t = W t for all t ) = 1 ’, meaning that pointwise equality at fixed times yields simultaneous equality (includingstopping times). In case it is useful to the reader, another restatement of inadmissibility is the following: inf Q ∈Q Q ( ∀ t ∈ N : p ′ t ≤ p t ) = 1 and sup Q ∈Q Q ( ∃ t ∈ N : p ′ t < p t ) > . The above definition of admissibility does not specify any alternative. What allows us to do so is thatthe admissibility conditions are stated ‘almost surely’. To elaborate, assume there was an alternative, say P ∗ such that there existed some time t and an event A ∈ F t with P ∗ ( A ) > but Q ( A ) = 0 for all Q ∈ Q .In this case, on the event A we would always set p t = 0 , e t = ∞ , and ψ t = 1 . We could always do sobecause no Q ∈ Q ‘is aware of’ the event A , hence modifying a p -value on A would not change any ofits Q -distributional properties (like validity). For this reason, and to avoid any notational complicationarising, from now on we always assume the following:If there exists some t ∈ N , some A ∈ F t , and some P ∗ ∈ P with P ∗ ( A ) > then there existsalso some Q ∈ Q with Q ( A ) > .This is a very mild assumption! It is, for example, satisfied if there exists some Q ∗ ∈ Q such that every P ∈ P is locally absolutely continuous with respect to Q ∗ (see Appendix A.1 for a review). This previouscondition is satisfied, for example, if all measures in P are locally equivalent. In interesting situationsa locally dominating measure Q ∗ may not actually exist—for example when testing for symmetry, as weencounter in this paper—but our standing assumption above is still true. Specifically, the aforementionedassumption also holds if each P ∈ P is locally absolutely continuous with respect to some specific Q ∈ Q (potentially different for each P )—this is often easier to check, and indeed holds in the symmetric example.11ote that validity is subset-proof, meaning that validity for Q implies validity for Q ′ ⊂ Q ; however,admissibility is neither subset-proof nor superset-proof. It is posssible that ( p t ) may be admissible for H : Q ∈ Q , but not for H : Q ∈ Q ′ for some Q ′ ⊂ Q or Q ′ ⊃ Q ; indeed, it may not even be valid in thelatter case. Thus, to avoid confusions, we should say ( p t ) is Q -admissible, but we sometimes drop theadditional prefix if it can be inferred from the context. (The same logic applies for ( e t ) and ( ψ t ) .)A family of sequential tests { ( ψ t ( α )) } α ∈ [0 , is said to be ‘nested in α ’, if ψ t ( α ′ ) ≥ ψ t ( α ) for any t and α ′ ≥ α . This ensures that it is easier to reject the null with a larger Type-I error budget.We conclude this subsection with the following observation. Proposition 7.
Assume that Q is locally dominated. Then any p -value for Q , e -value for Q , or ( Q , α ) -sequential test can be dominated by an admissible p -value, e -value, or sequential test, respectively. It is proven in Appendix B using transfinite induction. Next, we switch from testing to estimation. ( φ, P , α ) -confidence sequences Let φ be a map from P to an arbitrary set Z . For every every γ ∈ Z , define P γ := { P ∈ P : φ ( P ) = γ } , and note that {P γ } γ ∈Z is a partition of P . Of special interest is the i.i.d. case where P = µ ∞ . Asexamples, consider φ ( P ) = φ med ( µ ) denoting the median of µ , or φ ( P ) = φ mean ( µ ) denoting the mean of µ . Another case of interest is where each P ∈ P can be represented as P θ for some unique θ ∈ Θ (the‘parametric’ setting) and then φ ( P θ ) = θ . One may also care about fully nonparametric functionals, forexample φ cdf , which maps µ to its cumulative distribution function (or φ pdf if µ has a Lebesgue density).We then define a (1 − α ) -confidence sequence for a functional φ as an adapted sequence of confidencesets ( C t ) such that sup P ∈P P ( ∃ t ∈ N : φ ( P ) / ∈ C t ) ≤ α. This error condition can be phrased equivalently as a coverage criterion: P ( ∀ t ∈ N : φ ( P ) ∈ C t ) ≥ − α ,for every P ∈ P . Above, we have suppressed the dependence of C t on α for notational succinctness.Moreover, we emphasize that the outer probability P matches with the inner functional φ ( P ) , as it should,Indeed, the coverage probability for φ ( P ) should hold given that the data used to construct C t was drawnaccording to this P . We then say that ( C t ) is a ( φ, P , α ) -CS. If required, we define C ∞ := lim inf t →∞ C t .Note that even though we defined ( Q , α ) -sequential tests without explicit reference to any functional φ , one could associate Q to the subset of P in which φ takes on certain values. In other words, we canabsorb the role of any such φ into the definition of Q to reduce notational overhead.We say ( C t ) is inadmissible if there exists ( C ′ t ) that is a ( φ, P , α ) -CS such that P ( C ′ t ⊆ C t ) = 1 for all t and all P ∈ P , and P ( C ′ t ( C t ) > for some t and some P ∈ P .Finally, let us mention a minor technical point. A CS ( C t ) is by definition an adapted sequence ofrandom sets . In this paper, we take ‘adapted’ to simply mean that all events of the form { z ∈ C t } belongto F t . This is sufficient for our needs, and lets us avoid dealing with σ -algebras on spaces of sets.Last, as for sequential tests, a family of CSs { ( C t ( α )) } α ∈ [0 , is said to be ‘nested in α ’ if C t ( α ′ ) ⊆ C t ( α ) for any t and α ′ ≥ α . Remark 8.
An analogue of Proposition 7 also holds for ( φ, P , α ) -confidence sequences, where P is locallydominated. This will later follow from Theorem 28, in conjunction with Proposition 7. We now demonstrate how to transform one tool for sequential inference into another; some of these arewell-known or ‘obvious’ but some are new, especially in the composite setting. These transformations areusually not lossless or bidirectional, meaning that something may be lost in going from one to anotherand back again. All proofs are relegated to Appendix B.We start with the most straightforward direction: forming ( Q , α ) -sequential tests from the other three.12 roposition 9. One can construct ( Q , α ) -sequential tests in the following ways: (1) If ( p t ) is Q -valid, then { ( p t ≤ α ) } α ∈ [0 , is a nested family of ( Q , α ) -STs. (2) If ( e t ) is Q -safe, then { ( e t ≥ /α ) } α ∈ [0 , is a nested family of ( Q , α ) -STs. (3) If ( C t ) is a ( φ, P , α ) -CS, then ( φ ( Q ) ∩ C t = ∅ ) is a ( Q , α ) -ST. Next, we show how to form a composite anytime p -value from the other three. Proposition 10.
One can construct p -values for Q in the following ways: (1) If ( e t ) is Q -safe, then (1 ∧ inf s ≤ t / e s ) is Q -valid. (2) If { ( ψ t ( α )) } α ∈ [0 , is a nested family of Q -STs, then (inf { α : ψ t ( α ) = 1 } ) is Q -valid. (3) If { ( C t ( α )) } α ∈ [0 , is a nested family of ( φ, P ) -CSs, then (inf { α : φ ( Q ) ∩ C t ( α ) = ∅} ) is Q -valid. A confidence sequence can be formed by inverting families of tests, as follows.
Proposition 11.
Recall that P γ := { P ∈ P : φ ( P ) = γ } . If ( ψ γt ) is a ( P γ , α ) -sequential test for each γ ∈ Z then { γ ∈ Z : ψ γt = 0 } is a ( φ, P , α ) -confidence sequence. To convert a family of p -values or e -values into confidence sequences, we can first convert them to sequential tests using Proposition 9 andthen invert those tests as done in the first part of this proposition. Finally, we can form composite e -values by calibrating p -values, following Shafer et al. [20]. Proposition 12.
One can construct e -values for Q as follows: If ( p t ) is Q -valid, then / (2 √ p t ) is Q -safe. In fact, ( f ( p t )) is Q -safe for any nonincreasing ‘calibration’ function f : [0 , → [0 , ∞ ) such that R f ( u )d u = 1 . To convert a nested family of sequential tests or confidence sequences into e -values, wecan first convert them to p -values using Proposition 10 and then apply the aforementioned calibration. At several points in this paper, we will see that convexity of the class of distributions plays a central role,the first hint of which is provided below.
Proposition 13.
Let conv( Q ) denote the convex hull of Q . Then the following statements hold. (1) An e -value for Q is also automatically an e -value for conv( Q ) . Such closure under the convex hullalso holds for p -values and sequential tests. (2) An admissible e -value for Q is also automatically an admissible e -value for conv( Q ) . Such closureunder the convex hull also holds for p -values and sequential tests. The proof can be found in Appendix B. Example 42 in Appendix C shows that the statement ofProposition 13 does not extend to confidence sequences.
Remark 14.
Note that stability with respect to taking a running extremum does not hold for e -values,but does hold for the other three. To elaborate, if ( p t ) is Q -valid, then so is (min s ≤ t p s ) ; if ( C t ) is a ( φ, P , α ) -CS, then so is ( T s ≤ t C s ) ; if ( ψ t ) is ( Q , α ) -ST, then so is (max s ≤ t ψ s ) . However, a runningmaximum does not usually preserve safety for e -values! Moreover, if ( e t ) is P -safe and ( f t ) is Q -safe,then min( e t , f t ) is conv( P ∪ Q ) -safe, while ( e t + f t ) / is conv( P ∩ Q ) -safe. Similarly, if ( p t ) is P -validand ( q t ) is Q -valid, then max( p t , q t ) is conv( P ∪ Q ) -valid, while p t , q t ) is conv( P ∩ Q ) -valid. Last,if ( C t ) is a ( φ, P , α ) -CS and ( D t ) is a ( φ, Q , α ) -CS, then C t ∪ D t is a ( φ, P ∪ Q , α ) -CS, while C t ∩ D t isa ( φ, P ∩ Q , α ) -CS. Two instructive examples of exponential (super)martingales
We now illustrate the above concepts with two examples of composite martingales: the Gaussian NMand the symmetric NSM, the former being a simple parametric example, and the latter being a sophisti-cated nonparametric example (we use the term sophisticated because it includes atomic and nonatomicdistributions, and there is no underlying reference measure). If ( X t ) is a sequence of i.i.d. standard Gaussians, then it is well known that exp( P s ≤ t X s − t/ is anonnegative martingale. This is one of the simplest nontrivial pointwise NMs that one can construct.Below, we elaborate on how to construct a simple composite NM in the Gaussian case. We generalize theGaussian case to an arbitrary fixed mean m and a variance process, where each σ t is revealed to us ondemand. Technically, we suppose the process ( X t ) consists of two components, X t = ( Y t , σ t +1 ) , and let σ be a constant. Note that ( σ t ) is a predictable sequence. We now let G m denote the set of distributionssuch that the outcome Y t at time t is conditionally Gaussian with deterministic mean m and variance σ t ,that is, G m := { P : Y t | F t − ∼ N ( m, σ t ) for all t ∈ N } . Since the distribution generating the predictable variances is left unspecified, G m is a highly compositeset of measures. Next, define the process ( G mt ) by G mt := exp X s ≤ t ( Y s − m ) − X s ≤ t σ s . (10)It is easy to check that ( G mt ) is a G m -NM for each m ∈ R , specifically by evaluating the moment generatingfunction of X t . As a direct consequence of Ville’s inequality, a (1 − α ) -CS ( C mean t ) for an unknown meanis given by C mean t := (cid:26) m ∈ R : G mt < α (cid:27) . Formally, ( C mean t ) is a ( φ mean , S m ∈ R G m , α ) -CS for the mean, where we define the mean functional φ mean ( P ) = m for every P ∈ G m . Remark 15.
The confidence sequence ( C mean t ) derived above does not yield interval (5) in the introduction.That CS is obtained by noting that ( G mt ( λ )) , given by G mt ( λ ) := exp λ X s ≤ t ( Y s − m ) − λ X s ≤ t σ s , is a G m -NM for every λ ∈ R . Fubini’s theorem then implies that ( R G mt ( λ )dΦ( λ )) is also a G m -NM forany distribution function Φ . Choosing Φ as a standard Gaussian for example, yields the normal mixturemartingale (dating back at least to Darling and Robbins [3]). Applying Ville’s inequality then yields (5) . We now move to a nonparametric example that first appears in de la Peña [4] but the core idea can betraced back to Efron [7], who was interested in the robustness of the t -test to heavy tailed (but symmetric)distributions; further, it has recently been extended to the matrix setting by Howard et al. [10]. Thisexample is particularly interesting because of three reasons: (a) it deals with a nonparametric classof distributions that does not have a common dominating measure (it includes atomic and non-atomicmeasures), (b) there exists a rather elegant well-known ‘nonparametric’ NSM, meaning that one singleprocess is a composite NSM, (c) the visual form of the NSM below is reminiscent of the aforementioned14aussian example, making it intuitively appealing. Unfortunately, we will later demonstrate that thisconstruction is suboptimal, and leads to inadmissible sequential inference.Consider the convex set of distributions indexed by m ∈ R , where each increment is conditionallysymmetric around m , i.e., S m := { P : ( X t − m ) ∼ − ( X t − m ) | F t − } . (11)Consider then the following family of processes ( S mt ) , indexed by m ∈ R : S mt := exp X s ≤ t ( X s − m ) − X s ≤ t ( X s − m ) . (12)It is known [4] that ( S mt ) is an S m -NSM for each fixed m ∈ R ; the proof stems from the observation that cosh( z ) ≤ exp( z / , and thus for a single symmetric random variable Z , we have E [ e Z − Z / ] = E [ e − Z − Z / ] = E " e Z − Z / + e − Z − Z / = E [ e − Z / cosh( Z )] ≤ . Note that the symmetric distribution could be different at each time point (e.g., Gaussian with mean m at time one, ( δ m −| X | + δ m + | X | ) / at time two, etc., where δ z denotes the Dirac measure at some z ∈ R ).The process ( S mt ) is visually quite similar to the Gaussian process ( G mt ) from the previous subsection.The relaxation from using the true variance in ( G mt ) to using an empirical variance in ( S mt ) , allows theNM property to transform to an NSM property for a much larger class of heavy-tailed distributions(such as t and Cauchy distributions). Even when applied to Gaussians, one no longer needs to know thevariance.One can check that the sets C center t := (cid:26) m ∈ R : S mt < α (cid:27) together form a ( φ center , S m ∈ R S m , α ) -CS for the functional φ center , which maps symmetric distributionsto their center of symmetry. Said differently, if P ∈ S m then P ( ∃ t ∈ N : m / ∈ C center t ) ≤ α .Above, we have described only the confidence sequences, but one could equally well have defined e -values and anytime p -values. For example, to test the null H : P ∈ S , one can use the fact that ( S t ) is an NSM under the null, and thus ( S t ) is an e -value for S , and (inf s ≤ t /S s ) is a p -value for S . Ofcourse, all of these in turn define ( S , α ) -sequential tests.We return to these and other examples later in the paper. We begin our examination of admissibility via the lens of testing ( p -values, e -values, and sequential tests)and leave the results on estimation (confidence sequences) for Subsection 7.4. Theorem 16 (Necessary and sufficient conditions for pointwise admissibility) . Consider a point null Q = { Q } . The following statements describe necessary and sufficient conditions for pointwise admissibility. (1) If ( p t ) is admissible, then it is a closed Q -MM with F (inf t ∈ N p t ) = inf t ∈ N p t , where F is thedistribution function of inf t ∈ N p t . In the other direction, if ( p t ) is a closed Q -MM and inf t ∈ N p t is Q -uniformly distributed, then it is admissible. (2) ( e t ) is admissible if and only if it is a Q -NM with E Q [ e ] = 1 . (3) ( ψ t ) is an admissible ( Q , α ) -sequential test if and only if ψ t := sup s ≤ t M s ≥ /α , where ( M t ) is a Q -NM with no overshoot at /α and M ∞ ∈ { , /α } . The theorem is proven later in this section, with one subsection per statement.15 emark 17.
Similar to max-martingales, introduced in Subsection 3.2, we could have introduced min-martingales by replacing conditional suprema by infima. With such a notion in place it would be easy tosee that ( ψ t ) is an admissible ( Q , α ) -sequential test if and only if ( ψ t ) is a closed min-martingale such that sup t ∈ N ψ t is { , } -valued and Q (sup t ∈ N ψ t = 1) = α . Unfortunately, a version of the first statement in Theorem 16 that is phrased as follows—‘if ( p t ) isadmissible then p t = inf s ≤ t /M s for all t ∈ N , where ( M t ) is a Q -NM’—is incorrect. Below is a simplecounterexample. Example 18.
Let p = U , and p = p = p = p = . . . . Then ( p t ) is admissible since it cannot beimproved without violating uniformity. However, since the inverse of a uniform random variable is notintegrable, we cannot find a NM ( M t ) that yields ( p t ) ; indeed, no nonnegative integrable random variable N can yield p = 1 /N . The above example further demonstrates that max-martingales, not the ‘usual’ martingales, are theright mathematical construct to deal with p -values. We remark that if we are allowed to constructcontinuous-time processes, then one can work with usual martingales, see Shafer et al. [20, Theorem 2].However, we do obtain the following corollary of Theorem 16(1). Corollary 19.
Let ( M t ) denote a Q –martingale with M > and let F denote the distribution of inf t ∈ N /M t . If F is atomless, then p t := F (inf s ≤ t /M s ) is an admissible p -value.Proof. We will make use of the fact that the conditional supremum commutes with continuous nonde-creasing functions: W [ f ( Y ) | G ] = f ( W [ Y | G ]) for every continuous nondecreasing function f ; we willuse this with f = F . Combining this with the max-martingale property of the reciprocal of the runningsupremum of an NM (see (9)) we get, p t = F (cid:18) inf s ≤ t M s (cid:19) = F (cid:18)W Q h inf s ∈ N M s (cid:12)(cid:12)(cid:12) F t i(cid:19) = W Q h F (cid:16) inf s ∈ N M s (cid:17)(cid:12)(cid:12)(cid:12) F t i = W Q h inf s ∈ N p s (cid:12)(cid:12)(cid:12) F t i . The last equality follows because inf s ∈ N p s = F (inf s ∈ N /M s ) , which is uniformly distributed since F isatomless. We are now in a position to apply Theorem 16(1) to conclude that ( p t ) is admissible.Let us now provide an example of a closed Q -MM that satisfies F (inf t ∈ N p t ) = inf t ∈ N p t in thenotation of Theorem 16(1), but is not admissible. Example 20 (The gap between sufficient and admissible conditions in Theorem 16(1)) . This exampleshows that simply using F (inf s ≤ t /M s ) for a Q -NM ( M t ) and F as in Corollary 19 does not typicallyyield an admissible p -value. It also provides an example for the gap between the sufficient and admissibleconditions in Theorem 16 (1) . Define the martingale ( M t ) by M t := 1 + U ≤ / for all t ∈ N . Then F (inf s ≤ t /M s ) = inf s ≤ t /M s is an inadmissible p -value. (It is, however, a p -value despite E Q [ M ] > .)To see this, define a p -value ( p t ) by p t = U for all t ∈ N . Then ( p t ) strictly dominates inf s ≤ t /M s because U ≤ U ≤ / + U> / = inf s ≤ t M s . p -values (proof ) Proof of Theorem 16 (1) . We start with the necessary conditions for pointwise admissibility of a p -value.To this end, let ( p t ) be a Q -admissible p -value, which must necessarily be nonincreasing by Remark 14.First define ¯ p := inf t ∈ N p t = lim t →∞ p t , and let F be the distribution function of ¯ p . Since ( p t ) is valid, ¯ p is stochastically larger than uniform andso F ( x ) ≤ x for all x ∈ [0 , . For later use, let us observe that F is right-continuous, hence lim t →∞ F ( p t ) = F (¯ p ) . (13)16e now define p ′ t := W Q [ F (¯ p ) | F t ] . Since F is a nondecreasing function, we have F (¯ p ) ≤ F ( p t ) for all t ∈ N . By definition of conditionalsupremum, p ′ t is the smallest F t -measurable random variable with this property; therefore, p ′ t ≤ F ( p t ) for all t ∈ N . Since we also have F ( x ) ≤ x for all x ∈ [0 , we get p ′ t ≤ p t for all t ∈ N . But ( p t ) isadmissible by assumption, so we must in fact have the equality p ′ t = p t for all t ∈ N .We have now argued F (¯ p ) ≤ p ′ t = p t ≤ F ( p t ) . Taking now limits in t and recalling (13), we get F (¯ p ) ≤ ¯ p ≤ lim t →∞ F ( p t ) = F (¯ p ) , thus allowing us to conclude that F (¯ p ) = ¯ p and that ( p t ) is closed.This yields the necessary conditions of the theorem.Let us now discuss the sufficient conditions for pointwise admissibility of a p -value. To this end, let ( p t ) be a closed MM such that inf t ∈ N p t is uniformly distributed. It is then clear that ( p t ) is valid. Considernow an arbitrary p -value ( p ′ t ) with p ′ t ≤ p t for all t ∈ N . We must argue that we have equality. Weclearly have inf t ∈ N p ′ t ≤ inf t ∈ N p t . The validity of ( p ′ t ) implies moreover that inf t ∈ N p ′ t stochasticallydominates a uniform. This now directly yields that we have indeed inf t ∈ N p ′ t = inf t ∈ N p t =: ¯ p , which isuniform. By definition of max-martingales, p t is the smallest F t -measurable upper bound on ¯ p ; since p ′ t is another such bound, we must have p ′ t ≥ p t for all t ∈ N . This proves that ( p t ) is admissible. e -values (proof ) Proof of Theorem 16 (2) . Again, let us start with the necessary conditions for pointwise admissibility ofan e -value. To this end, let us fix an admissible ( e t ) . Next, let us define the ‘Snell envelope’ of ( e t ) as theprocess ( L t ) given by L t := ess sup τ ≥ t E Q [ e τ | F t ] , where τ ranges over all finite stopping times.First, observe that E Q [ L ] ≤ because ( e t ) is Q -safe and E Q [ e ] ≤ . It is self-evident that ( L t ) inheritsthe nonnegativity property directly from ( e t ) . Moreover, it is clear that L t ≥ e t since τ = t is a validstopping time, for all t ∈ N . Next, we claim that ( L t ) is a supermartingale. This is a well-known result,but for the convenience of the reader we include the short proof. This uses properties of the essentialsupremum reviewed in Appendix A.2, in particular Proposition 39.For each fixed t ∈ N , L t is the essential supremum of the family consisting of all E Q [ M τ |F t ] where τ ≥ t is a finite stopping time. This family is closed under maxima. To see this, let τ and τ ′ be given,define A = { E Q [ e τ |F t ] > E Q [ e τ ′ | F t ] } , and set τ ′′ = τ A + τ ′ A c . Since A lies in F t and τ, τ ′ ≥ t , we havethat τ ′′ is a stopping time and we obtain E Q [ e τ ′′ | F t ] = A E Q [ e τ | F t ] + A c E Q [ e τ ′ | F t ] = max { E Q [ e τ | F t ] , E Q [ e τ ′ | F t ] } . This demonstrates closure under maxima. Consequently we can apply Proposition 39 to obtain finitestopping times { τ n } n ∈ N with τ n ≥ t such that E Q [ e τ n |F t ] ↑ L t almost surely. Therefore, by the conditionalversion of the monotone convergence theorem, the tower rule, and the definition of L t − , we get E Q [ L t | F t − ] = E Q h lim n →∞ E Q [ e τ n | F t ] (cid:12)(cid:12)(cid:12) F t − i = lim n →∞ E Q [ E Q [ e τ n | F t ] | F t − ] = lim n →∞ E Q [ e τ n | F t − ] ≤ L t − . This shows that ( L t ) is a supermartingale.Since we have established that the Snell envelope ( L t ) is a supermartingale we can write down its Doobdecomposition as L t = M t − A t for a unique (nonnegative) integrable martingale ( M t ) with M = L , anda unique nondecreasing predictable process ( A t ) with A = 0 . The optional stopping theorem applied tothe martingale ( M t ) implies that it is an e -value and moreover, M t ≥ L t ≥ e t for all t ∈ N . Since ( e t ) was assumed admissible we get e t = M t for all t ∈ N . Finally, we can assume that E Q [ M ] = 1 , else wecan replace ( M t ) by ( M t + 1 − E Q [ M ]) , which is again an e -value.Let us now discuss the sufficient conditions for pointwise admissibility of a NM ( e t ) with E Q [ e ] = 1 .First of all, the optional stopping theorem yields that ( e t ) is an e -value. Consider now some e -value ( e ′ t ) for Q with e ′ t ≥ e t . Since E Q [ e t ] ≤ E Q [ e ′ t ] ≤ , we then have e t = e ′ t for each t ∈ N , yielding theadmissibility of ( e t ) , hence the assertion. 17 .3 Necessary and sufficient conditions for sequential tests (proof ) Proof of Theorem 16 (3) . Let us start with the necessary conditions for pointwise admissibility of a se-quential test. Recall that by assumption, ( ψ t ) satisfies ¯ α := Q ( ∃ t ∈ N : ψ t = 1) ≤ α. Define now ψ ′ t := U ≤ α − ¯ α + ψ t U>α − ¯ α , t ∈ N . Note that ψ ′ t ≥ ψ t for all t ∈ N and ( ψ ′ t ) is again a sequential test. If ¯ α < α then indeed ( ψ ′ t ) strictlydominates ( ψ t ) , in contradiction to the admissibilty of ( ψ t ) . Hence we may assume that ¯ α = α .Define now the Doob-Lévy martingale ( M t ) by M t := Q ( ∃ s ∈ N : ψ s = 1 | F t ) Q ( ∃ s ∈ N : ψ s = 1) = Q ( ∃ s ∈ N : ψ s = 1 | F t ) α . Note that E Q [ M ] = 1 and if there exists a time τ at which ψ τ = 1 , then M t = 1 /α for any t ≥ τ . So, M ∞ ∈ { , /α } . Define next e ψ t := M t ≥ /α . By Ville’s inequality, ( e ψ t ) is a sequential test that dominates ( ψ t ) . Since the latter was assumed admissible, we have established ψ t = e ψ t for all t ∈ N .Consider now a NM ( M t ) with no overshoot at /α and M ∞ ∈ { , /α } and define ψ t := sup s ≤ t M s ≥ /α .By Ville’s inequality, ( ψ t ) is a sequential test. Consider next some sequential test ( ψ ′ t ) with ψ ′ t ≥ ψ t andfix some t ∗ ∈ N . Since E Q [ ψ ∞ ] = α we know that ψ ′ t ∗ = 1 only on the event { ψ ∞ = 1 } = { τ < ∞} ,where τ := inf { t ∈ N : M t ≥ /α } . Hence, ψ ′ t ∗ = 1 implies that M t ≥ /α ; otherwise the martingaleproperty of ( M t ) would be contradicted. Thus ( ψ t ) is indeed Q -admissible, concluding the proof of thestatement. To build intuition towards composite admissibility, we begin with a basic question on validity: is there asystematic way to construct tools for valid (potentially inadmissible) inference in composite settings?
The following observations are straightforward and arguably well-known in some form or another, butare nevertheless useful to spell out formally in order to lay the path for the admissibility results.
Proposition 21 (Pointwise-to-composite validity) . The following statements lay out necessary and suf-ficient conditions that connect validity in the ‘pointwise’ setting to the ‘composite’ setting. (1) ( p t ) is Q -valid if and only if p t ≥ p Q t , Q -a.s., for all t and Q ∈ Q , where ( p Q t ) is some p -value for Q . (2) ( e t ) is Q -safe if and only if e t ≤ e Q t , Q -a.s., for all t and Q ∈ Q , where ( e Q t ) is some e -value for Q . (3) ( ψ t ) is a ( Q , α ) -ST if and only if ψ t ≤ ψ Q t , Q -a.s., for all t and Q ∈ Q , where ( ψ Q t ) is some ( Q , α ) -ST.Proof. We only prove (1), the other two assertions are argued analogously. Suppose we are given that ( p t ) is Q -valid. Then, by definition, p t is Q -valid for every Q ∈ Q ; so choosing ( p Q t ) := ( p t ) itself, we haveproved the ‘only if’ direction. For the other direction, suppose for every Q ∈ Q we are given a p -value ( p Q t ) for Q , and that ( p t ) satisfies p t ≥ p Q t , Q -almost surely, for all t . Let Q ∗ ∈ Q be some true (arbitrary)data-generating distribution. We must argue that ( p t ) is Q ∗ -valid. Indeed, for any α ∈ [0 , , we have Q ∗ ( ∃ t ∈ N : p t ≤ α ) ≤ Q ∗ (cid:16) ∃ t ∈ N : p Q ∗ t ≤ α (cid:17) ≤ α, where the first inequality follows because p t ≥ p Q ∗ t , and the second inequality follows because ( p Q ∗ t ) is Q ∗ -valid by assumption. This concludes the proof. 18he proposition provides a generic reduction from the composite setting to the pointwise setting forperforming valid inference, but we can deduce a similar result for admissible inference, presented later.While Proposition 21 forms a useful building block, it makes no mention of martingales. Nevertheless, wenow have the appropriate context in place to summarize some of our central results. Below, we use thenotions of essential supremum and essential infimum , which we review in Appendix A.2, and note thatwe will need the additional restriction that P is locally dominated in order for these essential extrema tobe well defined. Corollary 22 (Pointwise supermartingales are sufficient for composite validity) . Let Q be locally domi-nated. Then the following statements demonstrate how supermartingales suffice for sequential inference. (1) If p t = ess sup Q ∈Q ∧ inf s ≤ t /N Q s , where ( N Q t ) is upper bounded by a Q -NSM, then ( p t ) is Q -valid. (2) If e t = ess inf Q ∈Q N Q t , where ( N Q t ) is upper bounded by a Q -NSM, then ( e t ) is Q -safe. (3) If ψ t = ess inf Q ∈Q sup s ≤ t N Q s ≥ /α , where ( N Q t ) is upper bounded by a Q -NSM, then ( ψ t ) is a ( Q , α ) -ST.Above, all Q -NSMs start with initial expected value (at most) one, and ( N Q t ) is assumed nonnegative. This corollary follows directly from Proposition 21 and so its proof is omitted; see also Remark 14.
Not all constructions based on martingales are admissible. We next provide an analog to Proposition 21,now for admissibility.
Proposition 23 (Admissible composite tests must aggregate admissible pointwise tests) . Let Q be locallydominated. The following statements show how composite admissible instruments must aggregate (some)pointwise admissible instruments. (1) If ( p t ) is Q -admissible, then p t = ess sup Q ∈Q p Q t for all t , where ( p Q t ) is Q -admissible. (2) If ( e t ) is Q -admissible, then e t = ess inf Q ∈Q e Q t for all t , where ( e Q t ) is Q -admissible. (3) If ( ψ t ) is Q -admissible, then ψ t = ess inf Q ∈Q ψ Q t for all t , where ( ψ Q t ) is Q -admissible.Proof. Let ( p t ) denote a Q -admissible p -value. For each Q ∈ Q , let ( p Q t ) be a Q -admissible p -value thatdominates ( p t ) . Such ( p Q t ) exists thanks to Proposition 7. Let us now define p ′ t := ess sup Q ∈Q e p Q t , which is Q -valid thanks to Proposition 21(1). Clearly, we have p ′ t ≥ p t for all t ∈ N . Since ( p t ) is Q -admissible,we indeed have p ′ t = p t for all t ∈ N , yielding the assertion for p -values. The assertions for e-values andsequential tests are shown in the same manner.The following corollary describes the restrictions that every admissible construction necessarily satis-fies. Corollary 24 (Pointwise martingales are necessary for composite admissibility) . Let Q be locally domi-nated. Then the following statements demonstrate how martingales underpin all admissible constructions. (1) If ( p t ) is Q -admissible, then ( p t ) is nonincreasing and p t = ess sup Q ∈Q p Q t for all t , where ( p Q t ) is aclosed Q -MM. (2) If ( e t ) is Q -admissible, then e t = ess inf Q ∈Q M Q t for all t , where ( M Q t ) is a Q -NM with E Q [ M Q ] = 1 . (3) If ( ψ t ) is Q -admissible, then ψ t = ess inf Q ∈Q sup s ≤ t M Q s ≥ /α , where ( M Q t ) is a Q -NM with E Q [ M Q ] =1 , M Q ∞ ∈ { , /α } and ( M Q t ) has no overshoot at level /α . The aforementioned three statements are a direct consequence of Proposition 23 and Theorem 16.19 .3 Sufficient conditions for admissible (composite) inference
The next proposition argues that it suffices to consider only a subset of Q when constructing Q -admissible p -values, e -values, or sequential tests. Note that we do not require Q to be locally dominated below. Proposition 25 (Pointwise-to-composite admissibility) . Assume there exists a ‘reference family’ ( Q i ) i ∈ I ⊂Q such that for each t and A ∈ F t ,if there exists Q ∈ Q with Q ( A ) > , then there exists i ∈ I with Q i ( A ) > .Then we have the following. (1) If ( p t ) is Q -valid and ( p t ) is Q i -admissible for each i ∈ I , then ( p t ) is Q -admissible. (2) If ( e t ) is Q -safe and ( e t ) is Q i -admissible for each i ∈ I , then ( e t ) is Q -admissible. (3) If ( ψ t ) is a ( Q , α ) -ST and ( ψ t ) is Q i -admissible for each i ∈ I , then ( ψ t ) is Q -admissible.Proof. Let us only argue here the case of e -values. The other cases follow in exactly the same manner.Assume that there exists an e -value ( e ′ t ) for Q such that Q ( e ′ t ≥ e t ) = 1 for all t ∈ N and all Q ∈ Q , andthat Q ∗ ( e ′ t > e t ) > for some t ∈ N , and some Q ∗ ∈ Q . By assumption, there exists some i ∈ I such that Q i ( e ′ t > e t ) > . Since ( e t ) is assumed to be Q i -admissible, we get a contradiction.Of course, two special cases are found at the extremes: when the reference family is a singleton, itmeans there is a common reference measure R , and when the reference family is Q itself, the propositionis vacuous. The proposition is particularly useful in the first case; then to get an admissible e -value, forexample, it suffices to construct a Q -NSM ( M t ) that is also a Q ∗ -NM, thanks to Proposition 25(2) andTheorem 16(2). The following example demonstrates one such setting. Example 26.
Recalling notation from Section 5.1, let µ m ∈ G m denote the measure under which ( X t ) is i.i.d. Gaussian with unit variance and mean m , and consider Q := { µ m : m ≤ } . Then G t :=exp( P s ≤ t X s − t/ is not a Q -NM—it is a µ -NM when X t is standard Gaussian, but is a µ m -NSMfor m < . Nevertheless, ( G t ) is a Q -admissible e -value. The reason is that ( G t ) , being a µ -NM, isimmediately µ -admissible, and the singleton reference family { µ } satisfies the local absolute continuitycondition required to invoke Proposition 25 (2) . Corollary 27 (Composite martingales are sufficient for composite admissibility) . Consider a generalcomposite family Q . (1) ( p t ) is Q -admissible if is a closed Q -MM and inf t ∈ N p t is Q -uniformly distributed for every Q ∈ Q . (2) ( e t ) is is Q -admissible it is a Q -NM with E Q [ e ] = 1 for all Q ∈ Q . (3) ( ψ t ) is Q -admissible if ψ t = sup s ≤ t M s ≥ /α , where ( M t ) is a Q -NM with M = 1 , M ∞ ∈ { , /α } , Q -almost-surely, for every Q ∈ Q , and no overshoot at /α . This corollary is again a direct consequence of Proposition 25 and Theorem 16. Recall that as inCorollary 19, a sufficient condition for ( p t ) to be a closed Q -MM and inf t ∈ N p t be Q -uniformly distributed,for some fixed Q ∈ Q , is that the p -value has the representation p t = F Q (inf s ≤ t /M Q s ) , where ( M Q t ) is a Q -NM and inf s ∈ N /M s has an atomless distribution function F Q under Q .As an immediate consequence of Corollary 27(2), recall the Gaussian example in Section 5.1, andconsider testing if the underlying mean is zero. Since ( G t ) is a G -NM, it is also G -safe, and hence a G -admissible e -value when testing against, for example, P = S m ∈ R G m .20 .4 Necessary and sufficient conditions for confidence sequences Proposition 11 already shows that we can construct a confidene sequence by inverting a family of sequentialtests. We now show that their admissibility is also tightly linked to those of the underlying tests.
Theorem 28.
Recall that P γ := { P ∈ P : φ ( P ) = γ } . If ( ψ γt ) is an admissible ( P γ , α ) -sequential testfor each γ ∈ Z then { γ ∈ Z : ψ γt = 0 } is an admissible ( φ, P , α ) -confidence sequence. Similarly, if ( C t ) is an admissible ( φ, P , α ) -confidence sequence, then ψ γt := γ / ∈ C t yields an admissible ( P γ , α ) -sequentialtest for each γ ∈ Z , so that C t = S γ ∈Z { γ ∈ Z : ψ γt = 0 } . As a result, we can infer the following. (1) (Validity) If the process ( N P t ) is upper bounded by a P -NSM with initial expected value one, then C t := S P ∈P { φ ( P ) : sup s ≤ t N P s < /α } , is a ( φ, P , α ) -CS. (2) (Admissibility) If ( M γt ) is a P γ -NM with M γ = 1 , M γ ∞ ∈ { , /α } , P -almost surely, for every P ∈ P γ , and has no overshoot at /α , then C t := { γ ∈ Z : sup s ≤ t M γs < /α = 0 } is P -admissible.Assume now that P γ is locally dominated for each γ , and that ( C t ) is P -admissible. Then, for all t ∈ N ,we have C t = S P ∈P { φ ( P ) : ψ P t = 0 } where ( ψ P t ) is P -admissible. Moreover, we can write C t = [ P ∈P (cid:26) φ ( P ) : sup s ≤ t M P s < α (cid:27) , where ( M P t ) is a P -NM that has no overshoot at level /α , with M P = 1 and M P ∞ ∈ { , /α } .Proof. For the first statement of the theorem, note that C t = { γ ∈ Z : ψ γt = 0 } yields a ( φ, P , α ) -confidence sequence by Proposition 11. Suppose for contradiction that ( C ′ t ) is another ( φ, P , α ) -confidencesequence that witnesses the inadmissibility of ( C t ) . Proposition 9(3) then yields a corresponding family { ( η γt ) } γ ∈Z of sequential tests. The inadmissibility of ( C t ) then yields some γ ∈ Z such that ( η γt ) strictlydominates ( ψ γt ) , a contradiction to the assumption that ( ψ γt ) is admissible. The second statement followsin exactly the same way, again by an application of Propositions 11 and 9(3). Statements (1) and (2)are direct corollaries of combining the first part of the theorem with Corollary 22(3) and Corollary 27(3).Assume now that P γ is locally dominated, for each γ , and that ( C t ) is P -admissible. The statement thenfollows from Proposition 23(3) and Corollary 24(3).This and the previous section argued in detail that restricting our attention to constructions basedon NMs (not NSMs!) does not hurt us: these are universal constructions. Indeed, if one is presentedwith a p -value, e -value, sequential test, or confidence sequence constructed in some arbitrary fashion, weshow that one can always uncover a ‘hidden’ underlying NM, such that applying Ville’s inequality or theoptional stopping theorem yields an instrument that is at least as good as the original one. Recall that we were able to crisply summarize the necessary conditions for admissibility using martingalesin Corollary 24 and sufficient conditions in Corollary 27. The following discussion probes at the gapbetween necessary and sufficient conditions for composite admissibility, in order to demonstrate that thegap is real. We begin with two instructive examples that demonstrate that the necessary conditions ofCorollary 24 are not actually sufficient for admissibility.
Example 29 (Necessary conditions for Corollary 24(1) are not sufficient) . Assume that Q is the set ofprobability measures under which X is Bernoulli and X = X = . . . = 0 . Consider now p := X =0 U + X =1 √ U nd p = p = . . . = p . Note that p ≥ U by construction and hence is valid. Then ( p t ) is an anytimevalid p -value and satisfies Corollary 24 (1) and Proposition 23 (1) . Here p Q = p Q = p Q = . . . = F Q ( p ) = (1 − q ) p + q p , where q := Q ( X = 1) ∈ [0 , , for each Q ∈ Q and F Q is the Q -distribution function of p . Indeed, bydefinition p Q is Q -uniform and considering small q ’s it is easy to see that p = ess sup Q ∈Q p Q . However, ( p t ) is indeed inadmissible as it is dominated by ( U ) . Example 30 (Necessary conditions for Corollary 24(2) are not sufficient) . Assume that X t = ( Y t , Z t ) ,where Z t is some ’nuisance parameter’, and let Q := { Q + , Q − } consist of two measures: under both Q + and Q − , the first components Y t of the observation process are i.i.d. standard Gaussian. The two measuresmay be different on the ‘nuisance parameter’ Z t . Following the Gaussian example in Section 5.1, define M + t := exp X s ≤ t Y s − t ! and M − t := exp − X s ≤ t Y s − t ! .Then ( M + t ) is a Q + -NM, and ( M − t ) is a Q − -NM. Indeed, both processes are martingales under bothmeasures. The corresponding e -value for Q corresponding to Proposition 23 (2) is e t = min { M + t , M − t } = exp (cid:16) − (cid:12)(cid:12)(cid:12) X s ≤ t Y s (cid:12)(cid:12)(cid:12) − t (cid:17) . Then ( e t ) is the minimum of two martingales, both under Q + and Q − , hence a supermartingale under bothmeasures. Moreover, the martingale part of ( e t ) is the same under both measures (as both measures agreeon the filtration generated by the process ( Y t ) ) and strictly dominates ( e t ) under both measures. Thisproves that ( e t ) is inadmissible since it is strictly dominated by a Q -NM. Next, we discuss anti-concentration results, which are somewhat surprisingly insufficient for admissibil-ity. Since Ville’s inequality has been often used in this paper to demonstrate validity, one would hopethat if Ville’s inequality is ‘essentially’ tight (it holds with ‘almost’ equality) then the correspondinginferential instruments may be close to admissible. We examine this angle next. Below, we derive ananti-concentration (lower) bound to complement the upper bound of Ville’s inequality.
Lemma 31 (Anti-concentration for pointwise NMs) . Let ( M t ) be a Q –NM with E Q [ M ] = 1 , so that the‘multiplicative increment’ Y t := M t /M t − (with / ) has unit mean. Assume that the aggregateempirical variance of ( Y t ) is Q -almost surely infinite, i.e., Q X t ∈ N ( Y t − = ∞ ! = 1 . (14) Then M ∞ = 0 , Q -almost surely. Fix now some ε > . Assume that for each t ∈ N the multiplicativeincrement Y t with Y := M satisfies a tail condition, namely for each F t − -measurable random variables β > , where F − := {∅ , Ω } , we have E Q [ Y t |F t − , Y t ≥ β ] ≤ β (1 + ε ) . (15) Then, for any α ∈ (0 , we have α ≥ Q (cid:18) sup t ∈ N M t ≥ α (cid:19) ≥ α ε . Q ( Y t ≤ ε ) = 1 forall t ∈ N . We note that (14) is easily satisfied when P is a product measure and thus ( Y t ) is a sequenceof i.i.d. random variables, as long as Q ( Y t = 1) > . We can now extend the above pointwise result to thecomposite setting, and once more this result is of independent interest. Corollary 32 (Anti-concentration for bounded composite NMs) . Consider a family Q of probabilitymeasures and a Q -NM ( M t ) with M = 1 . Define Y t := M t /M t − (with / ) and assume thatfor each ε > there exists some Q ∈ Q such that conditions (14) and (15) hold for each t ∈ N and F t − -measurable random variable β > . Then sup Q ∈Q Q (cid:18) sup t ∈ N M t ≥ α (cid:19) = α, for any α ∈ (0 , . In words, the above result establishes rather simple and interpretable sufficient conditions under which p t := inf s ≤ t /M s uses up all its type-I error budget, meaning that at least in a worst-case sense, Ville’sinequality did not lead to a conservative test.Unfortunately, Example 44 shows, in the context of conditionally symmetric distributions (see Sec-tion 9), that even under the assumptions of the previous corollary, such ( p t ) does not need to be admissible.This further demonstrates the subtleties of establishing sufficient conditions for admissibility. Neverthe-less, Corollary 32 is of independent interest; but the fairly intuitive condition it yields does usually notsuffice for admissibility. Throughout this subsection, let us consider Q = { Q } .The admissibility of p -values is subtle and randomization appears to play a key role in enablingadmissible constructions. The key difficulty is in dealing with atomic limiting distributions, and we delvemore into this topic here with several examples.In Corollary 19, inf s ∈ N /M s is assumed to have an atomless distribution function F . Initial random-ization turns out to be necessary for this to hold. To formalize this claim, consider a martingale ( M t ) .We now argue the following fact:If M = 1 , Q -almost surely, then sup t ∈ N M t has an atom at one under Q .The proof is simple, so we present it immediately. Define Y t := M t /M t − with / Without lossof generality, we may assume that Q ( Y = 1) > . Since E Q [ Y ] = 1 , there exists some η > such that Q ( Y ≤ − η ) > η . On the event { Y ≤ − η } , the conditional version of Ville’s inequality (8) yields that Q (sup t ∈ N M t ≥ |F ) ≤ M ≤ − η . Hence on this event we have Q (sup t ∈ N M t < |F ) ≥ η , yielding theunconditional bound Q (sup t ∈ N M t < ≥ η . This then gives Q (sup t ∈ N M t = 1) ≥ η . Hence sup t ∈ N M t has an atom at one, and so does the induced p -value, completing the proof of the aforementioned fact.In contrast, if we consider the martingale ( M ′ t ) with randomized initial value M ′ t := M t + εU , where ε > , and recall that U is the (independent) F -measurable [0 , -uniformly distributed random variable,then sup t ∈ N M ′ t = sup t ∈ N M t + εU has a density since it is the convolution of sup t ∈ N M t with a randomvariable that has a density.Let us consider for the moment a p -value constructed as p t := F (inf s ≤ t /M s ) , where ( M t ) is a Q -martingale with M = 1 and F is the distribution function of inf s ∈ N /M s . (Note that such a p -value always dominates (inf s ≤ t /M s ) .) Then ( p t ) is always inadmissible. To see this, define p ∞ :=inf t ∈ N p t and δ := Q ( p ∞ = 1) > , where the inequality follows from the fact argued above. Moreover,define the conditional distribution function G by [0 , ∋ u Q ( U ≤ u | p ∞ = 1) . Let us then define p ′ t := p t ∧ (1 − δ + δG ( U )) . Then clearly ( p ′ t ) strictly dominates ( p t ) . Moreover, ( p ′ t ) is a p -value since p ′∞ := inf t ∈ N p ′ t = p ∞ ∧ (1 − δ + δG ( U )) stochastically dominates a uniform. Indeed, for α ∈ (0 , − δ )
23e have Q ( p ′∞ ≤ α ) = Q ( p ∞ ≤ α ) ≤ α and for α ∈ [1 − δ, we get Q ( p ′∞ ≤ α ) = Q ( p ∞ ≤ − δ ) + Q ( p ∞ = 1 , − δ + δG ( U ) ≤ α )= 1 − δ + δ Q ( δG ( U ) ≤ α − (1 − δ ) | p ∞ = 1)= 1 − δ + δ α − (1 − δ ) δ = α, where we used that G ( U ) is uniformly distributed under the conditional measure Q ( ·| p ∞ = 1) .We note above that atoms at one are ‘obviously’ undesirable for p -values. Quite surprisingly, theredo exist admissible anytime p -values with atomic limiting distributions (where the atoms are not at one);see Example 43. In that example, we have Q (inf t ∈ N p ∞ < /
2) = 0 , and ( p t ) is independent of therandomization device U ; nevertheless it is impossible to ‘randomize’ the atom.To end the discussion about atoms in the context of p -values, we remark that atomic limiting dis-tributions occur more often in discrete time than in continuous time. For example, if ( B t ) t ∈ [0 , ∞ ) is astandard Brownian motion, then (exp( B t − t/ t ∈ [0 , ∞ ) is a martingale, and inf t ≥ / exp( B t − t/ isexactly [0 , -uniformly distributed. However, the corresponding standard Gaussian NM from (10) hasthat inf t ∈ N /G t is atomic when ( X t ) under Q follow the law of i.i.d. standard Gaussians.In sharp contrast, initial randomization causes sequential tests based on e -values to become inad-missible. Indeed, if the jumps of ( e t ) are continuous with positive probability then the correspondingsequential test is not admissible for any α ∈ (0 , due to overshoot. Only e -values that have atomicjumps can possibly lead to admissible tests; however, any such e -value cannot lead to an admissible testfor every α (it will overshoot for some and not for others). Example 37 in the next section derives anadmissible sequential test for (composite) symmetric distributions. Considering the results of this paper presented thus far, we demonstrate how they may inform practice.At a high level, this section constructs (using different NMs) admissible version of all four instrumentsstudied in this paper for the class of conditionally symmetric distributions.We return to the example from Section 5.2 where we had presented a rather elegant, and intuitive,exponential NSM for distributions that yield conditionally symmetric observations. However, the resultsfollowing the example showed that the tests or confidence sequences stemming from it are inadmissible,since all admissible constructions must use NMs. We will construct such NMs, which appear to be new tothe best of our knowledge (but we would not be surprised if they have been discovered before). Recallingthe notation from Section 5.2, let S := S be the set of laws such that X t conditional on F t − is symmetricaround zero. We will demonstrate here that inference based on ( S t ) defined in (12) is inadmissible byexplicitly constructing procedures that dominate it.We note that S is not locally dominated. Indeed, just consider P ∈ S of the form P = U × µ ∞ (where U denotes the uniform measure), and note that there are uncountably many mutually singular choices for µ ; take for instance µ = ( δ x + δ − x ) / for x ∈ R . Here δ x denotes the Dirac measure at x . Nevertheless,despite the lack of a reference measure, it is still possible to construct a family of S -NMs, and thusadmissible e -values for S . Indeed, we have the following proposition. Proposition 33.
An adapted process ( M t ) with M bounded and nonnegative is an S -NM if and onlyif Y t := M t /M t − (with / ) is of the form Y t = f t ( X t ) , where ( f t ) is a nonnegative predictablefunction such that f t − is odd, or equivalently, f t ( x ) + f t ( − x ) = 2 for all x ∈ R . Moreover, if M = 1 then ( M t ) is an S -admissible e -value by Corollary 27 (2) . The proof is in Section B. A similar characterization as above also holds for any S -NSM; in thenotation of Proposition 33, ( M t ) is an S -NSM if and only if Y t = f t ( X t ) where ( f t ) is a nonnegativepredictable function such that f t ( x ) + f t ( − x ) ≤ , t ∈ N . (16)24oreover, an S -NSM ( M t ) can be converted to an S -NM ( f M t ) with f M t = Q s ≤ t e f s ( X s ) by the followingmirroring operation: e f t ( x ) = ( f t ( x ) , f t ( x ) ≥ f t ( − x );2 − f t ( − x ) , f t ( x ) < f t ( − x ) . Indeed, we get that e f t ≥ f t and that equality holds in (16) with f t replaced by e f t .Proposition 33 demonstrates how to construct admissible e -values for symmetry, and we give twoinstantiations that we have found (subjectively) elegant. Let h be an odd function and consider f ( x ) = 1 + arctan h ( x ) or f ( x ) = 1 + cos h ( x ) . Then, ( Q s ≤ t f ( X s )) is an S –NM and thus an e -value for S which is admissible by Proposition 33.Finally, we return to the exponential S -NSM from [4] from Section 5.2, showing that it is leads to aninadmissible e -value for S , and improving it to an admissible one by converting the NSM to an NM. Example 34.
Let
P ⊃ S and recall from Section 5.2 that the process ( S t ) := ( S t ) given by S t = Y s ≤ t g ( X s ) , where g ( x ) := exp (cid:18) x − x (cid:19) , is an S -NSM. Further, ( S t ) is not a martingale unless X t = 0 for all t ∈ N , and the corresponding S -safe e -value is inadmissible due to the following argument. Define f ( x ) = ( g ( x ) , x ≥ − g ( − x ) , x < . Then f ≥ g with equality only if x = 0 . Further, f ( − x ) − − f ( x ) and finally f is nonnegativesince g ≤ e / ≈ . ; hence ( Q s ≤ t f ( X s )) is an S -NM by Proposition 33. This also yields that thecorresponding e -value is admissible. Even though the original S -NSM is inadmissible, we recognize theaesthetic and analytical advantage in having a simple exponential formula. Let us now illustrate that the p -values corresponding to the S -NMs fully utilize the available Type-Ierror budget in the sense of the next proposition. Proposition 35.
Consider a nonnegative function f , continuous and strictly monotone at zero and suchthat f − is odd. Then M t := Q s ≤ t f ( X s ) is an S -NM by Proposition 33 and we have sup Q ∈S Q (cid:18) sup t ∈ N M t ≥ α (cid:19) = α, for any α ∈ (0 , . (17) Thus, defining p t := inf s ≤ t /M s , we have that ( p t ) is a p -value for S that satisfies sup Q ∈S ,τ ≥ Q ( p τ ≤ α ) = α, for any α ∈ (0 , . (18)The proof is in Section B. An analogous result to the above proposition is known for the class of sub-Gaussian distributions [11, Proposition 4], but had only been conjectured for other nonparametric classeslike S . Unfortunately, not every p -value for S constructed as (18) above is admissible; see Example 44 inAppendix C.Finally, let us construct an S -admissible p -value in the next example. Example 36.
Define the following subset of symmetric distributions: e S := ( P ∈ S : P X t ∈ N X t =0 = ∞ ! = 1 ) . (19)25 ext, we define the process ( p t ) as p := 1 and p t := 1 − X s ≤ t N s X s > = p t − − N t X t > , where N s := X i ≤ s X i =0 . Then it can be checked that ( p t ) is a closed S -MM. Moreover, inf t ∈ N p t is Q -uniform for each Q ∈ e S .Assume for the moment that Q ∈ S \ e S can be locally dominated by some Q ′ ∈ e S . Proposition 25 (1) andCorollary 27 (1) then yield that ( p t ) is S -admissible.Let us now fix some Q ∈ S \ e S and argue that indeed it can be locally dominated by some Q ′ ∈ e S .To do so, define the measure µ := 1 / δ + δ − ) with δ x denoting again the Dirac measure at x ∈ R .Moreover, let H denote the law of a Poisson random variable with expectation one. On the appropriatecanonical space, consider the measure Q × µ ∞ × H , and write U, ( X t ) , ( Y t ) , H for the canonical randomvariables. To summarize, U is uniform, ( X t ) is our original conditionally symmetric sequence, ( Y t ) is anindependent sequence of Rademacher random variables, and H is Poisson. Define a new sequence ( X ′ t ) by X ′ t := X t H>t + Y t H ≤ t for all t ∈ N and let Q ′ denote the measure induced by U and ( X ′ t ) . Then Q ′ ∈ e S and it can be checked that Q ′ locally dominates Q . This completes the proof of our initial claim.As a final observation, observe that if we replace S by the superset of probability measures for whichthe conditional laws of X t have median zero, all statements still hold. This section has now developed admissible e -values and p -values for testing for symmetry, and wemove next to admissible sequential tests (and thus confidence sequences). Example 37.
Consider the null H : P ∈ S from definition (11) , and let α = 0 . so that /α = 20 .Consider the process ( M t ) defined as follows. Let M = 1 , and zero is an absorbing state, meaningthat if M t = 0 , then the process stays at zero from then on. If M t is nonzero, then define M t +1 := M t + sign(X t ) X t =0 . It is easy to check that ( M t ) is an S -NM. Define τ as the first time M t reaches . Then, M τ = 20 on { τ < ∞} , and also M ∞ = 0 , Q -almost surely, for each Q ∈ e S , as in (19) .Invoking Proposition 25 (3) and Corollary 27 (3) as in the previous example yields that ( M t ≥ ) is ( S , . -admissible. More generally, ( M t ≥ /α ) is ( S , α ) -admissible whenever /α ∈ N . Of course, there is nothing special about 0.05 and 20; for any other α , the process ( M t ) can be alteredaccordingly to yield an admissible test for that α . The above process ( M t ) also delivers admissible testsfor several subsets of S , for example for if we restricted to only Gaussian distributions with any variance.This is interesting because admissibility is generally not subset-proof or superset-proof, but above wehave a single process ( M t ) that yields admissible e -values and sequential tests for a variety of subsets of S . Continuing from Example 37 and using Theorem 28, we can construct an admissible level α testfor any ( S m ) m ∈ R , so the above construction yields an admissible confidence sequence for the center ofsymmetry.Thus, we have accomplished our goal of constructing admissible versions of all four instruments forsequential inference, for a composite nonparametric class of distributions.
10 Summary
The central contribution of this work is to identify the central role of nonnegative martingales in anytime-valid sequential inference. As a by-product, we have added several modern mathematical techniquesto the toolkit of the methodologist who wishes to design statistically efficient methods for inference atarbitrary stopping times. We end with a few comments.It is apparent to us that some of our analysis may have been simpler in continuous time. Indeed, someof the difficulty in constructing admissible sequential tests using e -values arises from the ‘overshoot’, whilethe difficulty in designing admissible p -values arises because we do not observe the process ‘in between’the fixed times and thus the running infimum is not exactly uniformly distributed in the limit. Several ofthese problems go away with continuous time/path martingales. However, continuous path martingales26nly represent large-scale approximations of most actual experimental setups, which typically involvediscrete events. The accuracy of these approximations would have to be assessed, especially outside veryhigh-frequency settings like finance, and it may not be clear how to do so. We believe that the additionaleffort to understand admissibility in the discrete time setup was fruitful.Following the literature, our sequential inference tools were only required to have marginal guarantees,and not conditional ones. To pick one example, we required that for each Q ∈ Q , an e-value must satisfy E Q [ e τ ] ≤ at arbitrary stopping times τ , but it need not satisfy E Q [ e t |F s ] ≤ e s . This gap betweenconditional and marginal guarantees is paramount: it allows for the construction of e -values that are notsimply Q -NMs, because in several settings of interest, one can show that the only nontrivial Q -NM isthe constant process that equals one at all times, but nontrivial e -values with power to detect deviationsfrom Q can still be constructed. We explore these connections further by using a structural notion called‘fork-convexity’ in a separate work.Finally, the paper provides a rather general treatment of the inferential tools and problem settings.However, perhaps additional insights could be gained when P or Q have special structure, or when wepay attention to particular classes of stopping times, or restrict ourselves to a bounded horizon; thesemay all be promising directions to explore. Finally, while we take a step forward in relating the variousconcepts used for sequential inference, and present a thorough analysis of their validity and admissibility,the question of optimality is unaddressed by our work. Of course, this usually needs to be studied byspecifying appropriate alternatives, and introducing metrics by which to judge optimality (such as theGROW criterion of Grünwald et al. [9]), and so we leave such considerations for future work. Acknowledgments
The authors are thankful to the organizers of the International Seminar on Selective Inference, whichstimulated conversations that led to this paper. AR acknowledges NSF DMS grant 1916320.
11 References [1] E. N. Barron, P. Cardaliaguet, and R. Jensen. Conditional essential suprema with applications.
Appl. Math. Optim. , 48:229–253, 2003. 9[2] Laurent Bienvenu, Glenn Shafer, and Alexander Shen. On the history of martingales in the studyof randomness.
Electronic Journal for History of Probability and Statistics , 5, 2009. 8[3] D. A. Darling and Herbert Robbins. Confidence Sequences for Mean, Variance, and Median.
Pro-ceedings of the National Academy of Sciences , 58(1):66–68, July 1967. ISSN 0027-8424, 1091-6490.3, 5, 14[4] Victor H. de la Peña. A General Class of Exponential Inequalities for Martingales and Ratios.
TheAnnals of Probability , 27(1):537–564, January 1999. ISSN 0091-1798, 2168-894X. 14, 15, 25[5] William D Dupont. Sequential stopping rules and sequentially adjusted p -values: Does one requirethe other? Controlled Clinical Trials , 4(1-2):3–10, 1983. 3[6] Rick Durrett.
Probability: Theory and Examples . Cambridge University Press, 5a edition, 2017. 8[7] Bradley Efron. Student’s t -Test Under Symmetry Conditions. Journal of the American StatisticalAssociation , 64(328):1278–1302, 1969. ISSN 0162-1459. 14[8] Hans Föllmer and Alexander Schied.
Stochastic Finance , volume 27 of
De Gruyter Studies in Math-ematics . Walter de Gruyter & Co., Berlin, extended edition, 2004. 29[9] Peter Grünwald, Rianne de Heide, and Wouter Koolen. Safe testing. arXiv:1906.07801 , June 2019.3, 27 2710] Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform Chernoffbounds via nonnegative supermartingales.
Probability Surveys , 17:257–317, 2020. 3, 14[11] Steven R. Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform, nonpara-metric, nonasymptotic confidence sequences.
The Annals of Statistics , forthcoming, 2020. 3, 4, 5,25[12] Ramesh Johari, Leo Pekelis, and David J. Walsh. Always valid inference: Bringing sequential analysisto A/B testing. arXiv preprint arXiv:1512.04922 , 2015. 3[13] Alisa Kirichenko and Peter Grünwald. Minimax rates without the fixed sample size assumption. arXiv preprint arXiv:2006.11170 , 2020. 31[14] Tze Leung Lai. On Confidence Sequences.
The Annals of Statistics , 4(2):265–280, March 1976. ISSN0090-5364, 2168-8966. 3[15] Martin Larsson. Conditional infimum and recovery of monotone processes. arXiv preprintarXiv:1802.08628 , 2018. 9[16] Martin Larsson and Johannes Ruf. Convergence of local supermartingales.
Annales de l’InstitutHenri Poincaré (B) Probabilités et Statistiques , forthcoming, 2020. 33[17] Luigi Pace and Alessandra Salvan. Likelihood, replicability and Robbins’ confidence sequences.
International Statistical Review , 2019. 3[18] Herbert Robbins and David Siegmund. The Expected Sample Size of Some Tests of Power One.
TheAnnals of Statistics , 2(3):415–436, May 1974. ISSN 0090-5364, 2168-8966. 3[19] Glenn Shafer. The language of betting as a strategy for statistical and scientific communication(with discussion).
Journal of the Royal Statistical Society, Series A , 2020. 3, 8[20] Glenn Shafer, Alexander Shen, Nikolai Vereshchagin, and Vladimir Vovk. Test Martingales, BayesFactors and p -Values. Statistical Science , 26(1):84–101, February 2011. ISSN 0883-4237, 2168-8745.3, 8, 13, 16, 31[21] J Ville.
Étude Critique de la Notion de Collectif.
Gauthier-Villars, Paris, 1939. 8[22] Vladimir Vovk. Testing randomness. arXiv:1906.09256 , 2019. 30[23] Vladimir Vovk. Non-algorithmic theory of randomness.
Fields of Logic and Computation III. LectureNotes in Computer Science , 12180, 2020. 3[24] Vladimir Vovk and Ruodu Wang. Combining e-values and p-values. arXiv preprint arXiv:1912.06116 ,2019. 3[25] Abraham Wald. Sequential Tests of Statistical Hypotheses.
Annals of Mathematical Statistics , 16(2):117–186, 1945. 3, 8[26] Abraham Wald.
Sequential Analysis . John Wiley & Sons, New York, 1947. 3, 8[27] Larry Wasserman, Aaditya Ramdas, and Sivaraman Balakrishnan. Universal inference.
Proceedingsof the National Academy of Sciences , 2020. ISSN 0027-8424. 328
Additional technical concepts and definitions
A.1 Reference measures and local absolute continuity
Consider a probability space with a filtration ( F t ) t ∈ N . Let R be a particular probability measure on F ∞ ;we think of R as a reference measure . We now explain the concept of local domination and how it allowsus to unambiguously define conditional expectations. • P is called locally absolutely continuous with respect to R (or locally dominated by R ), if P t ≪ R t forall t ∈ N . We write this P ≪ loc R . More explicitly, this means that R ( A ) = 0 ⇒ P ( A ) = 0 , for any A ∈ F t and t ∈ N . Local absolute continuity does not imply that P ≪ R . However, it does imply that P τ ≪ R τ forany finite (but possibly unbounded) stopping time τ . Indeed, if A ∈ F τ and R ( A ) = 0 , then A ∩ { τ ≤ t } ∈ F t for all t , and hence P ( A ) = lim t →∞ P ( A ∩ { τ ≤ t } ) = 0 . • A set P of probability measures on F ∞ is called locally dominated by R if every element of P islocally dominated by R . • Any P ≪ loc R has an associated density process , namely the R -martingale ( Z t ) given by Z t := d P t / d R t .Being a nonnegative martingale, once Z t reaches zero it stays there. Thus with the convention / , ratios Z τ /Z t are well-defined for any t ∈ N and any finite stopping time τ ≥ t . Note thateach Z t is defined up to R -nullsets, and therefore also up to P -nullsets. • If P ≪ loc R has density process ( Z t ) , the following ‘Bayes formula’ holds: for any t ∈ N , any finitestopping times τ , and any nonnegative F τ -measurable random variable Y , one has E P [ Y | F t ] = E R (cid:20) Z τ Z t Y (cid:12)(cid:12)(cid:12)(cid:12) F t (cid:21) , P -almost surely.The right-hand side is uniquely defined R -almost surely (not just P -almost surely), and thereforeprovides a ‘canonical’ version of E P [ Y | F t ] . We always use this version. This allows us to view suchconditional expectations under P as being well-defined up to R -nullsets.One might ask why we work with local domination, rather a ‘global’ condition like P ≪ R for all P of interest. The answer is that such a condition would be far too restrictive, as we now illustrate. Let ( X t ) t ∈ N be a sequence of random variables. For each η ∈ R , let P η be the distribution such that the X t become i.i.d. Gaussian with mean η and unit variance. By the strong law of large numbers, P η assignsprobability one to the event A η := { lim t →∞ t − P ts =1 X s = η } . Moreover, the events A η are mutuallydisjoint: A η ∩ A ν = ∅ whenever η = ν . Therefore, by definition, the measures { P η } η ∈ R are all mutuallysingular. Since there is an uncountable number of them, there cannot exist a measure R such that P η ≪ R for all η ∈ R . On the other hand, if P ηt denotes the law of the partial sequence X , . . . , X t , then themeasures { P η } η ∈ R , are all mutually absolutely continuous. In particular, we could (for instance) use R = P as reference measure and obtain P ηt ≪ loc R t for all η ∈ R . A.2 Essential supremum and infimum
We briefly review the notions of essential supremum and infimum. For more information, as well as proofsof the results below, we refer to Section A.5 in [8].On some probability space, consider a collection { Y α } α ∈A of random variables, where A is an arbitraryindex set. If A is uncountable, the pointwise supremum sup α ∈A Y α might not be measurable (not a randomvariable). Alternatively, it might happen that Y α = 0 almost surely for every α ∈ A , but sup α ∈A Y α = 1 .For this reason, the pointwise supremum is often not useful. Instead, one can use the essential supremum . Proposition 38.
There exists a [ −∞ , ∞ ] -valued random variable Y , called the essential supremum anddenoted by ess sup α ∈A Y α , such that . Y ≥ Y α , almost surely, for every α ∈ A ,2. if Y ′ is a random variable that satisfies Y ′ ≥ Y α , almost surely, for every α ∈ A , then Y ′ ≥ Y ,almost surely.The essential supremum is almost surely unique. In words, the essential supremum is the smallest almost sure upper bound on { Y α } α ∈A . The propo-sition guarantees that it always exists. In some cases, more can be said: the essential supremum can beobtained as the limit of an increasing sequence. Proposition 39.
Suppose { Y α } is closed under maxima, meaning that for any α, β ∈ A there is some γ ∈ A such that Y γ = Y α ∨ Y β . Then there is a sequence { α n } n ∈ N such that { Y α n } n ∈ N is an increasingsequence and ess sup α ∈A Y α = lim n →∞ Y α n almost surely. One can also define the essential infimum by setting ess inf α ∈A Y α := − ess sup α ∈A ( − Y α ) . This is the largest almost sure lower bound on { Y α } α ∈A . It satisfies properties analogous to those in thepropositions above. A.3 On the choice of filtration
In the paper, we assume that the filtration ( F t ) in use is by default the canonical filtration F t := σ ( U, X , . . . , X t ) . However, there are examples of hypothesis tests for H : Q ∈ Q where the only Q -NMswith respect to ( F t ) are almost surely constants. For the purpose of designing more powerful tests, itmay make sense to coarsen the filtration.As a first example, consider the problem of testing if a sequence is exchangeable: H : X , X , . . . form an exchangeable sequence.Vovk [22] demonstrates that all martingales with respect to ( F t ) (under the null) are constants, and arehence all derived tests are powerless to reject the null. Nevertheless, Vovk demonstrates that one canderive interesting and nontrivial ‘conformal’ martingales ( M t ) with respect to the restricted filtration G t := σ ( M , . . . , M t ) ⊂ F t , that do indeed have power to reject the null (for appropriate deviations fromthe null). In short, coarsening the filtration is a design tool that could aid in the construction of morepowerful sequential tests, and p -values and e -values.In the following example, we show how the choice of including external randomization U into F also helps design better p -values. (However, it is not always possible to randomize atoms as Example 43illustrates.) Example 40.
Assume that Q = { Q } , where under Q we have that X is Bernoulli( / ) and X , X , . . . =0 . Consider the canonical filtration ( G t ) , so that G ∞ = σ ( X ) . Then ( M t ) with M = 1 , M t = 2 X for all t ∈ N is a Q -NM and the corresponding p -value (inf s ≤ t /M s ) is admissible. (Indeed, any p -value ( p t ) hasto satisfy Q ( p ≤ α ) ∈ { , / , } for each α ∈ [0 , ). However, by expanding the filtration using externalrandomization, one can easily derive a strictly smaller p -value ( p ′ t ) such that p ′ is uniform. In otherwords, the original p -value is only admissible under the filtration generated solely by the observations, butis inadmissible under an expanded filtration that includes a randomization device (which is the filtration ( F t ) in this paper). The main idea of the following randomization device is somewhat folklore, but we find the followingsuccinct lemma useful.
Lemma 41 (Randomization device) . If Y is a random variable with distribution function F and U isan independent uniformly distributed random variable, then Y ′ := U F ( Y ) + (1 − U ) F ( Y − ) is uniformly distributed and satisfies Y ′ ≤ F ( Y ) . roof. Fix a ∈ [0 , and define y := inf { x ∈ R : F ( x ) ≥ a } . Note that Pr( Y ′ ≤ a | Y ) = Pr (cid:18) U ≤ a − F ( y − ) F ( y ) − F ( y − ) (cid:12)(cid:12)(cid:12)(cid:12) Y (cid:19) Y = y + Pr( U F ( Y ) + (1 − U ) F ( Y − ) ≤ a | Y ) Y = y . (If Pr( Y = y ) = 0 , the first term should be understood as zero.) On { Y > y } we have a < F ( Y − ) , sothat the second term equals zero. On { Y < y } we have F ( Y ) ≤ a , so that the second term instead equalsone. Since also U is uniform and independent of Y , we get Pr( Y ′ ≤ a | Y ) = a − F ( y − ) F ( y ) − F ( y − ) Y = y + Y Pr( Y ′ ≤ a ) = a , showing that Y ′ is uniformly distributed.Finally, it is clear from the definition of Y ′ that Y ′ ≤ F ( Y ) . B Omitted proofs Proof of Lemma 1. It is clear that (ii) = ⇒ (iii). The implication (i) = ⇒ (ii) follows from A T = [ t ∈ N ( A t ∩ { T = t } ) ! ∪ ( A ∞ ∩ { T = ∞} ) ⊆ [ t ∈ N A t . For (iii) = ⇒ (i), take τ := inf { t ∈ N : A t occurs } , so that A τ = S t ∈ N A t . Proof of Lemma 3. First, (i) implies (ii) since N T ≤ sup t ∈ N N t , hence E [ N T ] ≤ E [sup t ∈ N N t ] , for allrandom times T . Conversely, for any ε > there exists some random time T such that N T ≥ sup t ∈ N N t − ε .Thus if (ii) holds, then E [sup t ∈ N N t ] ≤ E [ N T ] + ε ≤ ε . Since ε > was arbitrary, we find that (ii)implies (i). It is clear that (ii) implies (iii).The fact that (iii) implies (iv) is however not obvious, and is essentially a consequence of a resultby Shafer et al. [20, Theorem 3], as also noted recently by Kirichenko and Grünwald [13, Lemma 5.1].First, we note that if ( N t ) satisfies (iii), then p t := 1 ∧ inf s ≤ t /N s is a p -value (see also Proposition 10(1)).In particular, p ∞ := inf t ∈ N p t stochastically dominates a uniform. Therefore, for any nonnegative,nonincreasing function f ( u ) such that R f ( u )d u = 1 , we have E [ f ( p ∞ )] ≤ R f ( u )d u = 1 (see also theproof of Proposition 12). The function f ( u ) := g (1 /u ) , with g as in the lemma, satisfies this condition.Consequently, E [ g (1 ∨ sup s ∈ N N s )] = E [ f ( p ∞ )] ≤ , as required. Proof of Proposition 4. For each t ∈ N , define a probability measure P ′ t on the Borel sets of R t by P ′ t ( A ) := E Q [ M t A ] . Because ( M t ) is a martingale under Q , the sequence ( P ′ t ) t ∈ N forms a consistentsystem of finite-dimensional distributions. Therefore, by Kolmogorov’s extension theorem, there exists asingle probability measure P on the Borel sets of Ω = R N whose projection onto R t is exactly P ′ t for each t ∈ N . Put differently, P satisfies P t = P ′ t for all t ∈ N , as desired. Proof of Proposition 7. We prove the statement for p -values; the same argument holds for e -values andsequential tests. The proof is based on transfinite induction. Fix some p -value ( p t ) . For all countableordinals β , we now recursively define p -values ( p βt ) as follows. For β = 1 , we set ( p βt ) := ( p t ) . For anysuccessor ordinal γ := β + 1 , if ( p βt ) is Q -admissible we set p γt := p βt , and otherwise we let ( p γt ) be any p -value that strictly dominates ( p βt ) . For any limit ordinal γ := lim n →∞ β n , we define ( p γt ) := (lim n →∞ p β n t ) .31et us now use the induction assumption that ( p βt ) is Q -valid for all β < γ , for this limit ordinal γ . Since (lim n →∞ p β n t ) is a decreasing limit, we have for every ε > , α ∈ [0 , , and Q ∈ Q that Q (cid:18) inf t ∈ N p γt ≤ α (cid:19) ≤ Q (cid:18) lim n →∞ inf t ∈ N p β n t < α + ε (cid:19) = lim n →∞ Q (cid:18) inf t ∈ N p β n t < α + ε (cid:19) ≤ α + ε. It follows that ( p γt ) is Q -valid. By transfinite induction, this holds for all countable ordinals β .Writing R for the reference probability measure, { E R [ P t ∈ N − t p βt ] } β defines a decreasing [0 , -valuedtransfinite sequence. This sequence must eventually become stationary, that is, it becomes constant forall β beyond some countable ordinal β . Thus p βt = p β t for all β ≥ β and all t ∈ N . By construction, ( p β t ) must then be admissible and dominate ( p t ) . This shows that any p -value for Q can be dominatedby a Q -admissible p -value.Let us also remark that in the case of Q being a singleton the statement for e -values and sequentialtests could be proved in a more constructive manner as in Subsections 6.2 and 6.3. Proof of Proposition 9. We prove the three statements in order. Let ( ψ t ) denote the constructed binarysequence, which we will now show is a ( Q , α ) -sequential tets. Let τ denote an arbitrary stopping time,potentially infinite, and fix Q ∈ Q .(1) Q ( ψ τ = 1) = Q ( p τ ≤ α ) ≤ α since ( p t ) is Q -valid.(2) Q ( ψ τ = 1) = Q ( e τ ≥ /α ) ≤ α E Q [ e τ ] ≤ α , where we used Markov’s inequality and the fact that ( e t ) is Q -safe. In short, e -values satisfy Ville’s inequality.(3) Q ( ψ τ = 1) = Q ( φ ( Q ) ∩ C τ = ∅ ) ≤ Q ( φ ( Q ) / ∈ C τ ) ≤ α , where the first inequality follows because theevent { φ ( Q ) ∩ C τ = ∅} implies that C τ does not contain φ ( Q ) , which is improbable under Q .The fact that the ( Q , α ) -sequential tests in (1) and (2) are nested is obvious. This completes the proof. Proof of Proposition 10. We prove the three statements in order. Let ( p t ) denote the constructed se-quence of random variables, which we will now show in each case is a p -value. Let τ denote an arbitrarystopping time, potentially infinite, and fix Q ∈ Q .(1) Q (1 / e τ ≤ α ) = Q ( e τ ≥ /α ) ≤ E Q [ e τ ] · α ≤ α , where we used Markov’s inequality and the fact that ( e t ) is Q -safe. Since a p -value remains valid after taking the running infimum, we obtain ( p t ) is valid.(2) Q ( p τ > α ) = Q ( ψ τ ( α ) = 0) ≥ − α , where the equality follows since the sequential tests are nestedand the inequality because ( ψ t ( α )) is a ( Q , α ) -sequential test.(3) Q ( p τ > α ) = Q ( φ ( Q ) ∩ C τ ( α ) = ∅ ) ≥ Q ( φ ( Q ) / ∈ C τ ( α )) ≥ − α as in (2).This completes the proof. Proof of Proposition 11. Let ( C t ) denote the constructed sequence of sets, which we will now show is a ( φ, P , α ) -confidence sequence. To this end, let τ denote an arbitrary stopping time, potentially infinite,and fix P ∈ P ; note that P ∈ P γ for some γ ∈ Z . Then, we have P ( φ ( P ) / ∈ C τ ) = P ( γ / ∈ C τ ) = P ( ψ γτ =1) ≤ α since ( ψ γt ) is a ( P γ , α ) -sequential test. This completes the proof. Proof of Proposition 12. Define e t := f ( p t ) ; we must show that ( e t ) is Q -safe. If ( p t ) is Q -valid then forany stopping time τ and Q ∈ Q , the distribution of p τ is stochastically larger than a uniform randomvariable (denoted V ). Thus for any calibrator f , we have E Q [ e τ ] = E Q [ f ( p τ )] ≤ E [ f ( V )] = R f ( v )d v = 1 .Since this result holds for any τ and Q ∈ Q , the result follows.32 roof of Proposition 13. To see (1), fix a probability measure Q ∈ conv( Q ) . Then there exist Q , Q ∈ Q and λ ∈ [0 , such that Q = λ Q + (1 − λ ) Q . Let now ( e t ) denote an e -value for Q . Consider somestopping time τ and note that E Q [ e τ ] = λ E Q [ e τ ] + (1 − λ ) E Q [ e τ ] ≤ λ + 1 − λ = 1 , since ( e t ) is Q -safe. This yields that ( e t ) is also conv( Q ) -safe. The same argument also applies for valid p -values and sequential tests.Next, (2) follows in a similar way. Assume that ( e t ) is Q -admissible and consider some conv( Q ) -valid e -value ( e ′ t ) that satisfies Q ( e ′ t ≥ e t ) = 1 for all t ∈ N and Q ∈ conv( Q ) and there exists some Q ∗ ∈ conv( Q ) and some t ∈ N such that Q ∗ ( e ′ t > e t ) > . Since we always have Q ∗ = λ Q + (1 − λ ) Q for some Q , Q ∈ Q and λ ∈ [0 , we also have Q ( e ′ t > e t ) > or Q ( e ′ t > e t ) > , leading to a contradiction. Again, the sameargument also applies for admissibile p -values and sequential tests. Proof of Lemma 31. Fix some α ∈ (0 , , let τ denote the first time t that M t ≥ /α , and let q := Q ( τ < ∞ ) = Q (cid:18) sup t ∈ N M t ≥ α (cid:19) . Next, (14) yields Q ( M ∞ = 0) = 1 , for example, by [16, Theorem 4.2]. Note that the stopped process M τ is a uniformly integrable martingale, yielding E Q [ M τ ∞ ] = 1 . On the event { τ = ∞} , we have M τ ∞ = 0 .With M − := 1 , Y := 0 , and F − := {∅ , Ω } , these observations then yield E Q [ M τ ∞ ] = X t ∈ N E Q [ M t τ = t ] = X t ∈ N E Q [ M t − Y t τ = t ] = X t ∈ N E Q (cid:20) E Q (cid:20) Y t (cid:12)(cid:12)(cid:12)(cid:12) F t − , Y t ≥ αM t − (cid:21) M t − τ = t (cid:21) ≤ α (1 + ε ) X t ∈ N E Q [ τ = t ] = q εα . This then gives q ≥ α/ (1 + ε ) , yielding the claim. Proof of Proposition 33. First, assume ( M t ) is an S -NM and fix a time t . Since ( Y t ) is adapted, Y t is afunction of U, X , . . . , X t . Hence we may write Y t = f t ( X t ) for some nonnegative predictable function f t ( · ) . More explicitly, Y t = f t ( U, X , . . . , X t − ; X t ) . Now pick any real numbers x , . . . , x t . Consider thetwo-point measures µ s := ( δ − x s + δ x s ) / for all s ≤ t and let P := U × Q s ∈ N µ s ∧ t be the distribution thatmakes the data independent with X s ∼ µ s ∧ t . Then P ∈ S . Moreover, E P x [ Y t |F t − ] = 12 ( f t ( U, X , . . . , X t − ; x t ) + f t ( U, X , . . . , X t − ; − x t )) . Since the event { X i = x i , i = 1 , . . . , t − } has positive probability, we get 12 ( f t ( U, x , . . . , x t − ; x t ) + f t ( U, x , . . . , x t − ; − x t )) = 1 . But the numbers x , . . . , x t were arbitrary, so it follows that the function x f t ( U, x , . . . , x t − ; x ) − is odd for all x , . . . , x t − .For the reverse direction fix some P ∈ S and some t ∈ N . Then E P [ Y t |F t − ] = E P [ f t ( X t ) |F t − ] ( i ) = 12 ( E P [ f t ( X t ) |F t − ] + E P [ f t ( − X t ) |F t − ])= 1 + 12 ( E P [ f t ( X t ) − |F t − ] + E P [ f t ( − X t ) − |F t − ]) ( ii ) = 1 , ( i ) follows by symmetry of P , and equality ( ii ) follows because f t − is odd. Proof of Proposition 35. This follows from an application of Corollary 32. Fix some ε > . Since f iscontinuous at zero and f (0) = 1 there exists some η > such that f ( x ) ≤ ε for all x ∈ ( − η, η ) .This implies (15). Moreover, since f is strictly monotone at zero we may assume that f ( η ) = 1 .Consider now the measure µ η := ( δ − η + δ η ) / and note that Q η := U × µ ∞ η ∈ S . Hence Q η ( f ( X t ) ≤ ε ) = 1 for all t ∈ N . Moreover, Q η X t ∈ N ( f ( X t ) − = ∞ ! = Q η X t ∈ N ( f ( η ) − = ∞ ! = 1 , since f − is odd and f ( η ) = 1 . This shows that (14) holds. Hence Corollary 32 can indeed be appliedand the statement follows. C Auxiliary examples The following example shows that Proposition 13 cannot be extended to confidence sequences. Example 42 (Confidence sequences do not mesh with convex closures) . Let µ + ( µ − ) denote the lawof a Gaussian random variable with unit variance and mean ( − ). Moreover, let P = { µ ∞ + , µ ∞− } bethe family of i.i.d. laws of such distributions. Consider φ mean , which satisfies φ mean ( µ ∞− ) = − and φ mean ( µ ∞ + ) = 1 . Then ( C t ) given by C t = {− , +1 } is a (trivial) ( φ mean , P , α ) -valid confidence sequencefor α ∈ [0 , . Now consider the measure P = ( µ ∞ + + µ ∞− ) / ∈ conv( P ) , which satisfies φ mean ( P ) = 0 / ∈ C t .It is clear that ( C t ) is not a ( φ mean , conv( P ) , α ) -valid confidence sequence for any α ∈ [0 , . The next example also elaborates further on the discussion in Subsection 8.3 by providing an admissible p -value that has an atomic limiting distribution. Example 43 (Atomic admissible p -values exist even in the presence of an independent F -measurablerandom device) . Consider Q under which ( X t ) are i.i.d. uniformly distributed. Then an adapted process ( p t ) with the following properties can be constructed. • p t is supported on { / k/ t +1 } k =1 ,... t ; • Q ( p t = 1 / / t +1 ) = 1 / / t +1 and Q ( p t = 1 / k/ t +1 ) = 1 / t +1 for all k = 2 , . . . t ; • ( p t ) is a Q -MM with Q (cid:18) p t +1 − ∈ (cid:26) k − t +2 , k t +1 (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) p t − 12 = k t +1 (cid:19) = 1 for all k = 1 , . . . t and t ∈ N ; • ( p t ) is independent of U .Note that p ∞ := inf t ∈ N p t satisfies Q ( p ∞ ≤ α ) = α α ≥ / ≤ α for all α ∈ [0 , ; in particular ( p t ) is ananytime p -value and its limit p ∞ has an atom at / .We claim that ( p t ) is Q -admissible. Indeed, assume there exists an anytime p -value ( p ′ t ) that dominates ( p t ) (we explicitly allow ( p ′ t ) to depend on the randomization device U ). Then there exists some t ∈ N such that Q ( p ′ t < p t ) > . Let us first assume that Q ( p ′ t < / / t +1 ) > . In combination with the factthat ( p t ) is a Q -MM there exists some n > t + 2 such that Q (cid:18)(cid:26) p ′ t ≤ 12 + 12 t +1 (cid:27) ∩ (cid:26) p ∞ ≥ 12 + 12 t +2 (cid:27)(cid:19) > . ince p ′∞ ≤ p ∞ and since Q ( p ∞ ≥ / / t +2 ) = 1 / − / t +2 , we hence obtain Q ( p ′∞ ≥ / / t +2 ) < / − / t +2 , a contradiction to the fact that ( p ′ t ) is an anytime p -value. We obtain similar contradictionsif we assume Q ( { p ′ t < p t } ∩ { p t = 1 / k/ t +1 } ) > for some k = 2 , . . . , t . This shows that ( p t ) isindeed Q -admissible, despite having an atom and being independent of the randomization device. The next example illustrates how anti-concentration bounds can be satisfied by NMs that lead toinadmissible p -values. Example 44 (A p -value for S that satisfies Proposition 35 need not be admissible) . Fix the func-tion f : x ((1 + x ) ∧ + . This function satisfies the criteria of Proposition 35. Hence the pro-cess M t := Q s ≤ t f ( X s ) is an S -martingale and we also have (17) . Consider the p -value ( p t ) given by p t := inf s ≤ t /M s . This is a p -value by (17) .Define next p ′ t := p t for t = 0 , and p ′ t := p t − / X ∨ X ≤− for t ≥ . Clearly we have Q ( p ′ t ≤ p t ) = 1 for all t ∈ N and Q ∈ S and there exists Q ∗ ∈ S such that Q ∗ ( p ′ < p ) > . We now claim that ( p ′ t ) isalso S -valid. We will prove this claim below. This assertion then yields that (17) is not sufficient for theadmissibility of the corresponding p -value in general.We now prove the claim that ( p ′ t ) is S -valid. To do so we consider a subset e S ⊂ S , namely thosemeasures Q ∈ S that satisfy Q ( X ≤ − ∈ { , / } . Note that S = conv ( e S ) , the convex hull of e S . Thanksto Proposition 13 (1) it suffices to argue that ( p ′ t ) is e S -valid. To this end, note that on the event { X > − } we have p t = p ′ t for all t ∈ N . On the other hand, on the event { X ≤ − } we have p t = 1 for all t ∈ N .Fix now some α ∈ (0 , and Q ∈ e S . Without loss of generality we can assume that Q ( X ≤ − 1) = 1 / ,otherwise there is nothing to be argued. We now need to show that Q ( p ′∞ ≤ α ) ≤ α. (20) To make headway, note that { p ′∞ ≤ α } = ( { p ′∞ ≤ α } ∩ { X > − } ) ∪ ( { p ′∞ ≤ α } ∩ { X ≤ − } ) ⊂ { p ∞ ≤ α } ∪ (cid:26) p ∞ − X ∨ X ≤− ≤ α (cid:27) = { p ∞ ≤ α } ∪ (cid:26) X ∨ X ≤− ≥ − α (cid:27) . Thus, if α < / then { p ′∞ ≤ α } ⊂ { p ∞ ≤ α } and we have (20) . Let us now assume that α ≥ / . Notethat since Q ( p ∞ = 1) ≥ Q ( X ≤ − 1) = 1 / and hence Q ( p ∞ ≤ α ) < / . This then yields Q ( p ′∞ ≤ α ) ≤ Q ( p ∞ ≤ α ) + Q ( X ∨ X ≤ − < / / = 3 / ≤ α, yielding the e S -validity of ( p ′ t ) , hence also the S -validity of ( p ′ t ) ..