Testing exchangeability: fork-convexity, supermartingales, and e-processes
HHow can one test if a binary sequence is exchangeable?Fork-convex hulls, supermartingales, and Snell envelopes
Aaditya Ramdas , Johannes Ruf , Martin Larsson , Wouter M. Koolen Departments of Statistics and ML, Carnegie Mellon University Department of Mathematics, London School of Economics Department of Mathematics, Carnegie Mellon University Machine Learning Group, CWI Amsterdam {aramdas,martinl}@[email protected], [email protected]
February 2, 2021
Abstract
Suppose we observe an infinite series of coin flips X , X , . . . , and wish to sequentially test thenull that these binary random variables are exchangeable against Markovian alternatives. We utilize ageometric concept called “fork-convexity” (an adapted analog of convexity) that lies at the heart ofthis problem, and relate it to other concepts like Snell envelopes that are absent in the sequentialtesting literature. By demonstrating that the alternative lies within the fork-convex hull of the null,we prove that any nonnegative supermartingale under the exchangeable null is necessarily also asupermartingale under the alternative, and thus yields a powerless test. We then combine ideas fromuniversal inference (maximum likelihood under the null) and the method of mixtures (Jeffreys’ priorover the alternative) to derive a nonnegative process that is upper bounded by a martingale, butis not itself a supermartingale. We show that this process yields safe e-values, which in turn yieldsequential level- α tests that are consistent (power one), using regret bounds from universal coding todemonstrate their rate-optimal power. We present ways to extend these results to any finite alphabetand to Markovian alternatives of any order using a “double mixture” approach. We also discuss theirpower against change point alternatives, and give general approaches based on betting for unstructuredor ill-specified alternatives. Keywords:
Anytime-valid sequential inference; betting; calibrator; composite Snell envelope; de Finetti mixing;fork-convexity; Jeffreys’ prior; method of mixtures; nonnegative supermartingale; optional stopping; regret bound;safe e-value; testing exchangeability; universal coding.
Contents Q -safe e-value that is not a Q -NSM . . . . . . . . . . . . . . . . . . . . . . . . 51 a r X i v : . [ m a t h . S T ] F e b Jeffreys’ mixture meets maximum likelihood 6 Q -Snell envelopes 10 A.1 Reference measures and local absolute continuity . . . . . . . . . . . . . . . . . . . . . . . 22A.2 Essential supremum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Suppose we observe a sequence of binary coin flips X , X , · · · . Consider the problem of testing if ourdata ( X t ) t ≥ is either an exchangeable sequence, or an i.i.d. Bernoulli sequence, and if not, then to stopcollecting data and reject the null as soon as possible.Let the null set Q consist of all product distributions µ ∞ , where µ = Ber ( p ) for some p ∈ [0 , . Let ( F t ) t ≥ represent the canonical filtration, where F is the trivial sigma algebra and F t = σ ( X , . . . , X t ) .All martingale statements in this paper will implicitly refer to this canonical filtration. Furthermore, let T be the set of all stopping times (potentially infinite) with respect to ( F t ) . A level α sequential test forthis problem is any stopping time τ α ∈ T such that sup Q ∈Q Q ( τ α < ∞ ) ≤ α, (1)meaning that with probability − α , we never stop under the null. Following recent literature on sequentialtesting [6, 12], we introduce the related notion of a Q -safe e-value, which is a nonnegative sequence ofadapted random variables ( E t ) t ≥ such that sup Q ∈Q sup τ ∈T E Q [ E τ ] ≤ . Above, we interpret E ∞ := lim sup t →∞ E t for potentially infinite stopping times. Large e-values encodeevidence against the null, and it is easy to check that the stopping time κ α := inf (cid:26) t ≥ E t ≥ α (cid:27) (2)results in a level α sequential test by Markov’s inequality. More details can be found in Ramdas et al. [12],who also show that the sequence ( p t ) defined by p t := inf s ≤ t /E s is an anytime-valid p-value, meaning: sup Q ∈Q sup τ ∈T Q ( p τ ≤ α ) ≤ α (3)2 .i.d. Ber( ) sequences p Exchangeable laws: convex hull of 𝒬 First order Markov
𝒬 𝒬 𝒫 … 𝒫 k -th order Markov k ˜ 𝒬 All binary laws: closed fork convex hull of 𝒬 Figure 1: Various classes of distributions over infinite binary sequences encountered in this paper.for all α ∈ [0 , . Since such Q -safe e-values result in both sequential tests and anytime-valid p-values, wefocus on constructing e-values for the rest of this paper. As a matter of convention, we always use κ α to denote the above stopping time, that is the one that thresholds a safe e-value at level /α , while τ denotes a generic stopping time.Let P represent our alternative class, where each P ∈ P represents a first-order Markov process withparameters p | and p | . Here we abbreviate p | = 1 − p | and p | = 1 − p | . For simplicity, assumethat the first outcome is equally likely to be zero or one. We further assume that the Markov chain isnon-absorbing, meaning that p | > and p | > . Otherwise, with probability half, when we start atthe absorbing state, we may only see a sequence of ones (or a sequence of zeros) which is indistinguishablefrom a Bernoulli model despite being Markovian. We call an e-value powerful if the corresponding test is powerful, and of course we desire a test thatis consistent, meaning that its power goes to one with the sample size. Formally, a level- α test τ α hasasymptotically power one against P if inf P ∈P\Q P ( τ α < ∞ ) = 1 . (We explicitly exclude Q from P above because, for example, p | = p | = p recovers an i.i.d. Ber ( p ) sequence, and so Q ⊂ P as stated. Henceforth, it will always be understood that we desire power against
P\Q .) Similarly, a Q -safe e-value ( E t ) is said to be consistent, or power one, iffor all α ∈ (0 , , inf P ∈P\Q P ( κ α < ∞ ) = 1 , (4a)or equivalently if E t → ∞ , P -almost surely, for every P ∈ P\Q . (4b)Let Q represent the set of all exchangeable distributions over infinite binary sequences. Note that Q isa rich composite class of parametric distributions, whose convex hull is Q (de Finetti’s theorem). Thus, a3equential test for the i.i.d. setting is also valid under the weaker condition of exchangeability, a fact thatwe record below, proved in Ramdas et al. [12]. Proposition 1.
The properties of type-1 error control (1) and safety are closed under the convex hull,meaning that any Q -safe e-value is also Q -safe, and any level α sequential test for Q is also valid for Q . As a consequence, we may restrict our attention to developing a Q -safe e-value, and invoke the abovefact to step from the i.i.d. setting to the exchangeable setting, and this will be our approach in the rest ofthis paper. Another consequence, orthogonal to the scope of this paper, is that testing the null Q againstthe alternative Q is futile; safe and consistent e-values do not exist and neither do valid, power-one tests. Remark 2.
To avoid confusion, we note that the convex combination of Q , Q (cid:48) ∈ Q must be carefullyinterpreted. For example, if Q = Ber (0 . ∞ and Q (cid:48) = Ber (0 . ∞ then a draw from ( Q + Q (cid:48) ) / produceseither a sequence with 70 % zeros or with 70 % ones, each with probability half, and produces a sequencewith equal number of zeros and ones with probability zero. Contrast this with the fact that a draw from ( Ber (0 .
3) +
Ber (0 . / is equally likely to produce a zero or a one. In other words, one must take careto differentiate between (( Ber (0 .
3) +
Ber (0 . / ∞ , which is not in the convex hull of Q and Q (cid:48) , and ( Q + Q (cid:48) ) / , which is. Later, we will see that the former lies in the closed fork-convex hull of Q and Q (cid:48) . Note that it is impossible to have a power one test for Q against Q c , since the alternative class is toorich and consists of too many distributions that are too close to Q , meaning that there are too many waysto violate exchangeability. For example, it should be apparent to the reader that if the first coin has bias p and every other coin has bias p (cid:54) = p , then the resulting sequence is not exchangeable but we would neverbe able to reliably detect this deviation. This example relies on ensuring that the information requiredto detect a deviation from the null is exhausted early on in the sequence. To avoid such pathologiesit is necessary to restrict the alternative class in some meaningful way. Markovian alternatives are anattractive choice, balancing the needs of relevant practical motivation, tractable mathematical structure,succinct probabilistic description, and intuitive aesthetic appeal. We focus on the setting of a first-orderMarkov alternative, and briefly return to address higher order alternatives later. Let the likelihood under a particular Q ∈ Q , where Q = Ber ( p ) ∞ , be represented by Q t ≡ Q p ( X , . . . , X t ) := (1 − p ) n p n , where n = n ( t ) and n = n ( t ) represent the number of zeros and ones seen up to time t . The likelihoodassociated to P ∈ P is given by P t ≡ P p | ,p | ( X , . . . , X t ) := 12 t (cid:89) s =2 p X s | X s − = 12 p n | | p n | | p n | | p n | | , where n | = n | ( t ) is the total number of ones following zeros up to time t , etc.Naturally, for a point null Q ∈ Q and point alternative P ∈ P , Wald’s sequential likelihood ratio test(SLRT) [21] yields a power-one test. The likelihood ratio process, i.e., ( P t /Q t ) , is a Q -martingale startingin one and thus a Q -safe e-value, and the resulting test that thresholds at level /α has optimal poweragainst P . For composite nulls and alternatives, the SLRT cannot be directly applied. The mixture SLRT integrates over the alternatives using a “prior” distribution (or, more appropriately, mixture distribution,to avoid any Bayesian interpretations of our frequentist statements), but this only works for compositealternatives, since mixing over the null set does not yield a safe e-value or the desired type-1 error controlproperty in (1). (We note here that, interestingly, the GROW e-values of [6] are ratios of mixtures, thoughthey are safe only for a fixed sample size.) The generalized SLRT maximizes the likelihood under bothnull and alternative, but this also does not yield a martingale or a safe e-value. In both cases, it is difficultto find a threshold for the resulting process that achieves type-1 error control in (1), since the SLRT’schoice of /α does not suffice. 4espite the above apparent difficulties in generalizing the SLRT to yield an e-value, it has been recentlyestablished that nonnegative (super)martingales play a fundamental role in the design of admissiblesequential tests (and the construction of admissible safe e-values), even for composite nulls [12]. Inanticipation of the results to follow, it is useful to set up some relevant notation. In what follows, aprocess ( M t ) t ≥ will be called a Q -NM if it is a nonnegative martingale with initial value one, that is, ( M t ) is adapted to ( F t ) , M = 1 and E Q [ M t |F s ] = M s ≥ for any s ≤ t . Such processes are called testmartingales by Shafer et al. [15]. If ( M t ) is a Q -NM simultaneously for every Q ∈ Q , then we will call ita Q -NM. If the equality above is replaced by an inequality ≤ , then we will call it a Q -NSM or Q -NSM(nonnegative supermartingale). An appropriate variant of the optional stopping theorem implies that forany Q -NSM ( M t ) and any stopping time τ (potentially infinite), the stopped process has expectation atmost one, or in other words sup Q ∈Q sup τ ∈T E Q [ M τ ] ≤ . Indeed, it is well known that for any Q -NSM, M ∞ := lim t →∞ M t is a well defined random variable, and sup Q ∈Q E Q [ M ∞ ] ≤ . The correspondence with the definition of a safe e-value is not coincidental — toconstruct a Q -safe e-value, it suffices to construct a Q -NSM. However, we claim the following. Proposition 3.
Every Q -NSM is also a P -NSM (recall that Q and P contain all i.i.d. respectivelyfirst-order Markov distributions). In other words, any Q -safe e-value with nontrivial power cannot be a Q -NSM, since the latter is powerless against P by virtue of being a P -NSM. This paper is as much about understanding the above negative result, as about providing a positiveresult. In other words, this result probes at the “gap” between a Q -safe e-value and a Q -NSM. The formeris a much weaker property than the latter. While the latter suffices for the former, it is by no meansnecessary, as recently observed in a more abstract setup [12]. Indeed, a Q -NSM ( N t ) satisfies the muchstronger “conditional” property that E Q [ N τ |F s ] ≤ N τ ∧ s for every Q ∈ Q and stopping time τ ∈ T , which implies the earlier mentioned optional stopping result. Indeed, the above property is satisfied ifand only if ( N t ) is a Q -NSM, but the earlier properties can be satisfied even by processes that are upperbounded by NSMs, but are not themselves NSMs. It is exactly this gap that we will exploit.We provide a geometrical characterization of the above phenomenon: essentially, we will show thatthe above proposition is true because P lies within the “fork-convex hull” of Q , and we prove that thishull preserves the NSM property of a process. Thus, a Q -NSM yields a powerless test against P since itis automatically and unintentionally safe under the alternative as well as the null. Along the way, wewill encounter other friends from martingale theory, such as the Snell envelope, which previously has notplayed a prominent role in the mathematical treatment of sequential testing. Q -safe e-value that is not a Q -NSM Ramdas et al. [12] show that any Q -safe e-value is dominated by a Q -safe e-value of the form E t := inf Q ∈Q M Q t , where ( M Q t ) is a Q -NM. As mentioned before, each limiting variable M Q ∞ is well defined and has expectationat most one, and thus, E ∞ := lim sup t →∞ E t also has expectation at most one. The safety propertyimmediately holds at infinite times as well, meaning that sup Q ∈Q E Q [ E ∞ ] ≤ and sup τ ∈T sup Q ∈Q E Q [ E τ ] ≤ .To avoid too much suspense before we get into these subtle new concepts, we first present our solutionimmediately, and delay its derivation to the next section. To this end, define R t := Γ (cid:0) n | + 0 . (cid:1) Γ (cid:0) n | + 0 . (cid:1) Γ (cid:0) n | + 0 . (cid:1) Γ (cid:0) n | + 0 . (cid:1) . Γ( n | + n | + 1)Γ( n | + n | + 1) (cid:46) (cid:16)(cid:16) n t (cid:17) n (cid:16) n t (cid:17) n (cid:17) , (5)where Γ denotes the usual gamma function. 5 heorem 4. The process ( R t ) is a Q -safe e-value, and thus thresholding it at level /α yields a level- α sequential test κ α . Furthermore, this test has power one, i.e., (4a) holds. Above, ( R t ) is not itself a Q -NSM, but is nevertheless upper bounded by a (different!) Q -NM for every Q ∈ Q , resulting in it being a Q -safe e-value. This idea is enabled by bringing together the method ofmixtures (using Jeffreys’ prior) for combining the composite alternative, with the maximum likelihoodunder the composite null. Beyond showing that it has power one, one can quantify that it has rate-optimalpower by utilizing a regret bound from universal coding. The next section essentially proves the abovetheorem, after which we turn to defining fork-convexity. We end with a discussion about this paper’sapproach compared to other possible approaches to the problem. As briefly mentioned earlier, if the null set was a singleton, say corresponding to µ = Ber ( p ) , and thealternative was also a singleton, such as when ( X t ) deterministically alternates between 0 and 1, thenWald’s sequential likelihood ratio test [21] would immediately yield a solution to the problem at hand. Toelaborate, let f p denote the probability mass function of a Ber ( p ) random variable and let L denote thelikelihood function under the alternative: L ( X , . . . , X t ) := (cid:40) if ( X , X , . . . ) = (0 , , , , . . . );0 otherwise.Then, for any point null (indexed by p ) define the following likelihood ratio: R pt := L ( X , . . . , X t ) (cid:81) ts =1 f p ( X s ) , which equals (cid:40) p (cid:98) t/ (cid:99) (1 − p ) (cid:100) t/ (cid:101) with µ ∞ -probability p (cid:98) t/ (cid:99) (1 − p ) (cid:100) t/ (cid:101) ;0 with µ ∞ -probability − p (cid:98) t/ (cid:99) (1 − p ) (cid:100) t/ (cid:101) . It is easy to check that ( R pt ) is a Ber ( p ) -NM, and thus a Ber ( p ) -safe e-value, and κ α from (2) yields avalid level- α sequential test that coincides with Wald’s original proposal [21]. The question is how togeneralize this approach to deal with a composite null and a composite alternative in a computationallytractable and statistically powerful manner.The following observation deals the first blow: the only process that is a nonnegative martingale underevery i.i.d. Bernoulli sequence is one that is almost surely constant. In other words, the only Q -NM issuch that M t = 1 for all t ∈ N . This obviously results in a powerless test. So we then turn our attentionto constructing a Q -NSM, or a test supermartingale. Unfortunately this approach is dealt a fatal blow byProposition 3. As alluded to in the intro, we cannot employ mixtures in both numerator and denominatorbecause it violates safety by lowering the denominator too much, and we cannot maximize the likelihoodin the numerator and denominator because it violates safety by raising the numerator too much.Our proposal combines a suitably chosen mixture in the numerator with maximum likelihood in thedenominator, thus avoiding both pitfalls. To start, let us return to the point alternative described above (alternating 0 and 1), and just handle thecomposite null using maximum likelihood estimation, as proposed in universal inference [23]. To elaborate,observe that R ML t := inf p ∈ [0 , R pt = likelihood under the point alternativemaximum likelihood under the nullis a Q -safe e -value. Indeed, suppose the data truly comes from Ber ( p ∗ ) for an unknown p ∗ . Then, it isobvious that R ML t ≤ R p ∗ t , where the latter process is a Ber ( p ∗ ) -NM. Thus, for any Q ∈ Q (correspondingto some p ∗ ∈ [0 , ) and any stopping time τ ∈ T , we have E Q [ R ML τ ] ≤ E Q [ R p ∗ τ ] ≤ , ( p ∗ ) -NM ( R p ∗ t ) .To see that the resulting test has good power, note that under the alternative, p = 1 / uniquelyachieves the above infimum at any even time t ∈ N , in which case the denominator equals (1 / t andthe numerator equals one. Thus, R ML t = 2 t at even times, and the test { sup s ≤ t R ML s ≥ /α } (which equalszero until time κ α in (1) and then equals one) is a valid level- α sequential test that stops either at time (cid:100) log(1 /α ) / log(2) (cid:101) or at time (cid:100) log(1 /α ) / log(2) (cid:101) + 1 .To recap, despite the fact that we cannot find a Q -NM, ( R ML t ) is a powerful Q -safe e-value againstthe considered point alternative. This test takes the ratio of the likelihood under the alternative to themaximum likelihood under the null. Next, we detail how to handle the composite alternative P whentesting a point null in Q . Taking independent Jeffreys’ priors (with densities w ( θ ) = 1 / ( π (cid:112) θ (1 − θ )) ) for p | and p | , we obtainthe mixture likelihood P w × w ( X , . . . , X t ) := (cid:90) P p | ,p | ( X , . . . , X t ) w ( p | ) w ( p | )d( p | , p | )= Γ (cid:0) n | + 0 . (cid:1) Γ (cid:0) n | + 0 . (cid:1) Γ (cid:0) n | + 0 . (cid:1) Γ (cid:0) n | + 0 . (cid:1) . Γ( n | + n | + 1)Γ( n | + n | + 1) . Thus, for any point null represented by Ber ( p ) , we can define the mixture likelihood ratio R JP t := P w × w ( X , . . . , X t ) (cid:81) ts =1 f p ( X s ) = Jeffreys’ mixture over the alternativelikelihood under the point null . Using Fubini’s theorem to swap integrals, it is easy to check that R JP t is a Ber ( p ) -NM, and the correspondingsequential test is Wald’s usual mixture SLRT [22]. Note that it is the very particular form of this mixturethat yields a closed form expression and thus a computationally feasible test. However, we do not usethis mixture just for computational reasons; as we detail soon, combining it with the earlier maximumlikelihood idea also yields a statistically near-optimal power. Using, as in the previous example, that the likelihood under the null is maximised at p = n /n , where itevaluates to ( n /t ) n ( n /t ) n , we find that R t := Jeffreys’ mixture over the alternativemaximum likelihood under the nullreduces to the expression in (5). This is a Q -safe e -value by combining the arguments used for the safetyof ( R JP t ) and ( R ML t ) : swapping the maximum likelihood with the (unknown) true likelihood, and thenemploying Fubini’s theorem.We remark that any prior above would have yielded a Q -safe e -value, and in fact any Beta prior wouldhave yielded one in closed form, but the Jeffreys’ prior above allows us to invoke an appropriate optimalregret bound from the universal coding literature [11] to Markov sources (see [17] for a discussion of theresulting optimality): R t ≥ (cid:16) n | n | + n | (cid:17) n | (cid:16) n | n | + n | (cid:17) n | (cid:16) n | n | + n | (cid:17) n | (cid:16) n | n | + n | (cid:17) n | (cid:0) n t (cid:1) n (cid:0) n t (cid:1) n × e − log( n | + n | ) − log( n | + n | ) − O (1) = maximum likelihood of Markov modelmaximum likelihood of Bernoulli model × e − log t − O (1) . That is, R t starts gathering evidence against the null if the maximum likelihood for the first-order Markovchain outperforms the maximum likelihood for the Bernoulli model by a factor of order t . Note that7his is a small hurdle to overcome, as the first term is growing exponentially fast in t when the data areexplained better by a Markov model, as argued next. Theorem 5.
Under any first-order non-absorbing Markov alternative whose transition probabilities p | , p | , p | , p | satisfy p | (cid:54) = p | (and thus also p | (cid:54) = p | ) we have R t → ∞ almost surely. The condition on the transition probabilities means that the Markov chain does not reduce to an i.i.d.Bernoulli sequence. Recall that our definition of P disallows absorbing states. The latter condition isnecessary for a power one test, because if (say) is an absorbing state then there is positive probabilityof seeing only ones (recall that the Markov chain starts at 0 or 1 with equal probability). This isindistinguishable from a realization of an i.i.d. Ber( ) sequence. Proof of Theorem 5.
To simplify notation we define ˆ p | = n | n | + n | , ˆ p | = n | n | + n | , ˆ p = n t , as well as ˆ q | = n | n , ˆ q | = n | n , ˆ q | = n | n , ˆ q | = n | n . These quantities all depend on t , although this is suppressed in the notation. Since p | > and p | > by assumption, the Markov chain is recurrent and hence the ergodic theorem applies. Then, as t tends toinfinity we have ˆ p | → p | , ˆ p | → p | , and ˆ p → p , where p is the asymptotic frequency of ones. Anexpression for p can be obtained by noting that, by definition, p | = P ( X t +1 = 1 , X t = 0) P ( X t = 0) , p | = P ( X t +1 = 1 , X t = 1) P ( X t = 1) . Hence p | P ( X t = 0) + p | P ( X t = 1) = P ( X t +1 = 1) . Taking the asymptotic time average of this identityyields the equation p | (1 − p ) + p | p = p , which can be re-arranged to p = p | p | + p | . (6)Next, since ( n | + n | ) /n → we have ˆ q | = n | n | + n | × n | + n | n = ˆ p | × n | + n | n → p | , and then, because ˆ q | + ˆ q | = ( n | + n | ) /n → , we get ˆ q | → − p | = p | . In a similar manner,we get ˆ q | → p | and ˆ q | → − p | = p | . (The flip from ˆ q | to p | is intentional; note also that n | = n | ± , and n | + n | ∈ { n , n + 1 } , so that ˆ p | ≈ ˆ q | , and the limiting q matrix is thetranspose of the p matrix.)Let (cid:96) ( t ) denote the logarithm of the ratio between the maximum likelihood of the Markov model andthe maximum likelihood of the Bernoulli model. Using the above notation, this can be written as (cid:96) ( t ) = n (cid:18) log 1ˆ p − ˆ q | log 1ˆ p | − ˆ q | log 1ˆ p | (cid:19) + n (cid:18) log 11 − ˆ p − ˆ q | log 11 − ˆ p | − ˆ q | log 11 − ˆ p | (cid:19) . The first parenthesized expression converges, as t → ∞ , to log 1 p − p | log 1 p | − p | log 1 p | , which by Jensen’s inequality is greater than or equal to log 1 p − log (cid:18) p | p | + p | p | (cid:19) = log 1 p − log (cid:18) p | + p | p | (cid:19) = 0 , p | (cid:54) = p | by assumption (otherwise the data would be i.i.d. Bernoulli),Jensen’s inequality is actually strict. A similar argument applied to the second parenthesized expressionshows that it also converges to a strictly positive number. Therefore, there is a small constant ε > suchthat for all sufficiently large t we have (cid:96) ( t ) ≥ ( n + n ) ε = εt. Thus, for sufficiently large t , we have R t ≥ exp( εt − ln t − O (1)) → ∞ almost surely. Extensions to Markov sources of order k > or alphabet sizes d > are immediate. We may treateach k -th order context x ∈ { , . . . , d } k as an independent d -ary prediction problem, and by mixing withindependent Jeffreys’ (which are Dirichlet (1 / , . . . , / ) priors (or equivalently, composing independentKrichevsky-Trofimov estimators), we obtain a computationally attractive e-value with regret bounded by ( d k ( d − /
2) ln t + O (1) . In other words, we get a closed-form e-value R k,dt — whose details are tedious,despite being explicit, and thus omitted — such that R k,dt ≥ maximum likelihood of order k Markov modelmaximum likelihood of Bernoulli model · exp (cid:18) − d k ( d − t − O (1) (cid:19) . The (near)-optimality of this approach is discussed in Takeuchi et al. [17]. The e-value R t from (5) can beinterpreted as R , t .Further computationally attractive extensions include alternatives that consist of Markov sources ofvarying orders k = 1 , , . . . (see discussion on the mixture method for unions below). The even moregeneral Context Tree models have the length of the context that should be taken into account depend onthat very context [26].A similar calculation to the k = 1 , d = 2 case done previously shows that R k,dt → ∞ , P –almost surelyfor any alternative P ∈ P k \Q , where P k is the set of Markovian distributions with order at most k . Thisleads naturally to the following remark. Remark 6.
Let P , P , P . . . be a countable sequence of alternatives, that may or may not be nested.Suppose for every k ∈ N one can design a safe e-value ( E kt ) for testing Q against P k such that it haspower one, meaning that for any P ∈ P k \Q , we have E kt → ∞ , P –almost surely.Then, one can design a safe e-value for Q against (cid:83) k ∈ N P k such that for any P ∈ (cid:83) k ∈ N P k \Q , we have E t → ∞ , P –almost surely.The proof of the above claim is simple. We can, for example, define the “double mixture” E t := ∞ (cid:88) k =1 π k E kt , which is a countable mixture over the base e-values (that were already mixed using Jeffreys’ prior). Itis clear that ( E t ) is a safe e-value under Q , by linearity of expectation. To analyze its power, once analternative P has been picked, let P k ∗ be the first element of the nested sequence that contains P . Since E k ∗ t → ∞ , P –almost surely, the same property holds for E k ∗ t /k ∗ , and thus transfers to E t since e-valuesare nonnegative. The computational challenge of calculating E t remains, but this can be avoided by insteadcalculating the Q -safe e-value (cid:101) E t := t (cid:88) k =1 π k E kt . At the (finite) time k ∗ , (cid:101) E t begins to include to required term E k ∗ t , and thus inherits its property of approach-ing infinity almost surely (consistency). Replacing the sum (cid:80) tk =1 by (cid:80) f ( t ) k =1 for any increasing function f that grows to infinity, possibly with sublinear growth (such as log( · ) ), can further save computation withoutlosing the consistency property. .5 Handling generic alternatives by “betting” An alternative approach to the one above can be found in the recent work on universal inference byWasserman et al. [23]. It involves a non-anticipating “running” MLE in the numerator, combined with anMLE in the denominator: R NA t := non-anticipating likelihood under the alternativemaximum likelihood under the null . Here, the numerator (alternative) likelihood is given by t (cid:89) s =1 g s ( X s ) (7)where g s is any “non-anticipating” probability mass function, meaning that it is specified before seeing X s ,but can be learnt using the first s − data points. Formally, ( g t ) must be predictable with respect to ( F t ) . One example would be to choose g s as the (smoothed) maximum likelihood estimator under thealternative using the first s − samples, but other approaches inspired by machine learning or time seriesmodeling may also be employed.It is easy to prove that ( R NA t ) is a Q -safe e-value: each term can be verified to have conditional meanat most one by swapping the denominator for the (unknown) true null likelihood. The major strengthof the above approach is that arbitrarily flexible nonparametric or model-free update rules can be usedwithout sacrificing validity, thus opening up the potential for power against loosely specified alternativesor even the discovery of temporal patterns from the observed data. For example, one may employ acomplex Bayesian working model that outputs the posterior predictive probability of observing a zeroor one at the next step, and this would not violate any of our theoretical guarantees regardless of thechoice of priors or working model. Despite such a strong validity guarantee, the current drawback of thisapproach is that for generic update rules, there may not be an existing regret bound that we may use toconvince ourselves of its power. (Of course, such regret bounds would be available for specific updaterules and specific alternatives, and the online learning literature is rapidly expanding the scope and typesof available regret bounds for individual sequence prediction.)As a final remark, this non-anticipating likelihood is closely related to the “predictable-mixture”approach recently explored by [24], and has its roots in Wald [22, Eq. 10:10]. In this vein, it is also closelyrelated to testing hypotheses by betting, as popularized by Shafer and Vovk [13, 14]; specifically ( g t ) canbe viewed as a sequence of bets on the following outcome. Q -Snell envelopes Forgetting for a moment some of the earlier claims made without proof, one of the main questions we seekto answer in this section is:When is a Q -safe e -value simply a Q -NM or Q -NSM in disguise? In other words, is any Q -safe e -value always improved (or recovered) by some Q -NM or Q -NSM?Such a question was also asked in the latest preprint on safe testing by Grünwald et al. [6]. The necessityand sufficiency results of Ramdas et al. [12] imply that the answer in the singleton Q = { Q } case is: always (via the Doob decomposition of the Snell envelope). The answer in the composite setting is: sometimes .We now qualify the ‘sometimes’ by delving into the rich probabilistic structure underlying safe e -values,examining its relationship to convex null sets, a concept called ‘fork-convexity’, and a process that we calla ‘composite’ Snell envelope, known from the mathematical theory of risk measures [3].Most of this section does not depend on our observations being binary, and we allow the data ( X t ) totake values in a more general space X . Some of the technical notions required below, such as local absolutecontinuity, likelihood ratio (or density) processes, and essential suprema, are reviewed in Appendix A.10 .1 A sequential analog of convexity We first introduce the concept of fork-convexity , which can be viewed as a sequential version of convexity.
Definition 7.
Fix a reference measure R on the sequence space X N .1. A fork-convex combination of two locally dominated laws Q , Q (cid:48) with likelihood ratio processes ( Z t ) , ( Z (cid:48) t ) is another law Q (cid:48)(cid:48) with likelihood ratio process Z (cid:48)(cid:48) t = Z t , t ≤ shZ t + (1 − h ) Z s Z (cid:48) t Z (cid:48) s , t > s (8) for some s ∈ N and some F s -measurable random variable h in [0 , with h = 1 on { Z (cid:48) s = 0 } . Thelatter condition ensures that ( Z (cid:48)(cid:48) t ) is well-defined and an R -martingale, as required for a likelihoodratio process.2. A set Q of probability measures is called fork-convex if every fork-convex combination of elementsof Q still belongs to Q . Fork-convexity was first introduced by Žitković [20]. It is closely related to a concept in the literature onrisk measures called m-stability , due to Delbaen [3]. A similar notion called rectangularity was introducedby Epstein and Schneider [4] to describe intertemporal preferences with multiple priors. Rectangularityhas then been used extensively in the operations research literature in connection with robust Markovdecision processes; see e.g. [7, 25, 16].Note that fork-convexity implies convexity. To see this, observe that any (usual) convex combination a Q +(1 − a ) Q (cid:48) is also a fork-convex combination; just take s = 0 and h = a in (8) to get Z (cid:48)(cid:48) t = aZ t +(1 − a ) Z (cid:48) t ,which is the likelihood ratio process of Q (cid:48)(cid:48) = a Q + (1 − a ) Q (cid:48) .A set Q = { Q } that consists of a single law is clearly fork-convex. A set Q = { Q , Q } consisting oftwo distinct laws will not be fork-convex; it is not even convex. However, one can form its “fork-convexhull”. Here is the general definition. Definition 8.
1. The intersection of all fork-convex sets that contain a given set Q is called the fork-convex hull of Q . (Note that there is at least one fork-convex set containing Q , namely the set of all laws.)2. The closed fork-convex hull of Q is the closure of the fork-convex hull of Q with respect to L ( R ) convergence of the likelihood ratio processes at each fixed time t ∈ N , where we recall R is the assumedreference measure. Just as for usual convex hulls, the fork-convex hull of Q consists of all finite fork-convex combinationsof elements in Q . Here a finite fork-convex combination of some distributions Q , . . . , Q n ∈ Q is adistribution obtained by iteratively performing (8) a finite number of times on Q , . . . , Q n , on their fork-convex combinations, on their fork-convex combinations, and so on. Closed fork-convex hulls play animportant role in Theorem 11 below. Let us illustrate these concepts in a particular example. Example 9.
Here X = R and the references measure R is the law under which the data is i.i.d. standardnormal (this choice is somewhat arbitrary; we could have chosen any other strictly positive density). Let Q , Q be the laws under which ( X t ) is i.i.d. with X t ∼ f (under Q ) and X t ∼ f (under Q ) for someprobability density functions f , f . The likelihood ratio processes of Q and Q are Z t = t (cid:89) s =1 f ϕ ( X s ); Z t = t (cid:89) s =1 f ϕ ( X s ) , where ϕ is the standard normal density. Given some s ∈ N and F s -measurable random variable h in [0 , such that h = 1 on { Z s = 0 } , the corresponding fork-convex combination of Q and Q is the law Q hose density process is Z t = s ∧ t (cid:89) i =1 f ϕ ( X i ) × (cid:32) h t (cid:89) i = s +1 f ϕ ( X i ) + (1 − h ) t (cid:89) i = s +1 f ϕ ( X i ) (cid:33) . For any Borel set A ⊂ R we have Q ( X s +1 ∈ A | F s ) = E R (cid:20) Z s +1 Z s A ( X s +1 ) (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) . A brief calculation using the definition of ( Z t ) as well as the fact that the data is i.i.d. standard normalunder Q shows that the right-hand side of the last display evaluates to h (cid:82) A f ( x )d x + (1 − h ) (cid:82) A f ( x )d x .Hence the conditional density of X s +1 is dd x Q ( X s +1 ≤ x | F s ) = h ( X , . . . , X s ) f ( x ) + (1 − h ( X , . . . , X s )) f ( x ) , where we now explicitly indicate that h depends on X , . . . , X s . For s = 0 this simply means that X follows an unconditional mixture, d / (d x ) Q ( X ≤ x ) = hf ( x ) + (1 − h ) f ( x ) for some h ∈ [0 , . Repeatingthe above reasoning a finite number of times for different choices of s and h (and swapping Q and Q )produces the fork-convex hull (cid:101) Q of { Q , Q } . In summary, (cid:101) Q is the set of “finite adapted mixtures” of { Q , Q } . More precisely, (cid:101) Q consists of all probability measures Q such that the conditional density of X t is of the form dd x Q ( X t ≤ x |F t − ) = h t ( X , . . . , X t − ) f ( x ) + (1 − h t ( X , . . . , X t − )) f ( x ) , t ∈ N , (9) for some [0 , -valued functions h t ( x , . . . , x t − ) indexed by t ∈ N such that h t ( x , . . . , x t − ) equals zero(one) if f ( x i ) = 0 ( f ( x i ) = 0 ) for some i = 1 , . . . , t − . Since the fork-convex hull (cid:101) Q only contains finite fork-convex combinations, there is a finite time T (depending on the particular element Q ∈ (cid:101) Q ) beyondwhich the functions h t will either all be equal to zero, or all equal to one.The closed fork-convex hull, as defined earlier, consists of all Q of the form (9) without the restrictionthat h t eventually equals zero or one. To provide some intuition for the definitions as applied to the null Q considered in this paper, onecan imagine a more “algorithmic” process of producing distributions in the closed fork-convex hull. Firstpick any p ∈ [0 , and observe X ∼ Ber ( p ) . Then, after observing X , pick any p , and observe X ∼ Ber ( p ) . Continue this process indefinitely. Then, the ( p i ) sequence is predictable and the resultingbinary sequence has a law that is contained in the closed fork-convex hull of Q .It may be instructive to consider another simple example. For a fixed µ ∈ [0 , , define Q µ as theset of product distributions Q over infinite [0 , -valued sequences such that E Q [ X t |F t − ] = E Q [ X t ] = µ ,and define (cid:101) Q µ as the set of distributions Q (not necessarily of product form) over infinite [0 , -valuedsequences such that E Q [ X t |F t − ] = µ . Then Q µ is not fork-convex if µ ∈ (0 , but (cid:101) Q is, and the latter isthe closed fork-convex hull of the former. The problem of sequentially estimating µ in this setup has beenrecently studied by Waudby-Smith and Ramdas [24]. Consider a null set Q locally dominated by a reference measure R . We now establish the interesting factthat e-values based on Q -NSMs are powerless against any alternative in the closed fork-convex hull of Q . We state this formally in Theorem 11 below, but the underlying reason is contained in the followinglemma. Lemma 10. If ( L t ) is a supermartingale under two laws Q , Q (cid:48) , then ( L t ) is also a supermartingale underevery fork-convex combination Q (cid:48)(cid:48) of Q and Q (cid:48) . roof. Note that Q , Q (cid:48) are dominated by R := ( Q + Q (cid:48) ) / . Fix any s ∈ N and F s -measurable randomvariable h in [0 , , and let Q (cid:48)(cid:48) be the fork-convex combination of Q , Q (cid:48) given in (8). In compliance with thedefinition, we restrict h to satisfy h = 1 on { Z (cid:48) s = 0 } . Suppose ( L t ) is a supermartingale under Q and Q (cid:48) .Equivalently, ( Z t L t ) and ( Z (cid:48) t L t ) are supermartingales under R . Thus for t ∈ { , . . . , s } we have E R [ Z (cid:48)(cid:48) t L t | F t − ] = E R [ Z t L t | F t − ] ≤ Z t − L t − = Z (cid:48)(cid:48) t − L t − . For t ≥ s + 1 we have E R [ Z (cid:48)(cid:48) t L t | F t − ] = h E R [ Z t L t | F t − ] + (1 − h ) Z s E R (cid:20) Z (cid:48) t Z (cid:48) s L t (cid:12)(cid:12)(cid:12)(cid:12) F t − (cid:21) ≤ Z (cid:48)(cid:48) t − L t − . Thus ( Z (cid:48)(cid:48) t L t ) is an R -supermartingale, or equivalently, ( L t ) is a Q (cid:48)(cid:48) -supermartingale.The following theorem refers to the closed fork-convex hull of Q . This is the closure of the fork-convexhull of Q , understood in the sense of L ( R ) convergence of the likelihood ratio processes at each fixed time t ∈ N . Theorem 11.
Let (cid:101) Q be the closed fork-convex hull of Q . Then every Q -NSM is in fact a (cid:101) Q -NSM. Thusa test based on a Q -NSM is powerless against (cid:101) Q \ Q .Proof.
The fork-convex hull of Q consists of all finite fork-convex combinations of elements of Q . Therefore,thanks to Lemma 10, every Q -NSM remains an NSM under every law Q in the fork-convex hull of Q . Toextend this to the closure, pick any element Q ∈ (cid:101) Q . Then there is a sequence ( Q n ) in the fork-convexhull of Q such that Q n → Q . This means that Z nt → Z t in L ( R ) for all t ∈ N , where ( Z nt ) and ( Z t ) arethe likelihood ratio processes of Q n and Q , respectively. By passing to a subsequence, we may assumethat Z nt → Z t , R -almost surely, for all t ∈ N . Let ( L t ) be any Q -NSM and hence a Q n -NSM for all n .Equivalently, ( Z nt L t ) is an R -NSM for all n . By the R -supermartingale property and the conditional versionof Fatou’s lemma, we get E R [ Z t L t | F t − ] = E R (cid:104) lim n Z nt L t (cid:12)(cid:12)(cid:12) F t − (cid:105) ≤ lim inf n E R [ Z nt L t | F t − ] ≤ lim inf n Z nt − L t − = Z t − L t − . This completes the proof that every Q -NSM is in fact a (cid:101) Q -NSM.The first part of the above theorem asserts that the NSM property is preserved under taking closedfork-convex hulls, but note that this is not true for safe e -values in general. Indeed, ( E t ) being Q -safeimplies that it is conv ( Q ) -safe, but not necessarily (cid:101) Q -safe. For a single law Q ∈ Q and an e-value ( E t ) , the Q -Snell envelope is the smallest Q -NSM that dominates ( E t ) . It is natural to ask whether, in contrast to this pointwise construction, one can directly construct a“composite Q -Snell envelope”, i.e., a smallest Q -NSM that dominates ( E t ) .It turns out that the ability to define such a Q -Snell envelope of an e-value depends heavily on theproperty of fork-convexity. The following result states that if the null set Q is locally dominated andfork-convex, then a Q -Snell envelope of a given e-value ( E t ) exists, is safe, and improves upon ( E t ) . Theorem 12.
Let Q be locally dominated and fork-convex. Let ( E t ) be a Q -safe e-value. Then the process L t := ess sup Q ∈Q , τ ≥ t E Q [ E τ | F t ] , t ∈ N , where τ ranges over all finite stopping times, is the smallest Q -NSM that dominates ( E t ) and satisfies L ≤ . Hence, ( L t ) is the Q -Snell envelope of ( E t ) . In particular, by the optional stopping theorem, ( L t ) is a Q -safe e-value. roof. The proof is essentially a simplified version of an argument due to Delbaen [3, Theorem 11]. Thisresult is argued in continuous time and on a bounded time interval. For the convenience of the reader, weprovide a self-contained proof for this paper’s discrete-time, infinite-horizon setup. We use properties ofthe essential supremum reviewed in Appendix A.2, in particular Proposition 16.For each fixed s ∈ N , L s is defined as the essential supremum of the family consisting of all E Q [ E τ | F s ] ,indexed by all pairs ( Q , τ ) with Q ∈ Q and τ ≥ s a finite stopping time. We claim that this family is closedunder maxima. To prove this claim, let ( Q , τ ) and ( Q (cid:48) , τ (cid:48) ) be given. Let A = { E Q [ E τ | F s ] ≥ E Q (cid:48) [ E τ (cid:48) | F s ] } and set τ (cid:48)(cid:48) = τ A + τ (cid:48) A c and Z (cid:48)(cid:48) t = Z t , t ≤ s A Z t + A c Z s Z (cid:48) t Z (cid:48) s , t > s where ( Z t ) and ( Z (cid:48) t ) are the likelihood ratio processes of Q and Q (cid:48) , respectively. Note that Z (cid:48) s > on A c sothat Z (cid:48)(cid:48) t is well-defined. Since A belongs to F s and τ, τ (cid:48) ≥ s , τ (cid:48)(cid:48) is a (finite) stopping time. Moreover,since Q is fork-convex, ( Z (cid:48)(cid:48) t ) is the likelihood ratio process of some Q (cid:48)(cid:48) ∈ Q . We now compute E Q (cid:48)(cid:48) [ E τ (cid:48)(cid:48) | F s ] = E Q (cid:48)(cid:48) [ A E τ | F s ] + E Q (cid:48)(cid:48) [ A c E τ (cid:48) | F s ]= E R (cid:20) Z (cid:48)(cid:48) τ Z (cid:48)(cid:48) s A E τ (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) + E R (cid:20) Z (cid:48)(cid:48) τ (cid:48) Z (cid:48)(cid:48) s A c E τ (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) = E R (cid:20) Z τ Z s A E τ (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) + E R (cid:20) Z (cid:48) τ (cid:48) Z (cid:48) s A c E τ (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21) = A E Q [ E τ | F s ] + A c E Q (cid:48) [ E τ (cid:48) | F s ]= max { E Q [ E τ | F s ] , E Q (cid:48) [ E τ (cid:48) | F s ] } . This demonstrates closure under maxima.Now fix any Q ∈ Q and s ∈ N . Thanks to the closure property under maxima, Proposition 16 shows thatthere exist families ( Q n ) of measures in Q and ( τ n ) of finite stopping times taking values in { s, s + 1 , . . . } such that E Q n [ E τ n | F s ] ↑ L s almost surely under R , and hence under Q . Therefore, by the conditionalversion of the monotone convergence theorem, E Q [ L s | F s − ] = E Q (cid:104) lim n E Q n [ E τ n | F s ] (cid:12)(cid:12)(cid:12) F s − (cid:105) = lim n E Q [ E Q n [ E τ n | F s ] | F s − ] . (10)Replacing Q n by (1 − n − ) Q n + n − Q we still have (10) and, in addition, Q absolutely continuous withrespect to Q n . From now on we use this modified choice of Q n . Let ( Z t ) and ( Z nt ) be the likelihood ratioprocesses of Q and Q n , respectively, and define (cid:101) Z nt = Z t , t ≤ s,Z s Z nt Z ns , t > s. By fork-convexity, ( (cid:101) Z nt ) is the likelihood ratio process of some (cid:101) Q n ∈ Q . We then get E Q [ E Q n [ E τ n | F s ] | F s − ] = E R (cid:20) Z s Z s − E R (cid:20) Z nτ n Z ns E τ n (cid:12)(cid:12)(cid:12)(cid:12) F s (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) F s − (cid:21) = E R (cid:20) Z s Z s − Z nτ n Z ns E τ n (cid:12)(cid:12)(cid:12)(cid:12) F s − (cid:21) = E R (cid:34) (cid:101) Z nτ n (cid:101) Z ns − E τ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F s − (cid:35) = E (cid:101) Q n [ E τ n | F s − ] ≤ L s − . E Q [ L s | F s − ] ≤ L s − . Iterating this inequality and using that F istrivial yields E Q [ L s ] ≤ L . In particular, L s is Q -integrable. Since ( E t ) is a Q -safe e-value, we have L = sup Q ∈Q , τ ≥ E Q [ E τ ] ≤ . Since Q ∈ Q and s ∈ N were arbitrary, this proves that ( L t ) is a Q -NSMwith L ≤ .Let ( L (cid:48) t ) be another Q -NSM that dominates ( E t ) . Then for any Q ∈ Q , any t ∈ { , , . . . } , and anyfinite stopping time τ ≥ t , the optional stopping theorem under Q gives L (cid:48) t ≥ E Q [ L (cid:48) τ | F t ] ≥ E Q [ E τ | F t ] .Therefore L (cid:48) t ≥ L t by the definition of essential supremum.The process ( L t ) above is what we call the Q -Snell envelope. Note that the Q -Snell envelope of ( L t ) isalmost surely equal to ( L t ) itself. In short, the above theorem claims that if Q is fork-convex, then the Q -Snell envelope of any Q -safe e-value exists and is safe.To construct a powerful and valid test that dominates a safe e-value ( E t ) , one might be inherentlyinterested in the largest Q -NSM ( L t ) that dominates ( E t ) and satisfies L ≤ . However, we are not awareof a systematic way to obtain such a process. Nevertheless, even the smallest Q -NSM that dominates ( E t ) , namely the Q -Snell envelope, still tends to improve its power.For a given Q , can there be more than one process that is considered a Q -Snell envelope (of someother process), and amongst these, is there a largest one? In general, the answer is yes for the firstquestion and (typically) no for the second. Every Q -NSM is its own Snell envelope and there alwaysexist uncountably many Q -NSMs, namely the constant and nonnegative decreasing processes startingat one. In particular, the constant process is also a Q -NM albeit a powerless one. In fact, there may beuncountably many Q -NSMs, with none of these processes dominating the others, and at the same timethere may not exist any non-constant Q -NMs (that don’t use independent external randomization, whichinvolves expanding the filtration). For this paper’s choice of Q , we later show that every Q -NSM is almostsurely nonincreasing, and hence the constant process equaling one dominates all Q -NSMs, and indeed theonly Q -NM almost surely equals one.Taken together, Theorems 11 and 12 lead to the following corollary, which tells us that in certainsituations one has to move beyond composite NSMs to achieve powerful tests. We continue to let Q beany locally dominated null set and (cid:101) Q its closed fork-convex hull. Corollary 13.
Let ( E t ) be a Q -safe e-value. Then ( E t ) is dominated by (or equals) some Q -NSM ( L t ) with L ≤ if and only if ( E t ) already happens to be (cid:101) Q -safe (and therefore powerless against (cid:101) Q \ Q ).Proof.
To prove the forward implication, assume ( E t ) is dominated by some Q -NSM ( L t ) with L ≤ .By Theorem 11, ( L t ) is in fact a (cid:101) Q -NSM. It follows that ( E t ) is (cid:101) Q -safe as claimed, because we have E P [ E τ ] ≤ E P [ L τ ] ≤ L ≤ for every P ∈ (cid:101) Q and each finite stopping time τ .To prove the reverse implication, assume that ( E t ) is actually (cid:101) Q -safe. An application of Theorem 12(with Q replaced by (cid:101) Q ) then gives a (cid:101) Q -NSM ( L t ) with L ≤ that dominates ( E t ) . This completes theproof of the corollary.The above result suggests that we must look beyond NSMs for designing sequential tests for exchange-ability, and we next show that this fact holds regardless of the class of alternatives considered. We now return to the main focus of this paper, which is binary sequences; thus X = { , } . In this case,any law P is locally dominated by the i.i.d. Bernoulli(1/2) law R := Ber (1 / ∞ and the likelihood ratioprocess of P is Z t = t (cid:89) s =1 (cid:0) q s ( X , . . . , X s − ) { X s =1 } + 2(1 − q s ( X , . . . , X s − )) { X s =0 } (cid:1) (11)for some functions q t : { , } t − → [0 , such that, R -almost surely, q t ( X , . . . , X t − ) = Q ( X t = 1 | X , . . . , X t − ) . safe 𝒬 -safe 𝒬 -safe 𝒫 ⟹⟹ -NSM 𝒬 -NSM 𝒬 -NSM 𝒫 ⟹⟹ ⟹⟹⟹ Figure 2: A summary of some of the implications related to Q -NSMs and Q -safety (recall Figure 1 forthe definitions of these classes). We would like to design a Q -safe e-value that is powerful against P .Theorem 14 proves that a Q -NSM is non-viable since it unintentionally results in P -safety and thus nopower against P . The single non-implication sign above opens a door to constructing a non-NSM based Q -safe e-value that is consistent against P , and this is precisely the construction in (5).In particular, taking q t = p ∈ (0 , for all t gives the likelihood ratio process of Ber ( p ) ∞ with respect to R . By repeatedly taking fork-convex combinations of such laws we obtain any law P whose likelihood ratioprocess is of the form Z t = t (cid:89) s =1 N (cid:88) k =1 h s ∧ T,k ( X , . . . , X s − ) (cid:0) p s,k { X s =1 } + 2(1 − p s,k ) { X s =0 } (cid:1) (12)for some N, T ∈ N , some functions h t,k : { , } t − → [0 , with (cid:80) Nk =1 h s,k = 1 , and some p t,k ∈ (0 , .(The T appears because the fork-convex hull only allows for finitely many fork-convex combinations.) Theorem 14.
Every law P over the space of binary sequences belongs to the closed fork-convex hull of Q = { Ber ( p ) ∞ : p ∈ (0 , } . Thus, every Q -NSM must be almost surely nonincreasing, and thus neverexceeds one and always has zero power for any α .Proof. The fork-convex hull of Q consists of all laws Q obtained by taking fork-convex combinations afinite number of times. In particular, it contains all Q whose likelihood ratio process is of the form (12)with p i,k ∈ (0 , and N, T ∈ N . The closed fork-convex hull contains every Q whose likelihood ratioprocess is of the form (12) for p i,k ∈ [0 , and N = T = ∞ .Therefore, to prove the theorem it is enough to take an arbitrary law P , whose likelihood ratio processis necessarily of the form (11), and show that it can be written in the form (12). To do so, we must foreach t ∈ N choose h t,k and p t,k such that q t ( x , . . . , x t − ) = ∞ (cid:88) k =1 h t,k ( x , . . . , x t − ) p t,k for all ( x , . . . , x t − ) ∈ { , } t − .This is straightforward: simply let y , . . . , y t − list all elements of { , } t − , and set h t,k ( x , . . . , x t − ) =1 { y k } ( x , . . . , x t − ) , p t,k = q t ( y k ) for all k ≤ t − and h t,k ( x , . . . , x t − ) = 0 , p t,k = 0 for all k > t − .In other words, not only are Q -NSMs inadequate against Markovian alternatives, they are incapableof detecting any deviation from exchangeability. 16 A simulation: power against a change point alternative
We consider a somewhat counterintuitive example to show that a first-order Markov alternative toexchangeability is perhaps more powerful than one may believe at first sight. Consider a length n sequence of coin flips sampled from Ber ( p ) n Ber ( q ) n for some p (cid:54) = q . To match the setup of this examplewith our initial problem set up, one could potentially extend this to an infinite sequence in an arbitraryway, for example just continuing as Ber ( q ) after time n .This sequence is clearly not exchangeable. It is, however, not clear whether our proposed first-orderMarkov alternative would detect (much) evidence against the null, as the sequence is not Markov, butis more like a change point alternative. Detecting evidence is not a given; the outcomes are in factindependent (albeit not identically distributed). Hence there is no first-order dependency structure forthe Markov model to exploit. And on top of that, there seems to be only one problematic time-point,precisely half-way through the sequence. So even if the Markov model somehow exploited this, how couldit gain an amount of evidence growing with the length n of the sequence?We now show that the above arguments are all misguided, and that the process ( R t ) from (5) gains anamount of evidence against the exchangeable null that grows exponentially with t , between time n and n .The evolution of ( R t ) on a typical run of this process is shown in Figure 3.Figure 3: The process ( R t ) on a sequence sampled from Ber ( . Ber ( . . On the first half, wesee ( R t ) decays as /t , which is due to the overhead of Jeffreys’ mixture for the Markov model over themaximum likelihood Bernoulli parameter. After the change point, we see ( R t ) increasing fast on theexponential scale. Recalling (3) for those more familiar with the p-value scale, the corresponding anytimep-value dips below − towards the end.Initially, ( R t ) loses steam and tends towards zero at a rate /t before time n since the null is true andthere is a price to pay for the Jeffreys’ mixture over the alternative. To calibrate what to expect after thechange point, think of n as being relatively large so that we can reason about empirical frequencies ofzeros and ones with more ease. Let us compute the maximum likelihood parameters for typical sequenceswith frequency (tending to) p in the first half and q in the second half. For the Bernoulli model, we find ˆ p = ( p + q ) / . For the first-order Markov model we find that ˆ p | = p + q p + q and ˆ p | = (1 − p ) p + (1 − q ) q (1 − p ) + (1 − q ) . ˆ p | = ˆ p | = ˆ p , which occurs if andonly if p = q . The fact that an exploitable first-order Markov dependency structure arises can perhapsbe best observed in the extreme case p = 0 and q = 1 . As this comparison does not really depend on n , we find that for all other parameter settings with p (cid:54) = q , the Markov model will gain overall evidenceexponentially growing with t between time n and n . (Technically, the exponential growth does not startimmediately at time n + 1 , but it does so eventually.) However, as t grows even further — say beyond t = n or t = 2 n — R t will decrease once more towards zero. This is because the sequence eventually isdominated by i.i.d. Ber(q) coin flips, and the MLE under the null explains the data very well.Thus, for this example, we do not get a power one test, nor should we expect a single change pointaway from an i.i.d. model to yield power one for a test designed to be powerful against Markovianalternatives. If the initial pattern repeats itself after n steps, meaning that we keep alternating betweenBer ( p ) and Ber ( q ) models, then ( R t ) does have power one, and this is interesting because ( R t ) is designedfor first-order Markov alternatives, but it is consistent against these n -th order Markov alternatives.In fact, one can argue that it is information theoretically impossible to design any power one test,including tests that are tuned to detect a single change point. To see why, think of very small n , like n = 2 , to make the reason intuitively transparent. How can a test possibly have enough evidence with justone or two coin flips before the change point, to know with probability one a change actually did occur?Naturally, the larger the time n of the change point, the higher the power could be of any such test (as itis for our test also), but no test can possibly have power one since there is always some small probability(vanishing with n ) that the distribution of the first n coin flips look quite similar to the post-changedistribution.Nevertheless, this simple example illustrates the point that our proposed e-value R t for evidenceagainst exchangeability is actually powerful even in scenarios that are not (close to) Markov. While ( R t ) is a Q -safe e-value, (max s ≤ t R s ) is not. In other words, we are only allowed to measure ourperformance based on the wealth accumulated thus far and not the highest wealth that we reached atsome point in the process. The same is not true for p-values: (1 /R t ) is an anytime p-value, and so is (1 / max s ≤ t R s ) , the latter being the running infimum of the former. In game-theoretic terminology, thegambler can decide to stop playing the game (betting against the null) according to any stopping rule τ , but once they have stopped, only the final wealth R τ of the gambler matters, and a nearly bankruptgambler cannot point to their past wealth as a measure of their proficiency. This subtle point particularlymanifests itself in the above example, because with a single change point, ( R t ) rises to some amount(above in the figure) and then will shrink back to zero, so if we happen to stop too late, then R τ could provide only meagre evidence even though it was once astronomically large.So how can we get around this worrisome issue? We take inspiration from Shafer et al. [15] and use“calibrated p-values” as our e-values. (As a matter of terminology, our use of calibration here can be seenas an E → P → E process, but if we skip the middle step entirely, the E → E direct method has beencalled “adjustment” by Dawid et al. [2], Koolen and Vovk [10]. We will present it from both angles belowto tie some loose ends in the literature together.)Define p t := 1 / max s ≤ t R s , so that ( p t ) is a Q -valid p-value that satisfies (3). Let f be a calibrator [15,12], which is a nonincreasing function f such that (cid:82) f ( u )d u = 1 . Then ( f ( p t )) is a Q -safe e-value. It isnot hard to check that f ( p t ) ≤ R t , so there is some price to pay for being able to take the best possiblewealth into account. One possible choice for f is given by f ( u ) := 1 − u + u ln uu ( − ln u ) ; also see Vovk and Wang [19, Eq. (2)].In order to do things more directly, let F be an adjuster [15, 2], which is an increasing function F suchthat (cid:82) ∞ F ( y ) y − d y = 1 . Then A t = F (max s ≤ t R s ) yields a Q -safe e-value, and indeed as before A t ≤ R t .18ne possible choice for F is given by F ( y ) := y ln 2(1 + y )(ln(1 + y )) . Thus, even if R t rises sharply and then decreases to zero eventually, F ( R t ) does not. In fact, using the F given above in our example with a single change point, and noting that F ( y ) (cid:16) y/ (ln y ) for large y , wesee that A ∞ ≈ even though R ∞ = 0 . Of course, if R t → ∞ then so does F ( R t ) , meaning that itdoes not lose the consistency property against Markovian alternatives.Thus, at a (squared) logarithmic price to the overall capital, one can be protected against future losses,and for this reason we recommend using A t = F ( R t ) as an e-value if we are uncertain about how closeour alternative might be to the idealized Markovian case studied here. For those explicitly interested in powerful tests to detect change point alternatives in the setting of thispaper, we briefly describe a powerful test (albeit not a power one test, as already explained above).Essentially, one can combine the ideas in Remark 6 with those in Section 2.5. We let P k denote thealternative in which the change point is hypothesized to occur at time n = 2 k , though other increasingfunctions of k may also suffice. We will define an e-value E kt for each k and then use a countable mixtureover k as the final e-value.Now, we describe an e-value for a fixed k . Define ( g − s ) ns =1 to be a smoothed non-anticipating maximumlikelihood estimator, calculated using data from time to s − . The smoothing step is simple: add asingle fake observation worth half a heads (or half a tails) to the counts when determining the MLE.The smoothing leads to a slight regularization that can be viewed as the maximum-a-posteriori estimateusing a Beta (1 / , / prior, analogous to Krichevsky-Trofimov betting [11]. Similarly, define ( g + s ) ∞ s = n +1 to be the same smoothed non-anticipating maximum likelihood estimator, but calculated using datafrom time n + 1 to s − . In both cases, the smoothing also leads to a well-defined function g − and g + n +1 , which are effectively treated as a Ber (1 / model. Finally, define the e-value E kt as the ratioof (cid:81) n ∧ ts =1 g − s ( X s ) (cid:81) ts =( n +1) g + s ( X s ) to the maximum likelihood under the i.i.d. null. In other words, thedenominator is identical to one of R t , but the numerator has changed because the targeted alternative isnow different.Recalling Section 2.5 (and the final section of Wasserman et al. [23]), it is easy to see that E kt is a Q -safe e-value. If a changepoint occurs at time n ∗ (and let k ∗ := (cid:98) log n ∗ (cid:99) ), the e-values E k ∗ t and E k ∗ +1 t will grow exponentially between time n ∗ and n ∗ . Even with the countable weighting of Remark 6, theirexponential growth washes out the inverse polynomial weights, to yield a powerful e-value E t .Naturally, many permutations and combinations of these ideas can be used to derive a variety of testsagainst different kinds of alternatives. We leave further exploration of these variants to future work. The celebrated theorem of de Finetti, for which many proofs exist including based on elementaryarguments [9], states that all exchangeable binary sequences are mixtures of i.i.d. sequences. In fact, forany exchangeable sequence, the empirical measure P t := (1 /t ) (cid:80) ts =1 δ X s converges in distribution to ameasure µ supported on [0 , , and this is the so-called “de Finetti mixing measure” alluded to in theprevious sentence. The crux of the matter is that the convex hull of all i.i.d. binary sequences is preciselythe set of exchangeable binary sequences. Since the convex hull preserves properties like safety, one candevelop tests for the i.i.d. setting and invoke de Finetti to extend the result to the exchangeable setting.In this paper, we go several steps further: we prove that the set of Markovian sequences lies in the“fork-convex hull” of all exchangeable (or i.i.d.) sequences. In fact, Theorem 14 shows that the closedfork-convex hull is so large that every law over binary sequences is contained in it! Theorem 11 showsthat the nonnegative supermartingale (NSM) property is preserved not just by taking the convex hull ofa set of distributions, but also when taking the (much larger) fork-convex hull, and Corollary 13 shows19hat any safe test for Q is also safe for its fork-convex hull. Together, these results show that any NSMunder exchangeable distributions is also an NSM under Markovian distributions, and in fact it is an NSMunder every distribution over binary sequences, meaning that test statistics that are NSMs are powerlessto distinguish non-exchangeable distributions from exchangeable ones.We get around the above hurdles by designing a process ( R t ) in (5) that is upper bounded by somenonnegative martingale for every exchangeable distribution, despite not being an NSM itself. This processuses the method of mixtures with Jeffreys’ prior to handle the composite alternative, along with themaximum likelihood under the null, to ultimately yield a computationally efficient closed-form e-value.This e-value not only has the desired safety properties at arbitrary stopping times (potentially infinite),but also has power one against any alternative as implied by a regret bound borrowed from the universalcoding literature. Section 2 also presented variations that work for higher order Markovian alternatives,and finally also for even more general, loosely specified alternatives by combining the method of predictablemixtures [24] along with universal inference [23].An interesting approach towards testing randomness was recently expounded by Vovk [18], whichis based on conformal prediction. It replaces the canonical filtration ( F t ) by a poorer filtration G t = σ ( S , . . . , S t ) formed by conformal scores, where (informally) the score S t ≡ S ( X t , { X , . . . , X t − } ) measures how different ( X t ) is from { X , . . . , X t − } , in other words how much it does or does not conformto the past. Vovk then produces a sequence of independent p-values under the null, which are converted toe-values by appropriate calibration, which are in turn combined to form a martingale with respect to ( G t ) .This is particularly interesting because Vovk argues, much like in our setting, that the only martingaleswith respect to ( F t ) are almost surely constants, but he is able to identify nontrivial martingales withrespect to an appropriately impoverished filtration ( G t ) .Our approaches based on Jeffreys’ mixture and the nonanticipating likelihood (or predictable mixture)can be seen as providing two alternatives to Vovk’s methodology. In fact, the latter bears some commonal-ities to Vovk’s approach, in that the function S ( · , { X , . . . , X t − } ) mentioned above must be predictable,just like the sequence ( g t ) used in (7), which are all connected to betting approaches to statistical inference.Indeed, Vovk’s methodology seems most powerful for change point alternatives, making them most similarto the extensions discussed in Section 4. However, in the end, the details appear to be different, and theconceptual principles by which the methods are derived also differ significantly.A final, alternate approach to this problem could utilize reverse martingales and exchangeable filtrations.To elaborate, the exchangeable filtration is the reverse filtration ( E t ) ∞ t =0 where E := σ ( { X , X , . . . } ) ,and for all t ≥ , E t denotes the σ -algebra generated by all real-valued Borel-measurable functions f ( X , X , . . . ) which are permutation-symmetric in their first t arguments, so that E ⊇ E ⊇ E · · · . Itis known that if the data are exchangeable, then the empirical measure P t := (1 /t ) (cid:80) ts =1 δ X s forms ameasure-valued reverse martingale with respect to the exchangeable filtration, in the sense that ( (cid:82) g d P t , t ≥ t , is a reverse martingale for any bounded and Borel-measurable function g [8]. In fact, the converseof this statement also holds true if the sequence ( X t ) is stationary [1]. We hope to explore in more detailwhether this approach can lead to powerful tests in the future. Acknowledgments
AR acknowledges NSF DMS grant 1916320. [1] Martin Bladt. Characterisation of exchangeable sequences through empirical distributions. arXivpreprint arXiv:1903.07861 , 2019. 20[2] A Philip Dawid, Steven de Rooij, Peter Grunwald, Wouter M Koolen, Glenn Shafer, Alexander Shen,Nikolai Vereshchagin, and Vladimir Vovk. Probability-free pricing of adjusted American lookbacks. arXiv preprint arXiv:1108.4113 , 2011. 18 203] Freddy Delbaen. The structure of m-stable sets and in particular of the set of risk neutral measures.In
In memoriam Paul-André Meyer: Séminaire de Probabilités XXXIX , volume 1874 of
Lecture Notesin Math. , pages 215–258. Springer, Berlin, 2006. 10, 11, 14[4] Larry G. Epstein and Martin Schneider. Recursive multiple-priors.
J. Econom. Theory , 113(1):1–31,2003. ISSN 0022-0531. 11[5] Hans Föllmer and Alexander Schied.
Stochastic Finance , volume 27 of
De Gruyter Studies inMathematics . Walter de Gruyter & Co., Berlin, extended edition, 2004. 23[6] Peter Grünwald, Rianne de Heide, and Wouter Koolen. Safe testing. arXiv:1906.07801 , June 2019. 2,4, 10[7] Garud N. Iyengar. Robust dynamic programming.
Math. Oper. Res. , 30(2):257–280, 2005. ISSN0364-765X. 11[8] Olav Kallenberg.
Probabilistic Symmetries and Invariance Principles . Springer Science & BusinessMedia, 2006. 20[9] Werner Kirsch. An elementary proof of de Finetti’s theorem. arXiv preprint 1809.00882 , 2018. 19[10] Wouter M Koolen and Vladimir Vovk. Buy low, sell high.
Theoretical Computer Science , 558:144–158,2014. 18[11] Raphail E. Krichevsky and Victor K. Trofimov. The performance of universal encoding.
IEEETransactions on Information Theory , 27(2):199–207, March 1981. 7, 19[12] Aaditya Ramdas, Johannes Ruf, Martin Larsson, and Wouter Koolen. Admissible anytime-validsequential inference must rely on nonnegative martingales, 2020. 2, 4, 5, 10, 18[13] Glenn Shafer. The language of betting as a strategy for statistical and scientific communication (withdiscussion).
Journal of the Royal Statistical Society, Series A , 2020. 10[14] Glenn Shafer and Vladimir Vovk.
Game-Theoretic Foundations for Probability and Finance , volume455. John Wiley & Sons, 2019. 10[15] Glenn Shafer, Alexander Shen, Nikolai Vereshchagin, and Vladimir Vovk. Test Martingales, BayesFactors and p -Values. Statistical Science , 26(1):84–101, February 2011. ISSN 0883-4237, 2168-8745.5, 18[16] Alexander Shapiro. Rectangular sets of probability measures.
Operations Research , 64(2):528–541,2016. ISSN 0030-364X. 11[17] Jun’ichi Takeuchi, Tsutomu Kawabata, and Andrew R. Barron. Properties of Jeffreys mixture forMarkov sources.
IEEE Trans. Inf. Theory , 59(1):438–457, 2013. doi: 10.1109/TIT.2012.2219171.URL https://doi.org/10.1109/TIT.2012.2219171 . 7, 9[18] Vladimir Vovk. Testing randomness online.
Statistical Science , 2021. 20[19] Vladimir Vovk and Ruodu Wang. E-values: Calibration, combination, and applications.
Forthcomingin the Annals of Statistics , 2019. 18[20] Gordan Žitković. A filtered version of the bipolar theorem of Brannath and Schachermayer.
Journalof Theoretical Probability , 15(1):41–61, 2002. ISSN 0894-9840. 11[21] Abraham Wald. Sequential Tests of Statistical Hypotheses.
Annals of Mathematical Statistics , 16(2):117–186, 1945. 4, 6[22] Abraham Wald.
Sequential Analysis . John Wiley & Sons, New York, 1947. 7, 102123] Larry Wasserman, Aaditya Ramdas, and Sivaraman Balakrishnan. Universal inference.
Proceedingsof the National Academy of Sciences , 2020. ISSN 0027-8424. 6, 10, 19, 20[24] Ian Waudby-Smith and Aaditya Ramdas. Estimating means of bounded random variables by betting. arXiv preprint arXiv:2010.09686 , 2020. 10, 12, 20[25] Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov decision processes.
Mathematicsof Operations Research , 38(1):153–183, 2013. ISSN 0364-765X. 11[26] F. Willems, Y. Shtarkov, and T. Tjalkens. The context-tree weighting method: basic properties. 41:653–664, 1995. 9
A Additional technical concepts and definitions
A.1 Reference measures and local absolute continuity
Consider a probability space with a filtration ( F t ) t ∈ N . Let R be a particular probability measure on F ∞ ;we think of R as a reference measure . We now explain the concept of local domination and how it allowsus to unambiguously define conditional expectations.• If P is a probability measure on F ∞ and τ is a stopping time, we write P | τ for the restriction of P to F τ . (This is simply the probability measure on F τ defined by P | τ ( A ) = P ( A ) , A ∈ F τ . Think of thisas the ‘coarsening’ of P that only operates on events observable up to time τ .)• P is called locally dominated by R (or locally absolutely continuous with respect to R ), if P | t (cid:28) R | t forall t ∈ N . We write this P (cid:28) loc R . More explicitly, this means that R ( A ) = 0 ⇒ P ( A ) = 0 , for any A ∈ F t and t ∈ N . Local absolute continuity does not imply that P (cid:28) R . However, it does imply that P | τ (cid:28) R | τ for any finite (but possibly unbounded) stopping time τ . Indeed, if A ∈ F τ and R ( A ) = 0 , then A ∩ { τ ≤ t } ∈ F t for all t , and hence P ( A ) = lim t →∞ P ( A ∩ { τ ≤ t } ) = 0 .• A set P of probability measures on F ∞ is called locally dominated by R if every element of P islocally dominated by R .• Any P (cid:28) loc R has an associated likelihood ratio process (often also called density process ), namelythe R -martingale ( Z t ) given by Z t := d P | t / d R | t . Being a nonnegative martingale, once Z t reacheszero it stays there. Thus with the convention / , the ratios Z τ /Z t are well-defined for any t ∈ N and any finite stopping time τ ≥ t . Note that each Z t is defined up to R -nullsets, and thereforealso up to P -nullsets.• If P (cid:28) loc R has likelihood ratio process ( Z t ) , the following ‘Bayes formula’ holds: for any t ∈ N , anyfinite stopping times τ , and any nonnegative F τ -measurable random variable Y , one has E P [ Y | F t ] = E R (cid:20) Z τ Z t Y (cid:12)(cid:12)(cid:12)(cid:12) F t (cid:21) { Z t > } , P -almost surely.The right-hand side is uniquely defined R -almost surely (not just P -almost surely), and thereforeprovides a ‘canonical’ version of E P [ Y | F t ] . We always use this version.
This allows us to view suchconditional expectations under P as being well-defined up to R -nullsets.One might ask why we work with local domination, rather a ‘global’ condition like P (cid:28) R for all P of interest. The answer is that such a condition would be far too restrictive, as we now illustrate. Let ( X t ) t ∈ N be a sequence of random variables. For each η ∈ R , let P η be the distribution such that the X t become i.i.d. normal with mean η and unit variance. By the strong law of large numbers, P η assigns22robability one to the event A η := { lim t →∞ t − (cid:80) ts =1 X s = η } . Moreover, the events A η are mutuallydisjoint: A η ∩ A ν = ∅ whenever η (cid:54) = ν . This means by definition that the measures P µ are all mutuallysingular. Since there is an uncountable number of them, there cannot exist a measure R such that P η (cid:28) R for all η . On the other hand, if P η | t denotes the law of the partial sequence X , . . . , X t for some t ∈ N , thenthe measures P η | t , η ∈ R , are all mutually absolutely continuous. In particular, we could (for instance)use R = P as reference measure and obtain P η (cid:28) loc R for all η ∈ R . A.2 Essential supremum
On some probability space, consider a collection ( Y α ) α ∈A of random variables, where A is an arbitraryindex set. If A is uncountable, the pointwise supremum sup α ∈A Y α might not be measurable (not a randomvariable). Alternatively, it might happen that Y α = 0 almost surely for every α ∈ A , but sup α ∈A Y α = 1 .For this reason, the pointwise supremum is often not useful. Instead, one can use the essential supremum . Proposition 15.
There exists a [ −∞ , ∞ ] -valued random variable Y , called the essential supremum anddenoted by ess sup α ∈A Y α , such that1. Y ≥ Y α , almost surely, for every α ∈ A ,2. if Y (cid:48) is a random variable that satisfies Y (cid:48) ≥ Y α , almost surely, for every α ∈ A , then Y (cid:48) ≥ Y ,almost surely.The essential supremum is almost surely unique. In words, the essential supremum is the smallest almost sure upper bound on ( Y α ) . The propositionguarantees that it always exists. In some cases, more can be said: the essential supremum can be obtainedas the limit of an increasing sequence. Proposition 16.
Suppose ( Y α ) is closed under maxima, meaning that for any α, β ∈ A there is some γ ∈ A such that Y γ = max { Y α , Y β } . Then there is a sequence ( α n ) such that ( Y α n ) is an increasingsequence and ess sup α ∈A Y α = lim n Y α n ..