[PDF] A Theory of Universal Learning

Abstract

How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, does not explain the behavior of learning curves: the distribution-free PAC model of learning can only bound the upper envelope of the learning curves over all possible data distributions. This does not match the practice of machine learning, where the data source is typically fixed in any given scenario, while the learner may choose the number of training examples on the basis of factors such as computational resources and desired accuracy. In this paper, we study an alternative learning model that better captures such practical aspects of machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PAC model. More precisely, we consider the problem of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution. The main result of this paper is a remarkable trichotomy: there are only three possible rates of universal learning. More precisely, we show that the learning curves of any given concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover, each of these cases is completely characterized by appropriate combinatorial parameters, and we exhibit optimal learning algorithms that achieve the best possible rate in each case. For concreteness, we consider in this paper only the realizable case, though analogous results are expected to extend to more general learning scenarios.

Full PDF

aa r X i v : . [ c s . L G ] N ov A Theory of Universal Learning

Olivier Bousquet [email protected]

Google, Brain Team

Steve Hanneke [email protected]

Toyota Technological Institute at Chicago

Shay Moran [email protected]

Technion

Ramon van Handel [email protected]

Princeton University

Amir Yehudayoﬀ [email protected]

Technion

Abstract

How quickly can a given class of concepts be learned from examples? It is common to measure theperformance of a supervised machine learning algorithm by plotting its “learning curve”, that is,the decay of the error rate as a function of the number of training examples. However, the classicaltheoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis andValiant, does not explain the behavior of learning curves: the distribution-free PAC model of learn-ing can only bound the upper envelope of the learning curves over all possible data distributions.This does not match the practice of machine learning, where the data source is typically ﬁxed inany given scenario, while the learner may choose the number of training examples on the basis offactors such as computational resources and desired accuracy.In this paper, we study an alternative learning model that better captures such practical aspectsof machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PACmodel. More precisely, we consider the problem of universal learning, which aims to understand theperformance of learning algorithms on every data distribution, but without requiring uniformityover the distribution. The main result of this paper is a remarkable trichotomy: there are onlythree possible rates of universal learning. More precisely, we show that the learning curves of anygiven concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover,each of these cases is completely characterized by appropriate combinatorial parameters, and weexhibit optimal learning algorithms that achieve the best possible rate in each case.For concreteness, we consider in this paper only the realizable case, though analogous resultsare expected to extend to more general learning scenarios. ontents

A Mathematical background 35

A.1 Gale-Stewart games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A.2 Ordinals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36A.3 Well-founded relations and ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37A.4 Polish spaces and analytic sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

B Measurability of Gale-Stewart strategies 39

B.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39B.2 Game values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41B.3 A winning strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42B.4 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

C A nonmeasurable example 45 . Introduction

In supervised machine learning, a learning algorithm is presented with labeled examples of a concept,and the objective is to output a classiﬁer which correctly classiﬁes most future examples from thesame source. Supervised learning has been successfully applied in a vast number of scenarios, such asimage classiﬁcation and natural language processing. In any given scenario, it is common to considerthe performance of an algorithm by plotting its “learning curve”, that is, the error rate (measured onheld-out data) as a function of the number of training examples n . A learning algorithm is consideredsuccessful if the learning curve approaches zero as n → ∞ , and the diﬃculty of the learning task isreﬂected by the rate at which this curve approaches zero. One of the main goals of learning theoryis to predict what learning rates are achievable in a given learning task.To this end, the gold standard of learning theory is the celebrated PAC model (Probably Ap-proximately Correct) deﬁned by Vapnik and Chervonenkis (1974) and Valiant (1984). As will berecalled below, the PAC model aims to explain the best worst-case learning rate, over all data dis-tributions that are consistent with a given concept class, that is achievable by a learning algorithm.The fundamental result in this theory exhibits a striking dichotomy: a given learning problem eitherhas a linear worst-case learning rate (i.e., n − ), or is not learnable at all in this sense. These twocases are characterized by a fundamental combinatorial parameter of a learning problem: the VC (Vapnik-Chervonenkis) dimension. Moreover, in the learnable case, PAC theory provides optimallearning algorithms that achieve the linear worst-case rate.While it gives rise to a clean and compelling mathematical picture, one may argue that thePAC model fails to capture at a fundamental level the true behavior of many practical learningproblems. A key criticism of the PAC model is that the distribution-independent deﬁnition oflearnability is too pessimistic to explain practical machine learning: real-world data is rarely worst-case, and experiments show that practical learning rates can be much faster than is predicted by PACtheory (Cohn and Tesauro, 1990, 1992). It therefore appears that the worst-case nature of the PACmodel hides key features that are observed in practical learning problems. These considerationsmotivate the search for alternative learning models that better capture the practice of machinelearning, but still give rise to a canonical mathematical theory of learning rates. Moreover, givena theoretical framework capable of expressing these faster learning rates, we can then design newlearning strategies to fully exploit this possibility.The aim of this paper is to put forward one such theory. In the learning model considered here,we will investigate asymptotic rates of convergence of distribution-dependent bounds on the errorof a learning algorithm, holding universally for all distributions consistent with a given conceptclass. Despite that this is a much weaker (and therefore arguably more realistic) notion, we willnonetheless prove that any learning problem can only exhibit one of three possible universal rates:exponential, linear, and arbitrarily slow. Each of these three cases will be fully characterized bymeans of combinatorial parameters (the nonexistence of certain inﬁnite trees), and we will exhibitoptimal learning algorithms that achieve these rates (based on the theory of inﬁnite games). Throughout this paper we will be concerned with the following classical learning problem. A clas-siﬁcation problem is deﬁned by a distribution P over labelled examples ( x, y ) ∈ X × { , } . Thelearner does not know P , but is able to collect a sample of n i.i.d. examples from P . She uses theseexamples to build a classiﬁer ˆ h n : X → { , } . The objective of the learner is to achieve small error :er(ˆ h n ) := P { ( x, y ) : ˆ h n ( x ) = y } . While the data distribution P is unknown to the learner, any informative a priori theory of learningmust be expressed in terms of some properties of, or restrictions on, P . Following the PAC model, weintroduce such a restriction by way of an additional component, namely a concept class H ⊆ { , } X f classiﬁers. The concept class H allows the analyst to state assumptions about P . The simplestsuch assumption is that P is realizable : inf h ∈H er( h ) = 0 , that is, H contains hypotheses with arbitrarily small error. We will focus on the realizable settingthroughout this paper, as it already requires substantial new ideas and provides a clean platformto demonstrate them. We believe that the ideas of this paper can be extended to more generalnoisy/agnostic settings, and leave this direction to be explored in future work.In the present context, the aim of learning theory is to provide tools for understanding the bestpossible rates of convergence of E [er(ˆ h n )] to zero as the sample size n grows to ∞ . This rate dependson the quality of the learning algorithm, and on the complexity of the concept class H . The morecomplex H is, the less information the learner has about P , and thus the slower the convergence. The classical formalization of the problem of learning in statistical learning theory is given by the

PAC model , which adopts a minimax perspective. More precisely, let us denote by RE( H ) the familyof distributions P for which the concept class H is realizable. Then the fundamental result of PAClearning theory states that (Vapnik and Chervonenkis, 1974; Ehrenfeucht, Haussler, Kearns, andValiant, 1989; Haussler, Littlestone, and Warmuth, 1994)inf ˆ h n sup P ∈ RE( H ) E [er(ˆ h n )] ≍ min (cid:18) vc( H ) n , (cid:19) , where vc( H ) is the VC dimension of H . In other words, PAC learning theory is concerned with thebest worst-case error over all realizable distributions, that can be achieved by means of a learningalgorithm ˆ h n . The above result immediately implies a fundamental dichotomy for these uniformrates: every concept class H has a uniform rate that is either linear cn or bounded away from zero ,depending on the ﬁniteness of the combinatorial parameter vc( H ).The uniformity over P in the PAC model is very pessimistic, however, as it allows the worst-casedistribution to change with the sample size. This arguably does not reﬂect the practice of machinelearning: in a given learning scenario, the data generating mechanism P is ﬁxed, while the learneris allowed to collect an arbitrary amount of data (depending on factors such as the desired accuracyand the available computational resources). Experiments show that the rate at which the errordecays for any given P can be much faster than is suggested by PAC theory (Cohn and Tesauro,1990, 1992): for example, it is possible that the learning curve decays exponentially for every P .Such rates cannot be explained by the PAC model, which can only capture the upper envelope ofthe learning curves over all realizable P , as is illustrated in Figure 1.Furthermore, one may argue that it is really the learning curve for given P , rather than thePAC error bound, that is observed in practice. Indeed, the customary approach to estimate theperformance of an algorithm is to measure its empirical learning rate, that is, to train it on severaltraining sets of increasing sizes (obtained from the same data source) and to measure the test errorof each of the obtained classiﬁers. In contrast, to observe the PAC rate, one would have to repeatthe above measurements for many diﬀerent data distributions, and then discard all this data exceptfor the worst-case error over all considered distributions. From this perspective, it is inevitablethat the PAC model may fail to reveal the “true” empirical behavior of learning algorithms. Morereﬁned theoretical results have been obtained on a case-by-case basis in various practical situations:for example, under margin assumptions, some works established exponentially fast learning ratesfor popular algorithms such as stochastic gradient decent and kernel methods (Koltchinskii andBeznosova, 2005; Audibert and Tsybakov, 2007; Pillaud-Vivien, Rudi, and Bach, 2018; Nitanda andSuzuki, 2019). Such results rely on additional modelling assumptions, however, and do not providea fundamental theory of the learnable in the spirit of PAC learning. E [ e r ( ˆ h n ) ] ∼ n ∼ e − c ( P ) n Figure 1:

Illustration of the diﬀerence between universal and uniform rates. Each red curve showsexponential decay of the error for a diﬀerent data distribution P ; but the PAC rate only capturesthe pointwise supremum of these curves (blue curve) which decays linearly at best. Our aim in this paper is to propose a mathematical theory that is able to capture some of theabove features of practical learning systems, yet provides a complete characterization of achievablelearning rates for general learning tasks. Instead of considering uniform learning rates as in the PACmodel, we consider instead the problem of universal learning. The term universal means that a givenproperty (such as consistency or rate) holds for every realizable distribution P , but not uniformlyover all distributions. For example, a class H is universally learnable at rate R if the following holds: ∃ ˆ h n s.t. ∀ P ∈ RE( H ) , ∃ C, c > E [er(ˆ h n )] ≤ CR ( cn ) for all n. The crucial diﬀerence between this formulation and the PAC model is that here the constants

C, c areallowed to depend on P : thus universal learning is able to capture distribution-dependent learningcurves for a given learning task. For example, the illustration in Figure 1 suggests that it is perfectlypossible for a concept class H to be universally learnable at an exponential rate, even though its uniform learning rate is only linear. In fact, we will see that there is little connection betweenuniversal and uniform learning rates (as is illustrated in Figure 4 of section 2): a given problem mayeven be universally learnable at an exponential rate while it is not learnable at all in the PAC sense.These two models of learning reveal fundamentally diﬀerent features of a given learning problem.The fundamental question that we pose in this paper is: Question.

Given a class H , what is the fastest rate at which H can be universally learned? We provide a complete answer to this question, characterize the achievable rates by means of combi-natorial parameters, and exhibit learning algorithms that achieve these rates. The universal learningmodel therefore gives rise to a theory of learning that fully complements the classical PAC theory.

Before we proceed to the statement of our main results, we aim to develop some initial intuition forwhat universal learning rates are achievable. To this end, we brieﬂy discuss three basic examples.

Example 1.1.

Any ﬁnite class H is universally learnable at an exponential rate (Schuurmans, 1997).Indeed, let ε be the minimal error er( h ) among all classiﬁers h ∈ H with positive error er( h ) > n training data points is bounded by |H| (1 − ε ) n . Thus a learning rule that outputsany ˆ h n ∈ H that correctly classiﬁes the training data satisﬁes E [er(ˆ h n )] ≤ Ce − cn , where C, c > H , P . It is easily seen that this is the best possible: as long as H contains at least threefunctions, a learning curve cannot decay faster than exponentially (see Lemma 4.2 below). Example 1.2.

The class H = { h t : t ∈ R } of threshold classiﬁers on the real line h t ( x ) = x ≥ t isuniversally learnable at a linear rate. That a linear rate can be achieved already follows in this case rom PAC theory, as H is a VC class. However, in this example, a linear rate is the best possibleeven in the universal setting: for any learning algorithm, there is a realizable distribution P whoselearning curve decays no faster than a linear rate (Schuurmans, 1997). Example 1.3.

The class H of all measurable functions on a space X is universally learnable undermild conditions (Stone, 1977; Hanneke, Kontorovich, Sabato, and Weiss, 2019): that is, there existsa learning algorithm ˆ h n that ensures E [er(ˆ h n )] → n → ∞ for every realizable distribution P .However, there can be no universal guarantee on the learning rate (Devroye, Gy¨orﬁ, and Lugosi,1996). That is, for any learning algorithm ˆ h n and any function R ( n ) that converges to zero arbitrarilyslowly, there exists a realizable distribution P such that E [er(ˆ h n )] ≥ R ( n ) inﬁnitely often.The three examples above reveal that there are at least three possible universal learning rates.Remarkably, we ﬁnd that these are the only possibilities . That is, every nontrivial class H is eitheruniversally learnable at an exponential rate (but not faster), or is universally learnable at a linearrate (but not faster), or is universally learnable but necessarily with arbitrarily slow rates. We now summarize the key deﬁnitions and main results of the paper. (We refer to Appendix A.4for the relevant terminology on Polish spaces and measurability.)To specify the learning problem, we specify a domain X and a concept class H ⊆ { , } X . Wewill henceforth assume that X is a Polish space (for example, a Euclidean space, or any countableset) and that H satisﬁes a minimal measurability assumption speciﬁed in Deﬁnition 3.3 below.A classiﬁer is a universally measurable function h : X → { , } . Given a probability distribu-tion P on X × { , } , the error rate of a classiﬁer h is deﬁned aser( h ) = er P ( h ) := P { ( x, y ) : h ( x ) = y } . The distribution P is called realizable if inf h ∈H er( h ) = 0.A learning algorithm is a sequence of universally measurable functions H n : ( X × { , } ) n × X → { , } , n ∈ N . The input data to the learning algorithm is a sequence of independent P -distributed pairs ( X i , Y i ).When acting on this input data, the learning algorithm outputs the data-dependent classiﬁersˆ h n ( x ) := H n (( X , Y ) , . . . , ( X n , Y n ) , x ) . The objective in the design of a learning algorithm is that the expected error rate E [er(ˆ h n )] of theoutput concept decays as rapidly as possible as a function of n .The aim of this paper is to characterize what rates of convergence of E [er(ˆ h n )] are achievable.The following deﬁnition formalizes this notion of achievable rate in the universal learning model. Deﬁnition 1.4.

Let H be a concept class, and let R : N → [0 ,

1] with R ( n ) → • H is learnable at rate R if there is a learning algorithm ˆ h n such that for every realizabledistribution P , there exist C, c > E [er(ˆ h n )] ≤ CR ( cn ) for all n . • H is not learnable at rate faster than R if for every learning algorithm ˆ h n , there exists arealizable distribution P and C, c > E [er(ˆ h n )] ≥ CR ( cn ) for inﬁnitely many n .

1. For simplicity of exposition, we have stated a deﬁnition corresponding to deterministic algorithms, to avoidthe notational inconvenience required to formally deﬁne randomized algorithms in this context. Our resultsremain valid when allowing randomized algorithms as well: all algorithms we construct throughout this paper aredeterministic, and all lower bounds we prove also hold for randomized algorithms. x ∅∅∅ x x x x x x ∃ h ∈ H h ( x ∅ ) = 1 h ( x ) = 0 h ( x ) = 1 Figure 2:

A Littlestone tree of depth 3. Every branch is consistent with a concept h ∈ H . Thisis illustrated here for one of the branches. • H is learnable with optimal rate R if H is learnable at rate R and H is not learnable fasterthan R . • H requires arbitrarily slow rates if, for every R ( n ) → H is not learnable faster than R .Let us emphasize that, unlike in the PAC model, every concept class H is universally learnablein the sense that there exist learning algorithms such that E [er(ˆ h n )] → P ; seeExample 1.3 above. However, a concept class may nonetheless require arbitrarily slow rates, inwhich case it is impossible for the learner to predict how fast this convergence will take place. Remark 1.5.

While this is not assumed in the above deﬁnition, our lower bound results will infact prove a stronger claim: namely, that when a given concept class H is not learnable at ratefaster than R , the corresponding constants C, c > h n and concept class H . This issometimes referred to as a strong minimax lower bound (Antos and Lugosi, 1998).The following theorem is one of the main results of this work. It expresses a fundamental trichotomy : there are exactly three possibilities for optimal learning rates. Theorem 1.6.

For every concept class H with |H| ≥ , exactly one of the following holds. • H is learnable with optimal rate e − n . • H is learnable with optimal rate n . • H requires arbitrarily slow rates. A second main result of this work provides a detailed description of which of these three casesany given concept class H satisﬁes, by specifying complexity measures to distinguish the cases. Webegin with the following deﬁnition, which is illustrated in Figure 2. Henceforth we deﬁne the preﬁx y ≤ k := ( y , . . . , y k ) for any sequence y = ( y , y , . . . ). Deﬁnition 1.7. A Littlestone tree for H is a complete binary tree of depth d ≤ ∞ whose internalnodes are labelled by X , and whose two edges connecting a node to its children are labelled 0 and 1,such that every ﬁnite path emanating from the root is consistent with a concept h ∈ H .More precisely, a Littlestone tree is a collection { x u : 0 ≤ k < d, u ∈ { , } k } ⊆ X such that for every y ∈ { , } d and n < d , there exists h ∈ H so that h ( x y ≤ k ) = y k +1 for 0 ≤ k ≤ n .We say H has an inﬁnite Littlestone tree if there is a Littlestone tree for H of depth d = ∞ .

2. The restriction |H| ≥ |H| = 1 or if H = { h, − h } , then er(ˆ h n ) = 0 is triviallyachievable for all n . If |H| = 2 but H 6 = { h, − h } , then H is learnable with optimal rate e − n by Example 1.1. x ∅∅∅ ( x , x ) ( x , x ) ( x , , x , , x , )( x , , x , , x , )( x , , x , , x , ) ( x , , x , , x , ) ( x , , x , , x , )( x , , x , , x , )( x , , x , , x , )( x , , x , , x , ) ∃ h ∈ H h ( x ∅ ) = 1 h ( x ) = 0 , h ( x ) = 0 h ( x , ) = 1 , h ( x , ) = 0 , h ( x , ) = 0 Figure 3:

A VCL tree of depth 3. Every branch is consistent with a concept h ∈ H . This isillustrated here for one of the branches. Due to lack of space, not all external edges are drawn. The above notion is closely related to the

Littlestone dimension , a fundamentally importantquantity in online learning. A concept class H has Littlestone dimension d if it has a Littlestonetree of depth d but not of depth d + 1. When this is the case, classical online learning theory yieldsa learning algorithm that makes at most d mistakes in classifying any adversarial (as opposed torandom) realizable sequence of examples. Along the way to our main results, we will extend thetheory of online learning to the following setting: we show in Section 3.1 that the nonexistence ofan inﬁnite Littlestone tree characterizes the existence of an algorithm that guarantees a ﬁnite (butnot necessarily uniformly bounded) number of mistakes for every realizable sequence of examples.Let us emphasize that having an inﬁnite Littlestone tree is not the same as having an unboundedLittlestone dimension: the latter can happen due to existence of ﬁnite Littlestone trees of arbitrarilylarge depth, which does not imply the existence of any single tree of inﬁnite depth.Next we introduce a new type of complexity structure, which we term a VC-Littlestone tree .It represents a combination of the structures underlying Littlestone dimension and VC dimension.Though the deﬁnition may appear a bit complicated, the intuition is quite simple (see Figure 3).

Deﬁnition 1.8. A VCL tree for H of depth d ≤ ∞ is a collection { x u ∈ X k +1 : 0 ≤ k < d, u ∈ { , } × { , } × · · · × { , } k } such that for every n < d and y ∈ { , } × · · · × { , } n +1 , there exists a concept h ∈ H so that h ( x i y ≤ k ) = y ik +1 for all 0 ≤ i ≤ k and 0 ≤ k ≤ n , where we denote y ≤ k = ( y , ( y , y ) , . . . , ( y k , . . . , y k − k )) , x y ≤ k = ( x y ≤ k , . . . , x k y ≤ k ) . We say that H has an inﬁnite VCL tree if it has a VCL tree of depth d = ∞ .A VCL tree resembles a Littlestone tree, except that each node in a VCL tree is labelled by asequence of k points, where k is the depth of the node (in contrast, every node in a Littlestone treeis labelled by a single point). The branching factor at each node at depth k of a VCL tree is thus 2 k ,rather than 2 as in a Littlestone tree. In the language of Vapnik-Chervonenkis theory, this meansthat along each path in the tree, we encounter shattered sets of size increasing with depth. ith these deﬁnitions in hand, we can state our second main result: a complete characterizationof the optimal rate achievable for any given concept class H . Theorem 1.9.

For every concept class H with |H| ≥ , the following hold: • If H does not have an inﬁnite Littlestone tree, then H is learnable with optimal rate e − n . • If H has an inﬁnite Littlestone tree but does not have an inﬁnite VCL tree, then H is learnablewith optimal rate n . • If H has an inﬁnite VCL tree, then H requires arbitrarily slow rates. In particular, since Theorem 1.6 follows immediately from Theorem 1.9, the focus of this workwill be to prove Theorem 1.9. The proof of this theorem, and many related results, are presented inthe remainder of this paper.

We next discuss some technical aspects in the derivation of the trichotomy. We also highlight keydiﬀerences with the dichotomy of PAC learning theory.

In the uniform setting, the fact that every VC class is PAC learnable is witnessed by any algorithmthat outputs an concept h ∈ H that is consistent with the input sample. This is known in the liter-ature as the empirical risk minimization (ERM) principle and follows from the celebrated uniformconvergence theorem of (Vapnik and Chervonenkis, 1971). Moreover, any ERM algorithm achievesthe optimal uniform learning rate, up to lower order factors.In contrast, in the universal setting one has to carefully design the algorithms that achievethe optimal rates. In particular, here the optimal rates are not always achieved by general ERMmethods: for example, there are classes where exponential rates are achievable, but where thereexist ERM learners with arbitrarily slow rates (see Example 2.6 below). The learning algorithms wepropose below are novel in the literature: they are based on the theory of inﬁnite (Gale-Stewart)games, whose connection with learning theory appears to be new in this paper.As was anticipated in the previous section, a basic building block of our learning algorithms is thesolution of analogous problems in adversarial online learning . For example, as a ﬁrst step towardsa statistical learning algorithm that achieves exponential rates, we extend the mistake bound modelof (Littlestone, 1988) to scenarios where it is possible to guarantee a ﬁnite number of mistakes foreach realizable sequence, but without an a priori bound on the number of mistakes. We show this ispossible precisely when H has no inﬁnite Littlestone tree, in which case the resulting online learningalgorithm is deﬁned by the winning strategy of an associated Gale-Stewart game.Unfortunately, while online learning algorithms may be applied directly to random trainingdata, this does not in itself suﬃce to ensure good learning rates. The problem is that, althoughthe online learning algorithm is guaranteed to make no mistakes after a ﬁnite number of rounds, inthe statistical context this number of rounds is a random variable for which we have no control onthe variance or tail behavior. We must therefore introduce additional steps to convert such onlinelearning algorithms into statistical learning algorithms. In the case of exponential rates, this willbe done by applying the online learning algorithm to several diﬀerent batches of training examples,which must then be carefully aggregated to yield a classiﬁer that achieves an exponential rate.The case of linear rates presents additional complications. In this setting, the corresponding on-line learning algorithm does not eventually stop making mistakes: it is only guaranteed to eventuallyrule out a ﬁnite pattern of labels (which is feasible precisely when H has no inﬁnite VCL tree). Oncewe have learned to rule out one pattern of labels for every data sequence of length k , the situationbecomes essentially analogous to that of a VC class of dimension k −

1. In particular, we can then pply the one-inclusion graph predictor of Haussler, Littlestone, and Warmuth (1994) to classifysubsequent data points with a linear rate. When applied to random data, however, both the time ittakes for the online algorithm to learn to rule out a pattern, and the length k of that pattern, arerandom. We must therefore again apply this technique to several diﬀerent batches of training exam-ples and combine the resulting classiﬁers with aggregation methods to obtain a statistical learningalgorithm that achieves a linear rate. The proofs of our lower bounds are also signiﬁcantly more involved than those in PAC learningtheory. In contrast to the uniform setting, we are required to produce a single data distribution P for which the given learning algorithm has the claimed lower bound for inﬁnitely many n . To thisend, we will apply the probabilistic method by randomizing over both the choice of target labellingsfor the space, and the marginal distribution on X , coupling these two components of P . There is a serious technical issue that arises in our theory that gives rise to surprisingly interestingmathematical questions. In order to apply the winning strategies of Gale-Stewart games to randomdata, we must ensure such strategies are measurable: if this is not the case, our theory may failspectacularly (see Appendix C). However, nothing appears to be known in the literature about themeasurability of Gale-Stewart strategies in nontrivial settings.That measurability issues arise in learning theory is not surprising, of course; this is also thecase in classical PAC learning (Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989; Pestov, 2011).Our basic measurability assumption (Deﬁnition 3.3) is also the standard assumption made in thissetting (Dudley, 2014). It turns out, however, that measurability issues in classical learning theoryare essentially benign: the only issue that arises there is the measurability of the supremum of theempirical process over H . This can be trivially veriﬁed in most practical situations without theneed for an abstract theory: for example, measurability of the empirical process is trivial when H iscountable, or when H can be pointwise approximated by a countable class. For these reasons, mea-surability issues in classical learning theory are often considered “a minor nuisance”. The situationin this paper is completely diﬀerent: it is entirely unclear a priori whether Gale-Stewart strategiesare measurable even in apparently trivial cases, such as when H is countable.We will prove the existence of measurable strategies for a general class of Gale-Stewart gamesthat includes all the ones encountered in this paper. The solution of this problem exploits an inter-play between the mathematical and algorithmic aspects of the problem. To construct a measurablestrategy, we will explicitly deﬁne a strategy by means of a kind of greedy algorithm that aims tominimize in each step a value function that takes values in the ordinal numbers . This construc-tion gives rise to unexpected new notions for learning theory: for example, we will show that thecomplexity of online learning is characterized by an ordinal notion of Littlestone dimension, whichagrees with the classical notion when it is ﬁnite. To conclude the proof of measurability, we combinethese insights with a deep result of descriptive set theory (the Kunen-Martin theorem) which showsthat the Littlestone dimension of a measurable class H is always a countable ordinal. To conclude the introduction, we brieﬂy review prior work on the subject of universal learning rates.

An extreme notion of learnability in the universal setting is universal consistency : a learning algo-rithm is universally consistent if E [er(ˆ h n )] → inf h er( h ) for every distribution P . The ﬁrst proof hat universally consistent learning is possible was provided by Stone (1977), using local average estimators, such as based on k-nearest neighbor predictors, kernel rules, and histogram rules; see(Devroye, Gy¨orﬁ, and Lugosi, 1996) for a thorough discussion of such results. One can also estab-lish universal consistency of learning rules via the technique of structural risk minimization fromVapnik and Chervonenkis (1974). The most general results on universal consistency were recentlyestablished by Hanneke (2017) and Hanneke, Kontorovich, Sabato, and Weiss (2019), who provedthe existence of universally consistent learning algorithms in any separable metric space . In fact,Hanneke, Kontorovich, Sabato, and Weiss (2019) establish this for even more general spaces, called essentially separable , and prove that the latter property is actually necessary for universal consis-tency to be possible. An immediate implication of their result is that in such spaces X , and choosing H to be the set of all measurable functions, there exists a learning algorithm with E [er(ˆ h n )] → P (cf. Example 1.3). In particular, since we assume in this paper that X is Polish (i.e., separably metrizable), this result holds in our setting.While these results establish that it is always possible to have E [er(ˆ h n )] → P ,there is a so-called no free lunch theorem showing that it is not generally possible to bound the rate of convergence: that is, the set H of all measurable functions requires arbitrarily slow rates(Devroye, Gy¨orﬁ, and Lugosi, 1996). The proof of this result also extends to more general conceptclasses: the only property of H that was used in the proof is that it ﬁnitely shatters some countablyinﬁnite subset of X , that is, there exists X ′ = { x , x , . . . } ⊆ X such that, for every n ∈ N and y , . . . , y n ∈ { , } , there is h ∈ H with h ( x i ) = y i for every i ≤ n . It is natural to wonder whetherthe existence of such a countable ﬁnitely shattered set X ′ is also necessary for H to require arbitrarilyslow rates. Our main result settles this question in the negative. Indeed, Theorem 1.9 states thatthe existence of an inﬁnite VCL tree is both necessary and suﬃcient for a concept class H to requirearbitrarily slow rates; but it is possible for a class H to have an inﬁnite VCL tree while it does notﬁnitely shatter any countable set X ′ (see Example 2.8 below). The distinction between exponential and linear rates has been studied by Schuurmans (1997) in somespecial cases. Speciﬁcally, Schuurmans (1997) studied classes H that are concept chains , meaningthat every h, h ′ ∈ H have either h ≤ h ′ everywhere or h ′ ≤ h everywhere. For instance, thresholdclassiﬁers on the real line (Example 1.2) are a simple example of a concept chain.Since any concept chain H must have VC dimension at most 1, the optimal rates can neverbe slower than linear (Haussler, Littlestone, and Warmuth, 1994). However, Schuurmans (1997)found that some concept chains are universally learnable at an exponential rate, and gave a precisecharacterization of when this is the case. Speciﬁcally, he established that a concept chain H islearnable at an exponential rate if and only if H is nowhere dense , meaning that there is no inﬁnitesubset H ′ ⊆ H such that, for every distinct h , h ∈ H ′ with h ≤ h everywhere, ∃ h ∈ H ′ \ { h , h } with h ≤ h ≤ h everywhere. He also showed that concept chains H failing this property (i.e., thatare somewhere dense ) are not learnable at rate faster than n − (1+ ε ) (for any ε > n − .It is not diﬃcult to see that for concept chain classes, the property of being somewhere denseprecisely corresponds to the property of having an inﬁnite Littlestone tree, where the above set H ′ corresponds to the set of classiﬁers involved in the deﬁnition of the inﬁnite Littlestone tree.Theorem 1.9 therefore recovers the result of Schuurmans (1997) as a very special case, and sharpenshis n − (1+ ǫ ) general lower bound to a strict linear rate n − .Schuurmans (1997) also posed the question of whether his analysis can be extended beyondconcept chains: that is, whether there is a general characterization of which classes H are learnableat an exponential rate, versus which classes are not learnable at faster than a linear rate. Thisquestion is completely settled by the main results of this paper. .6.3 Classes with matching universal and uniform rates Antos and Lugosi (1998) showed that there exist concept classes for which no improvement on thePAC learning rate is possible in the universal setting. More precisely, they showed that, for any d ∈ N , there exists a concept class H of VC dimension d such that, for any learning algorithm ˆ h n ,there exists a realizable distribution P for which E [er(ˆ h n )] ≥ cdn for inﬁnitely many n , where thenumerical constant c can be made arbitrarily close to . This shows that universal learning ratesfor some classes tightly match their minimax rates up to a numerical constant factor. Universal learning rates have also been considered in the context of active learning , under the names true sample complexity or unveriﬁable sample complexity (Hanneke, 2009, 2012; Balcan, Hanneke,and Vaughan, 2010; Yang and Hanneke, 2013). Active learning is a variant of supervised learning,where the learning algorithm observes only the sequence X , X , . . . of unlabeled examples, and mayselect which examples X i to query (which reveals their labels Y i ); this happens sequentially, so thatthe learner observes the response to a query before selecting its next query point. In this setting,one is interested in characterizing the rate of convergence of E [er(ˆ h n )] where n is the number of queries (i.e., the number of labels observed) as opposed to the sample size.Hanneke (2012) showed that for any VC class H , there is an active learning algorithm ˆ h n suchthat, for every realizable distribution P , E [er(ˆ h n )] = o (cid:0) n (cid:1) . Note that such a result is certainly not achievable by passive learning algorithms (i.e., the type of learning algorithms discussed in thepresent work), given the results of Schuurmans (1997) and Antos and Lugosi (1998). The latter alsofollows from the results of this paper by Example 2.2 below. Denote by RE( h ) the family of distributions P such that er( h ) = 0 for a given classiﬁer h ∈ H . Benedek and Itai (1994) considered a partial relaxation of the PAC model, called nonuniformlearning , in which the learning rate may depend on h ∈ H but is still uniform over P ∈ RE( h ).This setting intermediate between the PAC setting (where the rate may depend only on n ) and theuniversal learning setting (where the rate may depend fully on P ). A concept class H is said tobe learnable in the nonuniform learning setting if there exists a learning algorithm ˆ h n such thatsup P ∈ RE( h ) E [er(ˆ h n )] → n → ∞ for every h ∈ H .Benedek and Itai (1994) proved that a concept class H is learnable in the nonuniform learningmodel if and only if H is a countable union of VC classes. In Example 2.7 below, we show thatthere exist classes H that are universally learnable, even at an exponential rate, but which are not learnable in the nonuniform learning setting. It is also easy to observe that there exist classes H thatare countable unions of VC classes (hence nonuniformly learnable) which have an inﬁnite VCL tree(and thus require arbitrarily slow universal learning rates). The universal and nonuniform learningmodels are therefore incomparable.

2. Examples

In Section 1.3, we introduced three basic examples that illustrate the three possible universal learningrates. In this section we provide further examples. The main aim of this section is to illustrateimportant distinctions with the uniform setting and other basic concepts in learning theory, whichare illustrated schematically in Figure 4.

Figure 4:

A Venn diagram depicting the trichotomy and its relation with uniform and universallearnability. While the focus here is on statistical learning, note that this diagram also capturesthe distinction between uniform and universal online learning, see Section 3.1.

We begin by giving four examples that illustrate that the classical PAC learning model (which ischaracterized by ﬁnite VC dimension) is not comparable to the universal learning model.

Example 2.1 (VC with exponential rate) . Consider the class

H ⊆ { , } N of all threshold functions h t ( x ) = x ≥ t where t ∈ N . This is a VC class (its VC dimension is 1), which is learnable at anexponential rate (it does not have an inﬁnite Littlestone tree). Note, however, that this class hasunbounded Littlestone dimension (it shatters Littlestone trees of arbitrary ﬁnite depths), so that itdoes not admit an online learning algorithm that makes a uniformly bounded number of mistakes. Example 2.2 (VC with linear rate) . Consider the class

H ⊆ { , } R of all threshold functions h t ( x ) = x ≥ t , where t ∈ R . This is a VC class (its VC dimension is 1) that is not learnable at anexponential rate (it has an inﬁnite Littlestone tree). Thus the optimal rate is linear. Example 2.3 (Exponential rate but not VC) . Let X = S k X k be the disjoint union of ﬁnite sets |X k | = k . For each k , let H k = { S : S ⊆ X k } , and consider the concept class H = S k H k . This classhas an unbounded VC dimension, yet is universally learnable at an exponential rate. To establishthe latter, it suﬃces to prove that H does not have an inﬁnite Littlestone tree. Indeed, once we ﬁxany root label x ∈ X k of a Littlestone tree, only h ∈ H k can satisfy h ( x ) = 1, and so the hypothesesconsistent with the subtree corresponding to h ( x ) = 1 form a ﬁnite class. This subtree can thereforehave only ﬁnitely many leaves, contradicting the existence of an inﬁnite Littlestone tree. Example 2.4 (Linear rate but not VC) . Consider the disjoint union of the classes of Examples 2.2and 2.3: that is, X is the disjoint union of R and ﬁnite sets X k with |X k | = k , and H is the unionof the class of all threshold functions on R and the classes H k = { S : S ⊆ X k } . This class hasan unbounded VC dimension, yet is universally learnable at a linear rate. To establish the latter,it suﬃces to note that H has an inﬁnite Littlestone tree as in Example 2.2, but H cannot have aninﬁnite VCL tree. Indeed, once we ﬁx any root label x ∈ X , the class { h ∈ H : h ( x ) = 1 } has ﬁniteVC dimension, and thus the corresponding subtree of the VCL tree must be ﬁnite. .2 Universal learning algorithms versus ERM The aim of the next two examples is to shed some light on the type of algorithms that can give riseto optimal universal learning rates. Recall that in the PAC model, a concept class is learnable ifand only if it can be learned by any ERM (empirical risk minimization) algorithm. The followingexamples will show that the ERM principle cannot explain the achievable universal learning rates;the algorithms developed in this paper are thus necessarily of a diﬀerent nature.An ERM algorithm is any learning rule that outputs a concept in H that minimizes the empiricalerror. There may in fact be many such hypotheses, and thus there are many inequivalent ERMalgorithms. Learnability by means of a general ERM algorithm is equivalent to the Glivenko-Cantelli property: that is, that the empirical errors of all h ∈ H converge simultaneously to thecorresponding population errors as n → ∞ . The Glivenko-Cantelli property has a uniform variant,in which the convergence rate is uniform over all data distributions P ; this property is equivalent toPAC learnability and is characterized by VC dimension (Vapnik and Chervonenkis, 1971). It also hasa universal variant, where the convergence holds for every P but with distribution-dependent rate;the latter is equivalent to the universal consistency of a general ERM algorithm. A combinatorialcharacterization of the universal Glivenko-Cantelli property is given by van Handel (2013).The following example shows that even if a concept class is universally learnable by a generalERM algorithm, this need not yield any control on the learning rate. This is in contrast to the PACsetting, where learnability by means of ERM always implies a linear learning rate. Example 2.5 (Arbitrarily slow rates but learnable by any ERM) . Let X = N and let H be the classof all classiﬁers on X . This class has an inﬁnite VCL tree and thus requires arbitrarily slow rates;but H is a universal Glivenko-Cantelli class and thus any ERM algorithm is universally consistent.In contrast, the next example shows that there are are scenarios where extremely fast universallearning is achievable, but where a general ERM algorithm can give rise to arbitrarily slow rates. Example 2.6 (Exponential rate achivable but general ERM arbitrarily slow) . Let X = S i ∈ N X i bethe disjoint union of ﬁnite sets with |X i | = 2 i . For each i ∈ N , let H i = { I : I ⊆ X i , | I | ≥ i − } , and consider the concept class H = S i ∈ N H i . It follows exactly as in Example 2.3 that H has noinﬁnite Littlestone tree, so that it is universally learnable at an exponential rate.We claim there exists, for any rate function R ( n ) →

0, an ERM algorithm that achieves rateslower than R . In the following, we ﬁx any such R , as well as strictly increasing sequences { n t } and { i t } satisfying the following: letting p t = it − n t , it holds that p t is decreasing, P ∞ t =1 p t ≤

1, and p t ≥ R ( n t ). The reader may verify that such sequences can be constructed by induction on t .Now consider any ERM with the following property: if the input data ( X , Y ) , . . . , ( X n , Y n ) issuch that Y i = 0 for all i , then the algorithm outputs ˆ h n ∈ H i Tn with T n = min { t : there exists h ∈ H i t such that h ( X ) = · · · = h ( X n ) = 0 } . We claim that such ERM perform poorly on the data distribution P deﬁned by P { ( x, } = 2 − i t p t for all x ∈ X i t , t ∈ N , where we set P { ( x ′ , } = 1 − P ∞ t =1 p t for some arbitrary choice of x ′ S t ∈ N X i t . Note that P isrealizable, as inf i er( h i ) ≤ inf i P { ( x, y ) : x ∈ X i } = 0 for any h i ∈ H i .It remains to show that E [er(ˆ h n )] ≥ R ( n ) for inﬁnitely many n . To this end, note that byMarkov’s inequality, there is a probability at least 1 / X , Y ) , . . . , ( X n t , Y n t )such that X j ∈ X i t is at most 2 i t − . On this event, we must have T n ≤ t , so thater(ˆ h n t ) ≥ P { ( x,

0) : x ∈ X i Tn } ≥ p t ≥ R ( n t ) . Thus we have shown that E [er(ˆ h n t )] ≥ R ( n t ) for all t ∈ N . .3 Universal learning versus other learning models The nonuniform learning model of Benedek and Itai (1994) is intermediate between universal andPAC learning, see section 1.6.5. Our next example shows that a concept class may be not evenlearnable in the nonuniform sense, while exhibiting the fastest rate of uniform learning.

Example 2.7 (Exponential rate but not nonuniformly learnable) . The following class can be learnedat an exponential rate, yet it cannot be presented as a countable union of VC classes (and hence itis not learnable in the nonuniform setting by Benedek and Itai, 1994): X = { S ⊂ R : | S | < ∞} , H = { h y : y ∈ R } , where h y ( S ) = y ∈ S . We ﬁrst claim that H has no inﬁnite Littlestone tree: indeed, once we ﬁx aroot label S ∈ X of a Littlestone tree, the class { h ∈ H : h ( S ) = 1 } is ﬁnite, so the correspondingsubtree must be ﬁnite. Thus H is universally learnable at an exponential rate.On the other hand, suppose that H were a countable union of VC classes. Then one elementof this countable union must contain inﬁnitely many hypotheses (as R is uncountable). This is acontradiction, as any inﬁnite subset { h y : y ∈ I } ⊆ H with I ⊆ R , | I | = ∞ has unbounded VCdimension (as its dual class is the class of all ﬁnite subsets of I ).Our next example is concerned with the characterization of arbitrarily slow rates. As we discussedin section 1.6.1, a no free lunch theorem of Devroye, Gy¨orﬁ, and Lugosi (1996) shows that a suﬃcient condition for a class H to require arbitrarily slow rates is that there exists an inﬁnite set X ′ ⊆ X ﬁnitely shattered by H : that is, there exists X ′ = { x , x , . . . } ⊆ X such that, for every n ∈ N and y , . . . , y n ∈ { , } , there is h ∈ H with h ( x i ) = y i for every i ≤ n . Since our Theorem 1.9 indicatesthat existence of an inﬁnite VCL tree is both suﬃcient and necessary , it is natural to ask how thesetwo conditions relate to each other. It is easy to see that the existence of a ﬁnitely shattered inﬁniteset X ′ implies the existence of an inﬁnite VCL tree. However, the following example shows that theopposite is not true: that is, there exist classes H with an inﬁnite VCL tree that do not ﬁnitelyshatter an inﬁnite set X ′ . Thus, these conditions are not equivalent, and our Theorem 1.9 providesa strictly weaker condition suﬃcient for H to require arbitrarily slow rates. Example 2.8 (No ﬁnitely shattered inﬁnite set, but requires arbitrarily slow rates) . Consider acountable space X that is itself structured into nodes of a VCL tree: that is, X = { x i u : k ∈ N ∪ { } , i ∈ { , . . . , k } , u ∈ { , } × { , } × · · · × { , } k } , where each x i u is a distinct point. Then for each y = ( y , ( y , y ) , . . . , ( y k , . . . , y k − k ) , . . . ) ∈ { , } ×{ , } × · · · , deﬁne h y such that every k ∈ N ∪ { } and i ∈ { , . . . , k } has h y ( x i y ≤ k ) = y ik +1 , andevery x ∈ X \{ x i y ≤ k : k ∈ N ∪ { } , i ∈ { , . . . , k }} has h y ( x ) = 0. Then deﬁne H = { h y : y ∈ { , } × { , } × · · · } . By construction, this class H has an inﬁnite VCL tree. However, any set S ⊂ X of size at least 2which is shattered by H must be contained within a single node of the tree. In particular, since anycountable set X ′ = { x ′ , x ′ , . . . } ⊆ X necessarily contains points x ′ i , x ′ j existing in diﬀerent nodes ofthe tree, the set { x ′ , . . . , x ′ max { i,j } } is not shattered by H , so that X ′ is not ﬁnitely shattered by H . The previous examples were designed to illustrate the key features of the results of this paperin comparison with other learning models; however, these examples may be viewed as somewhatartiﬁcial. To conclude this section, we give two examples of “natural” geometric concept classes thatare universally learnable with exponential rate. This suggests that our theory has direct implicationsfor learning scenarios of the kind that may arise in applications. xample 2.9 (Nonlinear manifolds) . Various practical learning problems are naturally expressedby concepts that indicate whether the data lie on a manifold. The following construction providesone simple way to model classes of nonlinear manifolds. Let the domain X be any Polish space, andﬁx a measurable function g : X → R d with d < ∞ . For a given k < ∞ , consider the concept class H = { Ag =0 : A ∈ R k × d } . The coordinate functions g , . . . , g d describe the nonlinear features of the class. For example, if X = C n and g j are polynomials, this model can describe any class of aﬃne algebraic varieties.We claim that H is universally learnable at exponential rate. It suﬃces to show that, in fact, H has ﬁnite Littlestone dimension. To see why, ﬁx any Littlestone tree, and consider its branch x ∅ , x , x , . . . ; for simplicity, we will denote these points in this example as x , x , x , . . . . Deﬁne V j = { A ∈ R k × d : Ag ( x i ) = 0 for i = 0 , . . . , j } . Each V j is a ﬁnite-dimensional linear space. Now note that if V j = V j − , then all h ∈ H such that h ( x i ) = 1, i = 1 , . . . , j − h ( x j ) = 1; but this is impossible, as the deﬁnition of a Littlestonetree requires the existence of h ∈ H such that h ( x i ) = 1, i = 1 , . . . , j − h ( x j ) = 0. Thus thedimension of V j must decrease strictly in j , so the branch x ∅ , x , x , . . . must be ﬁnite. Example 2.10 (Positive halfspaces on N d ) . It is a classical fact that the class of halfspaces on R d has ﬁnite VC dimension, and it is easy to see this class has an inﬁnite Littlestone tree. Thus the PACrate cannot be improved in this setting. The aim of this example is to show that the situation is quitediﬀerent if one considers positive halfspaces on a lattice N d : such a class is universally learnable withexponential rate. This may be viewed as an extension of Example 2.1, which illustrates that somegeometric classes on discrete spaces can be universally learned at a much faster rate than geometricclasses on continuous spaces (a phenomenon not captured by the PAC model).More precisely, let X = N d for some d ∈ N , and let H be the class of positive halfspaces: H = { w · x − b ≥ : ( w , b ) ∈ (0 , ∞ ) d +1 } . We will argue that H is universally learnable at an exponential rate by constructing an explicitlearning algorithm guaranteeing a ﬁnite number of mistakes for every realizable data sequence. Aswill be argued in Section 3 below, the existence of such an algorithm immediately implies H doesnot have an inﬁnite Littlestone tree. Moreover, we show in Section 4 that such an algorithm can beconverted into a learning algorithm achieving exponential rates for all realizable distributions P .Let S n ∈ ( X × { , } ) n be any data set consistent with some h ∈ H . If every ( x i , y i ) ∈ S n has y i = 0, let ˆ h n ( x ) = 0 for all x ∈ X . Otherwise, let ˆ h n ( x ) = x ∈ L( { x i :( x i , ∈ S n } ) , whereL( { z , . . . , z t } ) = (cid:26) z ′ + X i ≤ t α i z i : α i ∈ [0 , , X i ≤ t α i = 1 , z ′ ∈ [0 , ∞ ) d (cid:27) for any t ∈ N and z , . . . , z t ∈ X . L( { z , . . . , z t } ) is the smallest region containing the convex hull of z , . . . , z t for which the indicator of the region is non-decreasing in every dimension.Now consider any sequence { ( x i , y i ) } i ∈ N in X × { , } such that for each n ∈ N , letting S n = { ( x i , y i ) } ni =1 , there exists h ∗ n ∈ H with h ∗ n ( x i ) = y i for all i ≤ n . Since { x : h ∗ n +1 ( x ) = 1 } is convex,and h ∗ n +1 ( x ) is non-decreasing in every dimension, we have ˆ h n ≤ h ∗ n +1 . This implies that any n ∈ N with ˆ h n ( x n +1 ) = y n +1 must have y n +1 = 1 and ˆ h n ( x n +1 ) = 0. Therefore, by the deﬁnition of L( · ),the following must hold for any n with ˆ h n ( x n +1 ) = y n +1 : for every i ≤ n such that y i = 1, thereexists a coordinate 1 ≤ j ≤ d such that ( x n +1 ) j < ( x i ) j .Now suppose, for the sake of obtaining a contradiction, that there is an increasing inﬁnite se-quence { n t } t ∈ N such that ˆ h n t ( x n t +1 ) = y n t +1 , and consider a coloring of the inﬁnite complete raph with vertices { x n t +1 } t ∈ N where every edge { x n t +1 , x n t ′ +1 } with t < t ′ is colored with a valuemin { j : ( x n t ′ +1 ) j < ( x n t +1 ) j } . Then the inﬁnite Ramsey theorem implies there exists an inﬁnitemonochromatic clique: that is, a value j ≤ d and an inﬁnite subsequence { n t i } with ( x n ti +1 ) j strictlydecreasing in i . This is a contradiction, since clearly any strictly decreasing sequence ( x n ti +1 ) j maintaining x n ti +1 ∈ X can be of length at most ( x n t +1 ) j , which is ﬁnite. Therefore, the learningalgorithm ˆ h n makes at most a ﬁnite number of mistakes on any such sequence { ( x i , y i ) } i ∈ N . Let usnote, however, that there can be no uniform bound on the number of mistakes (independent of thespeciﬁc sequence { ( x i , y i ) } i ∈ N ), since the Littlestone dimension of H is inﬁnite.

3. The adversarial setting

Before we proceed to the main topic of this paper, we introduce a simpler adversarial analogue of ourlearning problem. The strategies that arise in this adversarial setting form a key ingredient of thestatistical learning algorithms that will appear in our main results. At the same time, it motivatesus to introduce a number of important concepts that play a central role in the sequel.

Let X be a set, and let the concept class H be a collection of indicator functions h : X → { , } . Weconsider an online learning problem deﬁned as a game between the learner and an adversary . Thegame is played in rounds. In each round t ≥ • The adversary chooses a point x t ∈ X . • The learner predicts a label ˆ y t ∈ { , } . • The adversary reveals the true label y t = h ( x t ) for some function h ∈ H that is consistentwith the previous label assignments h ( x ) = y , . . . , h ( x t − ) = y t − .The learner makes a mistake in round t if ˆ y t = y t . The goal of the learner is to make as few mistakesas possible and the goal of the adversary is to cause as many mistakes as possible. The adversaryneed not choose a target concept h ∈ H in advance, but must ensure that the sequence { ( x t , y t ) } ∞ t =1 is realizable by H in the sense that for all T ∈ N there exists h ∈ H such that h ( x t ) = y t for all t ≤ T . That is, each preﬁx { ( x t , y t ) } Tt =1 must be consistent with some h ∈ H .We say that the concept class H is online learnable if there is a strategyˆ y t = ˆ y t ( x , y , . . . , x t − , y t − , x t ) , that makes only ﬁnitely many mistakes, regardless of what realizable sequence { ( x t , y t ) } ∞ t =1 is pre-sented by the adversary.The above notion of learnability may be viewed as a universal analogue of the uniform mistakebound model of Littlestone (1988), which asks when there exists a strategy that is guaranteed tomake at most d < ∞ mistakes for any input. Littlestone showed that this is the case if and only if H has no Littlestone tree of depth d + 1. Here we ask only that the strategy makes a ﬁnite number ofmistakes on any input, without placing a uniform bound on the number of mistakes. The main resultof this section shows that this property is fully characterized by the existence of inﬁnite Littlestonetrees. Let us recall that Littlestone trees were deﬁned in Deﬁnition 1.7.

Theorem 3.1.

For any concept class H , we have the following dichotomy. • If H does not have an inﬁnite Littlestone tree, then there is a strategy for the learner thatmakes only ﬁnitely many mistakes against any adversary. If H has an inﬁnite Littlestone tree, then there is a strategy for the adversary that forces anylearner to make a mistake in every round.In particular, H is online learnable if and only if it has no inﬁnite Littlestone tree. A proof of this theorem is given in the next section. The proof uses classical results from thetheory of inﬁnite games, see Appendix A.1 for a review of the relevant notions.

Let us now view the online learning game from a diﬀerent perspective that ﬁts better into theframework of classical game theory. For x , . . . , x t ∈ X and y , . . . , y t ∈ { , } , consider the class H x ,y ,...,x t ,y t := { h ∈ H : h ( x ) = y , . . . , h ( x t ) = y t } of hypotheses that are consistent with x , y , . . . , x t , y t . An adversary who tries to maximize thenumber of mistakes the learner makes will choose a sequence of x t , y t with y t = ˆ y t for as many initialrounds in a row as possible. In other words, the adversary tries to keep H x , − ˆ y ,...,x t , − ˆ y t = ∅ as long as possible. When this set would become empty (for every possible x t ), however, the onlyconsistent choice of label is y t = ˆ y t , so the learner makes no mistakes from that point onwards.This motivates deﬁning the following game G . There are two players: P A and P L . In eachround τ : • Player P A chooses a point ξ τ ∈ X and shows it to Player P L . • Then, Player P L chooses a point η τ ∈ { , } .Player P L wins the game in round τ if H ξ ,η ,...,ξ τ ,η τ = ∅ . Player P A wins the game if the gamecontinues indeﬁnitely. In other words, the set of winning sequences for P L is W = { ( ξ , η ) ∈ ( X × { , } ) ∞ : H ξ ,η ,...,ξ τ ,η τ = ∅ for some 0 ≤ τ < ∞} This set of sequences W is ﬁnitely decidable in the sense that the membership of ( ξ , η ) in W iswitnessed by a ﬁnite subsequence. Thus the above game is a Gale-Stewart game (cf. Appendix A.1).In particular, by Theorem A.1, exactly one of P A or P L has a winning strategy in this game.The game G is intimately connected to the deﬁnition of Littlestone trees: an inﬁnite Littlestonetree is nothing other than a winning strategy for P A , expressed in a slightly diﬀerent language. Lemma 3.2.

Player P A has a winning strategy in the Gale-Stewart game G if and only if H hasan inﬁnite Littlestone tree. Proof

Suppose H has an inﬁnite Littlestone tree, for which we adopt the notation of Deﬁnition 1.7.Deﬁne a strategy for P A by ξ τ ( η , . . . , η τ − ) = x η ,...,η τ − (cf. Remark A.4). The deﬁnition of aLittlestone tree implies that H ξ ,η ,...,ξ τ ,η τ = ∅ for every η ∈ { , } ∞ and τ < ∞ , that is, thisstrategy is winning for P A . Conversely, suppose P A has a winning strategy, and deﬁne the inﬁnitetree T = { x u : 0 ≤ k < ∞ , u ∈ { , } k } by x η ,...,η τ − := ξ τ ( η , . . . , η τ − ) . The tree T is an inﬁnite Littlestone tree by the deﬁnition of a winning strategy for the game G .We are now ready to prove Theorem 3.1. Proof of Theorem 3.1

Assume H has an inﬁnite Littlestone tree { x u } . The adversary may playthe following strategy: in round t , choose x t = x y ,...,y t − nd after the learner reveals her prediction ˆ y t , choose y t = 1 − ˆ y t . By deﬁnition of a Littlestone tree, y t is consistent with H regardless of the learner’s prediction. Thisstrategy for the adversary in the online learning problem forces any learner to make a mistake inevery round.Now suppose H has no inﬁnite Littlestone tree. Then P L has a winning strategy η τ ( ξ , . . . , ξ τ )in the Gale-Stewart game G (cf. Remark A.4). If we were to know a priori that the adversaryalways forces an error when possible, then the learner could use this strategy directly with x t = ξ t and ˆ y t = 1 − η t to ensure she only makes ﬁnitely many mistakes. To extend this conclusion to anarbitrary adversary, we design our learning algorithm so that the Gale-Stewart game proceeds tothe next round only when the learner makes a mistake. More precisely, we introduce the followinglearning algorithm. • Initialize τ ← f ( x ) ← η ( x ). • In every round t ≥ y t = 1 − f ( x t ).- If ˆ y t = y t , let ξ τ ← x t , f ( x ) ← η τ +1 ( ξ , . . . , ξ τ , x ), and τ ← τ + 1.This algorithm can only make a ﬁnite number of mistakes against any adversary. Indeed, supposethat some adversary forces the learner to make an inﬁnite number of mistakes at times t , t , . . . Bythe deﬁnition of G , however, we have H x t ,y t ,...,x tk ,y tk = ∅ for some k < ∞ . This violates the rulesof the online learning game, because the sequence { ( x t , y t ) } t k t =1 is not consistent with H . The learning algorithm from the previous section solves the adversarial online learning problem.It is also a basic ingredient in the algorithm that achieves exponential rates in the probabilisticsetting (section 4 below). However, in passing from the adversarial setting to the probabilisticsetting, we encounter nontrivial diﬃculties. While the existence of winning strategies is guaranteedby the Gale-Stewart theorem, this result does not say anything about the complexity of thesestrategies. In particular, it is perfectly possible that the learning algorithm of the previous sectionis nonmeasurable, in which case its naive application in the probabilistic setting can readily yieldnonsensical results (cf. Appendix C).It is, therefore, essential to impose suﬃcient regularity assumptions so that the winning strategiesin the Gale-Stewart game G are measurable. This issue proves to be surprisingly subtle: almostnothing appears to be known in the literature regarding the measurability of Gale-Stewart strategies.We therefore develop a rather general result of this kind, Theorem B.1 in Appendix B, that suﬃcesfor all the purposes of this paper. Deﬁnition 3.3.

A concept class H of indicator functions h : X → { , } on a Polish space X issaid to be measurable if there is a Polish space Θ and Borel-measurable map h : Θ × X → { , } so that H = { h ( θ, · ) : θ ∈ Θ } .In other words, H is measurable when it can be parameterized in any reasonable way. This isthe case for almost any H encountered in practice. The Borel isomorphism theorem (Cohn, 1980,Theorem 8.3.6) implies that we would obtain an identical deﬁnition if we required only that Θ is aBorel subset of a Polish space. emark 3.4. Deﬁnition 3.3 is well-known in the literature: this is the standard measurabilityassumption made in empirical process theory, where it is usually called the image admissible Suslinproperty, cf. (Dudley, 2014, section 5.3).Our basic measurability result is the following corollary of Theorem B.1.

Corollary 3.5.

Let X be Polish and H be measurable. Then the Gale-Stewart game G of theprevious section has a universally measurable winning strategy. In particular, the learning algorithmof Theorem 3.1 is universally measurable. Proof

The conclusion follows from Theorem B.1 once we verify that the set W of winning sequencesfor P L in G is coanalytic (see Appendix A.4 for the relevant terminology and basic properties ofPolish spaces and analytic sets). To this end, we write its complement as W c = { ( ξ , η ) ∈ ( X × { , } ) ∞ : H ξ ,η ,...,ξ τ ,η τ = ∅ for all τ < ∞} = \ ≤ τ< ∞ [ θ ∈ Θ \ ≤ t ≤ τ { ( ξ , η ) ∈ ( X × { , } ) ∞ : h ( θ, x t ) = η t } . The set { ( θ, ξ , η ) : h ( θ, ξ i ) = η i } is Borel by the measurability assumption. Moreover, both inter-sections in the above expression are countable, while the union corresponds to the projection of aBorel set. The set W c is therefore analytic.That a nontrivial measurability assumption is needed in the ﬁrst place is not obvious: one mighthope that it suﬃces to simply require that every concept h ∈ H is measurable. Unfortunately, thisis not the case. In Appendix C, we describe a nonmeasurable concept class on X = [0 ,

1] such thateach h ∈ H is the indicator of a countable set. In this example, the set W of winning sequences isnonmeasurable: thus one cannot even give meaning to the probability that the game is won whenit is played with random data. In such a situation, the analysis in the following sections does notmake sense. Thus Corollary 3.5, while technical, is essential for the theory developed in this paper.It is perhaps not surprising that some measurability issues arise in our setting, as this is alreadythe case in classical PAC learning theory (Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989;Pestov, 2011). Deﬁnition 3.3 is the standard assumption that is made in this setting (Dudley, 2014).However, the only issue that arises in the classical setting is the measurability of the supremum ofthe empirical process over H . This is essentially straightforward: for example, measurability istrivial when H is countable, or can be pointwise approximated by a countable class. The latteralready captures many classes encountered in practice. For these reasons, measurability issues inclassical learning theory are often considered “a minor nuisance”. The measurability problem forGale-Stewart strategies is much more subtle, however, and cannot be taken for granted. For example,we do not know of a simpler proof of Theorem B.1 in the setting of Corollary 3.5 even when theclass H is countable. Further discussion may be found in Appendix C. In its classical form, the Gale-Stewart theorem (Theorem A.1) is a purely existential statement:it states the existence of winning strategies. To actually implement learning algorithms from suchstrategies, however, one would need to explicitly describe them. Such an explicit description isconstructed as part of the measurability proof of Theorem B.1 on the basis of a reﬁned notion ofdimension for concept classes that is of interest in its own right. The aim of this section is to brieﬂyintroduce the relevant ideas in the context of the online learning problem; see the proof of TheoremB.1 for more details. (The content of this section is not used elsewhere in the text.)It is instructive to begin by recalling the classical online learning strategy (Littlestone, 1988).The

Littlestone dimension of H is deﬁned as the largest depth of a Littlestone tree for H (if H s empty then its dimension is − d is ﬁnite, then there is a strategy for P L in the game G that wins at the latest in round d + 1. This winning strategy is built using the following observation. Observation 3.6.

Assume that the Littlestone dimension d of H is ﬁnite and that H is nonempty.Then for every x ∈ X , there exists y ∈ { , } such that the Littlestone dimension of H x,y is strictlyless than that of H . Proof

If both H x, and H x, have a Littlestone tree of depth d (say t , t , respectively), then H has a Littlestone tree of depth d + 1: take x as the root and attach t , t as its subtrees.The winning strategy for P L is now evident: as long as player P L always chooses y t so that theLittlestone dimension of H x ,y ,...,x t ,y t is smaller than that of H x ,y ,...,x t − ,y t − , then P L will win inat most d + 1 rounds.At ﬁrst sight, it appears that this strategy does not make much sense in our setting. Thoughwe assume that H has no inﬁnite Littlestone tree, it may have ﬁnite Littlestone trees of arbitrarilylarge depth. In this case the classical Littlestone dimension is inﬁnite, so a naive implementationof the above strategy fails. Nonetheless, the key idea behind the proof of Theorem B.1 is that anappropriate extension of Littlestone’s strategy works in the general setting. The basic observationis that the notion “inﬁnite Littlestone dimension” may be considerably reﬁned: we can extend theclassical notion to capture precisely “how inﬁnite” the Littlestone dimension is. With this newdeﬁnition in hand, the winning strategy for P L will be exactly the same as in the case of ﬁniteLittlestone dimension. The Littlestone dimension may not just be a natural number, but rather anordinal, which turns out to be precisely the correct way to measure the “number of steps to victory”.A brief introduction to ordinals and their role in game theory is given in Appendix A.2.Our extension of the Littlestone dimension uses the notion of rank , which assigns an ordinal toevery ﬁnite Littlestone tree. The rank is deﬁned by a partial order ≺ : let us write t ′ ≺ t if t ′ is aLittlestone tree that extends t by one level, namely, t is obtained from t ′ by removing its leaves. A Littlestone tree t is minimal if it cannot be extended to a Littlestone tree of larger depth. In thiscase, we say rank( t ) = 0. For non-minimal trees, we deﬁne rank( t ) by transﬁnite recursionrank( t ) = sup { rank( t ′ ) + 1 : t ′ ≺ t } . If rank( t ) = d is ﬁnite, then the largest Littlestone tree that extends t has d additional levels. The classical Littlestone dimension is d ∈ N if and only if rank( ∅ ) = d .Rank is well-deﬁned as long as H has no inﬁnite Littlestone tree. The crucial point is that when H has no inﬁnite tree, ≺ is well-founded (i.e., there are no inﬁnite decreasing chains in ≺ ), so thatevery ﬁnite Littlestone tree t appears in the above recursion. For more details, see Appendix A.3. Deﬁnition 3.7.

The ordinal Littlestone dimension of H is deﬁned as :LD( H ) :=  − H is empty. Ω if H has an inﬁnite Littlestone tree.rank( ∅ ) otherwise.When H has no inﬁnite Littlestone tree, we can construct a winning strategy for P L in the samemanner as in the case of ﬁnite Littlestone dimension. An extension of Observation 3.6 states that forevery x ∈ X , there exists y ∈ { , } so that LD( H x,y ) < LD( H ). The intuition behind this extension

3. It may appear somewhat confusing that t ′ ≺ t although t ′ is larger than t as a tree. The reason is that we ordertrees by how far they may be extended, and t ′ can be extended less far than t .4. Here we borrow Cantor’s notation Ω for the absolute inﬁnite : a number larger than every ordinal number. s the same as in the ﬁnite case, but its proof is more technical (cf. Proposition B.8). The strategyfor P L is now chosen so that LD( H x ,y ,...,x t ,y t ) decreases in every round. This strategy ensures thatP L wins in a ﬁnite number of rounds, because ordinals do not admit an inﬁnite decreasing chain.The idea that dimension can be an ordinal may appear a bit unusual. The meaning of this notionis quite intuitive, however, as is best illustrated by means of some simple examples. Recall that wehave already shown above that when LD( H ) < ω is ﬁnite ( ω denotes the smallest inﬁnite ordinal),the ordinal Littlestone dimension coincides with the classical Littlestone dimension. Example 3.8 (Disjoint union of ﬁnite-dimensional classes) . Partition X = N into disjoint intervals X , X , X , . . . with |X k | = k . For each k , let H k be the class of indicators of all subsets of X k . Let H = S k H k . We claim that LD( H ) = ω . Indeed, as soon as we select a root vertex x ∈ X k for aLittlestone tree, we can only grow the Littlestone tree for k − { x } ) = k − x ∈ X k . By deﬁnition, rank( ∅ ) = sup { rank( { x } ) + 1 : x ∈ X } = ω . Example 3.9 (Thresholds on N ) . Let X = N and consider the class of thresholds H = { x x ≤ z : z ∈ N } . As in the previous example, we claim that LD( H ) = ω . Indeed, as soon as we select a rootvertex x ∈ X for a Littlestone tree, we can grow the Littlestone tree for at most x − h ∈ H and distinct points y , . . . , y x such that h ( x ) = 0 and h ( y ) = · · · = h ( y x ) = 1). On the other hand, we can grow a Littlestone tree of depth order log( x ),by repeatedly choosing labels in each level that bisect the intervals between the labels chosen in theprevious level. It follows that rank( ∅ ) = sup { rank( { x } ) + 1 : x ∈ X } = ω . Example 3.10 (Thresholds on Z ) . Let X = Z and consider the class of thresholds H = { x x ≤ z : z ∈ Z } . In this case, LD( H ) = ω + 1. As soon as we select a root vertex x ∈ X , the class H x, isessentially the same as the threshold class from the previous example. It follows that rank( { x } ) = ω for every x ∈ X . Consequently, rank( ∅ ) = ω + 1. Example 3.11 (Union of partitions) . Let X = [0 , k , let H k be the class of indicators ofdyadic intervals length 2 − k (which partition X ). Let H = S k H k . In this example, LD( H ) = ω + 1.Indeed, consider a Littlestone tree t = { x ∅ , x , x } of depth two. The class H x ∅ , ,x , consists ofindicators of those dyadic intervals that contain both x ∅ and x . There is only a ﬁnite number suchintervals, because | x ∅ − x | > t ) < ω for any Littlestone tree of depth two. On the other hand, one may grow a Littlestonetree of arbitrary depth for any choice of root x ∅ : the class H x ∅ , is an inﬁnite sequence of nestedintervals, which is essentially the same as in Example 3.9; and H x ∅ , has a subclass that is essentiallythe same as H itself. Thus, rank( { x ∅ } ) = ω for every x ∅ ∈ X . Consequently, rank( ∅ ) = ω + 1.By inspecting these examples, a common theme emerges. A class of ﬁnite Littlestone dimensionis one whose Littlestone trees are of bounded depth. A class with LD( H ) = ω has arbitrarily largeﬁnite Littlestone trees, but the maximal depth of a Littlestone tree is ﬁxed once the root node hasbeen selected. Similarly, a class with LD( H ) = ω + k for k < ω has arbitrarily large ﬁnite Littlestonetrees, but the maximal depth of a Littlestone tree is ﬁxed once its ﬁrst k +1 levels have been selected.There are also higher ordinals such as LD( H ) = ω + ω ; this means that the choice of root of thetree determines an arbitrarily large ﬁnite number k , such that the maximal depth of the tree is ﬁxedafter the next k levels have been selected. For further examples in a more general context, we referto Appendix A.3 and to the lively discussion in (Evans and Hamkins, 2014) of game values in inﬁnitechess. In any case, the above examples illustrate that the notion of ordinal Littlestone dimension isnot only intuitive, but also computable in concrete situations.While only small inﬁnite ordinals appear in the above examples, there exist concept classes suchthat LD( H ) is an arbitrarily large ordinal (as in the proof of Lemma C.3). There is no general

5. The results in Appendix B are formulated in the setting of general Gale-Stewart games. When specialized tothe game G of Section 3.2, the reader may readily verify that the game value deﬁned in Section B.2 is preciselyval( x , y , . . . , x t , y t ) = LD( H x ,y ,...,x t ,y t ). pper bound on the ordinal Littlestone dimension. However, a key part of the proof of Theorem B.1is the remarkable fact that for measurable classes H in the sense of Deﬁnition 3.3, the Littlestonedimension can be at most a countable ordinal LD( H ) < ω (Lemma B.7). Thus any concept classthat one is likely to encounter in practice gives rise to a relatively simple learning strategy.

4. Exponential rates

Sections 4 and 5 of this paper are devoted to the proof of Theorem 1.9, which is the main result ofthis paper. The aim of the present section is to characterize when exponential rates do and do notoccur; the analogous questions for linear rates will be studied in the next section.Let us recall that the basic deﬁnitions of this paper are stated in section 1.4; they will be freelyused in the following without further comment. In particular, the following setting and assumptionswill be assumed throughout Sections 4 and 5. We ﬁx a Polish space X and a concept class H ⊆{ , } X satisfying the measurability assumption of Deﬁnition 3.3. To avoid trivialities, we alwaysassume that |H| >

2. The learner is presented with an i.i.d. sequence of samples ( X , Y ) , ( X , Y ) , . . . drawn from an unknown distribution P on X × { , } . We will always assume that P is realizable. We start by characterizing what classes H are learnable at an exponential rate. Theorem 4.1. If H does not have an inﬁnite Littlestone tree, H is learnable with optimal rate e − n . The theorem consists of two parts: we need to prove an upper bound and a lower bound on therate. The latter (already established by Schuurmans, 1997) is straightforward, so we present it ﬁrst.

Lemma 4.2 (Schuurmans (1997)) . For any learning algorithm ˆ h n , there exists a realizable distri-bution P such that E [er(ˆ h n )] ≥ − n − for inﬁnitely many n . In particular, this means H is notlearnable at rate faster than exponential: R ( n ) = e − n . Proof As |H| >

2, we can choose h , h ∈ H and x, x ′ ∈ X such that h ( x ) = h ( x ) =: y and h ( x ′ ) = h ( x ′ ). Now ﬁx any learning algorithm ˆ h n . Deﬁne two distributions P , P ,where each P i { ( x, y ) } = and P i { ( x ′ , i ) } = . Let I ∼ Bernoulli( ), and conditioned on I let( X , Y ) , ( X , Y ) , . . . be i.i.d. P I , and ( X , Y ) , . . . , ( X n , Y n ) are the training set for ˆ h n . Then E [ P (ˆ h n ( X n +1 ) = Y n +1 |{ ( X t , Y t ) } nt =1 , I )] ≥ P ( X = · · · = X n = x, X n +1 = x ′ ) = 2 − n − . Moreover, E [ P (ˆ h n ( X n +1 ) = Y n +1 |{ ( X t , Y t ) } nt =1 , I )]= 12 X i ∈{ , } E [ P (ˆ h n ( X n +1 ) = Y n +1 |{ ( X t , Y t ) } nt =1 , I = i ) | I = i ] . Since the average is bounded by the max, we conclude that for each n , there exists i n ∈ { , } suchthat for ( X , Y ) , . . . , ( X n , Y n ) i.i.d. P i n , E [er P in (ˆ h n )] ≥ − n − . In particular, by the pigeonhole principle, there exists i ∈ { , } such that i n = i inﬁnitely often, sothat E [er P i (ˆ h n )] ≥ − n − inﬁnitely often. he main challenge in the proof of Theorem 4.1 is constructing a learning algorithm that achievesexponential rate for every realizable P . We assume in the remainder of this section that H hasno inﬁnite Littlestone tree. Theorem 3.1 and Corollary 3.5 yield the existence of a sequence ofuniversally measurable functions ˆ Y t : ( X × { , } ) t − × X → { , } that solve the online learningproblem from Section 3.1. Deﬁne the data-dependent classiﬁerˆ y t − ( x ) := ˆ Y t ( X , Y , . . . , X t − , Y t − , x ) . Our ﬁrst observation is that this adversarial algorithm is also applicable in the probabilistic setting.

Lemma 4.3. P { er(ˆ y t ) > } → as t → ∞ . Proof As P is realizable, we can choose a sequence of hypotheses h k ∈ H so that er( h k ) ≤ − k .For every t ≥

1, a union bound gives X k P { h k ( X s ) = Y s for some s ≤ t } ≤ t X k er( h k ) < ∞ . By Borel-Cantelli, with probability one, there exists for every t ≥ h ∈ H such that h ( X s ) = Y s for all s ≤ t . In other words, with probability one X , Y , X , Y , . . . deﬁnes a valid inputsequence for the online learning problem of Section 3.1. Because we chose a winning strategy, thetime of the last mistake T = sup { s ≥ y s − ( X s ) = Y s } is a random variable that is ﬁnite with probability one. Now recall from the proof of Theorem 3.1that the online learning algorithm was chosen so that ˆ y t only changes when a mistake is made. Inparticular, ˆ y s = ˆ y t for all s ≥ t ≥ T . By the law of large numbers, P { er(ˆ y t ) = 0 } = P (cid:26) lim S →∞ S t + S X s = t +1 ˆ y t ( X s ) = Y s = 0 (cid:27) ≥ P (cid:26) lim S →∞ S t + S X s = t +1 ˆ y t ( X s ) = Y s = 0 , T ≤ t (cid:27) = P { T ≤ t } . It follows that P { er(ˆ y t ) > } ≤ P { T > t } → t → ∞ .Lemma 4.3 certainly shows that E [er(ˆ y t )] → t → ∞ . Thus the online learning algorithmyields a consistent algorithm in the statistical setting. This, however, does not yield any boundon the learning rate. We presently build a new algorithm on the basis of ˆ y t that guarantees anexponential learning rate.As a ﬁrst observation, suppose we knew a number t ∗ so that P { er(ˆ y t ∗ ) > } < . Then wecould output ˆ h n with exponential rate as follows. First, break up the data X , Y , . . . , X n , Y n into ⌊ n/t ∗ ⌋ batches, each of length t ∗ . Second, compute the classiﬁer ˆ y t ∗ separately for each batch.Finally, choose ˆ h n to be the majority vote among these classiﬁers. Now, by the deﬁnition of t ∗ andHoeﬀding’s inequality, the probability that more than one third of the classiﬁers has positive erroris exponentially small. It follows that the majority vote ˆ h n has zero error except on an event ofexponentially small probability.The problem with this idea is that t ∗ depends on the unknown distribution P , so we cannotassume it is known to the learner. Thus our ﬁnal algorithm proceeds in two stages: ﬁrst, weconstruct an estimate ˆ t n for t ∗ from the data; and then we apply the above majority algorithm withbatch size ˆ t n . emma 4.4. There exist universally measurable ˆ t n = ˆ t n ( X , Y , . . . , X n , Y n ) , whose deﬁnition doesnot depend on P , so that the following holds. Given t ∗ such that P { er(ˆ y t ∗ ) > } ≤ , there exist C, c > independent of n (but depending on P, t ∗ ) so that P { ˆ t n ∈ T good } ≥ − Ce − cn , where T good := { ≤ t ≤ t ∗ : P { er(ˆ y t ) > } ≤ } . Proof

For each 1 ≤ t ≤ ⌊ n ⌋ and 1 ≤ i ≤ ⌊ n t ⌋ , letˆ y it ( x ) := ˆ Y t +1 ( X ( i − t +1 , Y ( i − t +1 , . . . , X it , Y it , x )be the learning algorithm from Section 3.1 that is trained on batch i of the data. For each t , theclassiﬁers (ˆ y it ) i ≤⌊ n/ t ⌋ are trained on subsamples of the data that are independent of each other andof the second half ( X s , Y s ) s>n/ of the data. Thus (ˆ y it ) i ≤⌊ n/ t ⌋ may be viewed as independent drawsfrom the distribution of ˆ y t . We now estimate P { er(ˆ y t ) > } by the fraction of ˆ y it that make an erroron the second half of the data:ˆ e t := 1 ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 { ˆ y it ( X s ) = Y s for some n/ a.s.Deﬁne ˆ t n := inf { t ≤ ⌊ n ⌋ : ˆ e t < } with the convention inf ∅ = ∞ .Now, ﬁx t ∗ as in the statement of the lemma. By Hoeﬀding’s inequality, P { ˆ t n > t ∗ } ≤ P { ˆ e t ∗ ≥ } ≤ P { e t ∗ − E [ e t ∗ ] ≥ } ≤ e −⌊ n/ t ∗ ⌋ / . In other words, ˆ t n ≤ t ∗ except with exponentially small probability. In addition, by continuity, thereexists ε > ≤ t ≤ t ∗ with P { er(ˆ y t ) > } > we have P { er(ˆ y t ) > ε } > + .Fix 1 ≤ t ≤ t ∗ with P { er(ˆ y t ) > } > (if such a t exists). By Hoeﬀding’s inequality, P (cid:26) ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 er(ˆ y it ) >ε < (cid:27) ≤ e −⌊ n/ t ∗ ⌋ / . Now, if f is any classiﬁer so that er( f ) > ε , then P { f ( X s ) = Y s for some n/ < s ≤ n } ≥ − (1 − ε ) n/ . Therefore, as (ˆ y it ) i ≤⌊ n/ t ⌋ are independent of ( X s , Y s ) s>n/ , applying a union bound conditionallyon ( X s , Y s ) s ≤ n/ shows that the probability that every classiﬁer ˆ y it with er(ˆ y it ) > ε makes an erroron the second half of the sample is P { er(ˆ y it ) >ε ≤ { ˆ y it ( X s ) = Y s for some n/

~~6∈ T good } ≤ e −⌊ n/ t ∗ ⌋ / + t ∗ ⌊ n ⌋ (1 − ε ) n/ + t ∗ e −⌊ n/ t ∗ ⌋ / . The right-hand side is bounded by Ce − cn for some C, c >~~

Corollary 4.5. H has at most exponential learning rate. Proof

We adopt the notations in the proof of Lemma 4.4. The output ˆ h n of our ﬁnal learningalgorithm is the majority vote of the classiﬁers ˆ y i ˆ t n for 1 ≤ i ≤ ⌊ n t n ⌋ . We aim to show that E [er(ˆ h n )] ≤ Ce − cn for some constants C, c > t ∈ T good . By Hoeﬀding’s inequality, P (cid:26) ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 er(ˆ y it ) > > (cid:27) ≤ e −⌊ n/ t ∗ ⌋ / . In other words, except on an event of exponentially small probability, we have er(ˆ y it ) = 0 for amajority of indices i .By a union bound, we obtain P { er(ˆ y i ˆ t n ) > i ≤ ⌊ n t n ⌋}≤ P { ˆ t n

6∈ T good } + P { for some t ∈ T good , er(ˆ y it ) > i ≤ ⌊ n t ⌋}≤ Ce − cn + t ∗ e −⌊ n/ t ∗ ⌋ / . In words, except on an event of exponentially small probability, er(ˆ y i ˆ t n ) = 0 for a majority of indices i . It follows that the majority vote of these classiﬁers is a.s. correct on a random sample from P .That is, we have shown P { er(ˆ h n ) > } ≤ Ce − cn + t ∗ e −⌊ n/ t ∗ ⌋ / . The conclusion follows because E [er(ˆ h n )] ≤ P { er(ˆ h n ) > } . We showed in the previous section that if H has no inﬁnite Littlestone tree, then it can be learned byan algorithm whose rate decays exponentially fast. What is the fastest rate when H has an inﬁniteLittlestone tree? The following result implies a signiﬁcant drop in the rate: the rate is never fasterthan linear. Theorem 4.6. If H has an inﬁnite Littlestone tree, then for any learning algorithm ˆ h n , there existsa realizable distribution P such that E [er(ˆ h n )] ≥ n for inﬁnitely many n . In particular, this means H is not learnable at rate faster than n . The proof of Theorem 4.6 uses the probabilistic method. We deﬁne a distribution on realizabledistributions P with the property that for every learning algorithm, E [er(ˆ h n )] ≥ n inﬁnitely oftenwith positive probability over the choice of P . The main idea of the proof is to concentrate P ona random branch of the inﬁnite Littlestone tree. As any ﬁnite set of examples will only explore n initial segment of the chosen branch, the algorithm cannot know whether the random branchcontinues to the left or to the right after this initial segment. This ensures that the algorithm makesa mistake with probability when it is presented with a point that lies deeper along the branchthan the training data. The details follow. Proof of Theorem 4.6

Fix any learning algorithm with output ˆ h n , and an inﬁnite Littlestone tree t = { x u : 0 ≤ k < ∞ , u ∈ { , } k } for H . Let y = ( y , y , . . . ) be an i.i.d. sequence of Bernoulli( )variables. Deﬁne the (random) distribution P y on X × { , } by P y { ( x y ≤ k , y k +1 ) } = 2 − k − for k ≥ . The map y P y is measurable, so no measurability issues arise below.For every n < ∞ , there exists h ∈ H so that h ( x y ≤ k ) = y k +1 for 0 ≤ k ≤ n . Hence,er y ( h ) := P y { ( x, y ) ∈ X × { , } : h ( x ) = y } ≤ − n − . Letting n → ∞ , we ﬁnd that P y is realizable for every y .Now let ( X, Y ) , ( X , Y ) , ( X , Y ) , . . . be i.i.d. samples drawn from P y . Then we can write X = x y ≤ T , Y = y T +1 , X i = x y ≤ Ti , Y i = y T i +1 , where T, T , T , . . . are i.i.d. Geometric( ) (starting at 0) random variables independent of y . On theevent { T = k, max { T , . . . , T n } < k } , the value ˆ h n ( X ) is conditionally independent of y k +1 given X, ( X , Y ) , . . . , ( X n , Y n ), and (again on this event) the corresponding conditional distribution of y k +1 is Bernoulli( ) (since it is independent from y , . . . , y k and X, X , . . . , X n ). We therefore have P { ˆ h n ( X ) = Y, T = k, max { T , . . . , T n } < k } = P { ˆ h n ( X ) = y k +1 , T = k, max { T , . . . , T n } < k } = E h P n ˆ h n ( X ) = y k +1 (cid:12)(cid:12)(cid:12) X, ( X , Y ) , . . . , ( X n , Y n ) o T = k, max { T ,...,T n } n . The aboveidentity gives, by Fatou’s lemma, E h lim sup n →∞ n P { ˆ h n ( X ) = Y, T = k n | y } i ≥ lim sup n →∞ n P { ˆ h n ( X ) = Y, T = k n } >

132 ;Fatou’s lemma applies as (almost surely) n P { ˆ h n ( X ) = Y, T = k n | y } ≤ n P { T = k n } = n − k n − ≤ .Because P { ˆ h n ( X ) = Y, T = k n | y } ≤ P { ˆ h n ( X ) = Y | y } = E [er y (ˆ h n ) | y ] a.s. , we have E [lim sup n →∞ n E [er y (ˆ h n ) | y ]] > , which implies there must exist a realization of y suchthat E [er y (ˆ h n ) | y ] > n inﬁnitely often. Choosing P = P y for this realization of y concludes theproof. The following proposition summarizes some of the main ﬁndings of this section.

Proposition 4.7.

The following are equivalent.1. H is learnable at an exponential rate, but not faster. . H does not have an inﬁnite Littlestone tree.3. There is an “eventually correct” learning algorithm for H , that is, a learning algorithm thatoutputs ˆ h n so that P { er(ˆ h n ) > } → as n → ∞ .4. There is an “eventually correct” learning algorithm for H with exponential rate, that is, P { er(ˆ h n ) > } ≤ Ce − cn where C, c > may depend on P . Proof

The implication 2 ⇒ ⇒ ⇒ ⇒

5. Linear rates

In section 4 we characterized concept classes that have exponential learning rates. We also showedthat a concept class that does not have exponential learning rate cannot be learned at a rate fasterthan linear. The aim of this section is to characterize concept classes that have linear learning rate.Moreover, we show that classes that do not have linear learning rate must have arbitrarily slowrates. This completes our characterization of all possible learning rates.To understand the basic idea behind the characterization of linear rates, it is instructive to revisitthe idea that gave rise to exponential rates. First, we showed that it is possible to design an onlinelearning algorithm that achieves perfect prediction after a ﬁnite number of rounds. While we donot have a priori control of how fast this “eventually correct” algorithm attains perfect prediction,a modiﬁcation of the adversarial strategy converges at an exponentially fast rate.To attain a linear rate, we once again design an online algorithm. However, rather than aim forperfect prediction, we now set the more modest goal of learning just to rule out some ﬁnite-lengthpatterns in the data. Speciﬁcally, we aim to identify a collection of forbidden classiﬁcation patterns ,so that for some ﬁnite k , every ( x , . . . , x k ) ∈ X k has some forbidden pattern in { , } k ; call thisa VC pattern class . If we can identify such a collection of patterns with the property that wewill almost surely never observe one of these forbidden patterns in the data sequence, then we canapproach the learning problem in a manner analogous to learning with a VC class. The situationis not quite this simple, since we do not actually have a family of classiﬁers; fortunately, however,the classical one-inclusion graph prediction strategy of Haussler, Littlestone, and Warmuth (1994) isable to operate purely on the basis of the ﬁnite patterns on the data, and hence can be applied toyield the claimed linear rate once the forbidden patterns have been identiﬁed. In order to achievean overall linear learning rate, it then remains to modify the “eventually correct” algorithm so itattains a VC pattern class at an exponentially fast rate when it is trained on random data, usinganalogous ideas to the the ones that were already used in section 4.Throughout this section, we adopt the same setting and assumptions as in section 4.

We begin presently by developing the online learning algorithm associated to linear rates. Theconstruction will be quite similar to the one in Section 3.2. However, in the present setting, thenotion of a Littlestone tree is replaced by Vapnik-Chervonenkis-Littlestone (VCL) tree, which wasdeﬁned in Deﬁnition 1.8 (cf. Figure 3). In words, a VCL tree is deﬁned by the following properties.Each vertex of depth k is labelled by a sequence of k + 1 variables in X . Its out degree is 2 k +1 , andeach of these 2 k +1 edges is uniquely labeled by an element in { , } k +1 . A class H has an inﬁniteVCL tree if every ﬁnite root-to-vertex path is realized by a function in H . In particular, if H has aninﬁnite VCL tree then it has an inﬁnite Littlestone tree (the other direction does not hold). emark 5.1. Some features of Deﬁnition 1.8 are somewhat arbitrary, and the reader should notread undue meaning into them. We will ultimately be interested in whether or not H has an inﬁniteVCL tree. That the size of the sets x u grows linearly with the depth of the tree is not important; itwould suﬃce to assume that each x u is a ﬁnite set, and that the sizes of these sets are unboundedalong each inﬁnite branch. Thus we have signiﬁcant freedom in how to deﬁne the term “VCL tree”.The present canonical choice was made for concreteness.Just as we have seen for Littlestone trees in Section 3.2, a VCL tree is associated with thefollowing game V . In each round τ : • Player P A chooses points ξ τ = ( ξ τ , . . . , ξ τ − τ ) ∈ X τ . • Player P L chooses points η τ = ( η τ , . . . , η τ − τ ) ∈ { , } τ . • Player P L wins the game in round τ if H ξ ,η ,...,ξ τ ,η τ = ∅ .Here we have naturally extended to the present setting the notation H ξ ,η ,...,ξ τ ,η τ := { h ∈ H : h ( ξ is ) = η is for 0 ≤ i < s, ≤ s ≤ τ } that we used previously in Section 3.2. The game V is a Gale-Stewart game, because the winningcondition for P L is ﬁnitely decidable. Lemma 5.2. If H has no inﬁnite VCL tree, then there is a universally measurable winning strategyfor P L in the game V . Proof

By the same reasoning as in Lemma 3.2, the class H has an inﬁnite VCL tree if and only ifP A has a winning strategy in V . Thus if H has no inﬁnite VCL tree, then P L has a winning strategyby Theorem A.1. To obtain a universally measurable strategy, it suﬃces by Theorem B.1 to showthat the set of winning sequences for P L is coanalytic. The proof of this fact is identical to that ofCorollary 3.5.When H has no inﬁnite VCL tree, we can use the winning strategy for P L to design an algorithmthat learns to rule out some patterns in the data. We say that a sequence ( x , y , x , y , . . . ) ∈ ( X ×{ , } ) ∞ is consistent with H if for every t < ∞ , there exists h ∈ H such that h ( x s ) = y s for s ≤ t .Assuming H has no inﬁnite VCL tree, we now use the game V to design an algorithm that learns torule out some pattern of labels in such a sequence. To this end, denote by η τ : Q τσ =1 X σ → { , } τ the universally measurable winning strategy for P L provided by Lemma 5.2 (cf. Remark A.4). • Initialize τ ← • At every time step t ≥ η τ t − ( ξ , . . . , ξ τ t − − , x t − τ t − +1 , . . . , x t ) = ( y t − τ t − +1 , . . . , y t ): ⊲ Let ξ τ t − ← ( x t − τ t − +1 , . . . , x t ) and τ t ← τ t − + 1.- Otherwise, let τ t ← τ t − .In words, the algorithm traverses the input sequence ( x , y , x , y , . . . ) while using the assumedwinning strategy η τ to learn a set of “forbidden patterns” of length τ t ; that is, an assignment whichmaps every tuple x ′ ∈ X τ t to a pattern y ′ ( x ′ ) ∈ { , } τ t such that after some ﬁnite number of steps,

6. Given such a tree, we can always engineer a tree as in Deﬁnition 1.8 in two steps. First, by passing to a subtree,we can ensure that the cardinalities of x u are strictly increasing along each branch. Second, we can throw awaysome points in each set x u together with the corresponding subtrees to obtain a tree as in Deﬁnition 1.8. he algorithm never encounters the pattern indicated by y ′ ( x ′ ) when reading the next τ t examples x ′ .in the input sequence. Let us denote by ˆy t − ( z , . . . , z τ t − ) := η τ t − ( ξ , . . . , ξ τ t − − , z , . . . , z τ t − )the “pattern avoidance function” deﬁned by this algorithm. Lemma 5.3.

For any sequence x , y , x , y , . . . that is consistent with H , the algorithm learns, ina ﬁnite number of steps, to successfully rule out patterns in the data. That is, ˆy t − ( x t − τ t − +1 , . . . , x t ) = ( y t − τ t − +1 , . . . , y t ) , τ t = τ t − < ∞ , ˆy t = ˆy t − for all suﬃciently large t . Proof

Suppose ˆy t − ( x t − τ t − +1 , . . . , x t ) = ( y t − τ t − +1 , . . . , y t ) occurs at the inﬁnite sequence of times t = t , t , . . . Because η τ is a winning strategy for P L in the game V , we have H ξ ,η ,...,ξ k ,η k = ∅ forsome k < ∞ , where ξ i = ( x t i − τ ti − +1 , . . . , x t i ) and η i = ( y t i − τ ti − +1 , . . . , y t i ). But this contradictsthe assumption that the input sequence is consistent with H . Remark 5.4.

The strategy τ t depends in a universally measurable way on x ≤ t , y ≤ t . The map ˆy t ( · )is universally measurable jointly as a function of x ≤ t , y ≤ t . and of its input. More precisely, for each t ≥

0, there exist universally measurable functions T t : ( X × { , } ) t → { , . . . , t + 1 } , ˆY t : ( X × { , } ) t × (cid:16) [ s ≤ t X s (cid:17) → { , } t such that τ t = T t ( x , y , . . . , x t , y t ) , ˆy t ( z , . . . , z τ t ) = ˆY t ( x , y , . . . , x t , y t , z , . . . , z τ t ) . Remark 5.5.

The above learning algorithm uses the winning strategy for P L in the game V . Indirect analogy to Section 3.4, one can construct an explicit winning strategy in terms of a notion of“ordinal VCL dimension” whose deﬁnition can be read oﬀ from the proof of Theorem B.1. Becausethe details will not be needed for our purposes here, we omit further discussion. In this section we design a learning algorithm with linear learning rate for classes with no inﬁniteVCL trees.

Theorem 5.6. If H does not have an inﬁnite VCL tree, then H is learnable at rate n . The proof of this theorem is similar in spirit to that of Theorem 4.1, but requires some additionalingredients. Let us ﬁx a realizable distribution P and let ( X , Y ) , ( X , Y ) , . . . be i.i.d. samples from P . We assume in the remainder of this section that H has no inﬁnite VCL tree, so that we can runthe algorithm of the previous section on the random data. We set τ t := T t ( X , Y , . . . , X t , Y t ) , ˆy t ( z , . . . , z τ t ) := ˆY t ( X , Y , . . . , X t , Y t , z , . . . , z τ t ) , where the universally measurable functions T t , ˆY t are the ones deﬁned in Remark 5.4.For any integer k ≥ g : X k →{ , } k , deﬁne the errorper( g ) = per k ( g ) = P ⊗ k { ( x , y , . . . , x k , y k ) : g ( x , . . . , x k ) = ( y , . . . , y k ) } to be the probability that g fails to avoid the pattern of labels realized by the data. (The index k can be understood from the domain of g .) emma 5.7. P { per( ˆy t ) > } → as t → ∞ . Proof

We showed in the proof of Lemma 4.3 that the random data sequence X , Y , X , Y , . . . isa.s. consistent with H . Thus Lemma 5.3 implies that T = sup { s ≥ ˆy s − ( X s − τ s − +1 , . . . , X s ) = ( Y s − τ s − +1 , . . . , Y s ) } is ﬁnite a.s., and that ˆy s = ˆy t and τ s = τ t for all s ≥ t ≥ T . By the law of large numbers for m -dependent sequences, P { per τ t ( ˆy t ) = 0 } = P (cid:26) lim S →∞ S t + S X s = t +1 ˆy t ( X s ,...,X s + τt − )=( Y s ,...,Y s + τt − ) = 0 (cid:27) ≥ P (cid:26) lim S →∞ S t + S X s = t +1 ˆy t ( X s ,...,X s + τt − )=( Y s ,...,Y s + τt − ) = 0 , T ≤ t (cid:27) = P { T ≤ t } . As T is ﬁnite with probability one, it follows that P { per τ t ( ˆy t ) > } ≤ P { T > t } → t → ∞ .Lemma 5.7 ensures that we can learn to rule out patterns in the data. Once we have ruled outpatterns in the data, we can learn using the resulting “VC pattern class” using (in a somewhatnon-standard manner) the one-inclusion graph prediction algorithm of Haussler, Littlestone, andWarmuth (1994). That algorithm was originally designed for learning with VC classes of classiﬁers,but fortunately its operations only rely on the projection of the class to the set of ﬁnite realizablepatterns on the data , and therefore its behavior and analysis are equally well-deﬁned and valid whenwe have only a VC pattern class , rather than a VC class of functions.

Lemma 5.8.

Let g : X t → { , } t be a universally measurable function for some t ≥ . For every n ≥ , there is a universally measurable function ˆ Y gn : ( X × { , } ) n − × X → { , } such that, for every ( x , y , . . . , x n , y n ) ∈ ( X × { , } ) n that satisﬁes g ( x i , . . . , x i t ) = ( y i , . . . , y i t ) for all pairwise distinct ≤ i , . . . , i t ≤ n , we have n ! X σ ∈ Sym( n ) ˆ Y gn ( x σ (1) ,y σ (1) ,...,x σ ( n − ,y σ ( n − ,x σ ( n ) ) = y σ ( n ) < tn , where Sym( n ) denotes the symmetric group (of permutations of [ n ] ). Proof

Fix n ≥ X = { , . . . , n } . In the following, F ∈ { , } X denotes a set of hypotheses f : X → { , } . Applying (Haussler, Littlestone, and Warmuth, 1994, Theorem 2.3(ii)) with ¯ x =(1 , . . . , n ) yields a function A : 2 { , } X × ( X × { , } ) n − × X → { , } such that1 n ! X σ ∈ Sym( n ) A ( F,σ (1) ,f ( σ (1)) ,...,σ ( n − ,f ( σ ( n − ,σ ( n )) = f ( σ ( n )) ≤ vc( F ) n for any f ∈ F and F ∈ { , } X , where vc( F ) denotes the VC dimension of F . Moreover, byconstruction A is covariant under relabeling of X , that is, A ( F, σ (1) , y , . . . , σ ( n − , y n − , σ ( n )) =

7. If Z , Z , . . . is an i.i.d. sequence of random variables, then we have lim n →∞ n P ni =1 f ( Z i +1 , . . . , Z i + m ) = m P mi =1 lim n →∞ mn P ⌊ n/m ⌋ j =0 f ( Z mj +1+ i , . . . , Z ( m ( j +1)+ i )+ o (1) = E [ f ( Z , . . . , Z m )] by the law of large numbers. ( F ◦ σ, , y , . . . , n − , y n − , n ) for all permutations σ , where F ◦ σ := { f ◦ σ : f ∈ F } . The domainof A is a ﬁnite set, so the function A is trivially measurable.Given any input sequence ( x , y , . . . , x n , y n ), deﬁne the concept class F x as the collection of all f ∈ { , } X so that g ( x i , . . . , x i t ) = ( f ( i ) , . . . , f ( i t )) for all pairwise distinct 1 ≤ i , . . . , i t ≤ n .Deﬁne the classiﬁerˆ Y gn ( x , y , . . . , x n − , y n − , x n ) := A ( F x , , y , . . . , n − , y n − , n ) . As g is universally measurable, the classiﬁer ˆ Y gn is also universally measurable. Moreover, as A iscovariant and as F x σ (1) ,...,x σ ( n ) = F x ,...,x n ◦ σ , we haveˆ Y gn ( x σ (1) , y σ (1) , . . . , x σ ( n − , y σ ( n − , x σ ( n ) )= A ( F x , σ (1) , y σ (1) , . . . , σ ( n − , y σ ( n − , σ ( n )) . Now suppose that the input sequence ( x , y , . . . , x n , y n ) satisﬁes the assumption of the lemma.The function y ( i ) := y i satisﬁes y ∈ F x by the deﬁnition of F x . It therefore follows that for any suchsequence 1 n ! X σ ∈ Sym( n ) ˆ Y gn ( x σ (1) ,y σ (1) ,...,x σ ( n − ,y σ ( n − ,x σ ( n ) ) = y σ ( n ) ≤ vc( F x ) n . Finally, by construction, vc( F x ) < t . Remark 5.9.

Below we choose the function g in Lemma 5.8 to be the one generated by the algorithmfrom the previous section. By Remark 5.4, the resulting function is universally measurable jointlyin the training data and the function input. It follows from the proof of Lemma 5.8 that in such asituation, ˆ Y gn is also universally measurable jointly in the training data and the function input.We are now ready to outline our ﬁnal learning algorithm. Lemma 5.7 guarantees the existenceof some t ∗ such that P { per( ˆy t ∗ ) > } ≤ . Given a ﬁnite sample X , Y , . . . , X n , Y n , we split itin two parts. Using the ﬁrst part of the sample, we form an estimate ˆ t n of the index t ∗ . We thenconstruct, still using the ﬁrst half of the sample, a family of pattern avoidance functions. For eachof these pattern avoidance functions, we apply the algorithm from Lemma 5.8 to the second partof the sample to obtain a predictor. This yields a family of predictors, one per pattern avoidancefunction. Our ﬁnal classiﬁer is the majority vote among these predictors.We now proceed to the details. We ﬁrst prove a variant of Lemma 4.4. Lemma 5.10.

There exist universally measurable ˆ t n = ˆ t n ( X , Y , . . . , X ⌊ n ⌋ , Y ⌊ n ⌋ ) , whose deﬁnitiondoes not depend on P , so that the following holds. Given t ∗ so that P { per( ˆy t ∗ ) > } ≤ , there exist C, c > independent of n (but depending on P, t ∗ ) so that P { ˆ t n ∈ T good } ≥ − Ce − cn , where T good := { ≤ t ≤ t ∗ : P { per( ˆy t ) > } ≤ } . Proof

The proof is almost identical to that of Lemma 4.4. However, for completeness, we spell outthe details of the argument in the present setting. For each 1 ≤ t ≤ ⌊ n ⌋ and 1 ≤ i ≤ ⌊ n t ⌋ , let τ it := T t ( X ( i − t +1 , Y ( i − t +1 , . . . , X it , Y it ) , ˆy it ( z , . . . , z τ it ) := ˆY t ( X ( i − t +1 , Y ( i − t +1 , . . . , X it , Y it , z , . . . , z τ it ) e as deﬁned above for the subsample X ( i − t +1 , Y ( i − t +1 , . . . , X it , Y it of the ﬁrst quarter of thedata. For each t , estimate P { per( ˆy t ) > } by the fraction of ˆy it that make an error on the secondquarter of the data:ˆ e t := 1 ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 { ˆy it ( X s +1 ,...,X s + τit )=( Y s +1 ,...,Y s + τit ) for some n ≤ s ≤ n − τ it } . Observe that ˆ e t ≤ e t := 1 ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 per( ˆy it ) > a.s.Finally, we deﬁne ˆ t n := inf { t ≤ ⌊ n ⌋ : ˆ e t < } , with the convention inf ∅ = ∞ .Let t ∗ be as in the statement of the lemma. By Hoeﬀding’s inequality P { ˆ t n > t ∗ } ≤ P { ˆ e t ∗ ≥ } ≤ P { e t ∗ − E [ e t ∗ ] ≥ } ≤ e −⌊ n/ t ∗ ⌋ / . In addition, by continuity, there exists ε > ≤ t ≤ t ∗ such that P { per( ˆy t ) > } > we have P { per( ˆy t ) > ε } > + .Now, ﬁx 1 ≤ t ≤ t ∗ such that P { per( ˆy t ) > } > . By Hoeﬀding’s inequality, and choice of ε , P (cid:26) ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 per( ˆy it ) >ε < (cid:27) ≤ e −⌊ n/ t ∗ ⌋ / . Observe that for any g : X τ → { , } τ that satisﬁes per > ε , we have P { g ( X s +1 , . . . , X s + τ ) = ( Y s +1 , . . . , Y s + τ ) for some n ≤ s ≤ n − τ }≥ − (1 − ε ) ⌊ ( n − / τ ⌋ , because there are ⌊ ( n − / τ ⌋ disjoint intervals of length τ in [ n + 1 , n ] ∩ N . Since ( τ it , ˆy it ) i ≤⌊ n/ t ⌋ are independent of ( X s , Y s ) s>n/ , applying a union bound conditionally on ( X s , Y s ) s ≤ n/ shows thatthe probability that every ˆy it with per τ it ( ˆy it ) > ε makes an error on the second quarter of the sampleis P { per τit ( ˆy it ) >ε ≤ { ˆy it ( X s +1 ,...,X s + τit )=( Y s +1 ,...,Y s + τit ) for some n ≤ s ≤ n − τ it } for all i }≥ − ⌊ n t ⌋ (1 − ε ) ⌊ ( n − / t ∗ ⌋ , where we used that τ it ≤ t ∗ . It follows that P { ˆ t n = t } ≤ P { ˆ e t < } ≤ ⌊ n ⌋ (1 − ε ) ⌊ ( n − / t ∗ ⌋ + e −⌊ n/ t ∗ ⌋ / . Putting together the above estimates and applying a union bound, we have P { ˆ t n

6∈ T good } ≤ e −⌊ n/ t ∗ ⌋ / + t ∗ ⌊ n ⌋ (1 − ε ) ⌊ ( n − / t ∗ ⌋ + t ∗ e −⌊ n/ t ∗ ⌋ / . The right-hand side is bounded by Ce − cn for some C, c >

Proof of Theorem 5.6

We adopt the notations in the proof of Lemma 5.10. Our ﬁnal learningalgorithm is constructed as follows. First, we compute ˆ t n . Second, we use the ﬁrst half of the data o construct the pattern avoidance functions ˆy i ˆ t n for 1 ≤ i ≤ ⌊ n t n ⌋ . Third, we use the second half ofthe data to construct classiﬁers ˆ y i by running the algorithm from Lemma 5.8; namely,ˆ y i ( x ) := ˆ Y ˆy i ˆ tn ⌊ n/ ⌋ +2 ( X ⌈ n/ ⌉ , Y ⌈ n/ ⌉ , . . . , X n , Y n , x ) . Our ﬁnal output ˆ h n is the majority vote over ˆ y i for 1 ≤ i ≤ ⌊ n t n ⌋ . We aim to show that E [er(ˆ h n )] ≤ Cn for some constant C .To this end, for every t ∈ T good , because P { per( ˆy t ) > } ≤ , Hoeﬀding’s inequality implies P (cid:26) ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 per( ˆy it ) > > (cid:27) ≤ e −⌊ n/ t ∗ ⌋ / By a union bound, we obtain P (cid:26) ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 per( ˆy i ˆ tn ) > > , ˆ t n ∈ T good (cid:27) ≤ X t ∈T good P (cid:26) ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 per( ˆy it ) > > (cid:27) ≤ t ∗ e −⌊ n/ t ∗ ⌋ / . Thus except on an event of exponentially small probability, the pattern avoidance functions ˆy i ˆ t n havezero error for at least a fraction of of indices i .Now let ( X, Y ) ∼ P be independent of the data X , Y , . . . , X n , Y n . Then E [er(ˆ h n )] = P [ˆ h n ( X ) = Y ] ≤ P (cid:20) ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 ˆ y i ( X ) = Y ≥ (cid:21) . We can therefore estimate using Lemma 5.10 E [er(ˆ h n )] ≤ Ce − cn + t ∗ e −⌊ n/ t ∗ ⌋ / + P  ˆ t n ∈ T good , ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 ˆ y i ( X ) = Y ≥ , ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 per( ˆy i ˆ tn )=0 ≥  . Since any two sets, containing at least and fractions of { , . . . , ⌊ n/ ˆ t n ⌋} , must have at least fraction in their intersection (by the union bound for their complements), the last term in the aboveexpression is bounded above by P (cid:20) ˆ t n ∈ T good , ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 ˆ y i ( X ) = Y per( ˆy i ˆ tn )=0 ≥ (cid:21) ≤ E (cid:20) ˆ t n ∈T good ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 ˆ y i ( X ) = Y per( ˆy i ˆ tn )=0 (cid:21) , using Markov’s inequality. We can now apply Lemma 5.8 conditionally on the ﬁrst half of the datato conclude (using exchangeability) that E [er(ˆ h n )] ≤ Ce − cn + t ∗ e −⌊ n/ t ∗ ⌋ / + 16 E (cid:20) ˆ t n ∈T good ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 τ i ˆ t n ⌊ n/ ⌋ + 2 (cid:21) ≤ Ce − cn + t ∗ e −⌊ n/ t ∗ ⌋ / + 16( t ∗ + 1) ⌊ n/ ⌋ + 2 , here we used that τ i ˆ t n ≤ ˆ t n + 1 ≤ t ∗ + 1 for ˆ t n ∈ T good . The ﬁnal step in the proof of our main results is to show that classes with inﬁnite VCL trees havearbitrarily slow rates.

Theorem 5.11. If H has an inﬁnite VCL tree, then H requires arbitrarily slow rates. Together with Theorems 4.6 and 5.6, this theorem completes the characterization of classes H with linear learning rate: these are precisely the classes that have an inﬁnite Littlestone tree but donot have an inﬁnite VCL tree.The proof of Theorem 5.11 is similar to that of Theorem 4.6. The details, however, are moreinvolved. We prove, via the probabilistic method, that for any rate function R ( t ) → h n , there is a realizable distribution P so that E [er(ˆ h n )] ≥ R ( n )40 inﬁnitely often. The construction of the distribution according to which we choose P depends onthe rate function R and relies on the following technical lemma. Lemma 5.12.

Let R ( t ) → be any rate function. Then there exist probabilities p , p , . . . ≥ so that P k ≥ p k = 1 , two increasing sequences of integers ( n i ) i ≥ and ( k i ) i ≥ , and a constant ≤ C ≤ such that the following hold for all i > : (a) P k>k i p k ≤ n i . (b) n i p k i ≤ k i . (c) p k i = CR ( n i ) . Proof

We may assume without loss of generality that R (1) = 1. Otherwise, we can replace R by˜ R such that ˜ R (1) = 1 and ˜ R ( n ) = R ( n ) for n > n i ) and ( k i ). Let n = 1 and k = 1.For i >

1, let n i = inf (cid:26) n > n i − : R ( n ) ≤ min j
0, we have n i < ∞ for all i . The sequences are increasing by construction. Finally,we deﬁne p k = 0 for k

6∈ { k i : i ≥ } and p k i = CR ( n i )with C = P j ≥ R ( n j ) . As R ( n j ) ≤ − j +1 for all j > ≤ C ≤ R ( n j ) ≤ R ( n i )2 i − j k i ≤ R (1)2 i − j n i for all i < j. Therefore, as C ≤

1, we obtain X k>k i p k = X j>i p k j = X j>i CR ( n j ) ≤ n i . or (b), note that n i p k i = Cn i R ( n i ) ≤ k i . Finally, (c) holds by construction.We can now complete the proof of Theorem 5.11.

Proof of Theorem 5.11

We ﬁx throughout the proof a rate R ( t ) →

0. Deﬁne

C, p k , k i , n i as in Lemma 5.12. We also ﬁx any learning algorithm with output ˆ h n and an inﬁnite VCL tree t = { x u ∈ X k +1 : 0 ≤ k < ∞ , u ∈ { , } × · · · × { , } k } for H .Let y = ( y , y , . . . ) be a sequence of independent random vectors, where y k = ( y k , . . . , y k − k ) isuniformly distributed on { , } k for each k ≥

1. Deﬁne the random distribution P y on X × { , } as P y { ( x i y ≤ k − , y ik ) } = p k k for 0 ≤ i ≤ k − , k ≥ . In words, each y deﬁnes an inﬁnite branch of the tree t . Given y , we choose the vertex on this branchof depth k − p k . This vertex deﬁnes a subset of X of size k . The distribution P y chooses each element in this subset uniformly at random.Because t is a VCL tree, for every n < ∞ , there exists h ∈ H so that h ( x i y ≤ k − ) = y ik for0 ≤ i ≤ k − ≤ k ≤ n . Thuser y ( h ) := P y { ( x, y ) ∈ X × { , } : h ( x ) = y } ≤ X k>n p k . Letting n → ∞ , we ﬁnd that P y is realizable for every realization of y . Finally, the map y P y ismeasurable as in the proof of Theorem 4.6.Now let ( X, Y ) , ( X , Y ) , ( X , Y ) , . . . be i.i.d. samples drawn from P y . That is, X = x I y ≤ T − , Y = y IT , X i = x I i y ≤ Ti − , Y i = y I i T i , where ( T, I ) , ( T , I ) , ( T , I ) , . . . are i.i.d. random variables, independent of y , with distribution P { T = k, I = i } = p k k for 0 ≤ i ≤ k − , k ≥ . For all n and k , P { ˆ h n ( X ) = Y, T = k }≥ k − X i =0 P { ˆ h n ( X ) = y ik , T = k, I = i, T , . . . , T n ≤ k, ( T , I ) , . . . , ( T n , I n ) = ( k, i ) } = 12 k − X i =0 P { T = k, I = i, T , . . . , T n ≤ k, ( T , I ) , . . . , ( T n , I n ) = ( k, i ) } = p k (cid:18) − X l>k p l − p k k (cid:19) n where we used that conditionally on T = k, I = i, T , . . . , T n ≤ k, ( T , I ) , . . . , ( T n , I n ) = ( k, i ), thepredictor ˆ h n ( X ) is independent of y ik .We now choose k = k i and n = n i . By Lemma 5.12, P { ˆ h n i ( X ) = Y, T = k i } ≥ CR ( n i )2 (cid:18) − n i (cid:19) n i ≥ CR ( n i )18 or i ≥

3. By Fatou’s lemma, E h lim sup i →∞ R ( n i ) P { ˆ h n i ( X ) = Y, T = k i | y } i ≥ lim sup i →∞ R ( n i ) P { ˆ h n i ( X ) = Y, T = k i } ≥ C

18 ;Fatou applies as R ( n i ) P { ˆ h n i ( X ) = Y, T = k i | y } ≤ R ( n i ) P { T = k i } = C a.s. Because P { ˆ h n i ( X ) = Y, T = k i | y } ≤ P { ˆ h n i ( X ) = Y | y } = E [er y (ˆ h n i ) | y ] a.s. , there must exist a realization of y such that E [er y (ˆ h n ) | y ] > C R ( n ) ≥ R ( n ) inﬁnitely often.Choosing P = P y for this realization of y concludes the proof. Appendix A. Mathematical background

A.1 Gale-Stewart games

The aim of this section is to recall some basic notions from the classical theory of inﬁnite games.Fix sets X t , Y t for t ≥

1. We consider inﬁnite games between two players: in each round t ≥ A selects an element x t ∈ X t , and then player P L selects an element y t ∈ Y t . Therules of the game are determined by specifying a set W ⊆ Q t ≥ ( X t × Y t ) of winning sequencesfor P L . That is, after an inﬁnite sequence of consecutive plays x , y , x , y , . . . , we say that P L winsif ( x , y , x , y , . . . ) ∈ W ; otherwise, P A is declared the winner of the game.A strategy is a rule used by a given player to determine the next move given the current positionof the game. A strategy for P A is a sequence of functions f t : Q s
1, sothat P A plays x t = f t ( x , y , . . . , x t − , y t − ) in round t . Similarly, a strategy for P L is a sequenceof g t : Q s
1, so that P L plays y t = g t ( x , y , . . . , x t − , y t − , x t ) inround t . A strategy for P A is called winning if playing that strategy always makes P A win thegame regardless of what P L plays; a winning strategy for P L is deﬁned analogously.At the present level of generality, it is far from clear whether winning strategies even exist.We introduce some additional assumption in order to be able to develop a meaningful theory. Thesimplest such assumption was introduced in the classic work of Gale and Stewart (Gale and Stewart,1953): W is called ﬁnitely decidable if for every ( x , y , x , y , . . . ) ∈ W , there exists n < ∞ sothat ( x , y , . . . , x n , y n , x ′ n +1 , y ′ n +1 , x ′ n +2 , y ′ n +2 , . . . ) ∈ W for all choices of x ′ n +1 , y ′ n +1 , x ′ n +2 , y ′ n +2 , . . . In other words, that W is ﬁnitely decidable means thatif P L wins, then she knows that she won after playing a ﬁnite number of rounds. Conversely, in thiscase P A wins the game precisely when P L does not win after any ﬁnite number of rounds.An inﬁnite game whose set W is ﬁnitely decidable is called a Gale-Stewart game . The funda-mental theorem on Gale-Stewart games is the following.

Theorem A.1.

In a Gale-Stewart game, either P A or P L has a winning strategy. The classical proof of this result is short and intuitive, cf. (Gale and Stewart, 1953) or (Kechris,1995, Theorem 20.1). For a more constructive approach, see (Hodges, 1993, Corollary 3.4.3).

Remark A.2.

If one endows X t and Y t with the discrete topology, then W is ﬁnitely decidable ifand only if it is an open set for the associated product topology. For this reason, condition of aGale-Stewart game is usually expressed by saying that the set of winning sequences is open. Thisterminology is particularly confusing in the setting of this paper, because we endow X t and Y t with adiﬀerent topology. In order to avoid confusion, we have therefore opted to resort to the nonstandardterminology “ﬁnitely decidable”. emark A.3. In the literature it is sometimes assumed that X t = Y t = X for all t . However,the more general setting of this section is already contained in this special case. Indeed, given sets X t , Y t for every t , let X = S t ( X t ∪ Y t ) be their disjoint union. We may now augment the set W ofwinning sequences for P L so that the ﬁrst player who makes an inadmissible play (that is, x t

6∈ X t or y t

6∈ Y t ) loses instantly. This ensures that a winning strategy for either player will only makeadmissible plays, thus reducing the general case to the special case. Despite this equivalence, wehave chosen the more general formulation as this is most natural in applications. Remark A.4.

Even though we have deﬁned a strategy for P A as a sequence of functions x t = f t ( x , y , . . . , x t − , y t − ) of the full game position, it is implicit in this notation that x , . . . , x t − arealso played according to the previous rounds of the same strategy ( x t − = f t − ( x , y , . . . , x t − , y t − ),etc.). Thus we can equivalently view a strategy for P A as a sequence of functions x t = f t ( y , . . . , y t − )that depend only on the previous plays of P L . Similarly, a strategy for P L can be equivalently de-scribed by a sequence of functions y t = g t ( x , . . . , x t ). A.2 Ordinals

The aim of this section is to brieﬂy recall the notion of ordinals, which play an important role in ourtheory. An excellent introduction to this topic may be found in (Hrbacek and Jech, 1999, Chapter6), while the classical reference is (Sierpi´nski, 1965).A well-ordering of a set S is a linear ordering < with the property that every nonempty subsetof S contains a least element. For example, if we consider subsets of R with the usual orderingof the reals, then { , . . . , n } and N are well-ordered but Z and [0 ,

1] are not. We could howeverchoose nonstandard orderings on Z and [0 ,

1] so they become well-ordered; in fact, it is a classicalconsequence of the axiom of choice that any set may be well-ordered.Two well-ordered sets are said to be isomorphic if there is an order-preserving bijection betweenthem. There is a canonical way to construct a class of well-ordered sets, called ordinals , such thatany well-ordered set is isomorphic to exactly one ordinal. Ordinals uniquely encode well-orderedsets up to isomorphism, in the same way that cardinals uniquely encode sets up to bijection. Theclass of all ordinals is denoted ORD. The speciﬁc construction of ordinals is not important for ourpurposes, and we therefore discuss ordinals somewhat informally. We refer to (Hrbacek and Jech,1999, Chapter 6) or (Sierpi´nski, 1965) for a careful treatment.It is a basic fact that any pair of well-ordered sets is either isomorphic, or one is isomorphic toan initial segment of the other. This induces a natural ordering on ordinals. For α, β ∈ ORD, wewrite α < β if α is isomorphic to an initial segment of β . The deﬁning property of ordinals is thatany ordinal β is isomorphic to the set of ordinals { α : α < β } that precede it. In particular, < isitself a well-ordering; namely, every nonempty set of ordinals contains a least element, and everynonempty set S of ordinals has a least upper bound, denoted sup S .Ordinals form a natural set-theoretic extension of the natural numbers. By deﬁnition, everyordinal β has a successor ordinal β + 1, which is the smallest ordinal that is larger than β . Wecan therefore count ordinals one by one. The smallest ordinals are the ﬁnite ordinals 0 , , , , , . . . ;we naturally identify each number k with the well-ordered set { , . . . , k − } . The smallest inﬁniteordinal is denoted ω ; it may simply be identiﬁed with the family of all natural numbers with its usualordering. With ordinals, however, we can keep counting past inﬁnity: one counts 0 , , , . . . , ω, ω +1 , ω + 2 , . . . , ω + ω, ω + ω + 1 , . . . and so on. The smallest uncountable ordinal is denoted ω .An important concept deﬁned by ordinals is the principle of transﬁnite recursion . Informally,it states that if we have a recipe that, given sets of “objects” O α indexed by all ordinals α < β ,deﬁnes a new set of “objects” O β , and we are given a base set { O α : α < α } , then O β is uniquelydeﬁned for all β ∈ ORD. As a simple example, let us deﬁne the meaning of addition of ordinals γ + β . For the base case, we deﬁne γ + 0 = γ and γ + 1 to be the successor of γ . Subsequently,for any β , we deﬁne γ + β = sup { ( γ + α ) + 1 : α < β } . Then the principle of transﬁnite recursionensures that γ + β is uniquely deﬁned for all ordinals β . One can analogously develop a full ordinal rithmetic that deﬁnes addition, multiplication, exponentiation, etc. of ordinals just as for naturalnumbers (Hrbacek and Jech, 1999, section 6.5). A.3 Well-founded relations and ranks

In this section we extend the notion of a well-ordering to more general types of orders, and introducethe fundamental notion of rank. Our reference here is (Kechris, 1995, Appendix B).A relation ≺ on a set S is deﬁned by an arbitrary subset R ≺ ⊆ S × S as x ≺ y if and only if( x, y ) ∈ R ≺ . An element x of ( S, ≺ ) is called minimal if there does not exist y ≺ x . The relation iscalled well-founded if every nonempty subset of S has a minimal element. Thus a linear orderingis well-founded precisely when it is a well-ordering; but the notion of well-foundedness extends toany relation.To any well-founded relation ≺ on S we will associate a function ρ ≺ : S → ORD, called the rankfunction of ≺ , that is deﬁned by transﬁnite recursion. We say that ρ ≺ ( x ) = 0 if and only if x isminimal in S , and deﬁne for all other xρ ≺ ( x ) = sup { ρ ≺ ( y ) + 1 : y ≺ x } . The rank ρ ≺ ( x ) quantiﬁes how far x is from being minimal. Remark A.5.

Observe that every element x ∈ S indeed has a well-deﬁned rank (that is, it appearsat some stage in the transﬁnite recursion). Indeed, the transﬁnite recursion recipe deﬁnes ρ ≺ ( x ) assoon as ρ ≺ ( y ) has been deﬁned for all y ≺ x . If ρ ≺ ( x ) is undeﬁned, then there must exist x ≺ x sothat ρ ≺ ( x ) is undeﬁned. Repeating this process constructs an inﬁnite decreasing chain of elements x i ∈ S . But this contradicts the assumption that ≺ is well-founded, as an inﬁnite decreasing chaincannot contain a minimal element.Let ( S, ≺ ) and ( S ′ , ≺ ′ ) be sets endowed with relations. A map f : S → S ′ is called order-preserving if x ≺ y implies f ( x ) ≺ ′ f ( y ). It is a basic fact that ranks are monotone underorder-preserving maps: if ≺ ′ is well-founded and f : S → S ′ is order-preserving, then ≺ is well-founded and ρ ≺ ( x ) ≤ ρ ≺ ′ ( f ( x )) for all x ∈ S (this follows readily by induction on the value of ρ ≺ ( x )).Like ordinals, the rank of a well-founded relation is an intuitive object once one understandsits meaning. This is best illustrated by some simple examples. As explained in Remark A.5, awell-founded relation does not admit an inﬁnite decreasing chain x ≻ x ≻ x ≻ · · · , but it mightadmit ﬁnite decreasing chains of arbitrary length. As the following examples illustrate, the rank ρ ≺ ( x ) quantiﬁes how long we can keep growing a decreasing chain starting from x . Example A.6.

Suppose that ρ ≺ ( x ) = k for some ﬁnite ordinal 0 < k < ω . By the deﬁnition ofrank, ρ ≺ ( y ) < k for all y ≺ x , while there exists x ≺ x such that ρ ≺ ( x ) = k −

1. It follows readilythat ρ ≺ ( x ) = k if and only if the longest decreasing chain that can be grown starting from x haslength k + 1. Example A.7.

Suppose that ρ ≺ ( x ) = ω . By the deﬁnition of rank, ρ ≺ ( y ) < ω is an arbitrarilylarge ﬁnite ordinal for y ≺ x . We can grow an arbitrarily long decreasing chain starting from x ,but once we select its ﬁrst element x ≺ x we can grow at most ﬁnitely many elements as in theprevious example. In other words, the maximal length of the chain is decided by the choice of itsﬁrst element x . Example A.8.

Suppose that ρ ≺ ( x ) = ω + k for some k < ω . Then we can choose x ≻ x ≻ · · · ≻ x k so that ρ ≺ ( x k ) = ω . We can still grow arbitrarily long decreasing chains after selecting the ﬁrst k elements judiciously, but the length of the chain is decided at the latest after we selected x k +1 . Example A.9.

Suppose that ρ ≺ ( x ) = ω + ω . Then in the ﬁrst step, we can choose for any k < ω an element x ≺ x so that ρ ≺ ( x ) = ω + k . From that point onward, we proceed as in the previous xample. The maximal length of a decreasing chain starting from x is determined by two decisions:the choice of x decides a number k , so that the maximal length of the chain is decided at the latestafter we selected x k +2 .These examples can be further extended. For example, ρ ≺ ( x ) = ω · k + k ′ means that after k ′ initial steps we can make a sequence of k decisions, each decision being how many steps we can growthe chain before the next decision must be made. Similarly, ρ ≺ ( x ) = ω means we can decide onarbitrarily large numbers k, k ′ < ω in the ﬁrst step, and then proceed as for ω · k + k ′ ; etc. A.4 Polish spaces and analytic sets

We ﬁnally review the basic notions of measures and probabilities on Polish spaces. We refer to(Cohn, 1980, Chapter 8) for a self-contained introduction, and to (Kechris, 1995) for a comprehensivetreatment.A

Polish space is a separable topological space that can be metrized by a complete metric. Manyspaces encountered in practice are Polish, including R n , any compact metric space, any separableBanach space, etc. Moreover, any ﬁnite or countable product or disjoint union of Polish spaces isagain Polish.Let X , Y be Polish spaces, and let f : X → Y be a continuous function. It is shown in anyintroductory text on probability that f is Borel measurable, that is, f − ( B ) is a Borel subset of X for any Borel subset B of Y . However, the forward image f ( X ) is not necessarily Borel-measurablein Y . A subset B ⊆ Y of a Polish space is called analytic if it is the image of some Polish spaceunder a continuous map. It turns out that every Borel set is analytic, but not every analytic set isBorel. The family of analytic sets is closed under countable unions and intersections, but not undercomplements. The complement of an analytic set is called coanalytic . A set is Borel if and only ifit is both analytic and coanalytic.Although analytic sets may not be Borel-measurable, such sets are just as good as Borel setsfor the purposes of probability theory. Let F be the Borel σ -ﬁeld on a Polish space X . For anyprobability measure on µ , denote by F µ the completion of F with respect to µ , that is, the collectionof all subsets of X that diﬀer from a Borel set at most on a set of zero probability. A set B ⊆ X iscalled universally measurable if B ∈ F µ for every probability measure µ . Similarly, a function f : X → Y is called universally measurable if f − ( B ) is universally measurable for any universallymeasurable set B . It is clear from these deﬁnitions that universally measurable sets and functionson Polish spaces are indistinguishable from Borel sets from a probabilistic perspective.The following fundamental fact is known as the capacitability theorem. Theorem A.10.

Every analytic (or coanalytic) set is universally measurable.

The importance of analytic sets in probability theory stems from the fact that they make itpossible to establish measurability of certain uncountable unions of measurable sets. Indeed, let X and Y be Polish spaces, and let A ⊆ X × Y be an analytic set. The set B := [ y ∈Y { x ∈ X : ( x, y ) ∈ A } can be written as B = f ( A ) for the continuous function f ( x, y ) := x . The set B ⊆ X is also analytic,and hence universally measurable.We conclude this section by stating a deep fact about well-founded relations on Polish spaces.Let X be a Polish space and let ≺ be a well-founded relation on X . The relation ≺ is called analyticif R ≺ ⊆ X × X is an analytic set.

8. Our discussion of the intuitive meaning of the rank of a well-founded relation is based on the lively discussion in(Evans and Hamkins, 2014) of game values in inﬁnite chess. heorem A.11. Let ≺ be an analytic well-founded relation on a Polish space X . Its rank functionsatisﬁes sup x ∈X ρ ≺ ( x ) < ω . This result is known as the Kunen-Martin theorem; see (Kechris, 1995, Theorem 31.1) or (Del-lacherie, 1977) for a self-contained proof and historical comments.

Appendix B. Measurability of Gale-Stewart strategies

The fundamental theorem of Gale-Stewart games, Theorem A.1, states that either player P A or P L must have a winning strategy in an inﬁnite game when the set of winning sequences W for P L isﬁnitely decidable. This existential result provides no information, however, about the complexityof the winning strategies. In particular, it is completely unclear whether winning strategies can bechosen to be measurable. As we use winning strategies to design algorithms that operate on randomdata, non-measurable strategies are may be potentially a serious problem for our purposes. Indeed,lack of measurability can render probabilistic reasoning completely meaningless (cf. Appendix C).Almost nothing appears to be known in the literature regarding the measurability of Gale-Stewartstrategies. The aim of this appendix is to prove a general measurability theorem that captures allthe games that appear in this paper. We adopt the general setting and notations of Appendix A.1. Theorem B.1.

Let {X t } t ≥ be Polish spaces and {Y t } t ≥ be countable sets. Consider a Gale-Stewart game whose set W ⊆ Q t ≥ ( X t × Y t ) of winning sequences for P L is ﬁnitely decidable andcoanalytic. Then there is a universally measurable winning strategy. A characteristic feature of the games in this paper is the asymmetry between P A and P L .Player P A plays elements of an arbitrary Polish space, while P L can only play elements of a count-able set. Any strategy for P A is automatically measurable, as it may be viewed as a function of theprevious plays of P L only (cf. Remark A.4). The nontrivial content of Theorem B.1 is that if P L hasa winning strategy, such a strategy may be chosen to be universally measurable.To prove Theorem B.1, we construct an explicit winning strategy of the following form. Toevery sequence of plays x , y , . . . , x t , y t for which P L has not yet won, we associate an ordinal valuewith the following property: regardless of the next play x t +1 of P A , there exists y t +1 that decreasesthe value. Because there are no inﬁnite decreasing chains of ordinals, P L eventually wins with thisstrategy. To show that this strategy is measurable, we use the coanalyticity assumption of TheoremB.1 in two diﬀerent ways. On the one hand, we show that the set of game positions of countablevalue is measurable. On the other hand, the Kunen-Martin theorem implies that only countablevalues can appear. Remark B.2.

The construction of winning strategies for Gale-Stewart games using game values isnot new; cf. (Hodges, 1993, Section 3.4) or (Evans and Hamkins, 2014). We, however, deﬁne thegame value in a diﬀerent manner than is customary in the literature. While the proof ultimatelyshows that the two deﬁnitions are essentially equivalent, our deﬁnition enables us to directly applythe Kunen-Martin theorem, and is conceptually much closer to the classical Littlestone dimensionof concept classes (cf. Section 3.4).

B.1 Preliminaries

In the remainder of this appendix we assume that the assumptions of Theorem B.1 are in force, andthat P L has a winning strategy.Let us begin by introducing some basic notions. A position of the game is a ﬁnite sequence ofplays x , y , . . . , x n , y n for some 0 ≤ n < ∞ (the empty sequence ∅ denotes the initial position ofthe game). We denote the set of positions of length n by P n := n Y t =1 ( X t × Y t ) , where P := { ∅ } ), and by P := S ≤ n< ∞ P n the set of all positions. Note that, by our assump-tions, P n and P are Polish spaces.An active position is a sequence of plays x , y , . . . , x n , y n after which P L has not yet won.Namely, there exist x n +1 , y n +1 , x n +2 , y n +2 , . . . so that ( x , y , x , y , . . . ) W . The set of activepositions of length n can be written as A n := [ w ∈ Q ∞ t = n +1 ( X t ×Y t ) { v ∈ P n : ( v , w ) ∈ W c } . Because W is coanalytic, A n is an analytic subset of P n . We denote by A := S ≤ n< ∞ A n the set ofall active positions. Remark B.3.

The notion of active positions is fundamental to the deﬁnition of Gale-Stewart games.The fact that W is ﬁnitely decidable is nothing other than the property W = { ( x , y , x , y , . . . ) :( x , y , . . . , x n , y n ) A n for some 0 ≤ n < ∞} .We now introduce the fundamental notion of active trees. By assumption, there is no winningstrategy for P A . That is, there is no strategy for P A that ensures the game remains active forever.However, given any ﬁnite number n < ∞ , there could exist strategies for P A that force the gameto remain active for at least n rounds regardless of what P L plays. Such a strategy is naturallydeﬁned by specifying a decision tree of depth n , that is, a rooted tree such that each vertex atdepth t is labelled by a point in X t , and the edges to its children are labelled by Y t . Such a treecan be described by specifying a set of points { x y ∈ X t +1 : y ∈ Q ts =1 Y s , ≤ t < n } . This treekeeps the game active for n rounds as long as ( x ∅ , y , x y , y , . . . , x y ,...,y n − , y n ) ∈ A n for all possibleplays y , . . . , y n of P L . This notion is precisely the analogue of a Littlestone tree (Deﬁnition 1.7) inthe context of Gale-Stewart games.We need to consider strategies that keep the game active for a ﬁnite number of rounds startingfrom an arbitrary position (in the above discussion we assumed the starting position ∅ ). Deﬁnition B.4.

Given a position v ∈ P k of length k : • A decision tree of depth n with starting position v is a collection of points t = (cid:26) x y ∈ X k + t +1 : y ∈ k + t Y s = k +1 Y s , ≤ t < n (cid:27) . By convention, we call t = ∅ a decision tree of depth 0. • t is called active if ( v , x ∅ , y k +1 , x y k +1 , y k +2 , . . . , x y k +1 ,...,y k + n − , y k + n ) ∈ A k + n for all choicesof ( y k +1 , . . . , y k + n ) ∈ Q k + nt = k +1 Y t . • We denote by T v the set of all decision trees with starting position v (and any depth 0 ≤ n < ∞ ), and by T A v ⊆ T v the set of all active trees.As the sets Y t are assumed to be countable, any decision tree is described by a countable collectionof points. Thus T v is a Polish space (it is a countable disjoint union of countable products of thePolish spaces X t ). Moreover, as A k + n is analytic, it follows readily that T A v is analytic (it is acountable disjoint union of countable intersections of analytic sets). The key reason why Theorem B.1is restricted to the setting where each Y t is countable is to ensure these properties hold. .2 Game values We now assign to every position v ∈ P a value val( v ). Intuitively, the value measures how longwe can keep growing an active tree starting from v . It will be convenient to adjoin to the ordinalstwo elements − Ω that are smaller and larger than every ordinal, respectively. We writeORD ∗ := ORD ∪ {− , Ω } , and proceed to deﬁne the value function val : P → ORD ∗ .By deﬁnition, T A v is empty if and only if the position v A is inactive, that is, if P L has alreadywon. In this case, we deﬁne val( v ) = − v ∈ A is active. The deﬁnition of value uses a relation ≺ v on T A v . In thisrelation, t ′ ≺ v t if and only if the tree t is obtained from t ′ by removing its leaves (in particular,depth( t ′ ) = depth( t ) + 1). Let us make two basic observations about this relation: • An inﬁnite decreasing chain in ( T A v , ≺ v ) corresponds to an inﬁnite active tree, that is, a winningstrategy for P A starting from v . In other words, ≺ v is well-founded if and only if P A has nowinning strategy starting from the position v . • ( T A v , ≺ v ) has the tree ∅ of depth 0 as its unique maximal element. Indeed, any active treeremains active if its leaves are removed. So, there is an increasing chain from any active treeto ∅ .The deﬁnition of value uses the notion of rank from Section A.3. Deﬁnition B.5.

The game value val : P → ORD ∗ is deﬁned as follows. • val( v ) = − v A . • val( v ) = Ω if v ∈ A and ≺ v is not well-founded. • val( v ) = ρ ≺ v ( ∅ ) if v ∈ A and ≺ v is well-founded.In words, val( v ) = − L has already won; val( v ) = Ω means P L can no longer win; andotherwise val( v ) is the maximal rank of an active tree in ( T A v , ≺ v ), which quantiﬁes how long P A can postpone P L winning the game (cf. section A.3).For future reference, we record some elementary properties of the rank ρ ≺ v . Lemma B.6.

Fix v ∈ P such that ≤ val( v ) < Ω . (a) t ′ ≺ v t implies ρ ≺ v ( t ′ ) < ρ ≺ v ( t ) for any t , t ′ ∈ T A v . (b) For any t ′ ∈ T A v , t ′ = ∅ there is a unique t ∈ T A v such that t ′ ≺ v t . (c) For any t ∈ T A v and κ < ρ ≺ v ( t ) , there exists t ′ ≺ v t so that κ ≤ ρ ≺ v ( t ′ ) . Proof

For (a), it suﬃces to note that ρ ≺ v ( t ′ ) + 1 ≤ ρ ≺ v ( t ) for any t ′ ≺ v t by the deﬁnition ofrank. For (b), note that t is obtained from t ′ by removing its leaves. For (c), argue by contradic-tion: if ρ ≺ v ( t ′ ) < κ for all t ′ ≺ v t , then κ < ρ ≺ v ( t ) < κ + 1 where the second inequality follows bythe deﬁnition of rank. This is impossible, as there is no ordinal strictly between successive ordinals.In the absence of regularity assumptions, game values could be arbitrarily large ordinals (seeAppendix C). Remarkably, however, this is not the case in our setting. The assumption that W iscoanalytic implies that only countable game values may appear. This fact plays a crucial role in theproof of Theorem B.1. Lemma B.7.

For any v ∈ P , either val( v ) = Ω or val( v ) < ω . roof We may assume without loss of generality that 0 ≤ val( v ) < Ω . There is also no loss inextending the relation ≺ v to T v as follows: t ′ ≺ v t is deﬁned as above whenever t , t ′ ∈ T A v , while t T A v has no relation to any element of T v . Then every t T A v is minimal, while the rank of t ∈ T A v is unchanged.With this extension, the relation ≺ v on T v is deﬁned by R ≺ v = { ( t ′ , t ) ∈ T v × T v : t ′ ≺ v t , t ′ ∈ T A v } ;here t is uniquely obtained from t ′ ∈ T A v by removing its leaves. Because T A v is analytic, it followsthat ≺ v is a well-founded analytic relation on the Polish space T v . The conclusion follows fromTheorem A.11. B.3 A winning strategy

Our aim now is to show that the game values give rise to a winning strategy for P L . The keyobservation is the following. Proposition B.8.

Fix ≤ n < ∞ and v ∈ P n such that ≤ val( v ) < Ω . For every x ∈ X n +1 ,there exists y ∈ Y n +1 such that val( v , x, y ) < val( v ) . Before we prove this result, let us ﬁrst explain the intuition in the particularly simple case thatval( v ) = m < ω is ﬁnite. By the deﬁnition of value, the maximal depth of an active tree in T A v is m (cf. Example A.6). Now suppose, for sake of contradiction, that there exists x such thatval( v , x, y ) ≥ m for every y . That is, there exists an active tree t y ∈ T A v ,x,y of depth m for every y .Then we can construct an active tree in T A v of depth m + 1 by taking x as the root and attachingeach t y as its subtree of the corresponding child. But this is impossible, as we assumed that themaximal depth of an active tree in T A v is m .We use the same idea of “gluing together trees t y ” in the case that val( v ) is an inﬁnite ordinal,but its implementation in this case is more subtle. The key to the proof is the following lemma. Lemma B.9.

Fix ≤ n < ∞ , v ∈ P n , x ∈ X n +1 , and y, y ′ ∈ Y n +1 such that val( v , x, y ) ≤ val( v , x, y ′ ) . Then there exists a map f : T A v ,x,y → T A v ,x,y ′ such that: (a) depth( f ( t )) = depth( t ) for all t ∈ T A v ,x,y . (b) t ′ ≺ v ,x,y t implies f ( t ′ ) ≺ v ,x,y ′ f ( t ) for all t , t ′ ∈ T A v ,x,y . Proof

We ﬁrst dispose of trivial cases. If val( v , x, y ) = −

1, then T A v ,x,y = ∅ and there is nothingto prove. If val( v , x, y ′ ) = Ω , there is an inﬁnite decreasing chain ∅ = t (0) ≻ v ,x,y ′ t (1) ≻ v ,x,y ′ t (2) ≻ v ,x,y ′ t (3) ≻ v ,x,y ′ · · · in T A v ,x,y ′ . In this case we may deﬁne f ( t ) = t ( k ) whenever depth( t ) = k , and it is readily veriﬁed thedesired properties hold. We therefore assume in the remainder of the proof that 0 ≤ val( v , x, y ) ≤ val( v , x, y ′ ) < Ω .We now deﬁne f ( t ) by induction on depth( t ). For the induction to go through, we maintain thefollowing invariants: • depth( f ( t )) = depth( t ). • ρ ≺ v ,x,y ( t ) ≤ ρ ≺ v ,x,y ′ ( f ( t )). or the base, let f ( ∅ ) = ∅ . Because val( v , x, y ) ≤ val( v , x, y ′ ), we have ρ ≺ v ,x,y ( ∅ ) ≤ ρ ≺ v ,x,y ′ ( f ( ∅ )).For the step, suppose that f ( t ) has been deﬁned for all t ∈ T A v ,x,y with depth( t ) = k − t . Now consider t ′ ∈ T A v ,x,y with depth( t ′ ) = k , and let t ≻ v ,x,y t ′ be the tree obtained by removing its leaves. Then we have ρ ≺ v ,x,y ( t ′ ) < ρ ≺ v ,x,y ( t ) ≤ ρ ≺ v ,x,y ′ ( f ( t ))by Lemma B.6(a) and the induction hypothesis. Therefore, by Lemma B.6(c), we may choose f ( t ′ ) ≺ v ,x,y ′ f ( t ) so that ρ ≺ v ,x,y ( t ′ ) ≤ ρ ≺ v ,x,y ′ ( f ( t ′ )). In this manner we have deﬁned f ( t ′ ) for each t ′ ∈ T A v ,x,y with depth( t ′ ) = k . It is readily veriﬁed that the desired properties of the map f holdby construction.We can now complete the proof of Proposition B.8. Proof of Proposition B.8

Fix x ∈ X n +1 throughout the proof. If there exists y ∈ Y n +1 so thatval( v , x, y ) = −

1, the conclusion is trivial. We can therefore assume that val( v , x, y ) ≥ y .This implies, in particular, that { x } ∈ T A v .Because any collection of ordinals contains a minimal element, we can choose y ∗ ∈ Y n +1 such thatval( v , x, y ∗ ) ≤ val( v , x, y ) for all y . The main part of the proof is to construct an order-preservingmap ι : T A v ,x,y ∗ → T A v such that ι ( ∅ ) = { x } . Because val( v ) < Ω , we know that ≺ v is well-founded.It follows by monotonicity of rank under order-preserving maps that ≺ v ,x,y ∗ is well-founded andval( v , x, y ∗ ) = ρ ≺ v ,x,y ∗ ( ∅ ) ≤ ρ ≺ v ( { x } ) < ρ ≺ v ( ∅ ) = val( v ) , concluding the proof of the proposition.It therefore remains to construct the map ι . To this end, we use Lemma B.9 to construct forevery y an order-preserving map f y : T A v ,x,y ∗ → T A v ,x,y such that depth( f ( t )) = depth( t ). Givenany t ∈ T A v ,x,y ∗ , we deﬁne a decision tree ι ( t ) by taking x as its root and attaching f y ( t ) as itssubtree of the root-to-child edge labelled by y , for every y ∈ Y n +1 . By construction ι ( t ) ∈ T A v is anactive tree, ι ( ∅ ) = { x } , and ι is order-preserving as each of the maps f y is order-preserving.As we assumed at the outset that P L has a winning strategy, the initial value of the game is anordinal val( ∅ ) < Ω . We can now use Proposition B.8 to describe an explicit winning strategy. Ineach round in which P L has not yet won, for each point x t that is played by P A , Proposition B.8ensures that P L can choose y t so that val( x , y , . . . , x t , y t ) < val( x , y , . . . , x t − , y t − ). This choiceof y t deﬁnes a winning strategy for P L , because the ordinals are well-ordered. B.4 Measurability

We have constructed value-decreasing winning strategies for P L . To conclude the proof of Theo-rem B.1, it remains to show that it is possible to construct a universally measurable value-decreasingstrategy. The main remaining step is to show that the set of positions with any given game value ismeasurable. Lemma B.10.

For any ≤ n < ∞ , v ∈ P n , and κ ∈ ORD , we have val( v ) > κ if and only if thereexists x ∈ X n +1 such that val( v , x, y ) ≥ κ for all y ∈ Y n +1 . Proof

Suppose ﬁrst there exists x such that val( v , x, y ) ≥ κ for all y . If val( v ) < Ω , then it followsimmediately from Proposition B.8 that val( v ) > κ . On the other hand, if val( v ) = Ω , the conclusionis trivial.In the opposite direction, let val( v ) > κ . If val( v ) = Ω , then choosing x to be the root labelof an inﬁnite active tree yields val( v , x, y ) = Ω ≥ κ for all y . On the other hand, if val( v ) < Ω ,then we have ρ ≺ v ( ∅ ) = val( v ) > κ . By the deﬁnition of rank, there exists x such that { x } ∈ T A v and ρ ≺ v ( { x } ) + 1 > κ or, equivalently, ρ ≺ v ( { x } ) ≥ κ . Thus it remains to show that ρ ≺ v ( { x } ) ≤ val( v , x, y ) for every y . o this end, we follow in essence the reverse of the argument used in the proof of Proposition B.8.Denote by T A v ,x ⊆ T A v the set of active trees with root x , and by ≺ v ,x the induced relation. Thedeﬁnition of rank implies ρ ≺ v ( { x } ) = ρ ≺ v ,x ( { x } ). On the other hand, for any t ∈ T A v ,x , denote by f y ( t ) ∈ T A v ,x,y its subtree of the root-to-child edge labelled by y . Then f y : T A v ,x → T A v ,x,y is anorder-preserving map such that f y ( { x } ) = ∅ . Therefore, either val( v , x, y ) = Ω , or ρ ≺ v ( { x } ) = ρ ≺ v ,x ( { x } ) ≤ ρ ≺ v ,x,y ( ∅ ) = val( v , x, y )by monotonicity of rank under order-preserving maps. Corollary B.11.

The set A κn := { v ∈ A n : val( v ) > κ } is analytic for every ≤ n < ∞ and − ≤ κ < ω . Proof

The proof is by induction on κ . First note that A − n = A n is analytic for every n . Now forany 0 ≤ κ < ω , by Lemma B.10, A κn = [ x ∈X n +1 \ y ∈Y n +1 \ λ<κ { v ∈ A n : val( v , x, y ) > λ } = [ x ∈X n +1 \ y ∈Y n +1 \ λ<κ { v ∈ A n : ( v , x, y ) ∈ A λn +1 } . As κ < ω , the intersections in this expression are countable. Therefore, as A λn +1 is analytic for λ < κ by the induction hypothesis, it follows that A κn is analytic.We can now conclude the proof of Theorem B.1. Proof of Theorem B.1

We assume that P L has a winning strategy (otherwise the conclusion istrivial). For any 0 ≤ n < ∞ , deﬁne D n +1 := { ( v , x, y ) ∈ P n +1 : val( v , x, y ) < min { val( v ) , val( ∅ ) }} = [ − ≤ κ< val( ∅ ) { ( v , x, y ) ∈ P n +1 : val( v , x, y ) ≤ κ < val( v ) } = [ − ≤ κ< val( ∅ ) { ( v , x, y ) ∈ P n +1 : ( v , x, y ) ∈ ( A κn +1 ) c , v ∈ A κn } , where A κn is deﬁned in Corollary B.11. As P L has a winning strategy, Lemma B.7 implies thatval( ∅ ) < ω . Thus the union in the deﬁnition of D n +1 is countable, and it follows from CorollaryB.11 that D n +1 is universally measurable.Now deﬁne for every t ≥ g t : P t − × X t → Y t as follows. As Y t is countable, we mayenumerate it as Y t = { y , y , y , . . . } . Set g t ( v , x ) := (cid:26) y i if ( v , x, y j ) D t for j < i, ( v , x, y i ) ∈ D t ,y if ( v , x, y j ) D t for all j. In words, g t ( v , x ) = y i for the ﬁrst index i such that ( v , x, y i ) ∈ D t , and we set it arbitrarily to y if( v , x, y j ) D t for all j . This deﬁnes a universally measurable strategy for P L . It remains to showthis strategy is winning. o this end, suppose that val( x , y , . . . , x t − , y t − ) ≤ val( ∅ ). By Proposition B.8, for every x t there exists y t so that ( x , y , . . . , x t , y t ) ∈ D t . Thus playing y t = g t ( x , y , . . . , x t − , y t − , x t ) yields,by the deﬁnition of g t , val( x , y , . . . , x t , y t ) < val( x , y , . . . , x t − , y t − ) . The assumption val( x , y , . . . , x t − , y t − ) ≤ val( ∅ ) certainly holds for t = 0. It thus remains validfor any t as long as P L plays the strategy { g t } . It follows that { g t } is a value-decreasing strategy,so it is winning for P L . Appendix C. A nonmeasurable example

To fully appreciate the measurability issues that arise in this paper, it is illuminating to considerwhat can go wrong if we do not assume measurability in the sense of Deﬁnition 3.3. To this end werevisit in this section a standard example from empirical process theory (cf. (Dudley, 2014, Chapter5) or (Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989, p. 953)) in our setting.For the purposes of this section, we assume validity of the continuum hypothesis card([0 , ℵ .(This is not assumed anywhere else in the paper.) We may therefore identify [0 ,

1] with ω . Inparticular, this induces a well-ordering of [0 ,

1] which we will denote ⋖ , to distinguish it from theusual ordering of the reals.To construct our example, we let X = [0 ,

1] and H = { x x ≤ · z : z ∈ [0 , } . Every h ∈ H is the indicator of a countable set (being an initial segment of ω ). In particular, each h ∈ H is individually measurable. However, measurability in the sense of Deﬁnition 3.3 fails for H . Lemma C.1.

For the example of this section, the set S = { ( x , x ) ∈ X : H x , ,x , = ∅ } has inner measure and outer measure with respect to the Lebesgue measure. In particular, S isnot Lebesgue measurable. Proof

By the deﬁnition of H , we have S = { ( x , x ) ∈ X : x ⋖ x } . If S were Lebesgue-measurable, then Fubini’s theorem would yield0 = Z (cid:18) Z S ( x , x ) dx (cid:19) dx = Z (cid:18) Z S ( x , x ) dx (cid:19) dx = 1 , where we used that x S ( x , x ) is the indicator of a countable set and that x S ( x , x )is the indicator of the complement of a countable set. This is evidently absurd, so S cannot beLebesgue-measurable. That the outer measure of S is one and the inner measure is zero followsreadily from the above Fubini identities by bounding S and S c by its measurable cover, respectively. Corollary C.2.

The class H is not measurable in the sense of Deﬁnition 3.3. roof If H were measurable in the sense of Deﬁnition 3.3, then the same argument as in the proofof Corollary 3.5 would show that S is analytic. But this contradicts Lemma C.1, as analytic setsare universally measurable by Theorem A.10.Lemma C.1 illustrates the fundamental importance of measurability in our theory. For example,suppose player P A in the the game G of section 3.2 draws i.i.d. random plays x , x , . . . from theLebesgue measure on [0 , L plays the simplest type of strategy—the deterministicstrategy y = 0, y = 1—the fact that P L wins in the second round is not measurable. Moreover,one can show (see the proof of Lemma C.3 below) that any value-minimizing strategy for P L in thesense of Section B.3 plays y = 0, y = 1 for ( x , x ) ∈ S c . So, the same problem arises for thewinning strategies constructed by Theorem B.1.This kind of behavior would undermine any reasonable probabilistic analysis of the learningproblems in this paper. Even the deﬁnitions of learning rates make no sense when the probabilitiesof events have no meaning. The above example therefore illustrates that measurability is crucial forlearning problems with random data.It is instructive to check what goes wrong if one attempts to prove the existence of measurablestrategies as in Theorem B.1 for the present example. The coanalyticity assumption was used in theproof of Theorem B.1 in two diﬀerent ways. First, it ensures that the sets of active positions A n andthe super-level sets of the value function A κn are measurable for countable κ (cf. Corollary B.11).This immediately fails in the present example (Lemma C.1). Secondly, coanalyticity was used toshow that only countable game values can appear (cf. Lemma B.7). We presently show that thelatter also fails in the present example, so that coanalyticity is really essential for both parts of theproof. Lemma C.3.

In the present example, the game G satisﬁes val( ∅ ) ≥ ω . Proof

As in Section 3.4, for the game G we denote LD( H ) := val( ∅ ), and we recall thatval( x , y , . . . , x t , y t ) = LD( H x ,y ,...,x t ,y t ).We must recall some facts about ordinals (Sierpi´nski, 1965, section XIV.20). An ordinal κ iscalled additively indecomposable if ξ + κ = κ for every ξ < κ , or, equivalently, if the ordinal segment[ ξ, κ ) is isomorphic to κ for all ξ < κ . An ordinal is additively indecomposable if and only if it is ofthe form ω β for some ordinal β . Moreover, ω = ω ω , so that ω is additively indecomposable.For every ordinal β , deﬁne the class of indicators H β = { λ λ ≤ κ : κ ∈ ω β } on X β = ω β .We now prove by induction on β that LD( H β ) ≥ β for each β . Choosing β = ω then shows thatLD( H ) ≥ ω .For the initial step, it suﬃces that LD( H ) = 0 because X = 1 and H = { } . Now suppose wehave proved that LD( H α ) ≥ α for all α < β . Note ﬁrst that H βω α , = H α , where we view the latteras functions on X β . However, all functions in H α take the same value on points in X β \X α , so suchpoints cannot appear in any active tree. It follows immediately that LD( H βω α , ) = LD( H α ). By thesame reasoning, now using that [ ω α , ω β ) is isomorphic to ω β , it follows that LD( H βω α , ) = LD( H β ).Thus LD( H β ) > LD( H α ) ≥ α by the induction hypothesis and Lemma B.10. As this holds for any α < β , we have shown LD( H β ) ≥ β .Let us conclude our discussion of measurability by emphasizing that even in the presence of ameasurability assumption such as Deﬁnition 3.3 or coanalitycity of W in Theorem B.1, the key reasonwhy we are able to construct measurable strategies is that we assumed P L plays values in countablesets Y t (as is the case for all the games encountered in this paper). In general Gale-Stewart gameswhere both P A and P L play values in Polish spaces, there is little hope of obtaining measurablestrategies in a general setting. Indeed, an inspection of the proof of Corollary B.11 shows that thesuper-level sets of the value function are constructed by successive unions over X t and intersections ver Y t . Namely, by alternating projections and complements. However, it is consistent with theaxioms of set theory (ZFC) that the projection of a coanalytic set may be Lebesgue-nonmeasurable(Jech, 2003, Corollary 25.28). Thus it is possible to construct examples of Gale-Stewart games where X t , Y t are Polish, W is closed or open, and the set A κn of Corollary B.11 is nonmeasurable for κ = 0or 1. In contrast, because we assumed Y t are countable, only the unions over X t play a nontrivialrole in our setting and analyticity is preserved in the construction. References

A. Antos and G. Lugosi. Strong minimax lower bounds for learning.

Machine Learning , 30:31–56,1998. 1.5, 1.6.3, 1.6.4J.-Y. Audibert and A. B. Tsybakov. Fast learning rates for plug-in classiﬁers.

The Annals ofStatistics , 35(2):608–633, 2007. 1.2M.-F. Balcan, S. Hanneke, and J. Wortman Vaughan. The true sample complexity of active learning.

Machine Learning , 80(2–3):111–139, 2010. 1.6.4G. M. Benedek and A. Itai. Nonuniform learnability.

Journal of Computer and System Sciences ,48:311–323, 1994. 1.6.5, 2.3, 2.7A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the Vapnik-Chervonenkis dimension.

Journal of the Association for Computing Machinery , 36(4):929–965,1989. 1.5.3, 3.3, CD. Cohn and G. Tesauro. Can neural networks do better than the Vapnik-Chervonenkis bounds?In

Advances in Neural Information Processing Systems , 1990. 1, 1.2D. Cohn and G. Tesauro. How tight are the Vapnik-Chervonenkis bounds?

Neural Computation , 4(2):249–269, 1992. 1, 1.2D. L. Cohn.

Measure Theory . Birkh¨auser, Boston, Mass., 1980. ISBN 3-7643-3003-1. 3.3, A.4C. Dellacherie. Les d´erivations en th´eorie descriptive des ensembles et le th´eor`eme de la borne. In

S´eminaire de Probabilit´es, XI (Univ. Strasbourg, Strasbourg, 1975/1976) , pages 34–46. LectureNotes in Math., Vol. 581. Springer, 1977. A.4L. Devroye, L. Gy¨orﬁ, and G. Lugosi.

A Probabilistic Theory of Pattern Recognition . Springer-VerlagNew York, Inc., 1996. 1.3, 1.6.1, 2.3R. M. Dudley.

Uniform central limit theorems , volume 142 of

Cambridge Studies in AdvancedMathematics . Cambridge University Press, New York, second edition, 2014. ISBN 978-0-521-73841-5; 978-0-521-49884-5. 1.5.3, 3.4, 3.3, CA. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number ofexamples needed for learning.

Information and Computation , 82(3):247–261, 1989. 1.2C. D. A. Evans and Joel David Hamkins. Transﬁnite game values in inﬁnite chess.

Integers , 14:Paper No. G2, 36, 2014. 3.4, 8, B.2D. Gale and F. M. Stewart. Inﬁnite games with perfect information. In

Contributions to the theoryof games, vol. 2 , Annals of Mathematics Studies, no. 28, pages 245–266. Princeton UniversityPress, Princeton, N. J., 1953. A.1, A.1S. Hanneke.

Theoretical Foundations of Active Learning . PhD thesis, Machine Learning Department,School of Computer Science, Carnegie Mellon University, 2009. 1.6.4 . Hanneke. Activized learning: Transforming passive to active with improved label complexity. Journal of Machine Learning Research , 13(5):1469–1587, 2012. 1.6.4S. Hanneke. Learning whenever learning is possible: Universal learning under general stochasticprocesses. arXiv:1706.01418 , 2017. 1.6.1S. Hanneke, A. Kontorovich, S. Sabato, and R. Weiss. Universal Bayes consistency in metric spaces. arXiv:1705.08184 , 2019. 1.3, 1.6.1D. Haussler, N. Littlestone, and M. Warmuth. Predicting { , } -functions on randomly drawn points. Information and Computation , 115(2):248–292, 1994. 1.2, 1.5.1, 1.6.2, 5, 5.2, 5.2W. Hodges.

Model Theory , volume 42 of

Encyclopedia of Mathematics and its Applications . Cam-bridge University Press, Cambridge, 1993. ISBN 0-521-30442-3. doi: 10.1017/CBO9780511551574.URL https://doi.org/10.1017/CBO9780511551574 . A.1, B.2K. Hrbacek and T. Jech.

Introduction to Set Theory , volume 220 of

Monographs and Textbooksin Pure and Applied Mathematics . Marcel Dekker, Inc., New York, third edition, 1999. ISBN0-8247-7915-0. A.2T. Jech.

Set Theory . Springer Monographs in Mathematics. Springer-Verlag, Berlin, 2003. ISBN3-540-44085-2. The third millennium edition, revised and expanded. CA. S. Kechris.

Classical Descriptive Set Theory , volume 156 of

Graduate Texts in Mathematics .Springer-Verlag, New York, 1995. ISBN 0-387-94374-9. doi: 10.1007/978-1-4612-4190-4. URL https://doi.org/10.1007/978-1-4612-4190-4 . A.1, A.3, A.4, A.4V. Koltchinskii and O. Beznosova. Exponential convergence rates in classiﬁcation. In PeterAuer and Ron Meir, editors,

Learning Theory, 18th Annual Conference on Learning The-ory, COLT 2005, Bertinoro, Italy, June 27-30, 2005, Proceedings , volume 3559 of

LectureNotes in Computer Science , pages 295–307. Springer, 2005. doi: 10.1007/11503415 \

20. URL https://doi.org/10.1007/11503415_20 . 1.2N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algo-rithm.

Machine Learning , 2:285–318, 1988. 1.5.1, 3.1, 3.4A. Nitanda and T. Suzuki. Stochastic gradient descent with exponential convergence rates of ex-pected classiﬁcation errors. In

AISTATS , volume 89 of

Proceedings of Machine Learning Research ,pages 1417–1426. PMLR, 2019. 1.2V. Pestov. PAC learnability versus VC dimension: A footnote to a basic result of statistical learning.In

The 2011 International Joint Conference on Neural Networks , pages 1141–1145, July 2011. doi:10.1109/IJCNN.2011.6033352. 1.5.3, 3.3L. Pillaud-Vivien, A. Rudi, and F. Bach. Exponential convergence of testing error for stochas-tic gradient methods. In S´ebastien Bubeck, Vianney Perchet, and Philippe Rigollet, edi-tors,

Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018 , vol-ume 75 of

Proceedings of Machine Learning Research , pages 250–296. PMLR, 2018. URL http://proceedings.mlr.press/v75/pillaud-vivien18a.html . 1.2D. Schuurmans. Characterizing rational versus exponential learning curves.

Journal of Computerand System Sciences , 55(1):140–160, 1997. 1.1, 1.2, 1.6.2, 1.6.4, 4.1, 4.2S. Sierpi´nski.

Cardinal and Ordinal Numbers . Second revised edition. Monograﬁe Matematyczne,Vol. 34. Pa´nstowe Wydawnictwo Naukowe, Warsaw, 1965. A.2, C . J. Stone. Consistent nonparametric regression. The Annals of Statistics , pages 595–620, 1977.1.3, 1.6.1L. G. Valiant. A theory of the learnable.

Communications of the ACM , 27(11):1134–1142, November1984. 1R. van Handel. The universal Glivenko-Cantelli property.

Probability and Related Fields , 155:911–934, 2013. 2.2V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events totheir probabilities.

Theory of Probability and its Applications , 16(2):264–280, 1971. 1.5.1, 2.2V. Vapnik and A. Chervonenkis.

Theory of Pattern Recognition . Nauka, Moscow, 1974. 1, 1.2, 1.6.1L. Yang and S. Hanneke. Activized learning with uniform classiﬁcation noise. In

Proceedings of the th International Conference on Machine Learning , 2013. 1.6.4, 2013. 1.6.4

Related Researches

CaPC Learning: Confidential and Private Collaborative Learning

by Christopher A. Choquette-Choo

Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

by Dara Bahri

SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

by Bahare Fatemi

On Theory-training Neural Networks to Infer the Solution of Highly Coupled Differential Equations

by M. Torabi Rad

Spherical Message Passing for 3D Graph Networks

by Yi Liu

Target Training Does Adversarial Training Without Adversarial Samples

by Blerta Lindqvist

Domain Invariant Representation Learning with Domain Density Transformations

by A. Tuan Nguyen

Sparsification via Compressed Sensing for Automatic Speech Recognition

by Kai Zhen

Using Deep LSD to build operators in GANs latent space with meaning in real space

by J. Quetzalcoatl Toledo-Marin

Consensus Based Multi-Layer Perceptrons for Edge Computing

by Haimonti Dutta

RL for Latent MDPs: Regret Guarantees and a Lower Bound

by Jeongyeol Kwon

Scheduling the NASA Deep Space Network with Deep Reinforcement Learning

by Edwin Goh

Backdoor Scanning for Deep Neural Networks through K-Arm Optimization

by Guangyu Shen

On Explainability of Graph Neural Networks via Subgraph Explorations

by Hao Yuan

More Is More -- Narrowing the Generalization Gap by Adding Classification Heads

by Roee Cates

Benchmarking Deep Graph Generative Models for Optimizing New Drug Molecules for COVID-19

by Logan Ward

Estimation and Applications of Quantiles in Deep Binary Classification

by Anuj Tambwekar

Emotion Transfer Using Vector-Valued Infinite Task Learning

by Alex Lambert

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

by Florian E. Dorner

Bounded Memory Active Learning through Enriched Queries

by Max Hopkins

Adversarially Trained Models with Test-Time Covariate Shift Adaptation

by Jay Nandy

Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network

by Tomáš Chobola

Regularization Strategies for Quantile Regression

by Taman Narayan

Large-Scale Training System for 100-Million Classification at Alibaba

by Liuyihan Song

Classifier Calibration: with implications to threat scores in cybersecurity

by Waleed A. Yousef

«

1

2

3

4

»

Submitted on 9 Nov 2020 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar