A Theory of Universal Learning
Olivier Bousquet, Steve Hanneke, Shay Moran, Ramon van Handel, Amir Yehudayoff
aa r X i v : . [ c s . L G ] N ov A Theory of Universal Learning
Olivier Bousquet [email protected]
Google, Brain Team
Steve Hanneke [email protected]
Toyota Technological Institute at Chicago
Shay Moran [email protected]
Technion
Ramon van Handel [email protected]
Princeton University
Amir Yehudayoff [email protected]
Technion
Abstract
How quickly can a given class of concepts be learned from examples? It is common to measure theperformance of a supervised machine learning algorithm by plotting its “learning curve”, that is,the decay of the error rate as a function of the number of training examples. However, the classicaltheoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis andValiant, does not explain the behavior of learning curves: the distribution-free PAC model of learn-ing can only bound the upper envelope of the learning curves over all possible data distributions.This does not match the practice of machine learning, where the data source is typically fixed inany given scenario, while the learner may choose the number of training examples on the basis offactors such as computational resources and desired accuracy.In this paper, we study an alternative learning model that better captures such practical aspectsof machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PACmodel. More precisely, we consider the problem of universal learning, which aims to understand theperformance of learning algorithms on every data distribution, but without requiring uniformityover the distribution. The main result of this paper is a remarkable trichotomy: there are onlythree possible rates of universal learning. More precisely, we show that the learning curves of anygiven concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover,each of these cases is completely characterized by appropriate combinatorial parameters, and weexhibit optimal learning algorithms that achieve the best possible rate in each case.For concreteness, we consider in this paper only the realizable case, though analogous resultsare expected to extend to more general learning scenarios. ontents
A Mathematical background 35
A.1 Gale-Stewart games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A.2 Ordinals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36A.3 Well-founded relations and ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37A.4 Polish spaces and analytic sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
B Measurability of Gale-Stewart strategies 39
B.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39B.2 Game values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41B.3 A winning strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42B.4 Measurability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
C A nonmeasurable example 45 . Introduction
In supervised machine learning, a learning algorithm is presented with labeled examples of a concept,and the objective is to output a classifier which correctly classifies most future examples from thesame source. Supervised learning has been successfully applied in a vast number of scenarios, such asimage classification and natural language processing. In any given scenario, it is common to considerthe performance of an algorithm by plotting its “learning curve”, that is, the error rate (measured onheld-out data) as a function of the number of training examples n . A learning algorithm is consideredsuccessful if the learning curve approaches zero as n → ∞ , and the difficulty of the learning task isreflected by the rate at which this curve approaches zero. One of the main goals of learning theoryis to predict what learning rates are achievable in a given learning task.To this end, the gold standard of learning theory is the celebrated PAC model (Probably Ap-proximately Correct) defined by Vapnik and Chervonenkis (1974) and Valiant (1984). As will berecalled below, the PAC model aims to explain the best worst-case learning rate, over all data dis-tributions that are consistent with a given concept class, that is achievable by a learning algorithm.The fundamental result in this theory exhibits a striking dichotomy: a given learning problem eitherhas a linear worst-case learning rate (i.e., n − ), or is not learnable at all in this sense. These twocases are characterized by a fundamental combinatorial parameter of a learning problem: the VC (Vapnik-Chervonenkis) dimension. Moreover, in the learnable case, PAC theory provides optimallearning algorithms that achieve the linear worst-case rate.While it gives rise to a clean and compelling mathematical picture, one may argue that thePAC model fails to capture at a fundamental level the true behavior of many practical learningproblems. A key criticism of the PAC model is that the distribution-independent definition oflearnability is too pessimistic to explain practical machine learning: real-world data is rarely worst-case, and experiments show that practical learning rates can be much faster than is predicted by PACtheory (Cohn and Tesauro, 1990, 1992). It therefore appears that the worst-case nature of the PACmodel hides key features that are observed in practical learning problems. These considerationsmotivate the search for alternative learning models that better capture the practice of machinelearning, but still give rise to a canonical mathematical theory of learning rates. Moreover, givena theoretical framework capable of expressing these faster learning rates, we can then design newlearning strategies to fully exploit this possibility.The aim of this paper is to put forward one such theory. In the learning model considered here,we will investigate asymptotic rates of convergence of distribution-dependent bounds on the errorof a learning algorithm, holding universally for all distributions consistent with a given conceptclass. Despite that this is a much weaker (and therefore arguably more realistic) notion, we willnonetheless prove that any learning problem can only exhibit one of three possible universal rates:exponential, linear, and arbitrarily slow. Each of these three cases will be fully characterized bymeans of combinatorial parameters (the nonexistence of certain infinite trees), and we will exhibitoptimal learning algorithms that achieve these rates (based on the theory of infinite games). Throughout this paper we will be concerned with the following classical learning problem. A clas-sification problem is defined by a distribution P over labelled examples ( x, y ) ∈ X × { , } . Thelearner does not know P , but is able to collect a sample of n i.i.d. examples from P . She uses theseexamples to build a classifier ˆ h n : X → { , } . The objective of the learner is to achieve small error :er(ˆ h n ) := P { ( x, y ) : ˆ h n ( x ) = y } . While the data distribution P is unknown to the learner, any informative a priori theory of learningmust be expressed in terms of some properties of, or restrictions on, P . Following the PAC model, weintroduce such a restriction by way of an additional component, namely a concept class H ⊆ { , } X f classifiers. The concept class H allows the analyst to state assumptions about P . The simplestsuch assumption is that P is realizable : inf h ∈H er( h ) = 0 , that is, H contains hypotheses with arbitrarily small error. We will focus on the realizable settingthroughout this paper, as it already requires substantial new ideas and provides a clean platformto demonstrate them. We believe that the ideas of this paper can be extended to more generalnoisy/agnostic settings, and leave this direction to be explored in future work.In the present context, the aim of learning theory is to provide tools for understanding the bestpossible rates of convergence of E [er(ˆ h n )] to zero as the sample size n grows to ∞ . This rate dependson the quality of the learning algorithm, and on the complexity of the concept class H . The morecomplex H is, the less information the learner has about P , and thus the slower the convergence. The classical formalization of the problem of learning in statistical learning theory is given by the
PAC model , which adopts a minimax perspective. More precisely, let us denote by RE( H ) the familyof distributions P for which the concept class H is realizable. Then the fundamental result of PAClearning theory states that (Vapnik and Chervonenkis, 1974; Ehrenfeucht, Haussler, Kearns, andValiant, 1989; Haussler, Littlestone, and Warmuth, 1994)inf ˆ h n sup P ∈ RE( H ) E [er(ˆ h n )] ≍ min (cid:18) vc( H ) n , (cid:19) , where vc( H ) is the VC dimension of H . In other words, PAC learning theory is concerned with thebest worst-case error over all realizable distributions, that can be achieved by means of a learningalgorithm ˆ h n . The above result immediately implies a fundamental dichotomy for these uniformrates: every concept class H has a uniform rate that is either linear cn or bounded away from zero ,depending on the finiteness of the combinatorial parameter vc( H ).The uniformity over P in the PAC model is very pessimistic, however, as it allows the worst-casedistribution to change with the sample size. This arguably does not reflect the practice of machinelearning: in a given learning scenario, the data generating mechanism P is fixed, while the learneris allowed to collect an arbitrary amount of data (depending on factors such as the desired accuracyand the available computational resources). Experiments show that the rate at which the errordecays for any given P can be much faster than is suggested by PAC theory (Cohn and Tesauro,1990, 1992): for example, it is possible that the learning curve decays exponentially for every P .Such rates cannot be explained by the PAC model, which can only capture the upper envelope ofthe learning curves over all realizable P , as is illustrated in Figure 1.Furthermore, one may argue that it is really the learning curve for given P , rather than thePAC error bound, that is observed in practice. Indeed, the customary approach to estimate theperformance of an algorithm is to measure its empirical learning rate, that is, to train it on severaltraining sets of increasing sizes (obtained from the same data source) and to measure the test errorof each of the obtained classifiers. In contrast, to observe the PAC rate, one would have to repeatthe above measurements for many different data distributions, and then discard all this data exceptfor the worst-case error over all considered distributions. From this perspective, it is inevitablethat the PAC model may fail to reveal the “true” empirical behavior of learning algorithms. Morerefined theoretical results have been obtained on a case-by-case basis in various practical situations:for example, under margin assumptions, some works established exponentially fast learning ratesfor popular algorithms such as stochastic gradient decent and kernel methods (Koltchinskii andBeznosova, 2005; Audibert and Tsybakov, 2007; Pillaud-Vivien, Rudi, and Bach, 2018; Nitanda andSuzuki, 2019). Such results rely on additional modelling assumptions, however, and do not providea fundamental theory of the learnable in the spirit of PAC learning. E [ e r ( ˆ h n ) ] ∼ n ∼ e − c ( P ) n Figure 1:
Illustration of the difference between universal and uniform rates. Each red curve showsexponential decay of the error for a different data distribution P ; but the PAC rate only capturesthe pointwise supremum of these curves (blue curve) which decays linearly at best. Our aim in this paper is to propose a mathematical theory that is able to capture some of theabove features of practical learning systems, yet provides a complete characterization of achievablelearning rates for general learning tasks. Instead of considering uniform learning rates as in the PACmodel, we consider instead the problem of universal learning. The term universal means that a givenproperty (such as consistency or rate) holds for every realizable distribution P , but not uniformlyover all distributions. For example, a class H is universally learnable at rate R if the following holds: ∃ ˆ h n s.t. ∀ P ∈ RE( H ) , ∃ C, c > E [er(ˆ h n )] ≤ CR ( cn ) for all n. The crucial difference between this formulation and the PAC model is that here the constants
C, c areallowed to depend on P : thus universal learning is able to capture distribution-dependent learningcurves for a given learning task. For example, the illustration in Figure 1 suggests that it is perfectlypossible for a concept class H to be universally learnable at an exponential rate, even though its uniform learning rate is only linear. In fact, we will see that there is little connection betweenuniversal and uniform learning rates (as is illustrated in Figure 4 of section 2): a given problem mayeven be universally learnable at an exponential rate while it is not learnable at all in the PAC sense.These two models of learning reveal fundamentally different features of a given learning problem.The fundamental question that we pose in this paper is: Question.
Given a class H , what is the fastest rate at which H can be universally learned? We provide a complete answer to this question, characterize the achievable rates by means of combi-natorial parameters, and exhibit learning algorithms that achieve these rates. The universal learningmodel therefore gives rise to a theory of learning that fully complements the classical PAC theory.
Before we proceed to the statement of our main results, we aim to develop some initial intuition forwhat universal learning rates are achievable. To this end, we briefly discuss three basic examples.
Example 1.1.
Any finite class H is universally learnable at an exponential rate (Schuurmans, 1997).Indeed, let ε be the minimal error er( h ) among all classifiers h ∈ H with positive error er( h ) > n training data points is bounded by |H| (1 − ε ) n . Thus a learning rule that outputsany ˆ h n ∈ H that correctly classifies the training data satisfies E [er(ˆ h n )] ≤ Ce − cn , where C, c > H , P . It is easily seen that this is the best possible: as long as H contains at least threefunctions, a learning curve cannot decay faster than exponentially (see Lemma 4.2 below). Example 1.2.
The class H = { h t : t ∈ R } of threshold classifiers on the real line h t ( x ) = x ≥ t isuniversally learnable at a linear rate. That a linear rate can be achieved already follows in this case rom PAC theory, as H is a VC class. However, in this example, a linear rate is the best possibleeven in the universal setting: for any learning algorithm, there is a realizable distribution P whoselearning curve decays no faster than a linear rate (Schuurmans, 1997). Example 1.3.
The class H of all measurable functions on a space X is universally learnable undermild conditions (Stone, 1977; Hanneke, Kontorovich, Sabato, and Weiss, 2019): that is, there existsa learning algorithm ˆ h n that ensures E [er(ˆ h n )] → n → ∞ for every realizable distribution P .However, there can be no universal guarantee on the learning rate (Devroye, Gy¨orfi, and Lugosi,1996). That is, for any learning algorithm ˆ h n and any function R ( n ) that converges to zero arbitrarilyslowly, there exists a realizable distribution P such that E [er(ˆ h n )] ≥ R ( n ) infinitely often.The three examples above reveal that there are at least three possible universal learning rates.Remarkably, we find that these are the only possibilities . That is, every nontrivial class H is eitheruniversally learnable at an exponential rate (but not faster), or is universally learnable at a linearrate (but not faster), or is universally learnable but necessarily with arbitrarily slow rates. We now summarize the key definitions and main results of the paper. (We refer to Appendix A.4for the relevant terminology on Polish spaces and measurability.)To specify the learning problem, we specify a domain X and a concept class H ⊆ { , } X . Wewill henceforth assume that X is a Polish space (for example, a Euclidean space, or any countableset) and that H satisfies a minimal measurability assumption specified in Definition 3.3 below.A classifier is a universally measurable function h : X → { , } . Given a probability distribu-tion P on X × { , } , the error rate of a classifier h is defined aser( h ) = er P ( h ) := P { ( x, y ) : h ( x ) = y } . The distribution P is called realizable if inf h ∈H er( h ) = 0.A learning algorithm is a sequence of universally measurable functions H n : ( X × { , } ) n × X → { , } , n ∈ N . The input data to the learning algorithm is a sequence of independent P -distributed pairs ( X i , Y i ).When acting on this input data, the learning algorithm outputs the data-dependent classifiersˆ h n ( x ) := H n (( X , Y ) , . . . , ( X n , Y n ) , x ) . The objective in the design of a learning algorithm is that the expected error rate E [er(ˆ h n )] of theoutput concept decays as rapidly as possible as a function of n .The aim of this paper is to characterize what rates of convergence of E [er(ˆ h n )] are achievable.The following definition formalizes this notion of achievable rate in the universal learning model. Definition 1.4.
Let H be a concept class, and let R : N → [0 ,
1] with R ( n ) → • H is learnable at rate R if there is a learning algorithm ˆ h n such that for every realizabledistribution P , there exist C, c > E [er(ˆ h n )] ≤ CR ( cn ) for all n . • H is not learnable at rate faster than R if for every learning algorithm ˆ h n , there exists arealizable distribution P and C, c > E [er(ˆ h n )] ≥ CR ( cn ) for infinitely many n .
1. For simplicity of exposition, we have stated a definition corresponding to deterministic algorithms, to avoidthe notational inconvenience required to formally define randomized algorithms in this context. Our resultsremain valid when allowing randomized algorithms as well: all algorithms we construct throughout this paper aredeterministic, and all lower bounds we prove also hold for randomized algorithms. x ∅∅∅ x x x x x x ∃ h ∈ H h ( x ∅ ) = 1 h ( x ) = 0 h ( x ) = 1 Figure 2:
A Littlestone tree of depth 3. Every branch is consistent with a concept h ∈ H . Thisis illustrated here for one of the branches. • H is learnable with optimal rate R if H is learnable at rate R and H is not learnable fasterthan R . • H requires arbitrarily slow rates if, for every R ( n ) → H is not learnable faster than R .Let us emphasize that, unlike in the PAC model, every concept class H is universally learnablein the sense that there exist learning algorithms such that E [er(ˆ h n )] → P ; seeExample 1.3 above. However, a concept class may nonetheless require arbitrarily slow rates, inwhich case it is impossible for the learner to predict how fast this convergence will take place. Remark 1.5.
While this is not assumed in the above definition, our lower bound results will infact prove a stronger claim: namely, that when a given concept class H is not learnable at ratefaster than R , the corresponding constants C, c > h n and concept class H . This issometimes referred to as a strong minimax lower bound (Antos and Lugosi, 1998).The following theorem is one of the main results of this work. It expresses a fundamental trichotomy : there are exactly three possibilities for optimal learning rates. Theorem 1.6.
For every concept class H with |H| ≥ , exactly one of the following holds. • H is learnable with optimal rate e − n . • H is learnable with optimal rate n . • H requires arbitrarily slow rates. A second main result of this work provides a detailed description of which of these three casesany given concept class H satisfies, by specifying complexity measures to distinguish the cases. Webegin with the following definition, which is illustrated in Figure 2. Henceforth we define the prefix y ≤ k := ( y , . . . , y k ) for any sequence y = ( y , y , . . . ). Definition 1.7. A Littlestone tree for H is a complete binary tree of depth d ≤ ∞ whose internalnodes are labelled by X , and whose two edges connecting a node to its children are labelled 0 and 1,such that every finite path emanating from the root is consistent with a concept h ∈ H .More precisely, a Littlestone tree is a collection { x u : 0 ≤ k < d, u ∈ { , } k } ⊆ X such that for every y ∈ { , } d and n < d , there exists h ∈ H so that h ( x y ≤ k ) = y k +1 for 0 ≤ k ≤ n .We say H has an infinite Littlestone tree if there is a Littlestone tree for H of depth d = ∞ .
2. The restriction |H| ≥ |H| = 1 or if H = { h, − h } , then er(ˆ h n ) = 0 is triviallyachievable for all n . If |H| = 2 but H 6 = { h, − h } , then H is learnable with optimal rate e − n by Example 1.1. x ∅∅∅ ( x , x ) ( x , x ) ( x , , x , , x , )( x , , x , , x , )( x , , x , , x , ) ( x , , x , , x , ) ( x , , x , , x , )( x , , x , , x , )( x , , x , , x , )( x , , x , , x , ) ∃ h ∈ H h ( x ∅ ) = 1 h ( x ) = 0 , h ( x ) = 0 h ( x , ) = 1 , h ( x , ) = 0 , h ( x , ) = 0 Figure 3:
A VCL tree of depth 3. Every branch is consistent with a concept h ∈ H . This isillustrated here for one of the branches. Due to lack of space, not all external edges are drawn. The above notion is closely related to the
Littlestone dimension , a fundamentally importantquantity in online learning. A concept class H has Littlestone dimension d if it has a Littlestonetree of depth d but not of depth d + 1. When this is the case, classical online learning theory yieldsa learning algorithm that makes at most d mistakes in classifying any adversarial (as opposed torandom) realizable sequence of examples. Along the way to our main results, we will extend thetheory of online learning to the following setting: we show in Section 3.1 that the nonexistence ofan infinite Littlestone tree characterizes the existence of an algorithm that guarantees a finite (butnot necessarily uniformly bounded) number of mistakes for every realizable sequence of examples.Let us emphasize that having an infinite Littlestone tree is not the same as having an unboundedLittlestone dimension: the latter can happen due to existence of finite Littlestone trees of arbitrarilylarge depth, which does not imply the existence of any single tree of infinite depth.Next we introduce a new type of complexity structure, which we term a VC-Littlestone tree .It represents a combination of the structures underlying Littlestone dimension and VC dimension.Though the definition may appear a bit complicated, the intuition is quite simple (see Figure 3).
Definition 1.8. A VCL tree for H of depth d ≤ ∞ is a collection { x u ∈ X k +1 : 0 ≤ k < d, u ∈ { , } × { , } × · · · × { , } k } such that for every n < d and y ∈ { , } × · · · × { , } n +1 , there exists a concept h ∈ H so that h ( x i y ≤ k ) = y ik +1 for all 0 ≤ i ≤ k and 0 ≤ k ≤ n , where we denote y ≤ k = ( y , ( y , y ) , . . . , ( y k , . . . , y k − k )) , x y ≤ k = ( x y ≤ k , . . . , x k y ≤ k ) . We say that H has an infinite VCL tree if it has a VCL tree of depth d = ∞ .A VCL tree resembles a Littlestone tree, except that each node in a VCL tree is labelled by asequence of k points, where k is the depth of the node (in contrast, every node in a Littlestone treeis labelled by a single point). The branching factor at each node at depth k of a VCL tree is thus 2 k ,rather than 2 as in a Littlestone tree. In the language of Vapnik-Chervonenkis theory, this meansthat along each path in the tree, we encounter shattered sets of size increasing with depth. ith these definitions in hand, we can state our second main result: a complete characterizationof the optimal rate achievable for any given concept class H . Theorem 1.9.
For every concept class H with |H| ≥ , the following hold: • If H does not have an infinite Littlestone tree, then H is learnable with optimal rate e − n . • If H has an infinite Littlestone tree but does not have an infinite VCL tree, then H is learnablewith optimal rate n . • If H has an infinite VCL tree, then H requires arbitrarily slow rates. In particular, since Theorem 1.6 follows immediately from Theorem 1.9, the focus of this workwill be to prove Theorem 1.9. The proof of this theorem, and many related results, are presented inthe remainder of this paper.
We next discuss some technical aspects in the derivation of the trichotomy. We also highlight keydifferences with the dichotomy of PAC learning theory.
In the uniform setting, the fact that every VC class is PAC learnable is witnessed by any algorithmthat outputs an concept h ∈ H that is consistent with the input sample. This is known in the liter-ature as the empirical risk minimization (ERM) principle and follows from the celebrated uniformconvergence theorem of (Vapnik and Chervonenkis, 1971). Moreover, any ERM algorithm achievesthe optimal uniform learning rate, up to lower order factors.In contrast, in the universal setting one has to carefully design the algorithms that achievethe optimal rates. In particular, here the optimal rates are not always achieved by general ERMmethods: for example, there are classes where exponential rates are achievable, but where thereexist ERM learners with arbitrarily slow rates (see Example 2.6 below). The learning algorithms wepropose below are novel in the literature: they are based on the theory of infinite (Gale-Stewart)games, whose connection with learning theory appears to be new in this paper.As was anticipated in the previous section, a basic building block of our learning algorithms is thesolution of analogous problems in adversarial online learning . For example, as a first step towardsa statistical learning algorithm that achieves exponential rates, we extend the mistake bound modelof (Littlestone, 1988) to scenarios where it is possible to guarantee a finite number of mistakes foreach realizable sequence, but without an a priori bound on the number of mistakes. We show this ispossible precisely when H has no infinite Littlestone tree, in which case the resulting online learningalgorithm is defined by the winning strategy of an associated Gale-Stewart game.Unfortunately, while online learning algorithms may be applied directly to random trainingdata, this does not in itself suffice to ensure good learning rates. The problem is that, althoughthe online learning algorithm is guaranteed to make no mistakes after a finite number of rounds, inthe statistical context this number of rounds is a random variable for which we have no control onthe variance or tail behavior. We must therefore introduce additional steps to convert such onlinelearning algorithms into statistical learning algorithms. In the case of exponential rates, this willbe done by applying the online learning algorithm to several different batches of training examples,which must then be carefully aggregated to yield a classifier that achieves an exponential rate.The case of linear rates presents additional complications. In this setting, the corresponding on-line learning algorithm does not eventually stop making mistakes: it is only guaranteed to eventuallyrule out a finite pattern of labels (which is feasible precisely when H has no infinite VCL tree). Oncewe have learned to rule out one pattern of labels for every data sequence of length k , the situationbecomes essentially analogous to that of a VC class of dimension k −
1. In particular, we can then pply the one-inclusion graph predictor of Haussler, Littlestone, and Warmuth (1994) to classifysubsequent data points with a linear rate. When applied to random data, however, both the time ittakes for the online algorithm to learn to rule out a pattern, and the length k of that pattern, arerandom. We must therefore again apply this technique to several different batches of training exam-ples and combine the resulting classifiers with aggregation methods to obtain a statistical learningalgorithm that achieves a linear rate. The proofs of our lower bounds are also significantly more involved than those in PAC learningtheory. In contrast to the uniform setting, we are required to produce a single data distribution P for which the given learning algorithm has the claimed lower bound for infinitely many n . To thisend, we will apply the probabilistic method by randomizing over both the choice of target labellingsfor the space, and the marginal distribution on X , coupling these two components of P . There is a serious technical issue that arises in our theory that gives rise to surprisingly interestingmathematical questions. In order to apply the winning strategies of Gale-Stewart games to randomdata, we must ensure such strategies are measurable: if this is not the case, our theory may failspectacularly (see Appendix C). However, nothing appears to be known in the literature about themeasurability of Gale-Stewart strategies in nontrivial settings.That measurability issues arise in learning theory is not surprising, of course; this is also thecase in classical PAC learning (Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989; Pestov, 2011).Our basic measurability assumption (Definition 3.3) is also the standard assumption made in thissetting (Dudley, 2014). It turns out, however, that measurability issues in classical learning theoryare essentially benign: the only issue that arises there is the measurability of the supremum of theempirical process over H . This can be trivially verified in most practical situations without theneed for an abstract theory: for example, measurability of the empirical process is trivial when H iscountable, or when H can be pointwise approximated by a countable class. For these reasons, mea-surability issues in classical learning theory are often considered “a minor nuisance”. The situationin this paper is completely different: it is entirely unclear a priori whether Gale-Stewart strategiesare measurable even in apparently trivial cases, such as when H is countable.We will prove the existence of measurable strategies for a general class of Gale-Stewart gamesthat includes all the ones encountered in this paper. The solution of this problem exploits an inter-play between the mathematical and algorithmic aspects of the problem. To construct a measurablestrategy, we will explicitly define a strategy by means of a kind of greedy algorithm that aims tominimize in each step a value function that takes values in the ordinal numbers . This construc-tion gives rise to unexpected new notions for learning theory: for example, we will show that thecomplexity of online learning is characterized by an ordinal notion of Littlestone dimension, whichagrees with the classical notion when it is finite. To conclude the proof of measurability, we combinethese insights with a deep result of descriptive set theory (the Kunen-Martin theorem) which showsthat the Littlestone dimension of a measurable class H is always a countable ordinal. To conclude the introduction, we briefly review prior work on the subject of universal learning rates.
An extreme notion of learnability in the universal setting is universal consistency : a learning algo-rithm is universally consistent if E [er(ˆ h n )] → inf h er( h ) for every distribution P . The first proof hat universally consistent learning is possible was provided by Stone (1977), using local average estimators, such as based on k-nearest neighbor predictors, kernel rules, and histogram rules; see(Devroye, Gy¨orfi, and Lugosi, 1996) for a thorough discussion of such results. One can also estab-lish universal consistency of learning rules via the technique of structural risk minimization fromVapnik and Chervonenkis (1974). The most general results on universal consistency were recentlyestablished by Hanneke (2017) and Hanneke, Kontorovich, Sabato, and Weiss (2019), who provedthe existence of universally consistent learning algorithms in any separable metric space . In fact,Hanneke, Kontorovich, Sabato, and Weiss (2019) establish this for even more general spaces, called essentially separable , and prove that the latter property is actually necessary for universal consis-tency to be possible. An immediate implication of their result is that in such spaces X , and choosing H to be the set of all measurable functions, there exists a learning algorithm with E [er(ˆ h n )] → P (cf. Example 1.3). In particular, since we assume in this paper that X is Polish (i.e., separably metrizable), this result holds in our setting.While these results establish that it is always possible to have E [er(ˆ h n )] → P ,there is a so-called no free lunch theorem showing that it is not generally possible to bound the rate of convergence: that is, the set H of all measurable functions requires arbitrarily slow rates(Devroye, Gy¨orfi, and Lugosi, 1996). The proof of this result also extends to more general conceptclasses: the only property of H that was used in the proof is that it finitely shatters some countablyinfinite subset of X , that is, there exists X ′ = { x , x , . . . } ⊆ X such that, for every n ∈ N and y , . . . , y n ∈ { , } , there is h ∈ H with h ( x i ) = y i for every i ≤ n . It is natural to wonder whetherthe existence of such a countable finitely shattered set X ′ is also necessary for H to require arbitrarilyslow rates. Our main result settles this question in the negative. Indeed, Theorem 1.9 states thatthe existence of an infinite VCL tree is both necessary and sufficient for a concept class H to requirearbitrarily slow rates; but it is possible for a class H to have an infinite VCL tree while it does notfinitely shatter any countable set X ′ (see Example 2.8 below). The distinction between exponential and linear rates has been studied by Schuurmans (1997) in somespecial cases. Specifically, Schuurmans (1997) studied classes H that are concept chains , meaningthat every h, h ′ ∈ H have either h ≤ h ′ everywhere or h ′ ≤ h everywhere. For instance, thresholdclassifiers on the real line (Example 1.2) are a simple example of a concept chain.Since any concept chain H must have VC dimension at most 1, the optimal rates can neverbe slower than linear (Haussler, Littlestone, and Warmuth, 1994). However, Schuurmans (1997)found that some concept chains are universally learnable at an exponential rate, and gave a precisecharacterization of when this is the case. Specifically, he established that a concept chain H islearnable at an exponential rate if and only if H is nowhere dense , meaning that there is no infinitesubset H ′ ⊆ H such that, for every distinct h , h ∈ H ′ with h ≤ h everywhere, ∃ h ∈ H ′ \ { h , h } with h ≤ h ≤ h everywhere. He also showed that concept chains H failing this property (i.e., thatare somewhere dense ) are not learnable at rate faster than n − (1+ ε ) (for any ε > n − .It is not difficult to see that for concept chain classes, the property of being somewhere denseprecisely corresponds to the property of having an infinite Littlestone tree, where the above set H ′ corresponds to the set of classifiers involved in the definition of the infinite Littlestone tree.Theorem 1.9 therefore recovers the result of Schuurmans (1997) as a very special case, and sharpenshis n − (1+ ǫ ) general lower bound to a strict linear rate n − .Schuurmans (1997) also posed the question of whether his analysis can be extended beyondconcept chains: that is, whether there is a general characterization of which classes H are learnableat an exponential rate, versus which classes are not learnable at faster than a linear rate. Thisquestion is completely settled by the main results of this paper. .6.3 Classes with matching universal and uniform rates Antos and Lugosi (1998) showed that there exist concept classes for which no improvement on thePAC learning rate is possible in the universal setting. More precisely, they showed that, for any d ∈ N , there exists a concept class H of VC dimension d such that, for any learning algorithm ˆ h n ,there exists a realizable distribution P for which E [er(ˆ h n )] ≥ cdn for infinitely many n , where thenumerical constant c can be made arbitrarily close to . This shows that universal learning ratesfor some classes tightly match their minimax rates up to a numerical constant factor. Universal learning rates have also been considered in the context of active learning , under the names true sample complexity or unverifiable sample complexity (Hanneke, 2009, 2012; Balcan, Hanneke,and Vaughan, 2010; Yang and Hanneke, 2013). Active learning is a variant of supervised learning,where the learning algorithm observes only the sequence X , X , . . . of unlabeled examples, and mayselect which examples X i to query (which reveals their labels Y i ); this happens sequentially, so thatthe learner observes the response to a query before selecting its next query point. In this setting,one is interested in characterizing the rate of convergence of E [er(ˆ h n )] where n is the number of queries (i.e., the number of labels observed) as opposed to the sample size.Hanneke (2012) showed that for any VC class H , there is an active learning algorithm ˆ h n suchthat, for every realizable distribution P , E [er(ˆ h n )] = o (cid:0) n (cid:1) . Note that such a result is certainly not achievable by passive learning algorithms (i.e., the type of learning algorithms discussed in thepresent work), given the results of Schuurmans (1997) and Antos and Lugosi (1998). The latter alsofollows from the results of this paper by Example 2.2 below. Denote by RE( h ) the family of distributions P such that er( h ) = 0 for a given classifier h ∈ H . Benedek and Itai (1994) considered a partial relaxation of the PAC model, called nonuniformlearning , in which the learning rate may depend on h ∈ H but is still uniform over P ∈ RE( h ).This setting intermediate between the PAC setting (where the rate may depend only on n ) and theuniversal learning setting (where the rate may depend fully on P ). A concept class H is said tobe learnable in the nonuniform learning setting if there exists a learning algorithm ˆ h n such thatsup P ∈ RE( h ) E [er(ˆ h n )] → n → ∞ for every h ∈ H .Benedek and Itai (1994) proved that a concept class H is learnable in the nonuniform learningmodel if and only if H is a countable union of VC classes. In Example 2.7 below, we show thatthere exist classes H that are universally learnable, even at an exponential rate, but which are not learnable in the nonuniform learning setting. It is also easy to observe that there exist classes H thatare countable unions of VC classes (hence nonuniformly learnable) which have an infinite VCL tree(and thus require arbitrarily slow universal learning rates). The universal and nonuniform learningmodels are therefore incomparable.
2. Examples
In Section 1.3, we introduced three basic examples that illustrate the three possible universal learningrates. In this section we provide further examples. The main aim of this section is to illustrateimportant distinctions with the uniform setting and other basic concepts in learning theory, whichare illustrated schematically in Figure 4.
Figure 4:
A Venn diagram depicting the trichotomy and its relation with uniform and universallearnability. While the focus here is on statistical learning, note that this diagram also capturesthe distinction between uniform and universal online learning, see Section 3.1.
We begin by giving four examples that illustrate that the classical PAC learning model (which ischaracterized by finite VC dimension) is not comparable to the universal learning model.
Example 2.1 (VC with exponential rate) . Consider the class
H ⊆ { , } N of all threshold functions h t ( x ) = x ≥ t where t ∈ N . This is a VC class (its VC dimension is 1), which is learnable at anexponential rate (it does not have an infinite Littlestone tree). Note, however, that this class hasunbounded Littlestone dimension (it shatters Littlestone trees of arbitrary finite depths), so that itdoes not admit an online learning algorithm that makes a uniformly bounded number of mistakes. Example 2.2 (VC with linear rate) . Consider the class
H ⊆ { , } R of all threshold functions h t ( x ) = x ≥ t , where t ∈ R . This is a VC class (its VC dimension is 1) that is not learnable at anexponential rate (it has an infinite Littlestone tree). Thus the optimal rate is linear. Example 2.3 (Exponential rate but not VC) . Let X = S k X k be the disjoint union of finite sets |X k | = k . For each k , let H k = { S : S ⊆ X k } , and consider the concept class H = S k H k . This classhas an unbounded VC dimension, yet is universally learnable at an exponential rate. To establishthe latter, it suffices to prove that H does not have an infinite Littlestone tree. Indeed, once we fixany root label x ∈ X k of a Littlestone tree, only h ∈ H k can satisfy h ( x ) = 1, and so the hypothesesconsistent with the subtree corresponding to h ( x ) = 1 form a finite class. This subtree can thereforehave only finitely many leaves, contradicting the existence of an infinite Littlestone tree. Example 2.4 (Linear rate but not VC) . Consider the disjoint union of the classes of Examples 2.2and 2.3: that is, X is the disjoint union of R and finite sets X k with |X k | = k , and H is the unionof the class of all threshold functions on R and the classes H k = { S : S ⊆ X k } . This class hasan unbounded VC dimension, yet is universally learnable at a linear rate. To establish the latter,it suffices to note that H has an infinite Littlestone tree as in Example 2.2, but H cannot have aninfinite VCL tree. Indeed, once we fix any root label x ∈ X , the class { h ∈ H : h ( x ) = 1 } has finiteVC dimension, and thus the corresponding subtree of the VCL tree must be finite. .2 Universal learning algorithms versus ERM The aim of the next two examples is to shed some light on the type of algorithms that can give riseto optimal universal learning rates. Recall that in the PAC model, a concept class is learnable ifand only if it can be learned by any ERM (empirical risk minimization) algorithm. The followingexamples will show that the ERM principle cannot explain the achievable universal learning rates;the algorithms developed in this paper are thus necessarily of a different nature.An ERM algorithm is any learning rule that outputs a concept in H that minimizes the empiricalerror. There may in fact be many such hypotheses, and thus there are many inequivalent ERMalgorithms. Learnability by means of a general ERM algorithm is equivalent to the Glivenko-Cantelli property: that is, that the empirical errors of all h ∈ H converge simultaneously to thecorresponding population errors as n → ∞ . The Glivenko-Cantelli property has a uniform variant,in which the convergence rate is uniform over all data distributions P ; this property is equivalent toPAC learnability and is characterized by VC dimension (Vapnik and Chervonenkis, 1971). It also hasa universal variant, where the convergence holds for every P but with distribution-dependent rate;the latter is equivalent to the universal consistency of a general ERM algorithm. A combinatorialcharacterization of the universal Glivenko-Cantelli property is given by van Handel (2013).The following example shows that even if a concept class is universally learnable by a generalERM algorithm, this need not yield any control on the learning rate. This is in contrast to the PACsetting, where learnability by means of ERM always implies a linear learning rate. Example 2.5 (Arbitrarily slow rates but learnable by any ERM) . Let X = N and let H be the classof all classifiers on X . This class has an infinite VCL tree and thus requires arbitrarily slow rates;but H is a universal Glivenko-Cantelli class and thus any ERM algorithm is universally consistent.In contrast, the next example shows that there are are scenarios where extremely fast universallearning is achievable, but where a general ERM algorithm can give rise to arbitrarily slow rates. Example 2.6 (Exponential rate achivable but general ERM arbitrarily slow) . Let X = S i ∈ N X i bethe disjoint union of finite sets with |X i | = 2 i . For each i ∈ N , let H i = { I : I ⊆ X i , | I | ≥ i − } , and consider the concept class H = S i ∈ N H i . It follows exactly as in Example 2.3 that H has noinfinite Littlestone tree, so that it is universally learnable at an exponential rate.We claim there exists, for any rate function R ( n ) →
0, an ERM algorithm that achieves rateslower than R . In the following, we fix any such R , as well as strictly increasing sequences { n t } and { i t } satisfying the following: letting p t = it − n t , it holds that p t is decreasing, P ∞ t =1 p t ≤
1, and p t ≥ R ( n t ). The reader may verify that such sequences can be constructed by induction on t .Now consider any ERM with the following property: if the input data ( X , Y ) , . . . , ( X n , Y n ) issuch that Y i = 0 for all i , then the algorithm outputs ˆ h n ∈ H i Tn with T n = min { t : there exists h ∈ H i t such that h ( X ) = · · · = h ( X n ) = 0 } . We claim that such ERM perform poorly on the data distribution P defined by P { ( x, } = 2 − i t p t for all x ∈ X i t , t ∈ N , where we set P { ( x ′ , } = 1 − P ∞ t =1 p t for some arbitrary choice of x ′ S t ∈ N X i t . Note that P isrealizable, as inf i er( h i ) ≤ inf i P { ( x, y ) : x ∈ X i } = 0 for any h i ∈ H i .It remains to show that E [er(ˆ h n )] ≥ R ( n ) for infinitely many n . To this end, note that byMarkov’s inequality, there is a probability at least 1 / X , Y ) , . . . , ( X n t , Y n t )such that X j ∈ X i t is at most 2 i t − . On this event, we must have T n ≤ t , so thater(ˆ h n t ) ≥ P { ( x,
0) : x ∈ X i Tn } ≥ p t ≥ R ( n t ) . Thus we have shown that E [er(ˆ h n t )] ≥ R ( n t ) for all t ∈ N . .3 Universal learning versus other learning models The nonuniform learning model of Benedek and Itai (1994) is intermediate between universal andPAC learning, see section 1.6.5. Our next example shows that a concept class may be not evenlearnable in the nonuniform sense, while exhibiting the fastest rate of uniform learning.
Example 2.7 (Exponential rate but not nonuniformly learnable) . The following class can be learnedat an exponential rate, yet it cannot be presented as a countable union of VC classes (and hence itis not learnable in the nonuniform setting by Benedek and Itai, 1994): X = { S ⊂ R : | S | < ∞} , H = { h y : y ∈ R } , where h y ( S ) = y ∈ S . We first claim that H has no infinite Littlestone tree: indeed, once we fix aroot label S ∈ X of a Littlestone tree, the class { h ∈ H : h ( S ) = 1 } is finite, so the correspondingsubtree must be finite. Thus H is universally learnable at an exponential rate.On the other hand, suppose that H were a countable union of VC classes. Then one elementof this countable union must contain infinitely many hypotheses (as R is uncountable). This is acontradiction, as any infinite subset { h y : y ∈ I } ⊆ H with I ⊆ R , | I | = ∞ has unbounded VCdimension (as its dual class is the class of all finite subsets of I ).Our next example is concerned with the characterization of arbitrarily slow rates. As we discussedin section 1.6.1, a no free lunch theorem of Devroye, Gy¨orfi, and Lugosi (1996) shows that a sufficient condition for a class H to require arbitrarily slow rates is that there exists an infinite set X ′ ⊆ X finitely shattered by H : that is, there exists X ′ = { x , x , . . . } ⊆ X such that, for every n ∈ N and y , . . . , y n ∈ { , } , there is h ∈ H with h ( x i ) = y i for every i ≤ n . Since our Theorem 1.9 indicatesthat existence of an infinite VCL tree is both sufficient and necessary , it is natural to ask how thesetwo conditions relate to each other. It is easy to see that the existence of a finitely shattered infiniteset X ′ implies the existence of an infinite VCL tree. However, the following example shows that theopposite is not true: that is, there exist classes H with an infinite VCL tree that do not finitelyshatter an infinite set X ′ . Thus, these conditions are not equivalent, and our Theorem 1.9 providesa strictly weaker condition sufficient for H to require arbitrarily slow rates. Example 2.8 (No finitely shattered infinite set, but requires arbitrarily slow rates) . Consider acountable space X that is itself structured into nodes of a VCL tree: that is, X = { x i u : k ∈ N ∪ { } , i ∈ { , . . . , k } , u ∈ { , } × { , } × · · · × { , } k } , where each x i u is a distinct point. Then for each y = ( y , ( y , y ) , . . . , ( y k , . . . , y k − k ) , . . . ) ∈ { , } ×{ , } × · · · , define h y such that every k ∈ N ∪ { } and i ∈ { , . . . , k } has h y ( x i y ≤ k ) = y ik +1 , andevery x ∈ X \{ x i y ≤ k : k ∈ N ∪ { } , i ∈ { , . . . , k }} has h y ( x ) = 0. Then define H = { h y : y ∈ { , } × { , } × · · · } . By construction, this class H has an infinite VCL tree. However, any set S ⊂ X of size at least 2which is shattered by H must be contained within a single node of the tree. In particular, since anycountable set X ′ = { x ′ , x ′ , . . . } ⊆ X necessarily contains points x ′ i , x ′ j existing in different nodes ofthe tree, the set { x ′ , . . . , x ′ max { i,j } } is not shattered by H , so that X ′ is not finitely shattered by H . The previous examples were designed to illustrate the key features of the results of this paperin comparison with other learning models; however, these examples may be viewed as somewhatartificial. To conclude this section, we give two examples of “natural” geometric concept classes thatare universally learnable with exponential rate. This suggests that our theory has direct implicationsfor learning scenarios of the kind that may arise in applications. xample 2.9 (Nonlinear manifolds) . Various practical learning problems are naturally expressedby concepts that indicate whether the data lie on a manifold. The following construction providesone simple way to model classes of nonlinear manifolds. Let the domain X be any Polish space, andfix a measurable function g : X → R d with d < ∞ . For a given k < ∞ , consider the concept class H = { Ag =0 : A ∈ R k × d } . The coordinate functions g , . . . , g d describe the nonlinear features of the class. For example, if X = C n and g j are polynomials, this model can describe any class of affine algebraic varieties.We claim that H is universally learnable at exponential rate. It suffices to show that, in fact, H has finite Littlestone dimension. To see why, fix any Littlestone tree, and consider its branch x ∅ , x , x , . . . ; for simplicity, we will denote these points in this example as x , x , x , . . . . Define V j = { A ∈ R k × d : Ag ( x i ) = 0 for i = 0 , . . . , j } . Each V j is a finite-dimensional linear space. Now note that if V j = V j − , then all h ∈ H such that h ( x i ) = 1, i = 1 , . . . , j − h ( x j ) = 1; but this is impossible, as the definition of a Littlestonetree requires the existence of h ∈ H such that h ( x i ) = 1, i = 1 , . . . , j − h ( x j ) = 0. Thus thedimension of V j must decrease strictly in j , so the branch x ∅ , x , x , . . . must be finite. Example 2.10 (Positive halfspaces on N d ) . It is a classical fact that the class of halfspaces on R d has finite VC dimension, and it is easy to see this class has an infinite Littlestone tree. Thus the PACrate cannot be improved in this setting. The aim of this example is to show that the situation is quitedifferent if one considers positive halfspaces on a lattice N d : such a class is universally learnable withexponential rate. This may be viewed as an extension of Example 2.1, which illustrates that somegeometric classes on discrete spaces can be universally learned at a much faster rate than geometricclasses on continuous spaces (a phenomenon not captured by the PAC model).More precisely, let X = N d for some d ∈ N , and let H be the class of positive halfspaces: H = { w · x − b ≥ : ( w , b ) ∈ (0 , ∞ ) d +1 } . We will argue that H is universally learnable at an exponential rate by constructing an explicitlearning algorithm guaranteeing a finite number of mistakes for every realizable data sequence. Aswill be argued in Section 3 below, the existence of such an algorithm immediately implies H doesnot have an infinite Littlestone tree. Moreover, we show in Section 4 that such an algorithm can beconverted into a learning algorithm achieving exponential rates for all realizable distributions P .Let S n ∈ ( X × { , } ) n be any data set consistent with some h ∈ H . If every ( x i , y i ) ∈ S n has y i = 0, let ˆ h n ( x ) = 0 for all x ∈ X . Otherwise, let ˆ h n ( x ) = x ∈ L( { x i :( x i , ∈ S n } ) , whereL( { z , . . . , z t } ) = (cid:26) z ′ + X i ≤ t α i z i : α i ∈ [0 , , X i ≤ t α i = 1 , z ′ ∈ [0 , ∞ ) d (cid:27) for any t ∈ N and z , . . . , z t ∈ X . L( { z , . . . , z t } ) is the smallest region containing the convex hull of z , . . . , z t for which the indicator of the region is non-decreasing in every dimension.Now consider any sequence { ( x i , y i ) } i ∈ N in X × { , } such that for each n ∈ N , letting S n = { ( x i , y i ) } ni =1 , there exists h ∗ n ∈ H with h ∗ n ( x i ) = y i for all i ≤ n . Since { x : h ∗ n +1 ( x ) = 1 } is convex,and h ∗ n +1 ( x ) is non-decreasing in every dimension, we have ˆ h n ≤ h ∗ n +1 . This implies that any n ∈ N with ˆ h n ( x n +1 ) = y n +1 must have y n +1 = 1 and ˆ h n ( x n +1 ) = 0. Therefore, by the definition of L( · ),the following must hold for any n with ˆ h n ( x n +1 ) = y n +1 : for every i ≤ n such that y i = 1, thereexists a coordinate 1 ≤ j ≤ d such that ( x n +1 ) j < ( x i ) j .Now suppose, for the sake of obtaining a contradiction, that there is an increasing infinite se-quence { n t } t ∈ N such that ˆ h n t ( x n t +1 ) = y n t +1 , and consider a coloring of the infinite complete raph with vertices { x n t +1 } t ∈ N where every edge { x n t +1 , x n t ′ +1 } with t < t ′ is colored with a valuemin { j : ( x n t ′ +1 ) j < ( x n t +1 ) j } . Then the infinite Ramsey theorem implies there exists an infinitemonochromatic clique: that is, a value j ≤ d and an infinite subsequence { n t i } with ( x n ti +1 ) j strictlydecreasing in i . This is a contradiction, since clearly any strictly decreasing sequence ( x n ti +1 ) j maintaining x n ti +1 ∈ X can be of length at most ( x n t +1 ) j , which is finite. Therefore, the learningalgorithm ˆ h n makes at most a finite number of mistakes on any such sequence { ( x i , y i ) } i ∈ N . Let usnote, however, that there can be no uniform bound on the number of mistakes (independent of thespecific sequence { ( x i , y i ) } i ∈ N ), since the Littlestone dimension of H is infinite.
3. The adversarial setting
Before we proceed to the main topic of this paper, we introduce a simpler adversarial analogue of ourlearning problem. The strategies that arise in this adversarial setting form a key ingredient of thestatistical learning algorithms that will appear in our main results. At the same time, it motivatesus to introduce a number of important concepts that play a central role in the sequel.
Let X be a set, and let the concept class H be a collection of indicator functions h : X → { , } . Weconsider an online learning problem defined as a game between the learner and an adversary . Thegame is played in rounds. In each round t ≥ • The adversary chooses a point x t ∈ X . • The learner predicts a label ˆ y t ∈ { , } . • The adversary reveals the true label y t = h ( x t ) for some function h ∈ H that is consistentwith the previous label assignments h ( x ) = y , . . . , h ( x t − ) = y t − .The learner makes a mistake in round t if ˆ y t = y t . The goal of the learner is to make as few mistakesas possible and the goal of the adversary is to cause as many mistakes as possible. The adversaryneed not choose a target concept h ∈ H in advance, but must ensure that the sequence { ( x t , y t ) } ∞ t =1 is realizable by H in the sense that for all T ∈ N there exists h ∈ H such that h ( x t ) = y t for all t ≤ T . That is, each prefix { ( x t , y t ) } Tt =1 must be consistent with some h ∈ H .We say that the concept class H is online learnable if there is a strategyˆ y t = ˆ y t ( x , y , . . . , x t − , y t − , x t ) , that makes only finitely many mistakes, regardless of what realizable sequence { ( x t , y t ) } ∞ t =1 is pre-sented by the adversary.The above notion of learnability may be viewed as a universal analogue of the uniform mistakebound model of Littlestone (1988), which asks when there exists a strategy that is guaranteed tomake at most d < ∞ mistakes for any input. Littlestone showed that this is the case if and only if H has no Littlestone tree of depth d + 1. Here we ask only that the strategy makes a finite number ofmistakes on any input, without placing a uniform bound on the number of mistakes. The main resultof this section shows that this property is fully characterized by the existence of infinite Littlestonetrees. Let us recall that Littlestone trees were defined in Definition 1.7.
Theorem 3.1.
For any concept class H , we have the following dichotomy. • If H does not have an infinite Littlestone tree, then there is a strategy for the learner thatmakes only finitely many mistakes against any adversary. If H has an infinite Littlestone tree, then there is a strategy for the adversary that forces anylearner to make a mistake in every round.In particular, H is online learnable if and only if it has no infinite Littlestone tree. A proof of this theorem is given in the next section. The proof uses classical results from thetheory of infinite games, see Appendix A.1 for a review of the relevant notions.
Let us now view the online learning game from a different perspective that fits better into theframework of classical game theory. For x , . . . , x t ∈ X and y , . . . , y t ∈ { , } , consider the class H x ,y ,...,x t ,y t := { h ∈ H : h ( x ) = y , . . . , h ( x t ) = y t } of hypotheses that are consistent with x , y , . . . , x t , y t . An adversary who tries to maximize thenumber of mistakes the learner makes will choose a sequence of x t , y t with y t = ˆ y t for as many initialrounds in a row as possible. In other words, the adversary tries to keep H x , − ˆ y ,...,x t , − ˆ y t = ∅ as long as possible. When this set would become empty (for every possible x t ), however, the onlyconsistent choice of label is y t = ˆ y t , so the learner makes no mistakes from that point onwards.This motivates defining the following game G . There are two players: P A and P L . In eachround τ : • Player P A chooses a point ξ τ ∈ X and shows it to Player P L . • Then, Player P L chooses a point η τ ∈ { , } .Player P L wins the game in round τ if H ξ ,η ,...,ξ τ ,η τ = ∅ . Player P A wins the game if the gamecontinues indefinitely. In other words, the set of winning sequences for P L is W = { ( ξ , η ) ∈ ( X × { , } ) ∞ : H ξ ,η ,...,ξ τ ,η τ = ∅ for some 0 ≤ τ < ∞} This set of sequences W is finitely decidable in the sense that the membership of ( ξ , η ) in W iswitnessed by a finite subsequence. Thus the above game is a Gale-Stewart game (cf. Appendix A.1).In particular, by Theorem A.1, exactly one of P A or P L has a winning strategy in this game.The game G is intimately connected to the definition of Littlestone trees: an infinite Littlestonetree is nothing other than a winning strategy for P A , expressed in a slightly different language. Lemma 3.2.
Player P A has a winning strategy in the Gale-Stewart game G if and only if H hasan infinite Littlestone tree. Proof
Suppose H has an infinite Littlestone tree, for which we adopt the notation of Definition 1.7.Define a strategy for P A by ξ τ ( η , . . . , η τ − ) = x η ,...,η τ − (cf. Remark A.4). The definition of aLittlestone tree implies that H ξ ,η ,...,ξ τ ,η τ = ∅ for every η ∈ { , } ∞ and τ < ∞ , that is, thisstrategy is winning for P A . Conversely, suppose P A has a winning strategy, and define the infinitetree T = { x u : 0 ≤ k < ∞ , u ∈ { , } k } by x η ,...,η τ − := ξ τ ( η , . . . , η τ − ) . The tree T is an infinite Littlestone tree by the definition of a winning strategy for the game G .We are now ready to prove Theorem 3.1. Proof of Theorem 3.1
Assume H has an infinite Littlestone tree { x u } . The adversary may playthe following strategy: in round t , choose x t = x y ,...,y t − nd after the learner reveals her prediction ˆ y t , choose y t = 1 − ˆ y t . By definition of a Littlestone tree, y t is consistent with H regardless of the learner’s prediction. Thisstrategy for the adversary in the online learning problem forces any learner to make a mistake inevery round.Now suppose H has no infinite Littlestone tree. Then P L has a winning strategy η τ ( ξ , . . . , ξ τ )in the Gale-Stewart game G (cf. Remark A.4). If we were to know a priori that the adversaryalways forces an error when possible, then the learner could use this strategy directly with x t = ξ t and ˆ y t = 1 − η t to ensure she only makes finitely many mistakes. To extend this conclusion to anarbitrary adversary, we design our learning algorithm so that the Gale-Stewart game proceeds tothe next round only when the learner makes a mistake. More precisely, we introduce the followinglearning algorithm. • Initialize τ ← f ( x ) ← η ( x ). • In every round t ≥ y t = 1 − f ( x t ).- If ˆ y t = y t , let ξ τ ← x t , f ( x ) ← η τ +1 ( ξ , . . . , ξ τ , x ), and τ ← τ + 1.This algorithm can only make a finite number of mistakes against any adversary. Indeed, supposethat some adversary forces the learner to make an infinite number of mistakes at times t , t , . . . Bythe definition of G , however, we have H x t ,y t ,...,x tk ,y tk = ∅ for some k < ∞ . This violates the rulesof the online learning game, because the sequence { ( x t , y t ) } t k t =1 is not consistent with H . The learning algorithm from the previous section solves the adversarial online learning problem.It is also a basic ingredient in the algorithm that achieves exponential rates in the probabilisticsetting (section 4 below). However, in passing from the adversarial setting to the probabilisticsetting, we encounter nontrivial difficulties. While the existence of winning strategies is guaranteedby the Gale-Stewart theorem, this result does not say anything about the complexity of thesestrategies. In particular, it is perfectly possible that the learning algorithm of the previous sectionis nonmeasurable, in which case its naive application in the probabilistic setting can readily yieldnonsensical results (cf. Appendix C).It is, therefore, essential to impose sufficient regularity assumptions so that the winning strategiesin the Gale-Stewart game G are measurable. This issue proves to be surprisingly subtle: almostnothing appears to be known in the literature regarding the measurability of Gale-Stewart strategies.We therefore develop a rather general result of this kind, Theorem B.1 in Appendix B, that sufficesfor all the purposes of this paper. Definition 3.3.
A concept class H of indicator functions h : X → { , } on a Polish space X issaid to be measurable if there is a Polish space Θ and Borel-measurable map h : Θ × X → { , } so that H = { h ( θ, · ) : θ ∈ Θ } .In other words, H is measurable when it can be parameterized in any reasonable way. This isthe case for almost any H encountered in practice. The Borel isomorphism theorem (Cohn, 1980,Theorem 8.3.6) implies that we would obtain an identical definition if we required only that Θ is aBorel subset of a Polish space. emark 3.4. Definition 3.3 is well-known in the literature: this is the standard measurabilityassumption made in empirical process theory, where it is usually called the image admissible Suslinproperty, cf. (Dudley, 2014, section 5.3).Our basic measurability result is the following corollary of Theorem B.1.
Corollary 3.5.
Let X be Polish and H be measurable. Then the Gale-Stewart game G of theprevious section has a universally measurable winning strategy. In particular, the learning algorithmof Theorem 3.1 is universally measurable. Proof
The conclusion follows from Theorem B.1 once we verify that the set W of winning sequencesfor P L in G is coanalytic (see Appendix A.4 for the relevant terminology and basic properties ofPolish spaces and analytic sets). To this end, we write its complement as W c = { ( ξ , η ) ∈ ( X × { , } ) ∞ : H ξ ,η ,...,ξ τ ,η τ = ∅ for all τ < ∞} = \ ≤ τ< ∞ [ θ ∈ Θ \ ≤ t ≤ τ { ( ξ , η ) ∈ ( X × { , } ) ∞ : h ( θ, x t ) = η t } . The set { ( θ, ξ , η ) : h ( θ, ξ i ) = η i } is Borel by the measurability assumption. Moreover, both inter-sections in the above expression are countable, while the union corresponds to the projection of aBorel set. The set W c is therefore analytic.That a nontrivial measurability assumption is needed in the first place is not obvious: one mighthope that it suffices to simply require that every concept h ∈ H is measurable. Unfortunately, thisis not the case. In Appendix C, we describe a nonmeasurable concept class on X = [0 ,
1] such thateach h ∈ H is the indicator of a countable set. In this example, the set W of winning sequences isnonmeasurable: thus one cannot even give meaning to the probability that the game is won whenit is played with random data. In such a situation, the analysis in the following sections does notmake sense. Thus Corollary 3.5, while technical, is essential for the theory developed in this paper.It is perhaps not surprising that some measurability issues arise in our setting, as this is alreadythe case in classical PAC learning theory (Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989;Pestov, 2011). Definition 3.3 is the standard assumption that is made in this setting (Dudley, 2014).However, the only issue that arises in the classical setting is the measurability of the supremum ofthe empirical process over H . This is essentially straightforward: for example, measurability istrivial when H is countable, or can be pointwise approximated by a countable class. The latteralready captures many classes encountered in practice. For these reasons, measurability issues inclassical learning theory are often considered “a minor nuisance”. The measurability problem forGale-Stewart strategies is much more subtle, however, and cannot be taken for granted. For example,we do not know of a simpler proof of Theorem B.1 in the setting of Corollary 3.5 even when theclass H is countable. Further discussion may be found in Appendix C. In its classical form, the Gale-Stewart theorem (Theorem A.1) is a purely existential statement:it states the existence of winning strategies. To actually implement learning algorithms from suchstrategies, however, one would need to explicitly describe them. Such an explicit description isconstructed as part of the measurability proof of Theorem B.1 on the basis of a refined notion ofdimension for concept classes that is of interest in its own right. The aim of this section is to brieflyintroduce the relevant ideas in the context of the online learning problem; see the proof of TheoremB.1 for more details. (The content of this section is not used elsewhere in the text.)It is instructive to begin by recalling the classical online learning strategy (Littlestone, 1988).The
Littlestone dimension of H is defined as the largest depth of a Littlestone tree for H (if H s empty then its dimension is − d is finite, then there is a strategy for P L in the game G that wins at the latest in round d + 1. This winning strategy is built using the following observation. Observation 3.6.
Assume that the Littlestone dimension d of H is finite and that H is nonempty.Then for every x ∈ X , there exists y ∈ { , } such that the Littlestone dimension of H x,y is strictlyless than that of H . Proof
If both H x, and H x, have a Littlestone tree of depth d (say t , t , respectively), then H has a Littlestone tree of depth d + 1: take x as the root and attach t , t as its subtrees.The winning strategy for P L is now evident: as long as player P L always chooses y t so that theLittlestone dimension of H x ,y ,...,x t ,y t is smaller than that of H x ,y ,...,x t − ,y t − , then P L will win inat most d + 1 rounds.At first sight, it appears that this strategy does not make much sense in our setting. Thoughwe assume that H has no infinite Littlestone tree, it may have finite Littlestone trees of arbitrarilylarge depth. In this case the classical Littlestone dimension is infinite, so a naive implementationof the above strategy fails. Nonetheless, the key idea behind the proof of Theorem B.1 is that anappropriate extension of Littlestone’s strategy works in the general setting. The basic observationis that the notion “infinite Littlestone dimension” may be considerably refined: we can extend theclassical notion to capture precisely “how infinite” the Littlestone dimension is. With this newdefinition in hand, the winning strategy for P L will be exactly the same as in the case of finiteLittlestone dimension. The Littlestone dimension may not just be a natural number, but rather anordinal, which turns out to be precisely the correct way to measure the “number of steps to victory”.A brief introduction to ordinals and their role in game theory is given in Appendix A.2.Our extension of the Littlestone dimension uses the notion of rank , which assigns an ordinal toevery finite Littlestone tree. The rank is defined by a partial order ≺ : let us write t ′ ≺ t if t ′ is aLittlestone tree that extends t by one level, namely, t is obtained from t ′ by removing its leaves. A Littlestone tree t is minimal if it cannot be extended to a Littlestone tree of larger depth. In thiscase, we say rank( t ) = 0. For non-minimal trees, we define rank( t ) by transfinite recursionrank( t ) = sup { rank( t ′ ) + 1 : t ′ ≺ t } . If rank( t ) = d is finite, then the largest Littlestone tree that extends t has d additional levels. The classical Littlestone dimension is d ∈ N if and only if rank( ∅ ) = d .Rank is well-defined as long as H has no infinite Littlestone tree. The crucial point is that when H has no infinite tree, ≺ is well-founded (i.e., there are no infinite decreasing chains in ≺ ), so thatevery finite Littlestone tree t appears in the above recursion. For more details, see Appendix A.3. Definition 3.7.
The ordinal Littlestone dimension of H is defined as :LD( H ) := − H is empty. Ω if H has an infinite Littlestone tree.rank( ∅ ) otherwise.When H has no infinite Littlestone tree, we can construct a winning strategy for P L in the samemanner as in the case of finite Littlestone dimension. An extension of Observation 3.6 states that forevery x ∈ X , there exists y ∈ { , } so that LD( H x,y ) < LD( H ). The intuition behind this extension
3. It may appear somewhat confusing that t ′ ≺ t although t ′ is larger than t as a tree. The reason is that we ordertrees by how far they may be extended, and t ′ can be extended less far than t .4. Here we borrow Cantor’s notation Ω for the absolute infinite : a number larger than every ordinal number. s the same as in the finite case, but its proof is more technical (cf. Proposition B.8). The strategyfor P L is now chosen so that LD( H x ,y ,...,x t ,y t ) decreases in every round. This strategy ensures thatP L wins in a finite number of rounds, because ordinals do not admit an infinite decreasing chain.The idea that dimension can be an ordinal may appear a bit unusual. The meaning of this notionis quite intuitive, however, as is best illustrated by means of some simple examples. Recall that wehave already shown above that when LD( H ) < ω is finite ( ω denotes the smallest infinite ordinal),the ordinal Littlestone dimension coincides with the classical Littlestone dimension. Example 3.8 (Disjoint union of finite-dimensional classes) . Partition X = N into disjoint intervals X , X , X , . . . with |X k | = k . For each k , let H k be the class of indicators of all subsets of X k . Let H = S k H k . We claim that LD( H ) = ω . Indeed, as soon as we select a root vertex x ∈ X k for aLittlestone tree, we can only grow the Littlestone tree for k − { x } ) = k − x ∈ X k . By definition, rank( ∅ ) = sup { rank( { x } ) + 1 : x ∈ X } = ω . Example 3.9 (Thresholds on N ) . Let X = N and consider the class of thresholds H = { x x ≤ z : z ∈ N } . As in the previous example, we claim that LD( H ) = ω . Indeed, as soon as we select a rootvertex x ∈ X for a Littlestone tree, we can grow the Littlestone tree for at most x − h ∈ H and distinct points y , . . . , y x such that h ( x ) = 0 and h ( y ) = · · · = h ( y x ) = 1). On the other hand, we can grow a Littlestone tree of depth order log( x ),by repeatedly choosing labels in each level that bisect the intervals between the labels chosen in theprevious level. It follows that rank( ∅ ) = sup { rank( { x } ) + 1 : x ∈ X } = ω . Example 3.10 (Thresholds on Z ) . Let X = Z and consider the class of thresholds H = { x x ≤ z : z ∈ Z } . In this case, LD( H ) = ω + 1. As soon as we select a root vertex x ∈ X , the class H x, isessentially the same as the threshold class from the previous example. It follows that rank( { x } ) = ω for every x ∈ X . Consequently, rank( ∅ ) = ω + 1. Example 3.11 (Union of partitions) . Let X = [0 , k , let H k be the class of indicators ofdyadic intervals length 2 − k (which partition X ). Let H = S k H k . In this example, LD( H ) = ω + 1.Indeed, consider a Littlestone tree t = { x ∅ , x , x } of depth two. The class H x ∅ , ,x , consists ofindicators of those dyadic intervals that contain both x ∅ and x . There is only a finite number suchintervals, because | x ∅ − x | > t ) < ω for any Littlestone tree of depth two. On the other hand, one may grow a Littlestonetree of arbitrary depth for any choice of root x ∅ : the class H x ∅ , is an infinite sequence of nestedintervals, which is essentially the same as in Example 3.9; and H x ∅ , has a subclass that is essentiallythe same as H itself. Thus, rank( { x ∅ } ) = ω for every x ∅ ∈ X . Consequently, rank( ∅ ) = ω + 1.By inspecting these examples, a common theme emerges. A class of finite Littlestone dimensionis one whose Littlestone trees are of bounded depth. A class with LD( H ) = ω has arbitrarily largefinite Littlestone trees, but the maximal depth of a Littlestone tree is fixed once the root node hasbeen selected. Similarly, a class with LD( H ) = ω + k for k < ω has arbitrarily large finite Littlestonetrees, but the maximal depth of a Littlestone tree is fixed once its first k +1 levels have been selected.There are also higher ordinals such as LD( H ) = ω + ω ; this means that the choice of root of thetree determines an arbitrarily large finite number k , such that the maximal depth of the tree is fixedafter the next k levels have been selected. For further examples in a more general context, we referto Appendix A.3 and to the lively discussion in (Evans and Hamkins, 2014) of game values in infinitechess. In any case, the above examples illustrate that the notion of ordinal Littlestone dimension isnot only intuitive, but also computable in concrete situations.While only small infinite ordinals appear in the above examples, there exist concept classes suchthat LD( H ) is an arbitrarily large ordinal (as in the proof of Lemma C.3). There is no general
5. The results in Appendix B are formulated in the setting of general Gale-Stewart games. When specialized tothe game G of Section 3.2, the reader may readily verify that the game value defined in Section B.2 is preciselyval( x , y , . . . , x t , y t ) = LD( H x ,y ,...,x t ,y t ). pper bound on the ordinal Littlestone dimension. However, a key part of the proof of Theorem B.1is the remarkable fact that for measurable classes H in the sense of Definition 3.3, the Littlestonedimension can be at most a countable ordinal LD( H ) < ω (Lemma B.7). Thus any concept classthat one is likely to encounter in practice gives rise to a relatively simple learning strategy.
4. Exponential rates
Sections 4 and 5 of this paper are devoted to the proof of Theorem 1.9, which is the main result ofthis paper. The aim of the present section is to characterize when exponential rates do and do notoccur; the analogous questions for linear rates will be studied in the next section.Let us recall that the basic definitions of this paper are stated in section 1.4; they will be freelyused in the following without further comment. In particular, the following setting and assumptionswill be assumed throughout Sections 4 and 5. We fix a Polish space X and a concept class H ⊆{ , } X satisfying the measurability assumption of Definition 3.3. To avoid trivialities, we alwaysassume that |H| >
2. The learner is presented with an i.i.d. sequence of samples ( X , Y ) , ( X , Y ) , . . . drawn from an unknown distribution P on X × { , } . We will always assume that P is realizable. We start by characterizing what classes H are learnable at an exponential rate. Theorem 4.1. If H does not have an infinite Littlestone tree, H is learnable with optimal rate e − n . The theorem consists of two parts: we need to prove an upper bound and a lower bound on therate. The latter (already established by Schuurmans, 1997) is straightforward, so we present it first.
Lemma 4.2 (Schuurmans (1997)) . For any learning algorithm ˆ h n , there exists a realizable distri-bution P such that E [er(ˆ h n )] ≥ − n − for infinitely many n . In particular, this means H is notlearnable at rate faster than exponential: R ( n ) = e − n . Proof As |H| >
2, we can choose h , h ∈ H and x, x ′ ∈ X such that h ( x ) = h ( x ) =: y and h ( x ′ ) = h ( x ′ ). Now fix any learning algorithm ˆ h n . Define two distributions P , P ,where each P i { ( x, y ) } = and P i { ( x ′ , i ) } = . Let I ∼ Bernoulli( ), and conditioned on I let( X , Y ) , ( X , Y ) , . . . be i.i.d. P I , and ( X , Y ) , . . . , ( X n , Y n ) are the training set for ˆ h n . Then E [ P (ˆ h n ( X n +1 ) = Y n +1 |{ ( X t , Y t ) } nt =1 , I )] ≥ P ( X = · · · = X n = x, X n +1 = x ′ ) = 2 − n − . Moreover, E [ P (ˆ h n ( X n +1 ) = Y n +1 |{ ( X t , Y t ) } nt =1 , I )]= 12 X i ∈{ , } E [ P (ˆ h n ( X n +1 ) = Y n +1 |{ ( X t , Y t ) } nt =1 , I = i ) | I = i ] . Since the average is bounded by the max, we conclude that for each n , there exists i n ∈ { , } suchthat for ( X , Y ) , . . . , ( X n , Y n ) i.i.d. P i n , E [er P in (ˆ h n )] ≥ − n − . In particular, by the pigeonhole principle, there exists i ∈ { , } such that i n = i infinitely often, sothat E [er P i (ˆ h n )] ≥ − n − infinitely often. he main challenge in the proof of Theorem 4.1 is constructing a learning algorithm that achievesexponential rate for every realizable P . We assume in the remainder of this section that H hasno infinite Littlestone tree. Theorem 3.1 and Corollary 3.5 yield the existence of a sequence ofuniversally measurable functions ˆ Y t : ( X × { , } ) t − × X → { , } that solve the online learningproblem from Section 3.1. Define the data-dependent classifierˆ y t − ( x ) := ˆ Y t ( X , Y , . . . , X t − , Y t − , x ) . Our first observation is that this adversarial algorithm is also applicable in the probabilistic setting.
Lemma 4.3. P { er(ˆ y t ) > } → as t → ∞ . Proof As P is realizable, we can choose a sequence of hypotheses h k ∈ H so that er( h k ) ≤ − k .For every t ≥
1, a union bound gives X k P { h k ( X s ) = Y s for some s ≤ t } ≤ t X k er( h k ) < ∞ . By Borel-Cantelli, with probability one, there exists for every t ≥ h ∈ H such that h ( X s ) = Y s for all s ≤ t . In other words, with probability one X , Y , X , Y , . . . defines a valid inputsequence for the online learning problem of Section 3.1. Because we chose a winning strategy, thetime of the last mistake T = sup { s ≥ y s − ( X s ) = Y s } is a random variable that is finite with probability one. Now recall from the proof of Theorem 3.1that the online learning algorithm was chosen so that ˆ y t only changes when a mistake is made. Inparticular, ˆ y s = ˆ y t for all s ≥ t ≥ T . By the law of large numbers, P { er(ˆ y t ) = 0 } = P (cid:26) lim S →∞ S t + S X s = t +1 ˆ y t ( X s ) = Y s = 0 (cid:27) ≥ P (cid:26) lim S →∞ S t + S X s = t +1 ˆ y t ( X s ) = Y s = 0 , T ≤ t (cid:27) = P { T ≤ t } . It follows that P { er(ˆ y t ) > } ≤ P { T > t } → t → ∞ .Lemma 4.3 certainly shows that E [er(ˆ y t )] → t → ∞ . Thus the online learning algorithmyields a consistent algorithm in the statistical setting. This, however, does not yield any boundon the learning rate. We presently build a new algorithm on the basis of ˆ y t that guarantees anexponential learning rate.As a first observation, suppose we knew a number t ∗ so that P { er(ˆ y t ∗ ) > } < . Then wecould output ˆ h n with exponential rate as follows. First, break up the data X , Y , . . . , X n , Y n into ⌊ n/t ∗ ⌋ batches, each of length t ∗ . Second, compute the classifier ˆ y t ∗ separately for each batch.Finally, choose ˆ h n to be the majority vote among these classifiers. Now, by the definition of t ∗ andHoeffding’s inequality, the probability that more than one third of the classifiers has positive erroris exponentially small. It follows that the majority vote ˆ h n has zero error except on an event ofexponentially small probability.The problem with this idea is that t ∗ depends on the unknown distribution P , so we cannotassume it is known to the learner. Thus our final algorithm proceeds in two stages: first, weconstruct an estimate ˆ t n for t ∗ from the data; and then we apply the above majority algorithm withbatch size ˆ t n . emma 4.4. There exist universally measurable ˆ t n = ˆ t n ( X , Y , . . . , X n , Y n ) , whose definition doesnot depend on P , so that the following holds. Given t ∗ such that P { er(ˆ y t ∗ ) > } ≤ , there exist C, c > independent of n (but depending on P, t ∗ ) so that P { ˆ t n ∈ T good } ≥ − Ce − cn , where T good := { ≤ t ≤ t ∗ : P { er(ˆ y t ) > } ≤ } . Proof
For each 1 ≤ t ≤ ⌊ n ⌋ and 1 ≤ i ≤ ⌊ n t ⌋ , letˆ y it ( x ) := ˆ Y t +1 ( X ( i − t +1 , Y ( i − t +1 , . . . , X it , Y it , x )be the learning algorithm from Section 3.1 that is trained on batch i of the data. For each t , theclassifiers (ˆ y it ) i ≤⌊ n/ t ⌋ are trained on subsamples of the data that are independent of each other andof the second half ( X s , Y s ) s>n/ of the data. Thus (ˆ y it ) i ≤⌊ n/ t ⌋ may be viewed as independent drawsfrom the distribution of ˆ y t . We now estimate P { er(ˆ y t ) > } by the fraction of ˆ y it that make an erroron the second half of the data:ˆ e t := 1 ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 { ˆ y it ( X s ) = Y s for some n/ a.s.Define ˆ t n := inf { t ≤ ⌊ n ⌋ : ˆ e t < } with the convention inf ∅ = ∞ .Now, fix t ∗ as in the statement of the lemma. By Hoeffding’s inequality, P { ˆ t n > t ∗ } ≤ P { ˆ e t ∗ ≥ } ≤ P { e t ∗ − E [ e t ∗ ] ≥ } ≤ e −⌊ n/ t ∗ ⌋ / . In other words, ˆ t n ≤ t ∗ except with exponentially small probability. In addition, by continuity, thereexists ε > ≤ t ≤ t ∗ with P { er(ˆ y t ) > } > we have P { er(ˆ y t ) > ε } > + .Fix 1 ≤ t ≤ t ∗ with P { er(ˆ y t ) > } > (if such a t exists). By Hoeffding’s inequality, P (cid:26) ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 er(ˆ y it ) >ε < (cid:27) ≤ e −⌊ n/ t ∗ ⌋ / . Now, if f is any classifier so that er( f ) > ε , then P { f ( X s ) = Y s for some n/ < s ≤ n } ≥ − (1 − ε ) n/ . Therefore, as (ˆ y it ) i ≤⌊ n/ t ⌋ are independent of ( X s , Y s ) s>n/ , applying a union bound conditionallyon ( X s , Y s ) s ≤ n/ shows that the probability that every classifier ˆ y it with er(ˆ y it ) > ε makes an erroron the second half of the sample is P { er(ˆ y it ) >ε ≤ { ˆ y it ( X s ) = Y s for some n/
6∈ T good } ≤ e −⌊ n/ t ∗ ⌋ / + t ∗ ⌊ n ⌋ (1 − ε ) n/ + t ∗ e −⌊ n/ t ∗ ⌋ / . The right-hand side is bounded by Ce − cn for some C, c >
Corollary 4.5. H has at most exponential learning rate. Proof
We adopt the notations in the proof of Lemma 4.4. The output ˆ h n of our final learningalgorithm is the majority vote of the classifiers ˆ y i ˆ t n for 1 ≤ i ≤ ⌊ n t n ⌋ . We aim to show that E [er(ˆ h n )] ≤ Ce − cn for some constants C, c > t ∈ T good . By Hoeffding’s inequality, P (cid:26) ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 er(ˆ y it ) > > (cid:27) ≤ e −⌊ n/ t ∗ ⌋ / . In other words, except on an event of exponentially small probability, we have er(ˆ y it ) = 0 for amajority of indices i .By a union bound, we obtain P { er(ˆ y i ˆ t n ) > i ≤ ⌊ n t n ⌋}≤ P { ˆ t n
6∈ T good } + P { for some t ∈ T good , er(ˆ y it ) > i ≤ ⌊ n t ⌋}≤ Ce − cn + t ∗ e −⌊ n/ t ∗ ⌋ / . In words, except on an event of exponentially small probability, er(ˆ y i ˆ t n ) = 0 for a majority of indices i . It follows that the majority vote of these classifiers is a.s. correct on a random sample from P .That is, we have shown P { er(ˆ h n ) > } ≤ Ce − cn + t ∗ e −⌊ n/ t ∗ ⌋ / . The conclusion follows because E [er(ˆ h n )] ≤ P { er(ˆ h n ) > } . We showed in the previous section that if H has no infinite Littlestone tree, then it can be learned byan algorithm whose rate decays exponentially fast. What is the fastest rate when H has an infiniteLittlestone tree? The following result implies a significant drop in the rate: the rate is never fasterthan linear. Theorem 4.6. If H has an infinite Littlestone tree, then for any learning algorithm ˆ h n , there existsa realizable distribution P such that E [er(ˆ h n )] ≥ n for infinitely many n . In particular, this means H is not learnable at rate faster than n . The proof of Theorem 4.6 uses the probabilistic method. We define a distribution on realizabledistributions P with the property that for every learning algorithm, E [er(ˆ h n )] ≥ n infinitely oftenwith positive probability over the choice of P . The main idea of the proof is to concentrate P ona random branch of the infinite Littlestone tree. As any finite set of examples will only explore n initial segment of the chosen branch, the algorithm cannot know whether the random branchcontinues to the left or to the right after this initial segment. This ensures that the algorithm makesa mistake with probability when it is presented with a point that lies deeper along the branchthan the training data. The details follow. Proof of Theorem 4.6
Fix any learning algorithm with output ˆ h n , and an infinite Littlestone tree t = { x u : 0 ≤ k < ∞ , u ∈ { , } k } for H . Let y = ( y , y , . . . ) be an i.i.d. sequence of Bernoulli( )variables. Define the (random) distribution P y on X × { , } by P y { ( x y ≤ k , y k +1 ) } = 2 − k − for k ≥ . The map y P y is measurable, so no measurability issues arise below.For every n < ∞ , there exists h ∈ H so that h ( x y ≤ k ) = y k +1 for 0 ≤ k ≤ n . Hence,er y ( h ) := P y { ( x, y ) ∈ X × { , } : h ( x ) = y } ≤ − n − . Letting n → ∞ , we find that P y is realizable for every y .Now let ( X, Y ) , ( X , Y ) , ( X , Y ) , . . . be i.i.d. samples drawn from P y . Then we can write X = x y ≤ T , Y = y T +1 , X i = x y ≤ Ti , Y i = y T i +1 , where T, T , T , . . . are i.i.d. Geometric( ) (starting at 0) random variables independent of y . On theevent { T = k, max { T , . . . , T n } < k } , the value ˆ h n ( X ) is conditionally independent of y k +1 given X, ( X , Y ) , . . . , ( X n , Y n ), and (again on this event) the corresponding conditional distribution of y k +1 is Bernoulli( ) (since it is independent from y , . . . , y k and X, X , . . . , X n ). We therefore have P { ˆ h n ( X ) = Y, T = k, max { T , . . . , T n } < k } = P { ˆ h n ( X ) = y k +1 , T = k, max { T , . . . , T n } < k } = E h P n ˆ h n ( X ) = y k +1 (cid:12)(cid:12)(cid:12) X, ( X , Y ) , . . . , ( X n , Y n ) o T = k, max { T ,...,T n }
132 ;Fatou’s lemma applies as (almost surely) n P { ˆ h n ( X ) = Y, T = k n | y } ≤ n P { T = k n } = n − k n − ≤ .Because P { ˆ h n ( X ) = Y, T = k n | y } ≤ P { ˆ h n ( X ) = Y | y } = E [er y (ˆ h n ) | y ] a.s. , we have E [lim sup n →∞ n E [er y (ˆ h n ) | y ]] > , which implies there must exist a realization of y suchthat E [er y (ˆ h n ) | y ] > n infinitely often. Choosing P = P y for this realization of y concludes theproof. The following proposition summarizes some of the main findings of this section.
Proposition 4.7.
The following are equivalent.1. H is learnable at an exponential rate, but not faster. . H does not have an infinite Littlestone tree.3. There is an “eventually correct” learning algorithm for H , that is, a learning algorithm thatoutputs ˆ h n so that P { er(ˆ h n ) > } → as n → ∞ .4. There is an “eventually correct” learning algorithm for H with exponential rate, that is, P { er(ˆ h n ) > } ≤ Ce − cn where C, c > may depend on P . Proof
The implication 2 ⇒ ⇒ ⇒ ⇒
5. Linear rates
In section 4 we characterized concept classes that have exponential learning rates. We also showedthat a concept class that does not have exponential learning rate cannot be learned at a rate fasterthan linear. The aim of this section is to characterize concept classes that have linear learning rate.Moreover, we show that classes that do not have linear learning rate must have arbitrarily slowrates. This completes our characterization of all possible learning rates.To understand the basic idea behind the characterization of linear rates, it is instructive to revisitthe idea that gave rise to exponential rates. First, we showed that it is possible to design an onlinelearning algorithm that achieves perfect prediction after a finite number of rounds. While we donot have a priori control of how fast this “eventually correct” algorithm attains perfect prediction,a modification of the adversarial strategy converges at an exponentially fast rate.To attain a linear rate, we once again design an online algorithm. However, rather than aim forperfect prediction, we now set the more modest goal of learning just to rule out some finite-lengthpatterns in the data. Specifically, we aim to identify a collection of forbidden classification patterns ,so that for some finite k , every ( x , . . . , x k ) ∈ X k has some forbidden pattern in { , } k ; call thisa VC pattern class . If we can identify such a collection of patterns with the property that wewill almost surely never observe one of these forbidden patterns in the data sequence, then we canapproach the learning problem in a manner analogous to learning with a VC class. The situationis not quite this simple, since we do not actually have a family of classifiers; fortunately, however,the classical one-inclusion graph prediction strategy of Haussler, Littlestone, and Warmuth (1994) isable to operate purely on the basis of the finite patterns on the data, and hence can be applied toyield the claimed linear rate once the forbidden patterns have been identified. In order to achievean overall linear learning rate, it then remains to modify the “eventually correct” algorithm so itattains a VC pattern class at an exponentially fast rate when it is trained on random data, usinganalogous ideas to the the ones that were already used in section 4.Throughout this section, we adopt the same setting and assumptions as in section 4.
We begin presently by developing the online learning algorithm associated to linear rates. Theconstruction will be quite similar to the one in Section 3.2. However, in the present setting, thenotion of a Littlestone tree is replaced by Vapnik-Chervonenkis-Littlestone (VCL) tree, which wasdefined in Definition 1.8 (cf. Figure 3). In words, a VCL tree is defined by the following properties.Each vertex of depth k is labelled by a sequence of k + 1 variables in X . Its out degree is 2 k +1 , andeach of these 2 k +1 edges is uniquely labeled by an element in { , } k +1 . A class H has an infiniteVCL tree if every finite root-to-vertex path is realized by a function in H . In particular, if H has aninfinite VCL tree then it has an infinite Littlestone tree (the other direction does not hold). emark 5.1. Some features of Definition 1.8 are somewhat arbitrary, and the reader should notread undue meaning into them. We will ultimately be interested in whether or not H has an infiniteVCL tree. That the size of the sets x u grows linearly with the depth of the tree is not important; itwould suffice to assume that each x u is a finite set, and that the sizes of these sets are unboundedalong each infinite branch. Thus we have significant freedom in how to define the term “VCL tree”.The present canonical choice was made for concreteness.Just as we have seen for Littlestone trees in Section 3.2, a VCL tree is associated with thefollowing game V . In each round τ : • Player P A chooses points ξ τ = ( ξ τ , . . . , ξ τ − τ ) ∈ X τ . • Player P L chooses points η τ = ( η τ , . . . , η τ − τ ) ∈ { , } τ . • Player P L wins the game in round τ if H ξ ,η ,...,ξ τ ,η τ = ∅ .Here we have naturally extended to the present setting the notation H ξ ,η ,...,ξ τ ,η τ := { h ∈ H : h ( ξ is ) = η is for 0 ≤ i < s, ≤ s ≤ τ } that we used previously in Section 3.2. The game V is a Gale-Stewart game, because the winningcondition for P L is finitely decidable. Lemma 5.2. If H has no infinite VCL tree, then there is a universally measurable winning strategyfor P L in the game V . Proof
By the same reasoning as in Lemma 3.2, the class H has an infinite VCL tree if and only ifP A has a winning strategy in V . Thus if H has no infinite VCL tree, then P L has a winning strategyby Theorem A.1. To obtain a universally measurable strategy, it suffices by Theorem B.1 to showthat the set of winning sequences for P L is coanalytic. The proof of this fact is identical to that ofCorollary 3.5.When H has no infinite VCL tree, we can use the winning strategy for P L to design an algorithmthat learns to rule out some patterns in the data. We say that a sequence ( x , y , x , y , . . . ) ∈ ( X ×{ , } ) ∞ is consistent with H if for every t < ∞ , there exists h ∈ H such that h ( x s ) = y s for s ≤ t .Assuming H has no infinite VCL tree, we now use the game V to design an algorithm that learns torule out some pattern of labels in such a sequence. To this end, denote by η τ : Q τσ =1 X σ → { , } τ the universally measurable winning strategy for P L provided by Lemma 5.2 (cf. Remark A.4). • Initialize τ ← • At every time step t ≥ η τ t − ( ξ , . . . , ξ τ t − − , x t − τ t − +1 , . . . , x t ) = ( y t − τ t − +1 , . . . , y t ): ⊲ Let ξ τ t − ← ( x t − τ t − +1 , . . . , x t ) and τ t ← τ t − + 1.- Otherwise, let τ t ← τ t − .In words, the algorithm traverses the input sequence ( x , y , x , y , . . . ) while using the assumedwinning strategy η τ to learn a set of “forbidden patterns” of length τ t ; that is, an assignment whichmaps every tuple x ′ ∈ X τ t to a pattern y ′ ( x ′ ) ∈ { , } τ t such that after some finite number of steps,
6. Given such a tree, we can always engineer a tree as in Definition 1.8 in two steps. First, by passing to a subtree,we can ensure that the cardinalities of x u are strictly increasing along each branch. Second, we can throw awaysome points in each set x u together with the corresponding subtrees to obtain a tree as in Definition 1.8. he algorithm never encounters the pattern indicated by y ′ ( x ′ ) when reading the next τ t examples x ′ .in the input sequence. Let us denote by ˆy t − ( z , . . . , z τ t − ) := η τ t − ( ξ , . . . , ξ τ t − − , z , . . . , z τ t − )the “pattern avoidance function” defined by this algorithm. Lemma 5.3.
For any sequence x , y , x , y , . . . that is consistent with H , the algorithm learns, ina finite number of steps, to successfully rule out patterns in the data. That is, ˆy t − ( x t − τ t − +1 , . . . , x t ) = ( y t − τ t − +1 , . . . , y t ) , τ t = τ t − < ∞ , ˆy t = ˆy t − for all sufficiently large t . Proof
Suppose ˆy t − ( x t − τ t − +1 , . . . , x t ) = ( y t − τ t − +1 , . . . , y t ) occurs at the infinite sequence of times t = t , t , . . . Because η τ is a winning strategy for P L in the game V , we have H ξ ,η ,...,ξ k ,η k = ∅ forsome k < ∞ , where ξ i = ( x t i − τ ti − +1 , . . . , x t i ) and η i = ( y t i − τ ti − +1 , . . . , y t i ). But this contradictsthe assumption that the input sequence is consistent with H . Remark 5.4.
The strategy τ t depends in a universally measurable way on x ≤ t , y ≤ t . The map ˆy t ( · )is universally measurable jointly as a function of x ≤ t , y ≤ t . and of its input. More precisely, for each t ≥
0, there exist universally measurable functions T t : ( X × { , } ) t → { , . . . , t + 1 } , ˆY t : ( X × { , } ) t × (cid:16) [ s ≤ t X s (cid:17) → { , } t such that τ t = T t ( x , y , . . . , x t , y t ) , ˆy t ( z , . . . , z τ t ) = ˆY t ( x , y , . . . , x t , y t , z , . . . , z τ t ) . Remark 5.5.
The above learning algorithm uses the winning strategy for P L in the game V . Indirect analogy to Section 3.4, one can construct an explicit winning strategy in terms of a notion of“ordinal VCL dimension” whose definition can be read off from the proof of Theorem B.1. Becausethe details will not be needed for our purposes here, we omit further discussion. In this section we design a learning algorithm with linear learning rate for classes with no infiniteVCL trees.
Theorem 5.6. If H does not have an infinite VCL tree, then H is learnable at rate n . The proof of this theorem is similar in spirit to that of Theorem 4.1, but requires some additionalingredients. Let us fix a realizable distribution P and let ( X , Y ) , ( X , Y ) , . . . be i.i.d. samples from P . We assume in the remainder of this section that H has no infinite VCL tree, so that we can runthe algorithm of the previous section on the random data. We set τ t := T t ( X , Y , . . . , X t , Y t ) , ˆy t ( z , . . . , z τ t ) := ˆY t ( X , Y , . . . , X t , Y t , z , . . . , z τ t ) , where the universally measurable functions T t , ˆY t are the ones defined in Remark 5.4.For any integer k ≥ g : X k →{ , } k , define the errorper( g ) = per k ( g ) = P ⊗ k { ( x , y , . . . , x k , y k ) : g ( x , . . . , x k ) = ( y , . . . , y k ) } to be the probability that g fails to avoid the pattern of labels realized by the data. (The index k can be understood from the domain of g .) emma 5.7. P { per( ˆy t ) > } → as t → ∞ . Proof
We showed in the proof of Lemma 4.3 that the random data sequence X , Y , X , Y , . . . isa.s. consistent with H . Thus Lemma 5.3 implies that T = sup { s ≥ ˆy s − ( X s − τ s − +1 , . . . , X s ) = ( Y s − τ s − +1 , . . . , Y s ) } is finite a.s., and that ˆy s = ˆy t and τ s = τ t for all s ≥ t ≥ T . By the law of large numbers for m -dependent sequences, P { per τ t ( ˆy t ) = 0 } = P (cid:26) lim S →∞ S t + S X s = t +1 ˆy t ( X s ,...,X s + τt − )=( Y s ,...,Y s + τt − ) = 0 (cid:27) ≥ P (cid:26) lim S →∞ S t + S X s = t +1 ˆy t ( X s ,...,X s + τt − )=( Y s ,...,Y s + τt − ) = 0 , T ≤ t (cid:27) = P { T ≤ t } . As T is finite with probability one, it follows that P { per τ t ( ˆy t ) > } ≤ P { T > t } → t → ∞ .Lemma 5.7 ensures that we can learn to rule out patterns in the data. Once we have ruled outpatterns in the data, we can learn using the resulting “VC pattern class” using (in a somewhatnon-standard manner) the one-inclusion graph prediction algorithm of Haussler, Littlestone, andWarmuth (1994). That algorithm was originally designed for learning with VC classes of classifiers,but fortunately its operations only rely on the projection of the class to the set of finite realizablepatterns on the data , and therefore its behavior and analysis are equally well-defined and valid whenwe have only a VC pattern class , rather than a VC class of functions.
Lemma 5.8.
Let g : X t → { , } t be a universally measurable function for some t ≥ . For every n ≥ , there is a universally measurable function ˆ Y gn : ( X × { , } ) n − × X → { , } such that, for every ( x , y , . . . , x n , y n ) ∈ ( X × { , } ) n that satisfies g ( x i , . . . , x i t ) = ( y i , . . . , y i t ) for all pairwise distinct ≤ i , . . . , i t ≤ n , we have n ! X σ ∈ Sym( n ) ˆ Y gn ( x σ (1) ,y σ (1) ,...,x σ ( n − ,y σ ( n − ,x σ ( n ) ) = y σ ( n ) < tn , where Sym( n ) denotes the symmetric group (of permutations of [ n ] ). Proof
Fix n ≥ X = { , . . . , n } . In the following, F ∈ { , } X denotes a set of hypotheses f : X → { , } . Applying (Haussler, Littlestone, and Warmuth, 1994, Theorem 2.3(ii)) with ¯ x =(1 , . . . , n ) yields a function A : 2 { , } X × ( X × { , } ) n − × X → { , } such that1 n ! X σ ∈ Sym( n ) A ( F,σ (1) ,f ( σ (1)) ,...,σ ( n − ,f ( σ ( n − ,σ ( n )) = f ( σ ( n )) ≤ vc( F ) n for any f ∈ F and F ∈ { , } X , where vc( F ) denotes the VC dimension of F . Moreover, byconstruction A is covariant under relabeling of X , that is, A ( F, σ (1) , y , . . . , σ ( n − , y n − , σ ( n )) =
7. If Z , Z , . . . is an i.i.d. sequence of random variables, then we have lim n →∞ n P ni =1 f ( Z i +1 , . . . , Z i + m ) = m P mi =1 lim n →∞ mn P ⌊ n/m ⌋ j =0 f ( Z mj +1+ i , . . . , Z ( m ( j +1)+ i )+ o (1) = E [ f ( Z , . . . , Z m )] by the law of large numbers. ( F ◦ σ, , y , . . . , n − , y n − , n ) for all permutations σ , where F ◦ σ := { f ◦ σ : f ∈ F } . The domainof A is a finite set, so the function A is trivially measurable.Given any input sequence ( x , y , . . . , x n , y n ), define the concept class F x as the collection of all f ∈ { , } X so that g ( x i , . . . , x i t ) = ( f ( i ) , . . . , f ( i t )) for all pairwise distinct 1 ≤ i , . . . , i t ≤ n .Define the classifierˆ Y gn ( x , y , . . . , x n − , y n − , x n ) := A ( F x , , y , . . . , n − , y n − , n ) . As g is universally measurable, the classifier ˆ Y gn is also universally measurable. Moreover, as A iscovariant and as F x σ (1) ,...,x σ ( n ) = F x ,...,x n ◦ σ , we haveˆ Y gn ( x σ (1) , y σ (1) , . . . , x σ ( n − , y σ ( n − , x σ ( n ) )= A ( F x , σ (1) , y σ (1) , . . . , σ ( n − , y σ ( n − , σ ( n )) . Now suppose that the input sequence ( x , y , . . . , x n , y n ) satisfies the assumption of the lemma.The function y ( i ) := y i satisfies y ∈ F x by the definition of F x . It therefore follows that for any suchsequence 1 n ! X σ ∈ Sym( n ) ˆ Y gn ( x σ (1) ,y σ (1) ,...,x σ ( n − ,y σ ( n − ,x σ ( n ) ) = y σ ( n ) ≤ vc( F x ) n . Finally, by construction, vc( F x ) < t . Remark 5.9.
Below we choose the function g in Lemma 5.8 to be the one generated by the algorithmfrom the previous section. By Remark 5.4, the resulting function is universally measurable jointlyin the training data and the function input. It follows from the proof of Lemma 5.8 that in such asituation, ˆ Y gn is also universally measurable jointly in the training data and the function input.We are now ready to outline our final learning algorithm. Lemma 5.7 guarantees the existenceof some t ∗ such that P { per( ˆy t ∗ ) > } ≤ . Given a finite sample X , Y , . . . , X n , Y n , we split itin two parts. Using the first part of the sample, we form an estimate ˆ t n of the index t ∗ . We thenconstruct, still using the first half of the sample, a family of pattern avoidance functions. For eachof these pattern avoidance functions, we apply the algorithm from Lemma 5.8 to the second partof the sample to obtain a predictor. This yields a family of predictors, one per pattern avoidancefunction. Our final classifier is the majority vote among these predictors.We now proceed to the details. We first prove a variant of Lemma 4.4. Lemma 5.10.
There exist universally measurable ˆ t n = ˆ t n ( X , Y , . . . , X ⌊ n ⌋ , Y ⌊ n ⌋ ) , whose definitiondoes not depend on P , so that the following holds. Given t ∗ so that P { per( ˆy t ∗ ) > } ≤ , there exist C, c > independent of n (but depending on P, t ∗ ) so that P { ˆ t n ∈ T good } ≥ − Ce − cn , where T good := { ≤ t ≤ t ∗ : P { per( ˆy t ) > } ≤ } . Proof
The proof is almost identical to that of Lemma 4.4. However, for completeness, we spell outthe details of the argument in the present setting. For each 1 ≤ t ≤ ⌊ n ⌋ and 1 ≤ i ≤ ⌊ n t ⌋ , let τ it := T t ( X ( i − t +1 , Y ( i − t +1 , . . . , X it , Y it ) , ˆy it ( z , . . . , z τ it ) := ˆY t ( X ( i − t +1 , Y ( i − t +1 , . . . , X it , Y it , z , . . . , z τ it ) e as defined above for the subsample X ( i − t +1 , Y ( i − t +1 , . . . , X it , Y it of the first quarter of thedata. For each t , estimate P { per( ˆy t ) > } by the fraction of ˆy it that make an error on the secondquarter of the data:ˆ e t := 1 ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 { ˆy it ( X s +1 ,...,X s + τit )=( Y s +1 ,...,Y s + τit ) for some n ≤ s ≤ n − τ it } . Observe that ˆ e t ≤ e t := 1 ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 per( ˆy it ) > a.s.Finally, we define ˆ t n := inf { t ≤ ⌊ n ⌋ : ˆ e t < } , with the convention inf ∅ = ∞ .Let t ∗ be as in the statement of the lemma. By Hoeffding’s inequality P { ˆ t n > t ∗ } ≤ P { ˆ e t ∗ ≥ } ≤ P { e t ∗ − E [ e t ∗ ] ≥ } ≤ e −⌊ n/ t ∗ ⌋ / . In addition, by continuity, there exists ε > ≤ t ≤ t ∗ such that P { per( ˆy t ) > } > we have P { per( ˆy t ) > ε } > + .Now, fix 1 ≤ t ≤ t ∗ such that P { per( ˆy t ) > } > . By Hoeffding’s inequality, and choice of ε , P (cid:26) ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 per( ˆy it ) >ε < (cid:27) ≤ e −⌊ n/ t ∗ ⌋ / . Observe that for any g : X τ → { , } τ that satisfies per > ε , we have P { g ( X s +1 , . . . , X s + τ ) = ( Y s +1 , . . . , Y s + τ ) for some n ≤ s ≤ n − τ }≥ − (1 − ε ) ⌊ ( n − / τ ⌋ , because there are ⌊ ( n − / τ ⌋ disjoint intervals of length τ in [ n + 1 , n ] ∩ N . Since ( τ it , ˆy it ) i ≤⌊ n/ t ⌋ are independent of ( X s , Y s ) s>n/ , applying a union bound conditionally on ( X s , Y s ) s ≤ n/ shows thatthe probability that every ˆy it with per τ it ( ˆy it ) > ε makes an error on the second quarter of the sampleis P { per τit ( ˆy it ) >ε ≤ { ˆy it ( X s +1 ,...,X s + τit )=( Y s +1 ,...,Y s + τit ) for some n ≤ s ≤ n − τ it } for all i }≥ − ⌊ n t ⌋ (1 − ε ) ⌊ ( n − / t ∗ ⌋ , where we used that τ it ≤ t ∗ . It follows that P { ˆ t n = t } ≤ P { ˆ e t < } ≤ ⌊ n ⌋ (1 − ε ) ⌊ ( n − / t ∗ ⌋ + e −⌊ n/ t ∗ ⌋ / . Putting together the above estimates and applying a union bound, we have P { ˆ t n
6∈ T good } ≤ e −⌊ n/ t ∗ ⌋ / + t ∗ ⌊ n ⌋ (1 − ε ) ⌊ ( n − / t ∗ ⌋ + t ∗ e −⌊ n/ t ∗ ⌋ / . The right-hand side is bounded by Ce − cn for some C, c >
Proof of Theorem 5.6
We adopt the notations in the proof of Lemma 5.10. Our final learningalgorithm is constructed as follows. First, we compute ˆ t n . Second, we use the first half of the data o construct the pattern avoidance functions ˆy i ˆ t n for 1 ≤ i ≤ ⌊ n t n ⌋ . Third, we use the second half ofthe data to construct classifiers ˆ y i by running the algorithm from Lemma 5.8; namely,ˆ y i ( x ) := ˆ Y ˆy i ˆ tn ⌊ n/ ⌋ +2 ( X ⌈ n/ ⌉ , Y ⌈ n/ ⌉ , . . . , X n , Y n , x ) . Our final output ˆ h n is the majority vote over ˆ y i for 1 ≤ i ≤ ⌊ n t n ⌋ . We aim to show that E [er(ˆ h n )] ≤ Cn for some constant C .To this end, for every t ∈ T good , because P { per( ˆy t ) > } ≤ , Hoeffding’s inequality implies P (cid:26) ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 per( ˆy it ) > > (cid:27) ≤ e −⌊ n/ t ∗ ⌋ / By a union bound, we obtain P (cid:26) ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 per( ˆy i ˆ tn ) > > , ˆ t n ∈ T good (cid:27) ≤ X t ∈T good P (cid:26) ⌊ n/ t ⌋ ⌊ n/ t ⌋ X i =1 per( ˆy it ) > > (cid:27) ≤ t ∗ e −⌊ n/ t ∗ ⌋ / . Thus except on an event of exponentially small probability, the pattern avoidance functions ˆy i ˆ t n havezero error for at least a fraction of of indices i .Now let ( X, Y ) ∼ P be independent of the data X , Y , . . . , X n , Y n . Then E [er(ˆ h n )] = P [ˆ h n ( X ) = Y ] ≤ P (cid:20) ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 ˆ y i ( X ) = Y ≥ (cid:21) . We can therefore estimate using Lemma 5.10 E [er(ˆ h n )] ≤ Ce − cn + t ∗ e −⌊ n/ t ∗ ⌋ / + P ˆ t n ∈ T good , ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 ˆ y i ( X ) = Y ≥ , ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 per( ˆy i ˆ tn )=0 ≥ . Since any two sets, containing at least and fractions of { , . . . , ⌊ n/ ˆ t n ⌋} , must have at least fraction in their intersection (by the union bound for their complements), the last term in the aboveexpression is bounded above by P (cid:20) ˆ t n ∈ T good , ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 ˆ y i ( X ) = Y per( ˆy i ˆ tn )=0 ≥ (cid:21) ≤ E (cid:20) ˆ t n ∈T good ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 ˆ y i ( X ) = Y per( ˆy i ˆ tn )=0 (cid:21) , using Markov’s inequality. We can now apply Lemma 5.8 conditionally on the first half of the datato conclude (using exchangeability) that E [er(ˆ h n )] ≤ Ce − cn + t ∗ e −⌊ n/ t ∗ ⌋ / + 16 E (cid:20) ˆ t n ∈T good ⌊ n/ t n ⌋ ⌊ n/ t n ⌋ X i =1 τ i ˆ t n ⌊ n/ ⌋ + 2 (cid:21) ≤ Ce − cn + t ∗ e −⌊ n/ t ∗ ⌋ / + 16( t ∗ + 1) ⌊ n/ ⌋ + 2 , here we used that τ i ˆ t n ≤ ˆ t n + 1 ≤ t ∗ + 1 for ˆ t n ∈ T good . The final step in the proof of our main results is to show that classes with infinite VCL trees havearbitrarily slow rates.
Theorem 5.11. If H has an infinite VCL tree, then H requires arbitrarily slow rates. Together with Theorems 4.6 and 5.6, this theorem completes the characterization of classes H with linear learning rate: these are precisely the classes that have an infinite Littlestone tree but donot have an infinite VCL tree.The proof of Theorem 5.11 is similar to that of Theorem 4.6. The details, however, are moreinvolved. We prove, via the probabilistic method, that for any rate function R ( t ) → h n , there is a realizable distribution P so that E [er(ˆ h n )] ≥ R ( n )40 infinitely often. The construction of the distribution according to which we choose P depends onthe rate function R and relies on the following technical lemma. Lemma 5.12.
Let R ( t ) → be any rate function. Then there exist probabilities p , p , . . . ≥ so that P k ≥ p k = 1 , two increasing sequences of integers ( n i ) i ≥ and ( k i ) i ≥ , and a constant ≤ C ≤ such that the following hold for all i > : (a) P k>k i p k ≤ n i . (b) n i p k i ≤ k i . (c) p k i = CR ( n i ) . Proof
We may assume without loss of generality that R (1) = 1. Otherwise, we can replace R by˜ R such that ˜ R (1) = 1 and ˜ R ( n ) = R ( n ) for n > n i ) and ( k i ). Let n = 1 and k = 1.For i >
1, let n i = inf (cid:26) n > n i − : R ( n ) ≤ min j
0, we have n i < ∞ for all i . The sequences are increasing by construction. Finally,we define p k = 0 for k
6∈ { k i : i ≥ } and p k i = CR ( n i )with C = P j ≥ R ( n j ) . As R ( n j ) ≤ − j +1 for all j > ≤ C ≤ R ( n j ) ≤ R ( n i )2 i − j k i ≤ R (1)2 i − j n i for all i < j. Therefore, as C ≤
1, we obtain X k>k i p k = X j>i p k j = X j>i CR ( n j ) ≤ n i . or (b), note that n i p k i = Cn i R ( n i ) ≤ k i . Finally, (c) holds by construction.We can now complete the proof of Theorem 5.11.
Proof of Theorem 5.11
We fix throughout the proof a rate R ( t ) →
0. Define
C, p k , k i , n i as in Lemma 5.12. We also fix any learning algorithm with output ˆ h n and an infinite VCL tree t = { x u ∈ X k +1 : 0 ≤ k < ∞ , u ∈ { , } × · · · × { , } k } for H .Let y = ( y , y , . . . ) be a sequence of independent random vectors, where y k = ( y k , . . . , y k − k ) isuniformly distributed on { , } k for each k ≥
1. Define the random distribution P y on X × { , } as P y { ( x i y ≤ k − , y ik ) } = p k k for 0 ≤ i ≤ k − , k ≥ . In words, each y defines an infinite branch of the tree t . Given y , we choose the vertex on this branchof depth k − p k . This vertex defines a subset of X of size k . The distribution P y chooses each element in this subset uniformly at random.Because t is a VCL tree, for every n < ∞ , there exists h ∈ H so that h ( x i y ≤ k − ) = y ik for0 ≤ i ≤ k − ≤ k ≤ n . Thuser y ( h ) := P y { ( x, y ) ∈ X × { , } : h ( x ) = y } ≤ X k>n p k . Letting n → ∞ , we find that P y is realizable for every realization of y . Finally, the map y P y ismeasurable as in the proof of Theorem 4.6.Now let ( X, Y ) , ( X , Y ) , ( X , Y ) , . . . be i.i.d. samples drawn from P y . That is, X = x I y ≤ T − , Y = y IT , X i = x I i y ≤ Ti − , Y i = y I i T i , where ( T, I ) , ( T , I ) , ( T , I ) , . . . are i.i.d. random variables, independent of y , with distribution P { T = k, I = i } = p k k for 0 ≤ i ≤ k − , k ≥ . For all n and k , P { ˆ h n ( X ) = Y, T = k }≥ k − X i =0 P { ˆ h n ( X ) = y ik , T = k, I = i, T , . . . , T n ≤ k, ( T , I ) , . . . , ( T n , I n ) = ( k, i ) } = 12 k − X i =0 P { T = k, I = i, T , . . . , T n ≤ k, ( T , I ) , . . . , ( T n , I n ) = ( k, i ) } = p k (cid:18) − X l>k p l − p k k (cid:19) n where we used that conditionally on T = k, I = i, T , . . . , T n ≤ k, ( T , I ) , . . . , ( T n , I n ) = ( k, i ), thepredictor ˆ h n ( X ) is independent of y ik .We now choose k = k i and n = n i . By Lemma 5.12, P { ˆ h n i ( X ) = Y, T = k i } ≥ CR ( n i )2 (cid:18) − n i (cid:19) n i ≥ CR ( n i )18 or i ≥
3. By Fatou’s lemma, E h lim sup i →∞ R ( n i ) P { ˆ h n i ( X ) = Y, T = k i | y } i ≥ lim sup i →∞ R ( n i ) P { ˆ h n i ( X ) = Y, T = k i } ≥ C
18 ;Fatou applies as R ( n i ) P { ˆ h n i ( X ) = Y, T = k i | y } ≤ R ( n i ) P { T = k i } = C a.s. Because P { ˆ h n i ( X ) = Y, T = k i | y } ≤ P { ˆ h n i ( X ) = Y | y } = E [er y (ˆ h n i ) | y ] a.s. , there must exist a realization of y such that E [er y (ˆ h n ) | y ] > C R ( n ) ≥ R ( n ) infinitely often.Choosing P = P y for this realization of y concludes the proof. Appendix A. Mathematical background
A.1 Gale-Stewart games
The aim of this section is to recall some basic notions from the classical theory of infinite games.Fix sets X t , Y t for t ≥
1. We consider infinite games between two players: in each round t ≥ A selects an element x t ∈ X t , and then player P L selects an element y t ∈ Y t . Therules of the game are determined by specifying a set W ⊆ Q t ≥ ( X t × Y t ) of winning sequencesfor P L . That is, after an infinite sequence of consecutive plays x , y , x , y , . . . , we say that P L winsif ( x , y , x , y , . . . ) ∈ W ; otherwise, P A is declared the winner of the game.A strategy is a rule used by a given player to determine the next move given the current positionof the game. A strategy for P A is a sequence of functions f t : Q s 1, sothat P A plays x t = f t ( x , y , . . . , x t − , y t − ) in round t . Similarly, a strategy for P L is a sequenceof g t : Q s 1, so that P L plays y t = g t ( x , y , . . . , x t − , y t − , x t ) inround t . A strategy for P A is called winning if playing that strategy always makes P A win thegame regardless of what P L plays; a winning strategy for P L is defined analogously.At the present level of generality, it is far from clear whether winning strategies even exist.We introduce some additional assumption in order to be able to develop a meaningful theory. Thesimplest such assumption was introduced in the classic work of Gale and Stewart (Gale and Stewart,1953): W is called finitely decidable if for every ( x , y , x , y , . . . ) ∈ W , there exists n < ∞ sothat ( x , y , . . . , x n , y n , x ′ n +1 , y ′ n +1 , x ′ n +2 , y ′ n +2 , . . . ) ∈ W for all choices of x ′ n +1 , y ′ n +1 , x ′ n +2 , y ′ n +2 , . . . In other words, that W is finitely decidable means thatif P L wins, then she knows that she won after playing a finite number of rounds. Conversely, in thiscase P A wins the game precisely when P L does not win after any finite number of rounds.An infinite game whose set W is finitely decidable is called a Gale-Stewart game . The funda-mental theorem on Gale-Stewart games is the following. Theorem A.1. In a Gale-Stewart game, either P A or P L has a winning strategy. The classical proof of this result is short and intuitive, cf. (Gale and Stewart, 1953) or (Kechris,1995, Theorem 20.1). For a more constructive approach, see (Hodges, 1993, Corollary 3.4.3). Remark A.2. If one endows X t and Y t with the discrete topology, then W is finitely decidable ifand only if it is an open set for the associated product topology. For this reason, condition of aGale-Stewart game is usually expressed by saying that the set of winning sequences is open. Thisterminology is particularly confusing in the setting of this paper, because we endow X t and Y t with adifferent topology. In order to avoid confusion, we have therefore opted to resort to the nonstandardterminology “finitely decidable”. emark A.3. In the literature it is sometimes assumed that X t = Y t = X for all t . However,the more general setting of this section is already contained in this special case. Indeed, given sets X t , Y t for every t , let X = S t ( X t ∪ Y t ) be their disjoint union. We may now augment the set W ofwinning sequences for P L so that the first player who makes an inadmissible play (that is, x t 6∈ X t or y t 6∈ Y t ) loses instantly. This ensures that a winning strategy for either player will only makeadmissible plays, thus reducing the general case to the special case. Despite this equivalence, wehave chosen the more general formulation as this is most natural in applications. Remark A.4. Even though we have defined a strategy for P A as a sequence of functions x t = f t ( x , y , . . . , x t − , y t − ) of the full game position, it is implicit in this notation that x , . . . , x t − arealso played according to the previous rounds of the same strategy ( x t − = f t − ( x , y , . . . , x t − , y t − ),etc.). Thus we can equivalently view a strategy for P A as a sequence of functions x t = f t ( y , . . . , y t − )that depend only on the previous plays of P L . Similarly, a strategy for P L can be equivalently de-scribed by a sequence of functions y t = g t ( x , . . . , x t ). A.2 Ordinals The aim of this section is to briefly recall the notion of ordinals, which play an important role in ourtheory. An excellent introduction to this topic may be found in (Hrbacek and Jech, 1999, Chapter6), while the classical reference is (Sierpi´nski, 1965).A well-ordering of a set S is a linear ordering < with the property that every nonempty subsetof S contains a least element. For example, if we consider subsets of R with the usual orderingof the reals, then { , . . . , n } and N are well-ordered but Z and [0 , 1] are not. We could howeverchoose nonstandard orderings on Z and [0 , 1] so they become well-ordered; in fact, it is a classicalconsequence of the axiom of choice that any set may be well-ordered.Two well-ordered sets are said to be isomorphic if there is an order-preserving bijection betweenthem. There is a canonical way to construct a class of well-ordered sets, called ordinals , such thatany well-ordered set is isomorphic to exactly one ordinal. Ordinals uniquely encode well-orderedsets up to isomorphism, in the same way that cardinals uniquely encode sets up to bijection. Theclass of all ordinals is denoted ORD. The specific construction of ordinals is not important for ourpurposes, and we therefore discuss ordinals somewhat informally. We refer to (Hrbacek and Jech,1999, Chapter 6) or (Sierpi´nski, 1965) for a careful treatment.It is a basic fact that any pair of well-ordered sets is either isomorphic, or one is isomorphic toan initial segment of the other. This induces a natural ordering on ordinals. For α, β ∈ ORD, wewrite α < β if α is isomorphic to an initial segment of β . The defining property of ordinals is thatany ordinal β is isomorphic to the set of ordinals { α : α < β } that precede it. In particular, < isitself a well-ordering; namely, every nonempty set of ordinals contains a least element, and everynonempty set S of ordinals has a least upper bound, denoted sup S .Ordinals form a natural set-theoretic extension of the natural numbers. By definition, everyordinal β has a successor ordinal β + 1, which is the smallest ordinal that is larger than β . Wecan therefore count ordinals one by one. The smallest ordinals are the finite ordinals 0 , , , , , . . . ;we naturally identify each number k with the well-ordered set { , . . . , k − } . The smallest infiniteordinal is denoted ω ; it may simply be identified with the family of all natural numbers with its usualordering. With ordinals, however, we can keep counting past infinity: one counts 0 , , , . . . , ω, ω +1 , ω + 2 , . . . , ω + ω, ω + ω + 1 , . . . and so on. The smallest uncountable ordinal is denoted ω .An important concept defined by ordinals is the principle of transfinite recursion . Informally,it states that if we have a recipe that, given sets of “objects” O α indexed by all ordinals α < β ,defines a new set of “objects” O β , and we are given a base set { O α : α < α } , then O β is uniquelydefined for all β ∈ ORD. As a simple example, let us define the meaning of addition of ordinals γ + β . For the base case, we define γ + 0 = γ and γ + 1 to be the successor of γ . Subsequently,for any β , we define γ + β = sup { ( γ + α ) + 1 : α < β } . Then the principle of transfinite recursionensures that γ + β is uniquely defined for all ordinals β . One can analogously develop a full ordinal rithmetic that defines addition, multiplication, exponentiation, etc. of ordinals just as for naturalnumbers (Hrbacek and Jech, 1999, section 6.5). A.3 Well-founded relations and ranks In this section we extend the notion of a well-ordering to more general types of orders, and introducethe fundamental notion of rank. Our reference here is (Kechris, 1995, Appendix B).A relation ≺ on a set S is defined by an arbitrary subset R ≺ ⊆ S × S as x ≺ y if and only if( x, y ) ∈ R ≺ . An element x of ( S, ≺ ) is called minimal if there does not exist y ≺ x . The relation iscalled well-founded if every nonempty subset of S has a minimal element. Thus a linear orderingis well-founded precisely when it is a well-ordering; but the notion of well-foundedness extends toany relation.To any well-founded relation ≺ on S we will associate a function ρ ≺ : S → ORD, called the rankfunction of ≺ , that is defined by transfinite recursion. We say that ρ ≺ ( x ) = 0 if and only if x isminimal in S , and define for all other xρ ≺ ( x ) = sup { ρ ≺ ( y ) + 1 : y ≺ x } . The rank ρ ≺ ( x ) quantifies how far x is from being minimal. Remark A.5. Observe that every element x ∈ S indeed has a well-defined rank (that is, it appearsat some stage in the transfinite recursion). Indeed, the transfinite recursion recipe defines ρ ≺ ( x ) assoon as ρ ≺ ( y ) has been defined for all y ≺ x . If ρ ≺ ( x ) is undefined, then there must exist x ≺ x sothat ρ ≺ ( x ) is undefined. Repeating this process constructs an infinite decreasing chain of elements x i ∈ S . But this contradicts the assumption that ≺ is well-founded, as an infinite decreasing chaincannot contain a minimal element.Let ( S, ≺ ) and ( S ′ , ≺ ′ ) be sets endowed with relations. A map f : S → S ′ is called order-preserving if x ≺ y implies f ( x ) ≺ ′ f ( y ). It is a basic fact that ranks are monotone underorder-preserving maps: if ≺ ′ is well-founded and f : S → S ′ is order-preserving, then ≺ is well-founded and ρ ≺ ( x ) ≤ ρ ≺ ′ ( f ( x )) for all x ∈ S (this follows readily by induction on the value of ρ ≺ ( x )).Like ordinals, the rank of a well-founded relation is an intuitive object once one understandsits meaning. This is best illustrated by some simple examples. As explained in Remark A.5, awell-founded relation does not admit an infinite decreasing chain x ≻ x ≻ x ≻ · · · , but it mightadmit finite decreasing chains of arbitrary length. As the following examples illustrate, the rank ρ ≺ ( x ) quantifies how long we can keep growing a decreasing chain starting from x . Example A.6. Suppose that ρ ≺ ( x ) = k for some finite ordinal 0 < k < ω . By the definition ofrank, ρ ≺ ( y ) < k for all y ≺ x , while there exists x ≺ x such that ρ ≺ ( x ) = k − 1. It follows readilythat ρ ≺ ( x ) = k if and only if the longest decreasing chain that can be grown starting from x haslength k + 1. Example A.7. Suppose that ρ ≺ ( x ) = ω . By the definition of rank, ρ ≺ ( y ) < ω is an arbitrarilylarge finite ordinal for y ≺ x . We can grow an arbitrarily long decreasing chain starting from x ,but once we select its first element x ≺ x we can grow at most finitely many elements as in theprevious example. In other words, the maximal length of the chain is decided by the choice of itsfirst element x . Example A.8. Suppose that ρ ≺ ( x ) = ω + k for some k < ω . Then we can choose x ≻ x ≻ · · · ≻ x k so that ρ ≺ ( x k ) = ω . We can still grow arbitrarily long decreasing chains after selecting the first k elements judiciously, but the length of the chain is decided at the latest after we selected x k +1 . Example A.9. Suppose that ρ ≺ ( x ) = ω + ω . Then in the first step, we can choose for any k < ω an element x ≺ x so that ρ ≺ ( x ) = ω + k . From that point onward, we proceed as in the previous xample. The maximal length of a decreasing chain starting from x is determined by two decisions:the choice of x decides a number k , so that the maximal length of the chain is decided at the latestafter we selected x k +2 .These examples can be further extended. For example, ρ ≺ ( x ) = ω · k + k ′ means that after k ′ initial steps we can make a sequence of k decisions, each decision being how many steps we can growthe chain before the next decision must be made. Similarly, ρ ≺ ( x ) = ω means we can decide onarbitrarily large numbers k, k ′ < ω in the first step, and then proceed as for ω · k + k ′ ; etc. A.4 Polish spaces and analytic sets We finally review the basic notions of measures and probabilities on Polish spaces. We refer to(Cohn, 1980, Chapter 8) for a self-contained introduction, and to (Kechris, 1995) for a comprehensivetreatment.A Polish space is a separable topological space that can be metrized by a complete metric. Manyspaces encountered in practice are Polish, including R n , any compact metric space, any separableBanach space, etc. Moreover, any finite or countable product or disjoint union of Polish spaces isagain Polish.Let X , Y be Polish spaces, and let f : X → Y be a continuous function. It is shown in anyintroductory text on probability that f is Borel measurable, that is, f − ( B ) is a Borel subset of X for any Borel subset B of Y . However, the forward image f ( X ) is not necessarily Borel-measurablein Y . A subset B ⊆ Y of a Polish space is called analytic if it is the image of some Polish spaceunder a continuous map. It turns out that every Borel set is analytic, but not every analytic set isBorel. The family of analytic sets is closed under countable unions and intersections, but not undercomplements. The complement of an analytic set is called coanalytic . A set is Borel if and only ifit is both analytic and coanalytic.Although analytic sets may not be Borel-measurable, such sets are just as good as Borel setsfor the purposes of probability theory. Let F be the Borel σ -field on a Polish space X . For anyprobability measure on µ , denote by F µ the completion of F with respect to µ , that is, the collectionof all subsets of X that differ from a Borel set at most on a set of zero probability. A set B ⊆ X iscalled universally measurable if B ∈ F µ for every probability measure µ . Similarly, a function f : X → Y is called universally measurable if f − ( B ) is universally measurable for any universallymeasurable set B . It is clear from these definitions that universally measurable sets and functionson Polish spaces are indistinguishable from Borel sets from a probabilistic perspective.The following fundamental fact is known as the capacitability theorem. Theorem A.10. Every analytic (or coanalytic) set is universally measurable. The importance of analytic sets in probability theory stems from the fact that they make itpossible to establish measurability of certain uncountable unions of measurable sets. Indeed, let X and Y be Polish spaces, and let A ⊆ X × Y be an analytic set. The set B := [ y ∈Y { x ∈ X : ( x, y ) ∈ A } can be written as B = f ( A ) for the continuous function f ( x, y ) := x . The set B ⊆ X is also analytic,and hence universally measurable.We conclude this section by stating a deep fact about well-founded relations on Polish spaces.Let X be a Polish space and let ≺ be a well-founded relation on X . The relation ≺ is called analyticif R ≺ ⊆ X × X is an analytic set. 8. Our discussion of the intuitive meaning of the rank of a well-founded relation is based on the lively discussion in(Evans and Hamkins, 2014) of game values in infinite chess. heorem A.11. Let ≺ be an analytic well-founded relation on a Polish space X . Its rank functionsatisfies sup x ∈X ρ ≺ ( x ) < ω . This result is known as the Kunen-Martin theorem; see (Kechris, 1995, Theorem 31.1) or (Del-lacherie, 1977) for a self-contained proof and historical comments. Appendix B. Measurability of Gale-Stewart strategies The fundamental theorem of Gale-Stewart games, Theorem A.1, states that either player P A or P L must have a winning strategy in an infinite game when the set of winning sequences W for P L isfinitely decidable. This existential result provides no information, however, about the complexityof the winning strategies. In particular, it is completely unclear whether winning strategies can bechosen to be measurable. As we use winning strategies to design algorithms that operate on randomdata, non-measurable strategies are may be potentially a serious problem for our purposes. Indeed,lack of measurability can render probabilistic reasoning completely meaningless (cf. Appendix C).Almost nothing appears to be known in the literature regarding the measurability of Gale-Stewartstrategies. The aim of this appendix is to prove a general measurability theorem that captures allthe games that appear in this paper. We adopt the general setting and notations of Appendix A.1. Theorem B.1. Let {X t } t ≥ be Polish spaces and {Y t } t ≥ be countable sets. Consider a Gale-Stewart game whose set W ⊆ Q t ≥ ( X t × Y t ) of winning sequences for P L is finitely decidable andcoanalytic. Then there is a universally measurable winning strategy. A characteristic feature of the games in this paper is the asymmetry between P A and P L .Player P A plays elements of an arbitrary Polish space, while P L can only play elements of a count-able set. Any strategy for P A is automatically measurable, as it may be viewed as a function of theprevious plays of P L only (cf. Remark A.4). The nontrivial content of Theorem B.1 is that if P L hasa winning strategy, such a strategy may be chosen to be universally measurable.To prove Theorem B.1, we construct an explicit winning strategy of the following form. Toevery sequence of plays x , y , . . . , x t , y t for which P L has not yet won, we associate an ordinal valuewith the following property: regardless of the next play x t +1 of P A , there exists y t +1 that decreasesthe value. Because there are no infinite decreasing chains of ordinals, P L eventually wins with thisstrategy. To show that this strategy is measurable, we use the coanalyticity assumption of TheoremB.1 in two different ways. On the one hand, we show that the set of game positions of countablevalue is measurable. On the other hand, the Kunen-Martin theorem implies that only countablevalues can appear. Remark B.2. The construction of winning strategies for Gale-Stewart games using game values isnot new; cf. (Hodges, 1993, Section 3.4) or (Evans and Hamkins, 2014). We, however, define thegame value in a different manner than is customary in the literature. While the proof ultimatelyshows that the two definitions are essentially equivalent, our definition enables us to directly applythe Kunen-Martin theorem, and is conceptually much closer to the classical Littlestone dimensionof concept classes (cf. Section 3.4). B.1 Preliminaries In the remainder of this appendix we assume that the assumptions of Theorem B.1 are in force, andthat P L has a winning strategy.Let us begin by introducing some basic notions. A position of the game is a finite sequence ofplays x , y , . . . , x n , y n for some 0 ≤ n < ∞ (the empty sequence ∅ denotes the initial position ofthe game). We denote the set of positions of length n by P n := n Y t =1 ( X t × Y t ) , where P := { ∅ } ), and by P := S ≤ n< ∞ P n the set of all positions. Note that, by our assump-tions, P n and P are Polish spaces.An active position is a sequence of plays x , y , . . . , x n , y n after which P L has not yet won.Namely, there exist x n +1 , y n +1 , x n +2 , y n +2 , . . . so that ( x , y , x , y , . . . ) W . The set of activepositions of length n can be written as A n := [ w ∈ Q ∞ t = n +1 ( X t ×Y t ) { v ∈ P n : ( v , w ) ∈ W c } . Because W is coanalytic, A n is an analytic subset of P n . We denote by A := S ≤ n< ∞ A n the set ofall active positions. Remark B.3. The notion of active positions is fundamental to the definition of Gale-Stewart games.The fact that W is finitely decidable is nothing other than the property W = { ( x , y , x , y , . . . ) :( x , y , . . . , x n , y n ) A n for some 0 ≤ n < ∞} .We now introduce the fundamental notion of active trees. By assumption, there is no winningstrategy for P A . That is, there is no strategy for P A that ensures the game remains active forever.However, given any finite number n < ∞ , there could exist strategies for P A that force the gameto remain active for at least n rounds regardless of what P L plays. Such a strategy is naturallydefined by specifying a decision tree of depth n , that is, a rooted tree such that each vertex atdepth t is labelled by a point in X t , and the edges to its children are labelled by Y t . Such a treecan be described by specifying a set of points { x y ∈ X t +1 : y ∈ Q ts =1 Y s , ≤ t < n } . This treekeeps the game active for n rounds as long as ( x ∅ , y , x y , y , . . . , x y ,...,y n − , y n ) ∈ A n for all possibleplays y , . . . , y n of P L . This notion is precisely the analogue of a Littlestone tree (Definition 1.7) inthe context of Gale-Stewart games.We need to consider strategies that keep the game active for a finite number of rounds startingfrom an arbitrary position (in the above discussion we assumed the starting position ∅ ). Definition B.4. Given a position v ∈ P k of length k : • A decision tree of depth n with starting position v is a collection of points t = (cid:26) x y ∈ X k + t +1 : y ∈ k + t Y s = k +1 Y s , ≤ t < n (cid:27) . By convention, we call t = ∅ a decision tree of depth 0. • t is called active if ( v , x ∅ , y k +1 , x y k +1 , y k +2 , . . . , x y k +1 ,...,y k + n − , y k + n ) ∈ A k + n for all choicesof ( y k +1 , . . . , y k + n ) ∈ Q k + nt = k +1 Y t . • We denote by T v the set of all decision trees with starting position v (and any depth 0 ≤ n < ∞ ), and by T A v ⊆ T v the set of all active trees.As the sets Y t are assumed to be countable, any decision tree is described by a countable collectionof points. Thus T v is a Polish space (it is a countable disjoint union of countable products of thePolish spaces X t ). Moreover, as A k + n is analytic, it follows readily that T A v is analytic (it is acountable disjoint union of countable intersections of analytic sets). The key reason why Theorem B.1is restricted to the setting where each Y t is countable is to ensure these properties hold. .2 Game values We now assign to every position v ∈ P a value val( v ). Intuitively, the value measures how longwe can keep growing an active tree starting from v . It will be convenient to adjoin to the ordinalstwo elements − Ω that are smaller and larger than every ordinal, respectively. We writeORD ∗ := ORD ∪ {− , Ω } , and proceed to define the value function val : P → ORD ∗ .By definition, T A v is empty if and only if the position v A is inactive, that is, if P L has alreadywon. In this case, we define val( v ) = − v ∈ A is active. The definition of value uses a relation ≺ v on T A v . In thisrelation, t ′ ≺ v t if and only if the tree t is obtained from t ′ by removing its leaves (in particular,depth( t ′ ) = depth( t ) + 1). Let us make two basic observations about this relation: • An infinite decreasing chain in ( T A v , ≺ v ) corresponds to an infinite active tree, that is, a winningstrategy for P A starting from v . In other words, ≺ v is well-founded if and only if P A has nowinning strategy starting from the position v . • ( T A v , ≺ v ) has the tree ∅ of depth 0 as its unique maximal element. Indeed, any active treeremains active if its leaves are removed. So, there is an increasing chain from any active treeto ∅ .The definition of value uses the notion of rank from Section A.3. Definition B.5. The game value val : P → ORD ∗ is defined as follows. • val( v ) = − v A . • val( v ) = Ω if v ∈ A and ≺ v is not well-founded. • val( v ) = ρ ≺ v ( ∅ ) if v ∈ A and ≺ v is well-founded.In words, val( v ) = − L has already won; val( v ) = Ω means P L can no longer win; andotherwise val( v ) is the maximal rank of an active tree in ( T A v , ≺ v ), which quantifies how long P A can postpone P L winning the game (cf. section A.3).For future reference, we record some elementary properties of the rank ρ ≺ v . Lemma B.6. Fix v ∈ P such that ≤ val( v ) < Ω . (a) t ′ ≺ v t implies ρ ≺ v ( t ′ ) < ρ ≺ v ( t ) for any t , t ′ ∈ T A v . (b) For any t ′ ∈ T A v , t ′ = ∅ there is a unique t ∈ T A v such that t ′ ≺ v t . (c) For any t ∈ T A v and κ < ρ ≺ v ( t ) , there exists t ′ ≺ v t so that κ ≤ ρ ≺ v ( t ′ ) . Proof For (a), it suffices to note that ρ ≺ v ( t ′ ) + 1 ≤ ρ ≺ v ( t ) for any t ′ ≺ v t by the definition ofrank. For (b), note that t is obtained from t ′ by removing its leaves. For (c), argue by contradic-tion: if ρ ≺ v ( t ′ ) < κ for all t ′ ≺ v t , then κ < ρ ≺ v ( t ) < κ + 1 where the second inequality follows bythe definition of rank. This is impossible, as there is no ordinal strictly between successive ordinals.In the absence of regularity assumptions, game values could be arbitrarily large ordinals (seeAppendix C). Remarkably, however, this is not the case in our setting. The assumption that W iscoanalytic implies that only countable game values may appear. This fact plays a crucial role in theproof of Theorem B.1. Lemma B.7. For any v ∈ P , either val( v ) = Ω or val( v ) < ω . roof We may assume without loss of generality that 0 ≤ val( v ) < Ω . There is also no loss inextending the relation ≺ v to T v as follows: t ′ ≺ v t is defined as above whenever t , t ′ ∈ T A v , while t T A v has no relation to any element of T v . Then every t T A v is minimal, while the rank of t ∈ T A v is unchanged.With this extension, the relation ≺ v on T v is defined by R ≺ v = { ( t ′ , t ) ∈ T v × T v : t ′ ≺ v t , t ′ ∈ T A v } ;here t is uniquely obtained from t ′ ∈ T A v by removing its leaves. Because T A v is analytic, it followsthat ≺ v is a well-founded analytic relation on the Polish space T v . The conclusion follows fromTheorem A.11. B.3 A winning strategy Our aim now is to show that the game values give rise to a winning strategy for P L . The keyobservation is the following. Proposition B.8. Fix ≤ n < ∞ and v ∈ P n such that ≤ val( v ) < Ω . For every x ∈ X n +1 ,there exists y ∈ Y n +1 such that val( v , x, y ) < val( v ) . Before we prove this result, let us first explain the intuition in the particularly simple case thatval( v ) = m < ω is finite. By the definition of value, the maximal depth of an active tree in T A v is m (cf. Example A.6). Now suppose, for sake of contradiction, that there exists x such thatval( v , x, y ) ≥ m for every y . That is, there exists an active tree t y ∈ T A v ,x,y of depth m for every y .Then we can construct an active tree in T A v of depth m + 1 by taking x as the root and attachingeach t y as its subtree of the corresponding child. But this is impossible, as we assumed that themaximal depth of an active tree in T A v is m .We use the same idea of “gluing together trees t y ” in the case that val( v ) is an infinite ordinal,but its implementation in this case is more subtle. The key to the proof is the following lemma. Lemma B.9. Fix ≤ n < ∞ , v ∈ P n , x ∈ X n +1 , and y, y ′ ∈ Y n +1 such that val( v , x, y ) ≤ val( v , x, y ′ ) . Then there exists a map f : T A v ,x,y → T A v ,x,y ′ such that: (a) depth( f ( t )) = depth( t ) for all t ∈ T A v ,x,y . (b) t ′ ≺ v ,x,y t implies f ( t ′ ) ≺ v ,x,y ′ f ( t ) for all t , t ′ ∈ T A v ,x,y . Proof We first dispose of trivial cases. If val( v , x, y ) = − 1, then T A v ,x,y = ∅ and there is nothingto prove. If val( v , x, y ′ ) = Ω , there is an infinite decreasing chain ∅ = t (0) ≻ v ,x,y ′ t (1) ≻ v ,x,y ′ t (2) ≻ v ,x,y ′ t (3) ≻ v ,x,y ′ · · · in T A v ,x,y ′ . In this case we may define f ( t ) = t ( k ) whenever depth( t ) = k , and it is readily verified thedesired properties hold. We therefore assume in the remainder of the proof that 0 ≤ val( v , x, y ) ≤ val( v , x, y ′ ) < Ω .We now define f ( t ) by induction on depth( t ). For the induction to go through, we maintain thefollowing invariants: • depth( f ( t )) = depth( t ). • ρ ≺ v ,x,y ( t ) ≤ ρ ≺ v ,x,y ′ ( f ( t )). or the base, let f ( ∅ ) = ∅ . Because val( v , x, y ) ≤ val( v , x, y ′ ), we have ρ ≺ v ,x,y ( ∅ ) ≤ ρ ≺ v ,x,y ′ ( f ( ∅ )).For the step, suppose that f ( t ) has been defined for all t ∈ T A v ,x,y with depth( t ) = k − t . Now consider t ′ ∈ T A v ,x,y with depth( t ′ ) = k , and let t ≻ v ,x,y t ′ be the tree obtained by removing its leaves. Then we have ρ ≺ v ,x,y ( t ′ ) < ρ ≺ v ,x,y ( t ) ≤ ρ ≺ v ,x,y ′ ( f ( t ))by Lemma B.6(a) and the induction hypothesis. Therefore, by Lemma B.6(c), we may choose f ( t ′ ) ≺ v ,x,y ′ f ( t ) so that ρ ≺ v ,x,y ( t ′ ) ≤ ρ ≺ v ,x,y ′ ( f ( t ′ )). In this manner we have defined f ( t ′ ) for each t ′ ∈ T A v ,x,y with depth( t ′ ) = k . It is readily verified that the desired properties of the map f holdby construction.We can now complete the proof of Proposition B.8. Proof of Proposition B.8 Fix x ∈ X n +1 throughout the proof. If there exists y ∈ Y n +1 so thatval( v , x, y ) = − 1, the conclusion is trivial. We can therefore assume that val( v , x, y ) ≥ y .This implies, in particular, that { x } ∈ T A v .Because any collection of ordinals contains a minimal element, we can choose y ∗ ∈ Y n +1 such thatval( v , x, y ∗ ) ≤ val( v , x, y ) for all y . The main part of the proof is to construct an order-preservingmap ι : T A v ,x,y ∗ → T A v such that ι ( ∅ ) = { x } . Because val( v ) < Ω , we know that ≺ v is well-founded.It follows by monotonicity of rank under order-preserving maps that ≺ v ,x,y ∗ is well-founded andval( v , x, y ∗ ) = ρ ≺ v ,x,y ∗ ( ∅ ) ≤ ρ ≺ v ( { x } ) < ρ ≺ v ( ∅ ) = val( v ) , concluding the proof of the proposition.It therefore remains to construct the map ι . To this end, we use Lemma B.9 to construct forevery y an order-preserving map f y : T A v ,x,y ∗ → T A v ,x,y such that depth( f ( t )) = depth( t ). Givenany t ∈ T A v ,x,y ∗ , we define a decision tree ι ( t ) by taking x as its root and attaching f y ( t ) as itssubtree of the root-to-child edge labelled by y , for every y ∈ Y n +1 . By construction ι ( t ) ∈ T A v is anactive tree, ι ( ∅ ) = { x } , and ι is order-preserving as each of the maps f y is order-preserving.As we assumed at the outset that P L has a winning strategy, the initial value of the game is anordinal val( ∅ ) < Ω . We can now use Proposition B.8 to describe an explicit winning strategy. Ineach round in which P L has not yet won, for each point x t that is played by P A , Proposition B.8ensures that P L can choose y t so that val( x , y , . . . , x t , y t ) < val( x , y , . . . , x t − , y t − ). This choiceof y t defines a winning strategy for P L , because the ordinals are well-ordered. B.4 Measurability We have constructed value-decreasing winning strategies for P L . To conclude the proof of Theo-rem B.1, it remains to show that it is possible to construct a universally measurable value-decreasingstrategy. The main remaining step is to show that the set of positions with any given game value ismeasurable. Lemma B.10. For any ≤ n < ∞ , v ∈ P n , and κ ∈ ORD , we have val( v ) > κ if and only if thereexists x ∈ X n +1 such that val( v , x, y ) ≥ κ for all y ∈ Y n +1 . Proof Suppose first there exists x such that val( v , x, y ) ≥ κ for all y . If val( v ) < Ω , then it followsimmediately from Proposition B.8 that val( v ) > κ . On the other hand, if val( v ) = Ω , the conclusionis trivial.In the opposite direction, let val( v ) > κ . If val( v ) = Ω , then choosing x to be the root labelof an infinite active tree yields val( v , x, y ) = Ω ≥ κ for all y . On the other hand, if val( v ) < Ω ,then we have ρ ≺ v ( ∅ ) = val( v ) > κ . By the definition of rank, there exists x such that { x } ∈ T A v and ρ ≺ v ( { x } ) + 1 > κ or, equivalently, ρ ≺ v ( { x } ) ≥ κ . Thus it remains to show that ρ ≺ v ( { x } ) ≤ val( v , x, y ) for every y . o this end, we follow in essence the reverse of the argument used in the proof of Proposition B.8.Denote by T A v ,x ⊆ T A v the set of active trees with root x , and by ≺ v ,x the induced relation. Thedefinition of rank implies ρ ≺ v ( { x } ) = ρ ≺ v ,x ( { x } ). On the other hand, for any t ∈ T A v ,x , denote by f y ( t ) ∈ T A v ,x,y its subtree of the root-to-child edge labelled by y . Then f y : T A v ,x → T A v ,x,y is anorder-preserving map such that f y ( { x } ) = ∅ . Therefore, either val( v , x, y ) = Ω , or ρ ≺ v ( { x } ) = ρ ≺ v ,x ( { x } ) ≤ ρ ≺ v ,x,y ( ∅ ) = val( v , x, y )by monotonicity of rank under order-preserving maps. Corollary B.11. The set A κn := { v ∈ A n : val( v ) > κ } is analytic for every ≤ n < ∞ and − ≤ κ < ω . Proof The proof is by induction on κ . First note that A − n = A n is analytic for every n . Now forany 0 ≤ κ < ω , by Lemma B.10, A κn = [ x ∈X n +1 \ y ∈Y n +1 \ λ<κ { v ∈ A n : val( v , x, y ) > λ } = [ x ∈X n +1 \ y ∈Y n +1 \ λ<κ { v ∈ A n : ( v , x, y ) ∈ A λn +1 } . As κ < ω , the intersections in this expression are countable. Therefore, as A λn +1 is analytic for λ < κ by the induction hypothesis, it follows that A κn is analytic.We can now conclude the proof of Theorem B.1. Proof of Theorem B.1 We assume that P L has a winning strategy (otherwise the conclusion istrivial). For any 0 ≤ n < ∞ , define D n +1 := { ( v , x, y ) ∈ P n +1 : val( v , x, y ) < min { val( v ) , val( ∅ ) }} = [ − ≤ κ< val( ∅ ) { ( v , x, y ) ∈ P n +1 : val( v , x, y ) ≤ κ < val( v ) } = [ − ≤ κ< val( ∅ ) { ( v , x, y ) ∈ P n +1 : ( v , x, y ) ∈ ( A κn +1 ) c , v ∈ A κn } , where A κn is defined in Corollary B.11. As P L has a winning strategy, Lemma B.7 implies thatval( ∅ ) < ω . Thus the union in the definition of D n +1 is countable, and it follows from CorollaryB.11 that D n +1 is universally measurable.Now define for every t ≥ g t : P t − × X t → Y t as follows. As Y t is countable, we mayenumerate it as Y t = { y , y , y , . . . } . Set g t ( v , x ) := (cid:26) y i if ( v , x, y j ) D t for j < i, ( v , x, y i ) ∈ D t ,y if ( v , x, y j ) D t for all j. In words, g t ( v , x ) = y i for the first index i such that ( v , x, y i ) ∈ D t , and we set it arbitrarily to y if( v , x, y j ) D t for all j . This defines a universally measurable strategy for P L . It remains to showthis strategy is winning. o this end, suppose that val( x , y , . . . , x t − , y t − ) ≤ val( ∅ ). By Proposition B.8, for every x t there exists y t so that ( x , y , . . . , x t , y t ) ∈ D t . Thus playing y t = g t ( x , y , . . . , x t − , y t − , x t ) yields,by the definition of g t , val( x , y , . . . , x t , y t ) < val( x , y , . . . , x t − , y t − ) . The assumption val( x , y , . . . , x t − , y t − ) ≤ val( ∅ ) certainly holds for t = 0. It thus remains validfor any t as long as P L plays the strategy { g t } . It follows that { g t } is a value-decreasing strategy,so it is winning for P L . Appendix C. A nonmeasurable example To fully appreciate the measurability issues that arise in this paper, it is illuminating to considerwhat can go wrong if we do not assume measurability in the sense of Definition 3.3. To this end werevisit in this section a standard example from empirical process theory (cf. (Dudley, 2014, Chapter5) or (Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989, p. 953)) in our setting.For the purposes of this section, we assume validity of the continuum hypothesis card([0 , ℵ .(This is not assumed anywhere else in the paper.) We may therefore identify [0 , 1] with ω . Inparticular, this induces a well-ordering of [0 , 1] which we will denote ⋖ , to distinguish it from theusual ordering of the reals.To construct our example, we let X = [0 , 1] and H = { x x ≤ · z : z ∈ [0 , } . Every h ∈ H is the indicator of a countable set (being an initial segment of ω ). In particular, each h ∈ H is individually measurable. However, measurability in the sense of Definition 3.3 fails for H . Lemma C.1. For the example of this section, the set S = { ( x , x ) ∈ X : H x , ,x , = ∅ } has inner measure and outer measure with respect to the Lebesgue measure. In particular, S isnot Lebesgue measurable. Proof By the definition of H , we have S = { ( x , x ) ∈ X : x ⋖ x } . If S were Lebesgue-measurable, then Fubini’s theorem would yield0 = Z (cid:18) Z S ( x , x ) dx (cid:19) dx = Z (cid:18) Z S ( x , x ) dx (cid:19) dx = 1 , where we used that x S ( x , x ) is the indicator of a countable set and that x S ( x , x )is the indicator of the complement of a countable set. This is evidently absurd, so S cannot beLebesgue-measurable. That the outer measure of S is one and the inner measure is zero followsreadily from the above Fubini identities by bounding S and S c by its measurable cover, respectively. Corollary C.2. The class H is not measurable in the sense of Definition 3.3. roof If H were measurable in the sense of Definition 3.3, then the same argument as in the proofof Corollary 3.5 would show that S is analytic. But this contradicts Lemma C.1, as analytic setsare universally measurable by Theorem A.10.Lemma C.1 illustrates the fundamental importance of measurability in our theory. For example,suppose player P A in the the game G of section 3.2 draws i.i.d. random plays x , x , . . . from theLebesgue measure on [0 , L plays the simplest type of strategy—the deterministicstrategy y = 0, y = 1—the fact that P L wins in the second round is not measurable. Moreover,one can show (see the proof of Lemma C.3 below) that any value-minimizing strategy for P L in thesense of Section B.3 plays y = 0, y = 1 for ( x , x ) ∈ S c . So, the same problem arises for thewinning strategies constructed by Theorem B.1.This kind of behavior would undermine any reasonable probabilistic analysis of the learningproblems in this paper. Even the definitions of learning rates make no sense when the probabilitiesof events have no meaning. The above example therefore illustrates that measurability is crucial forlearning problems with random data.It is instructive to check what goes wrong if one attempts to prove the existence of measurablestrategies as in Theorem B.1 for the present example. The coanalyticity assumption was used in theproof of Theorem B.1 in two different ways. First, it ensures that the sets of active positions A n andthe super-level sets of the value function A κn are measurable for countable κ (cf. Corollary B.11).This immediately fails in the present example (Lemma C.1). Secondly, coanalyticity was used toshow that only countable game values can appear (cf. Lemma B.7). We presently show that thelatter also fails in the present example, so that coanalyticity is really essential for both parts of theproof. Lemma C.3. In the present example, the game G satisfies val( ∅ ) ≥ ω . Proof As in Section 3.4, for the game G we denote LD( H ) := val( ∅ ), and we recall thatval( x , y , . . . , x t , y t ) = LD( H x ,y ,...,x t ,y t ).We must recall some facts about ordinals (Sierpi´nski, 1965, section XIV.20). An ordinal κ iscalled additively indecomposable if ξ + κ = κ for every ξ < κ , or, equivalently, if the ordinal segment[ ξ, κ ) is isomorphic to κ for all ξ < κ . An ordinal is additively indecomposable if and only if it is ofthe form ω β for some ordinal β . Moreover, ω = ω ω , so that ω is additively indecomposable.For every ordinal β , define the class of indicators H β = { λ λ ≤ κ : κ ∈ ω β } on X β = ω β .We now prove by induction on β that LD( H β ) ≥ β for each β . Choosing β = ω then shows thatLD( H ) ≥ ω .For the initial step, it suffices that LD( H ) = 0 because X = 1 and H = { } . Now suppose wehave proved that LD( H α ) ≥ α for all α < β . Note first that H βω α , = H α , where we view the latteras functions on X β . However, all functions in H α take the same value on points in X β \X α , so suchpoints cannot appear in any active tree. It follows immediately that LD( H βω α , ) = LD( H α ). By thesame reasoning, now using that [ ω α , ω β ) is isomorphic to ω β , it follows that LD( H βω α , ) = LD( H β ).Thus LD( H β ) > LD( H α ) ≥ α by the induction hypothesis and Lemma B.10. As this holds for any α < β , we have shown LD( H β ) ≥ β .Let us conclude our discussion of measurability by emphasizing that even in the presence of ameasurability assumption such as Definition 3.3 or coanalitycity of W in Theorem B.1, the key reasonwhy we are able to construct measurable strategies is that we assumed P L plays values in countablesets Y t (as is the case for all the games encountered in this paper). In general Gale-Stewart gameswhere both P A and P L play values in Polish spaces, there is little hope of obtaining measurablestrategies in a general setting. Indeed, an inspection of the proof of Corollary B.11 shows that thesuper-level sets of the value function are constructed by successive unions over X t and intersections ver Y t . Namely, by alternating projections and complements. However, it is consistent with theaxioms of set theory (ZFC) that the projection of a coanalytic set may be Lebesgue-nonmeasurable(Jech, 2003, Corollary 25.28). Thus it is possible to construct examples of Gale-Stewart games where X t , Y t are Polish, W is closed or open, and the set A κn of Corollary B.11 is nonmeasurable for κ = 0or 1. In contrast, because we assumed Y t are countable, only the unions over X t play a nontrivialrole in our setting and analyticity is preserved in the construction. References A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Machine Learning , 30:31–56,1998. 1.5, 1.6.3, 1.6.4J.-Y. Audibert and A. B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals ofStatistics , 35(2):608–633, 2007. 1.2M.-F. Balcan, S. Hanneke, and J. Wortman Vaughan. The true sample complexity of active learning. Machine Learning , 80(2–3):111–139, 2010. 1.6.4G. M. Benedek and A. Itai. Nonuniform learnability. Journal of Computer and System Sciences ,48:311–323, 1994. 1.6.5, 2.3, 2.7A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery , 36(4):929–965,1989. 1.5.3, 3.3, CD. Cohn and G. Tesauro. Can neural networks do better than the Vapnik-Chervonenkis bounds?In Advances in Neural Information Processing Systems , 1990. 1, 1.2D. Cohn and G. Tesauro. How tight are the Vapnik-Chervonenkis bounds? Neural Computation , 4(2):249–269, 1992. 1, 1.2D. L. Cohn. Measure Theory . Birkh¨auser, Boston, Mass., 1980. ISBN 3-7643-3003-1. 3.3, A.4C. Dellacherie. Les d´erivations en th´eorie descriptive des ensembles et le th´eor`eme de la borne. In S´eminaire de Probabilit´es, XI (Univ. Strasbourg, Strasbourg, 1975/1976) , pages 34–46. LectureNotes in Math., Vol. 581. Springer, 1977. A.4L. Devroye, L. Gy¨orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition . Springer-VerlagNew York, Inc., 1996. 1.3, 1.6.1, 2.3R. M. Dudley. Uniform central limit theorems , volume 142 of Cambridge Studies in AdvancedMathematics . Cambridge University Press, New York, second edition, 2014. ISBN 978-0-521-73841-5; 978-0-521-49884-5. 1.5.3, 3.4, 3.3, CA. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number ofexamples needed for learning. Information and Computation , 82(3):247–261, 1989. 1.2C. D. A. Evans and Joel David Hamkins. Transfinite game values in infinite chess. Integers , 14:Paper No. G2, 36, 2014. 3.4, 8, B.2D. Gale and F. M. Stewart. Infinite games with perfect information. In Contributions to the theoryof games, vol. 2 , Annals of Mathematics Studies, no. 28, pages 245–266. Princeton UniversityPress, Princeton, N. J., 1953. A.1, A.1S. Hanneke. Theoretical Foundations of Active Learning . PhD thesis, Machine Learning Department,School of Computer Science, Carnegie Mellon University, 2009. 1.6.4 . Hanneke. Activized learning: Transforming passive to active with improved label complexity. Journal of Machine Learning Research , 13(5):1469–1587, 2012. 1.6.4S. Hanneke. Learning whenever learning is possible: Universal learning under general stochasticprocesses. arXiv:1706.01418 , 2017. 1.6.1S. Hanneke, A. Kontorovich, S. Sabato, and R. Weiss. Universal Bayes consistency in metric spaces. arXiv:1705.08184 , 2019. 1.3, 1.6.1D. Haussler, N. Littlestone, and M. Warmuth. Predicting { , } -functions on randomly drawn points. Information and Computation , 115(2):248–292, 1994. 1.2, 1.5.1, 1.6.2, 5, 5.2, 5.2W. Hodges. Model Theory , volume 42 of Encyclopedia of Mathematics and its Applications . Cam-bridge University Press, Cambridge, 1993. ISBN 0-521-30442-3. doi: 10.1017/CBO9780511551574.URL https://doi.org/10.1017/CBO9780511551574 . A.1, B.2K. Hrbacek and T. Jech. Introduction to Set Theory , volume 220 of Monographs and Textbooksin Pure and Applied Mathematics . Marcel Dekker, Inc., New York, third edition, 1999. ISBN0-8247-7915-0. A.2T. Jech. Set Theory . Springer Monographs in Mathematics. Springer-Verlag, Berlin, 2003. ISBN3-540-44085-2. The third millennium edition, revised and expanded. CA. S. Kechris. Classical Descriptive Set Theory , volume 156 of Graduate Texts in Mathematics .Springer-Verlag, New York, 1995. ISBN 0-387-94374-9. doi: 10.1007/978-1-4612-4190-4. URL https://doi.org/10.1007/978-1-4612-4190-4 . A.1, A.3, A.4, A.4V. Koltchinskii and O. Beznosova. Exponential convergence rates in classification. In PeterAuer and Ron Meir, editors, Learning Theory, 18th Annual Conference on Learning The-ory, COLT 2005, Bertinoro, Italy, June 27-30, 2005, Proceedings , volume 3559 of LectureNotes in Computer Science , pages 295–307. Springer, 2005. doi: 10.1007/11503415 \ 20. URL https://doi.org/10.1007/11503415_20 . 1.2N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algo-rithm. Machine Learning , 2:285–318, 1988. 1.5.1, 3.1, 3.4A. Nitanda and T. Suzuki. Stochastic gradient descent with exponential convergence rates of ex-pected classification errors. In AISTATS , volume 89 of Proceedings of Machine Learning Research ,pages 1417–1426. PMLR, 2019. 1.2V. Pestov. PAC learnability versus VC dimension: A footnote to a basic result of statistical learning.In The 2011 International Joint Conference on Neural Networks , pages 1141–1145, July 2011. doi:10.1109/IJCNN.2011.6033352. 1.5.3, 3.3L. Pillaud-Vivien, A. Rudi, and F. Bach. Exponential convergence of testing error for stochas-tic gradient methods. In S´ebastien Bubeck, Vianney Perchet, and Philippe Rigollet, edi-tors, Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018 , vol-ume 75 of Proceedings of Machine Learning Research , pages 250–296. PMLR, 2018. URL http://proceedings.mlr.press/v75/pillaud-vivien18a.html . 1.2D. Schuurmans. Characterizing rational versus exponential learning curves. Journal of Computerand System Sciences , 55(1):140–160, 1997. 1.1, 1.2, 1.6.2, 1.6.4, 4.1, 4.2S. Sierpi´nski. Cardinal and Ordinal Numbers . Second revised edition. Monografie Matematyczne,Vol. 34. Pa´nstowe Wydawnictwo Naukowe, Warsaw, 1965. A.2, C . J. Stone. Consistent nonparametric regression. The Annals of Statistics , pages 595–620, 1977.1.3, 1.6.1L. G. Valiant. A theory of the learnable. Communications of the ACM , 27(11):1134–1142, November1984. 1R. van Handel. The universal Glivenko-Cantelli property. Probability and Related Fields , 155:911–934, 2013. 2.2V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events totheir probabilities. Theory of Probability and its Applications , 16(2):264–280, 1971. 1.5.1, 2.2V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition . Nauka, Moscow, 1974. 1, 1.2, 1.6.1L. Yang and S. Hanneke. Activized learning with uniform classification noise. In Proceedings of the th International Conference on Machine Learning , 2013. 1.6.4, 2013. 1.6.4