Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning
aa r X i v : . [ s t a t . M L ] D ec Institute of Mathematical StatisticsLECTURE NOTES–MONOGRAPH SERIESVolume 56
Pac-Bayesian SupervisedClassification: The Thermodynamicsof Statistical Learning
Olivier Catoni Institute of Mathematical StatisticsBeachwood, Ohio, USA nstitute of Mathematical Statistics
Lecture Notes–Monograph Series
Series Editor:Anthony C. DavisonThe production of the
Institute of Mathematical StatisticsLecture Notes–Monograph Series is managed by theIMS Office: Rong Chen, Treasurer andElyse Gustafson, Executive Director.Library of Congress Control Number: 2007939120International Standard Book Number (10) 0-940600-72-2International Standard Book Number (13) 978-0-940600-72-0International Standard Serial Number 0749-2170DOI:
Copyright c (cid:13) ontents
Preface . . . . . . . . . . . . . . . . . . . . . . . . v Introduction. . . . . . . . . . . . . . . . . . . . . . vii
1. Inductive PAC-Bayesian learning . . . . . . . . . . . . λ . . .
2. Comparing posterior distributions to Gibbs priors . . . . Contents
3. Transductive PAC-Bayesian learning . . . . . . . . . .
4. Support Vector Machines . . . . . . . . . . . . . . .
Appendix: Classification by thresholding . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . reface
This monograph deals with adaptive supervised classification, using tools bor-rowed from statistical mechanics and information theory, stemming from the PAC-Bayesian approach pioneered by David McAllester and applied to a conception ofstatistical learning theory forged by Vladimir Vapnik. Using convex analysis on theset of posterior probability measures, we show how to get local measures of thecomplexity of the classification model involving the relative entropy of posteriordistributions with respect to Gibbs posterior measures. We then discuss relativebounds, comparing the generalization error of two classification rules, showing howthe margin assumption of Mammen and Tsybakov can be replaced with some em-pirical measure of the covariance structure of the classification model. We show howto associate to any posterior distribution an effective temperature relating it to theGibbs prior distribution with the same level of expected error rate, and how to esti-mate this effective temperature from data, resulting in an estimator whose expectederror rate converges according to the best possible power of the sample size adap-tively under any margin and parametric complexity assumptions. We describe andstudy an alternative selection scheme based on relative bounds between estimators,and present a two step localization technique which can handle the selection of aparametric model from a family of those. We show how to extend systematicallyall the results obtained in the inductive setting to transductive learning, and usethis to improve Vapnik’s generalization bounds, extending them to the case whenthe sample is made of independent non-identically distributed pairs of patterns andlabels. Finally we review briefly the construction of Support Vector Machines andshow how to derive generalization bounds for them, measuring the complexity ei-ther through the number of support vectors or through the value of the transductiveor inductive margin.
Olivier Catoni
CNRS – Laboratoire de Probabilit´es et Mod`eles Al´eatoires, Universit´e Paris 6(site Chevaleret), 4 place Jussieu – Case 188, 75 252 Paris Cedex 05.v o my son Nicolas ntroduction
Among the possible approaches to pattern recognition, statistical learning theoryhas received a lot of attention in the last few years. Although a realistic patternrecognition scheme involves data pre-processing and post-processing that need atheory of their own, a central role is often played by some kind of supervised learningalgorithm. This central building block is the subject we are going to analyse in thesenotes.Accordingly, we assume that we have prepared in some way or another a sample of N labelled patterns ( X i , Y i ) Ni =1 , where X i ranges in some pattern space X and Y i ranges in some finite label set Y . We also assume that we have devised our experi-ment in such a way that the couples of random variables ( X i , Y i ) are independent(but not necessarily equidistributed). Here, randomness should be understood tocome from the way the statistician has planned his experiment. He may for in-stance have drawn the X i s at random from some larger population of patterns thealgorithm is meant to be applied to in a second stage. The labels Y i may havebeen set with the help of some external expertise (which may itself be faulty orcontain some amount of randomness, so we do not assume that Y i is a function of X i , and allow the couple of random variables ( X i , Y i ) to follow any kind of jointdistribution). In practice, patterns will be extracted from some high dimensionaland highly structured data, such as digital images, speech signals, DNA sequences,etc. We will not discuss this pre-processing stage here, although it poses crucialproblems dealing with segmentation and the choice of a representation. The aimof supervised classification is to choose some classification rule f : X → Y whichpredicts Y from X making as few mistakes as possible on average.The choice of f will be driven by a suitable use of the information provided by thesample ( X i , Y i ) Ni =1 on the joint distribution of X and Y . Moreover, considering allthe possible measurable functions f from X to Y would not be feasible in practiceand maybe more importantly not well founded from a statistical point of view,at least as soon as the pattern space X is large and little is known in advanceabout the joint distribution of patterns X and labels Y . Therefore, we will considerparametrized subsets of classification rules { f θ : X → Y ; θ ∈ Θ m } , m ∈ M , whichmay be grouped to form a big parameter set Θ = S m ∈ M Θ m .The subject of this monograph is to introduce to statistical learning theory, andmore precisely to the theory of supervised classification, a number of technical toolsakin to statistical mechanics and information theory, dealing with the concepts ofentropy and temperature. A central task will in particular be to control the mutualinformation between an estimated parameter and the observed sample. The focuswill not be directly on the description of the data to be classified, but on the de-scription of the classification rules. As we want to deal with high dimensional data,we will be bound to consider high dimensional sets of candidate classification rules,and will analyse them with tools very similar to those used in statistical mechanicsviiiii Introduction to describe particle systems with many degrees of freedom. More specifically, thesets of classification rules will be described by Gibbs measures defined on parametersets and depending on the observed sample value. A Gibbs measure is the specialkind of probability measure used in statistical mechanics to describe the state of aparticle system driven by a given energy function at some given temperature. Here,Gibbs measures will emerge as minimizers of the average loss value under entropy(or mutual information) constraints. Entropy itself, more precisely the Kullbackdivergence function between probability measures, will emerge in conjunction withthe use of exponential deviation inequalities: indeed, the log-Laplace transform maybe seen as the Legendre transform of the Kullback divergence function, as will bestated in Lemma 1.1.3 (page 4).To fix notation, let ( X i , Y i ) Ni =1 be the canonical process on Ω = ( X × Y ) N (whichmeans the coordinate process). Let the pattern space be provided with a sigma-algebra B turning it into a measurable space ( X , B ). On the finite label space Y , wewill consider the trivial algebra B ′ made of all its subsets. Let M (cid:2) ( K × Y ) N , ( B ⊗ B ′ ) ⊗ N (cid:3) be our notation for the set of probability measures (i.e. of positive measuresof total mass equal to 1) on the measurable space (cid:2) ( X × Y ) N , ( B × B ′ ) ⊗ N (cid:3) . Once someprobability distribution P ∈ M (cid:2) ( X × Y ) N , ( B ⊗ B ′ ) ⊗ N (cid:3) is chosen, it turns ( X i , Y i ) Ni =1 into the canonical realization of a stochastic process modelling the observed sample(also called the training set). We will assume that P = N Ni =1 P i , where for each i = 1 , . . . , N , P i ∈ M ( X × Y , B ⊗ B ′ ), to reflect the assumption that we observeindependent pairs of patterns and labels. We will also assume that we are providedwith some indexed set of possible classification rules R Θ = (cid:8) f θ : X → Y ; θ ∈ Θ (cid:9) , where (Θ , T ) is some measurable index set. Assuming some indexation of the classi-fication rules is just a matter of presentation. Although it leads to heavier notation,it allows us to integrate over the space of classification rules as well as over Ω, us-ing the usual formalism of multiple integrals. For this matter, we will assume that( θ, x ) f θ ( x ) : (Θ × X , B ⊗ T ) → ( Y , B ′ ) is a measurable function.In many cases, as already mentioned, Θ = S m ∈ M Θ m will be a finite (or moregenerally countable) union of subspaces, dividing the classification model R Θ = S m ∈ M R Θ m into a union of sub-models. The importance of introducing such astructure has been put forward by V. Vapnik, as a way to avoid making stronghypotheses on the distribution P of the sample. If neither the distribution of thesample nor the set of classification rules were constrained, it is well known that nokind of statistical inference would be possible. Considering a family of sub-models isa way to provide for adaptive classification where the choice of the model depends onthe observed sample. Restricting the set of classification rules is more realistic thanrestricting the distribution of patterns, since the classification rules are a processingtool left to the choice of the statistician, whereas the distribution of the patternsis not fully under his control, except for some planning of the learning experimentwhich may enforce some weak properties like independence, but not the preciseshapes of the marginal distributions P i which are as a rule unknown distributionson some high dimensional space.In these notes, we will concentrate on general issues concerned with a natu-ral measure of risk, namely the expected error rate of each classification rule f θ ,expressed as(0.1) R ( θ ) = 1 N N X i =1 P (cid:2) f θ ( X i ) = Y i (cid:3) . ntroduction ixAs this quantity is unobserved, we will be led to work with the corresponding empirical error rate (0.2) r ( θ, ω ) = 1 N N X i =1 (cid:2) f θ ( X i ) = Y i (cid:3) . This does not mean that practical learning algorithms will always try to minimizethis criterion. They often on the contrary try to minimize some other criterion whichis linked with the structure of the problem and has some nice additional properties(like smoothness and convexity, for example). Nevertheless, and independently ofthe precise form of the estimator b θ : Ω → Θ under study, the analysis of R ( b θ ) is anatural question, and often corresponds to what is required in practice.Answering this question is not straightforward because, although R ( θ ) is theexpectation of r ( θ ), a sum of independent Bernoulli random variables, R ( b θ ) is notthe expectation of r ( b θ ), because of the dependence of b θ on the sample, and neitheris r ( b θ ) a sum of independent random variables. To circumvent this unfortunatesituation, some uniform control over the deviations of r from R is needed.We will follow the PAC-Bayesian approach to this problem, originated in themachine learning community and pioneered by McAllester (1998, 1999). It can beseen as some variant of the more classical approach of M -estimators relying onempirical process theory — as described for instance in Van de Geer (2000).It is built on some general principles: • One idea is to embed the set of estimators of the type b θ : Ω → Θ into thelarger set of regular conditional probability measures ρ : (cid:0) Ω , ( B ⊗ B ′ ) ⊗ N (cid:1) → M (Θ , T ). We will call these conditional probability measures posterior dis-tributions , to follow standard terminology. • A second idea is to measure the fluctuations of ρ with respect to the sample,using some prior distribution π ∈ M (Θ , T ), and the Kullback divergencefunction K ( ρ, π ). The expectation P (cid:8) K ( ρ, π ) (cid:9) measures the randomness of ρ . The optimal choice of π would be P ( ρ ), resulting in a measure of therandomness of ρ equal to the mutual information between the sample and theestimated parameter drawn from ρ . Anyhow, since P ( ρ ) is usually not betterknown than P , we will have to be content with some less concentrated priordistribution π , resulting in some looser measure of randomness, as shown bythe identity P (cid:2) K ( ρ, π ) (cid:3) = P (cid:8) K (cid:2) ρ, P ( ρ ) (cid:3)(cid:9) + K (cid:2) P ( ρ ) , π (cid:3) . • A third idea is to analyse the fluctuations of the random process θ r ( θ )from its mean process θ R ( θ ) through the log-Laplace transform − λ log (cid:26)Z Z exp (cid:2) − λr ( θ, ω ) (cid:3) π ( dθ ) P ( dω ) (cid:27) , as would be done in statistical mechanics, where this is called the free energy.This transform is well suited to relate min θ ∈ Θ r ( θ ) to inf θ ∈ Θ R ( θ ), since forlarge enough values of the parameter λ , corresponding to low enough valuesof the temperature, the system has small fluctuations around its ground state. • A fourth idea deals with localization. It consists of considering a prior dis-tribution π depending on the unknown expected error rate function R . Thussome central result of the theory will consist in an empirical upper bound for K (cid:2) ρ, π exp( − βR ) (cid:3) , where π exp( − βR ) , defined by its density ddπ (cid:2) π exp( − βR ) (cid:3) = exp( − βR ) π (cid:2) exp( − βR ) (cid:3) , Introduction is a Gibbs distribution built from a known prior distribution π ∈ M (Θ , T ),some inverse temperature parameter β ∈ R + and the expected error rate R .This bound will in particular be used when ρ is a posterior Gibbs distribution,of the form π exp( − βr ) . The general idea will be to show that in the case when ρ is not too random, in the sense that it is possible to find a prior (thatis non-random) distribution π such that K ( ρ, π ) is small, then ρ ( r ) can bereliably taken for a good approximation of ρ ( R ).This monograph is divided into four chapters. The first deals with the inductivesetting presented in these lines. The second is devoted to relative bounds. It showsthat it is possible to obtain a tighter estimate of the mutual information betweenthe sample and the estimated parameter by comparing prior and posterior Gibbsdistributions. It shows how to use this idea to obtain adaptive model selectionschemes under very weak hypotheses.The third chapter introduces the transductive setting of V. Vapnik (Vapnik,1998), which consists in comparing the performance of classification rules on thelearning sample with their performance on a test sample instead of their averageperformance. The fourth one is a fast introduction to Support Vector Machines.It is the occasion to show the implications of the general results discussed in thethree first chapters when some particular choice is made about the structure of theclassification rules.In the first chapter, two types of bounds are shown. Empirical bounds are usefulto build, compare and select estimators.
Non random bounds are useful to assess thespeed of convergence of estimators, relating this speed to the behaviour of the Gibbsprior expected error rate β π exp( − βR ) ( R ) and to covariance factors related to themargin assumption of Mammen and Tsybakov when a finer analysis is performed.We will proceed from the most straightforward bounds towards more elaborateones, built to achieve a better asymptotic behaviour. In this course towards moresophisticated inequalities, we will introduce local bounds and relative bounds .The study of relative bounds is expanded in the third chapter, where tightercomparisons between prior and posterior Gibbs distributions are proved. Theorems2.1.3 (page 54) and 2.2.4 (page 73) present two ways of selecting some nearly opti-mal classification rule. They are both proved to be adaptive in all the parametersunder Mammen and Tsybakov margin assumptions and parametric complexity as-sumptions. This is done in Corollary 2.1.17 (page 67) of Theorem 2.1.15 (page66) and in Theorem 2.2.11 (page 89). In the first approach, the performance of arandomized estimator modelled by a posterior distribution is compared with theperformance of a prior Gibbs distribution. In the second approach posterior distri-butions are directly compared between themselves (and leads to slightly strongerresults, to the price of using a more complex algorithm). When there are more thanone parametric model, it is appropriate to use also some doubly localized scheme :two step localization is presented for both approaches, in Theorems 2.3.2 (page 94)and 2.3.9 (page 108) and provides bounds with a decreased influence of the numberof empirically inefficient models included in the selection scheme.We would not like to induce the reader into thinking that the most sophisticatedresults presented in these first two chapters are necessarily the most useful ones,they are as a rule only more efficient asymptotically , whereas, being more involved,they use looser constants leading to less precision for small sample sizes. In practicewhether a sample is to be considered small is a question of the ratio between thenumber of examples and the complexity (roughly speaking the number of parame-ters) of the model used for classification. Since our aim here is to describe methods ntroduction xiappropriate for complex data (images, speech, DNA, . . . ), we suspect that practi-tioners wanting to make use of our proposals will often be confronted with smallsample sizes; thus we would advise them to try the simplest bounds first and onlyafterwards see whether the asymptotically better ones can bring some improvement.We would also like to point out that the results of the first two chapters are notof a purely theoretical nature: posterior parameter distributions can indeed be com-puted effectively, using Monte Carlo techniques, and there is well-established know-how about these computations in Bayesian statistics. Moreover, non-randomizedestimators of the classical form b θ : Ω → Θ can be efficiently approximated by pos-terior distributions ρ : Ω → M (Θ) supported by a fairly narrow neighbourhoodof b θ , more precisely a neighbourhood of the size of the typical fluctuations of b θ , sothat this randomized approximation of b θ will most of the time provide the sameclassification as b θ itself, except for a small amount of dubious examples for whichthe classification provided by b θ would anyway be unreliable. This is explained onpage 7.As already mentioned, the third chapter is about the transductive setting , thatis about comparing the performance of estimators on a training set and on a testset. We show first that this comparison can be based on a set of exponential devi-ation inequalities which parallels the one used in the inductive case. This gives theopportunity to transport all the results obtained in the inductive case in a system-atic way. In the transductive setting, the use of prior distributions can be extendedto the use of partially exchangeable posterior distributions depending on the unionof training and test patterns, bringing increased possibilities to adapt to the dataand giving rise to such crucial notions of complexity as the Vapnik–Cervonenkisdimension.Having done so, we more specifically focus on the small sample case , where localand relative bounds are not expected to be of great help. Introducing a fictitious(that is unobserved) shadow sample, we study Vapnik-type generalization bounds,showing how to tighten and extend them with some original ideas, like making noGaussian approximation to the log-Laplace transform of Bernoulli random vari-ables, using a shadow sample of arbitrary size. shrinking from the use of any sym-metrization trick, and using a suitable subset of the group of permutations to coverthe case of independent non-identically distributed data. The culminating resultof the third chapter is Theorem 3.3.3 (page 125), subsequent bounds showing theseparate influence of the above ideas and providing an easier comparison with Vap-nik’s original results. Vapnik-type generalization bounds have a broad applicability,not only through the concept of Vapnik–Cervonenkis dimension, but also throughthe use of compression schemes (Little et al., 1986), which are briefly described onpage 117.The beginning of the fourth chapter introduces Support Vector Machines, bothin the separable and in the non-separable case (using the box constraint). We thendescribe different types of bounds. We start with compression scheme bounds, toproceed with margin bounds. We begin with transductive margin bounds, recallingon this occasion in Theorem 4.2.2 (page 144) the growth bound for a family of clas-sification rules with given Vapnik–Cervonenkis dimension. In Theorem 4.2.4 (page146) we give the usual estimate of the Vapnik–Cervonenkis dimension of a familyof separating hyperplanes with a given transductive margin (we mean by this thatthe margin is computed on the union of the training and test sets). We present anoriginal probabilistic proof inspired by a similar one from Cristianini et al. (2000),whereas other proofs available usually rely on the informal claim that the simplexii Introduction is the worst case. We end this short review of Support Vector Machines with a dis-cussion of inductive margin bounds. Here the margin is computed on the trainingset only, and a more involved combinatorial lemma, due to Alon et al. (1997) andrecalled in Lemma 4.2.6 (page 149) is used. We use this lemma and the results ofthe third chapter to establish a bound depending on the margin of the training setalone.In appendix, we finally discuss the textbook example of classification by thresh-olding: in this setting, each classification rule is built by thresholding a series ofmeasurements and taking a decision based on these thresholded values. This rel-atively simple example (which can be considered as an introduction to the moretechnical case of classification trees) can be used to give more flesh to the resultsof the first three chapters.It is a pleasure to end this introduction with my greatest thanks to AnthonyDavison, for his careful reading of the manuscript and his numerous suggestions. hapter 1
Inductive PAC-Bayesianlearning
The setting of inductive inference (as opposed to transductive inference to be dis-cussed later) is the one described in the introduction.When we will have to take the expectation of a random variable Z : Ω → R aswell as of a function of the parameter h : Θ → R with respect to some probabilitymeasure, we will as a rule use short functional notation instead of resorting to theintegral sign: thus we will write P ( Z ) for R Ω Z ( ω ) P ( dω ) and π ( h ) for R Θ h ( θ ) π ( dθ ).A more traditional statistical approach would focus on estimators b θ : Ω → Θof the parameter θ and be interested on the relationship between the empiricalerror rate r ( b θ ), defined by equation (0.1, page viii), which is the number of errorsmade on the sample, and the expected error rate R ( b θ ), defined by equation (0.2,page ix), which is the expected probability of error on new instances of patterns.The PAC-Bayesian approach instead chooses a broader perspective and allows theestimator b θ to be drawn at random using some auxiliary source of randomness tosmooth the dependence of b θ on the sample. One way of representing the supple-mentary randomness allowed in the choice of b θ , is to consider what it is usual tocall posterior distributions on the parameter space, that is probability measures ρ : Ω → M (Θ , T ), depending on the sample, or from a technical perspective,regular conditional (or transition) probability measures. Let us recall that we usethe model described in the introduction: the training sample is modelled by thecanonical process ( X i , Y i ) Ni =1 on Ω = (cid:0) X × Y (cid:1) N , and a product probability measure P = N Ni =1 P i on Ω is considered to reflect the assumption that the training sam-ple is made of independent pairs of patterns and labels. The transition probabilitymeasure ρ , along with P ∈ M (Ω), defines a probability distribution on Ω × Θ anddescribes the conditional distribution of the estimated parameter b θ knowing thesample ( X i , Y i ) Ni =1 .The main subject of this broadened theory becomes to investigate the relation-ship between ρ ( r ), the average error rate of b θ on the training sample, and ρ ( R ), theexpected error rate of b θ on new samples. The first step towards using some kindof thermodynamics to tackle this question, is to consider the Laplace transformof ρ ( R ) − ρ ( r ), a well known provider of non-asymptotic deviation bounds. Thistransform takes the form P n exp h λ (cid:2) ρ ( R ) − ρ ( r ) (cid:3)io , Chapter 1. Inductive PAC-Bayesian learning where some inverse temperature parameter λ ∈ R + , as a physicist would call it, isintroduced. This Laplace transform would be easy to bound if ρ did not depend on ω ∈ Ω (namely on the sample), because ρ ( R ) would then be non-random, and ρ ( r ) = 1 N N X i =1 ρ (cid:2) Y i = f θ ( X i ) (cid:3) , would be a sum of independent random variables. It turns out, and this will bethe subject of the next section, that this annoying dependence of ρ on ω can bequantified, using the inequality ρ ( R ) − ρ ( r ) ≤ λ − log n π h exp (cid:2) λ ( R − r ) (cid:3)io + λ − K ( ρ, π ) , which holds for any probability measure π ∈ M (Θ) on the parameter space;for our purpose it will be appropriate to consider a prior distribution π that isnon-random, as opposed to ρ , which depends on the sample. Here, K ( ρ, π ) is theKullback divergence of ρ from π , whose definition will be recalled when we willcome to technicalities; it can be seen as an upper bound for the mutual informationbetween the ( X i , Y i ) Ni =1 and the estimated parameter b θ . This inequality will allowus to relate the penalized difference ρ ( R ) − ρ ( r ) − λ − K ( ρ, π ) with the Laplacetransform of sums of independent random variables. Let us now come to the details of the investigation sketched above. The first thingwe will do is to study the Laplace transform of R ( θ ) − r ( θ ), as a starting point forthe more general study of ρ ( R ) − ρ ( r ): it corresponds to the simple case where b θ is not random at all, and therefore where ρ is a Dirac mass at some deterministicparameter value θ .In the setting described in the introduction, let us consider the Bernoulli randomvariables σ i ( θ ) = (cid:2) Y i = f θ ( X i ) (cid:3) , which indicates whether the classification rule f θ made an error on the i th component of the training sample. Using independenceand the concavity of the logarithm function, it is readily seen that for any realconstant λ log n P (cid:8) exp (cid:2) − λr ( θ ) (cid:3)(cid:9)o = N X i =1 log n P h exp (cid:0) − λN σ i (cid:1)io ≤ N log (cid:26) N N X i =1 P h exp (cid:0) − λN σ i (cid:1)i(cid:27) . The right-hand side of this inequality is the log-Laplace transform of a Bernoullidistribution with parameter N P Ni =1 P ( σ i ) = R ( θ ). As any Bernoulli distribution isfully defined by its parameter, this log-Laplace transform is necessarily a functionof R ( θ ). It can be expressed with the help of the family of functions(1.1) Φ a ( p ) = − a − log (cid:8) − (cid:2) − exp( − a ) (cid:3) p (cid:9) , a ∈ R , p ∈ (0 , . It is immediately seen that Φ a is an increasing one-to-one mapping of the unitinterval onto itself, and that it is convex when a >
0, concave when a < .1. Basic inequality a = 0. Moreover the inverse of Φ a is given by the formulaΦ − a ( q ) = 1 − exp( − aq )1 − exp( − a ) , a ∈ R , q ∈ (0 , . This formula may be used to extend Φ − a to q ∈ R , and we will use this extensionwithout further notice when required.Using this notation, the previous inequality becomeslog n P (cid:8) exp (cid:2) − λr ( θ ) (cid:3)(cid:9)o ≤ − λ Φ λN (cid:2) R ( θ ) (cid:3) , proving Lemma 1.1.1 . For any real constant λ and any parameter θ ∈ Θ , P (cid:26) exp n λ h Φ λN (cid:2) R ( θ ) (cid:3) − r ( θ ) io(cid:27) ≤ . In previous versions of this study, we had used some Bernstein bound, insteadof this lemma. Anyhow, as it will turn out, keeping the log-Laplace transform of aBernoulli instead of approximating it provides simpler and tighter results.Lemma 1.1.1 implies that for any constants λ ∈ R + and ǫ ∈ )0 , P (cid:20) Φ λN (cid:2) R ( θ ) (cid:3) + log( ǫ ) λ ≤ r ( θ ) (cid:21) ≥ − ǫ. Choosing λ ∈ arg max R + Φ λN (cid:2) R ( θ ) (cid:3) + log( ǫ ) λ , we deduce Lemma 1.1.2 . For any ǫ ∈ )0 , , any θ ∈ Θ , P ( R ( θ ) ≤ inf λ ∈ R + Φ − λN (cid:20) r ( θ ) − log( ǫ ) λ (cid:21)) ≥ − ǫ. We will illustrate throughout these notes the bounds we prove with a smallnumerical example: in the case where N = 1000, ǫ = 0 .
01 and r ( θ ) = 0 .
2, we getwith a confidence level of 0 .
99 that R ( θ ) ≤ . λ = 234.Now, to proceed towards the analysis of posterior distributions, let us put U λ ( θ,ω ) = λ h Φ λN (cid:2) R ( θ ) (cid:3) − r ( θ, ω ) i for short, and let us consider some prior probabilitydistribution π ∈ M (Θ , T ). A proper choice of π will be an important question,underlying much of the material presented in this monograph, so for the time be-ing, let us only say that we will let this choice be as open as possible by writinginequalities which hold for any choice of π . Let us insist on the fact that when wesay that π is a prior distribution, we mean that it does not depend on the trainingsample ( X i , Y i ) Ni =1 . The quantity of interest to obtain the bound we are looking foris log n P h π (cid:2) exp( U λ ) (cid:3)io . Using Fubini’s theorem for non-negative functions, we seethat log n P h π (cid:2) exp( U λ ) (cid:3)io = log n π h P (cid:2) exp( U λ ) (cid:3)io ≤ . To relate this quantity to the expectation ρ ( U λ ) with respect to any posteriordistribution ρ : Ω → M (Θ), we will use the properties of the Kullback divergence Chapter 1. Inductive PAC-Bayesian learning K ( ρ, π ) of ρ with respect to π , which is defined as K ( ρ, π ) = R log( dρdπ ) dρ, when ρ is absolutely continuouswith respect to π ,+ ∞ , otherwise . The following lemma shows in which sense the Kullback divergence function can bethought of as the dual of the log-Laplace transform.
Lemma 1.1.3 . For any bounded measurable function h : Θ → R , and any proba-bility distribution ρ ∈ M (Θ) such that K ( ρ, π ) < ∞ , log (cid:8) π (cid:2) exp( h ) (cid:3)(cid:9) = ρ ( h ) − K ( ρ, π ) + K ( ρ, π exp( h ) ) , where by definition dπ exp( h ) dπ = exp[ h ( θ )] π [exp( h )] . Consequently log (cid:8) π (cid:2) exp( h )] (cid:3)(cid:9) = sup ρ ∈ M (Θ) ρ ( h ) − K ( ρ, π ) . The proof is just a matter of writing down the definition of the quantities involvedand using the fact that the Kullback divergence function is non-negative, and canbe found in Catoni (2004, page 160). In the duality between measurable functionsand probability measures, we thus see that the log-Laplace transform with respectto π is the Legendre transform of the Kullback divergence function with respect to π . Using this, we get P n exp (cid:8) sup ρ ∈ M (Θ) ρ [ U λ ( θ )] − K ( ρ, π ) (cid:9)o ≤ , which, combined with the convexity of λ Φ λN , proves the basic inequality we werelooking for. Theorem 1.1.4 . For any real constant λ , P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) λ h Φ λN (cid:2) ρ ( R ) (cid:3) − ρ ( r ) i − K ( ρ, π ) (cid:21)(cid:27) ≤ P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) λ h ρ (cid:0) Φ λN ◦ R (cid:1) − ρ ( r ) i − K ( ρ, π ) (cid:21)(cid:27) ≤ . We insist on the fact that in this theorem, we take a supremum in ρ ∈ M (Θ) inside the expectation with respect to P , the sample distribution. This means thatthe proved inequality holds for any ρ depending on the training sample, that is forany posterior distribution: indeed, measurability questions set aside, P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) λ h ρ (cid:2) U λ ( θ ) (cid:3) − K ( ρ, π ) i(cid:21)(cid:27) = sup ρ :Ω → M (Θ) P (cid:26) exp (cid:20) λ (cid:2) ρ (cid:2) U λ ( θ ) (cid:3) − K ( ρ, π ) (cid:3)i(cid:21)(cid:27) , .2. Non local bounds ρ :Ω → M (Θ) P (cid:26) exp (cid:20) λ h ρ (cid:2) U λ ( θ ) (cid:3) − K ( ρ, π ) (cid:3)i(cid:21)(cid:27) ≤ P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) λ h ρ (cid:2) U λ ( θ ) (cid:3) − K ( ρ, π ) i(cid:21)(cid:27) , where the supremum in ρ taken in the left-hand side is restricted to regular condi-tional probability distributions.The following sections will show how to use this theorem. At least three sorts of bounds can be deduced from Theorem 1.1.4.The most interesting ones with which to build estimators and tune parameters,as well as the first that have been considered in the development of the PAC-Bayesian approach, are deviation bounds. They provide an empirical upper boundfor ρ ( R ) — that is a bound which can be computed from observed data — withsome probability 1 − ǫ , where ǫ is a presumably small and tunable parameter settingthe desired confidence level.Anyhow, most of the results about the convergence speed of estimators to befound in the statistical literature are concerned with the expectation P (cid:2) ρ ( R ) (cid:3) , there-fore it is also enlightening to bound this quantity. In order to know at which rateit may be approaching inf Θ R , a non-random upper bound is required, which willrelate the average of the expected risk P (cid:2) ρ ( R ) (cid:3) with the properties of the contrastfunction θ R ( θ ).Since the values of constants do matter a lot when a bound is to be used to se-lect between various estimators using classification models of various complexities,a third kind of bound, related to the first, may be considered for the sake of itshopefully better constants: we will call them unbiased empirical bounds , to stressthe fact that they provide some empirical quantity whose expectation under P canbe proved to be an upper bound for P (cid:2) ρ ( R ) (cid:3) , the average expected risk. The priceto pay for these better constants is of course the lack of formal guarantee given bythe bound: two random variables whose expectations are ordered in a certain waymay very well be ordered in the reverse way with a large probability, so that basingthe estimation of parameters or the selection of an estimator on some unbiasedempirical bound is a hazardous business. Anyhow, since it is common practice touse the inequalities provided by mathematical statistical theory while replacing theproven constants with smaller values showing a better practical efficiency, consid-ering unbiased empirical bounds as well as deviation bounds provides an indicationabout how much the constants may be decreased while not violating the theory toomuch. Let ρ : Ω → M (Θ) be some fixed (and arbitrary) posterior distribution, describingsome randomized estimator b θ : Ω → Θ. As we already mentioned, in these notes aposterior distribution will always be a regular conditional probability measure. Bythis we mean that
Chapter 1. Inductive PAC-Bayesian learning • for any A ∈ T , the map ω ρ ( ω, A ) : (cid:0) Ω , ( B ⊗ B ′ ) ⊗ N (cid:1) → R + is assumed tobe measurable; • for any ω ∈ Ω, the map A ρ ( ω, A ) : T → R + is assumed to be a probabilitymeasure.We will also assume without further notice that the σ -algebras we deal with arealways countably generated. The technical implications of these assumptions arestandard and discussed for instance in Catoni (2004, pages 50-54), where, amongother things, a detailed proof of the decomposition of the Kullback Liebler diver-gence is given.Let us restrict to the case when the constant λ is positive. We get from Theorem1.1.4 that(1.2) exp (cid:20) λ n Φ λN h P (cid:2) ρ ( R ) (cid:3)i − P (cid:2) ρ ( r ) (cid:3)o − P (cid:2) K ( ρ, π ) (cid:3)(cid:21) ≤ , where we have used the convexity of the exp function and of Φ λN . Since we haverestricted our attention to positive values of the constant λ , equation (1.2) can alsobe written P (cid:2) ρ ( R ) (cid:3) ≤ Φ − λN n P (cid:2) ρ ( r ) + λ − K ( ρ, π ) (cid:3)o , leading to Theorem 1.2.1 . For any posterior distribution ρ : Ω → M (Θ) , for any positiveparameter λ , P (cid:2) ρ ( R ) (cid:3) ≤ − exp h − N − P (cid:2) λρ ( r ) + K ( ρ, π ) (cid:3)i − exp( − λN ) ≤ P ( λN (cid:2) − exp( − λN ) (cid:3) (cid:20) ρ ( r ) + K ( ρ, π ) λ (cid:21)) . The last inequality provides the unbiased empirical upper bound for ρ ( R ) we werelooking for, meaning that the expectation of λN (cid:2) − exp( − λN ) (cid:3) h ρ ( r ) + K ( ρ,π ) λ i is larger than the expectation of ρ ( R ). Let us no-tice that 1 ≤ λN (cid:2) − exp( − λN ) (cid:3) ≤ (cid:2) − λ N (cid:3) − and therefore that this coefficient is closeto 1 when λ is significantly smaller than N .If we are ready to believe in this bound (although this belief is not mathematicallywell founded, as we already mentioned), we can use it to optimize λ and to choose ρ . While the optimal choice of ρ when λ is fixed is, according to Lemma 1.1.3 (page4), to take it equal to π exp( − λr ) , a Gibbs posterior distribution, as it is sometimescalled, we may for computational reasons be more interested in choosing ρ in someother class of posterior distributions.For instance, our real interest may be to select some non-randomized estimatorfrom a family b θ m : Ω → Θ m , m ∈ M , of possible ones, where Θ m are measurablesubsets of Θ and where M is an arbitrary (non necessarily countable) index set.We may for instance think of the case when b θ m ∈ arg min Θ m r . We may slightlyrandomize the estimators to start with, considering for any θ ∈ Θ m and any m ∈ M ,∆ m ( θ ) = n θ ′ ∈ Θ m : (cid:2) f θ ′ ( X i ) (cid:3) Ni =1 = (cid:2) f θ ( X i ) (cid:3) Ni =1 o , .2. Non local bounds ρ m by the formula dρ m dπ ( θ ) = (cid:2) θ ∈ ∆ m ( b θ m ) (cid:3) π (cid:2) ∆ m ( b θ m ) (cid:3) . Our posterior minimizes K ( ρ, π ) among those distributions whose support is re-stricted to the values of θ in Θ m for which the classification rule f θ is identicalto the estimated one f b θ m on the observed sample. Presumably, in many practi-cal situations, f θ ( x ) will be ρ m almost surely identical to f b θ m ( x ) when θ is drawnfrom ρ m , for the vast majority of the values of x ∈ X and all the sub-models Θ m not plagued with too much overfitting (since this is by construction the case when x ∈ { X i : i = 1 , . . . , N } ). Therefore replacing b θ m with ρ m can be expected to be aminor change in many situations. This change by the way can be estimated in the(admittedly not so common) case when the distribution of the patterns ( X i ) Ni =1 isknown. Indeed, introducing the pseudo distance(1.3) D ( θ, θ ′ ) = 1 N N X i =1 P (cid:2) f θ ( X i ) = f θ ′ ( X i ) (cid:3) , θ, θ ′ ∈ Θ , one immediately sees that R ( θ ′ ) ≤ R ( θ ) + D ( θ, θ ′ ), for any θ, θ ′ ∈ Θ, and thereforethat R ( b θ m ) ≤ ρ m ( R ) + ρ m (cid:2) D ( · , b θ m ) (cid:3) . Let us notice also that in the case where Θ m ⊂ R d m , and R happens to be convex on∆ m ( b θ m ), then ρ m ( R ) ≥ R (cid:2)R θρ m ( dθ ) (cid:3) , and we can replace b θ m with e θ m = R θρ m ( dθ ),and obtain bounds for R ( e θ m ). This is not a very heavy assumption about R , in thecase where we consider b θ m ∈ arg min Θ m r . Indeed, b θ m , and therefore ∆ m ( b θ m ), willpresumably be close to arg min Θ m R , and requiring a function to be convex in theneighbourhood of its minima is not a very strong assumption.Since r ( b θ m ) = ρ m ( r ), and K ( ρ m , π ) = − log (cid:8) π (cid:2) ∆ m ( b θ m ) (cid:3)(cid:9) , our unbiased empiri-cal upper bound in this context reads as λN (cid:2) − exp( − λN ) (cid:3) ( r ( b θ m ) − log (cid:8) π (cid:2) ∆ m ( b θ m ) (cid:3)(cid:9) λ ) . Let us notice that we obtain a complexity factor − log (cid:8) π (cid:2) ∆ m ( b θ m ) (cid:3)(cid:9) which may becompared with the Vapnik–Cervonenkis dimension. Indeed, in the case of binaryclassification, when using a classification model with Vapnik–Cervonenkis dimen-sion not greater than h m , that is when any subset of X which can be split in anyarbitrary way by some classification rule f θ of the model Θ m has at most h m points,then (cid:8) ∆ m ( θ ) : θ ∈ Θ m (cid:9) is a partition of Θ m with at most (cid:16) eNh m (cid:17) h m components: these facts, if not alreadyfamiliar to the reader, will be proved in Theorems 4.2.2 and 4.2.3 (page 144).Therefore inf θ ∈ Θ m − log (cid:8) π (cid:2) ∆ m ( θ ) (cid:3)(cid:9) ≤ h m log (cid:18) eNh m (cid:19) − log (cid:2) π (Θ m ) (cid:3) . Thus, if the model and prior distribution are well suited to the classification task, inthe sense that there is more “room” (where room is measured with π ) between the Chapter 1. Inductive PAC-Bayesian learning two clusters defined by b θ m than between other partitions of the sample of patterns( X i ) Ni =1 , then we will have − log (cid:8) π (cid:2) ∆ m ( b θ ) (cid:3)(cid:9) ≤ h m log (cid:18) eNh m (cid:19) − log (cid:2) π (Θ m ) (cid:3) . An optimal value b m may be selected so that b m ∈ arg min m ∈ M ( inf λ ∈ R + λN (cid:2) − exp( − λN ) (cid:3) r ( b θ m ) − log (cid:8) π (cid:2) ∆ m ( b θ m ) (cid:3)(cid:9) λ !) . Since ρ b m is still another posterior distribution, we can be sure that P n R ( b θ b m ) − ρ b m (cid:2) D ( · , b θ b m ) (cid:3)o ≤ P (cid:2) ρ b m ( R ) (cid:3) ≤ inf λ ∈ R + P ( λN (cid:2) − exp( − λN ) (cid:3) r ( b θ b m ) − log (cid:8) π (cid:2) ∆ b m ( b θ b m ) (cid:3)(cid:9) λ !) . Taking the infimum in λ inside the expectation with respect to P would be possibleat the price of some supplementary technicalities and a slight increase of the boundthat we prefer to postpone to the discussion of deviation bounds, since they are theonly ones to provide a rigorous mathematical foundation to the adaptive selectionof estimators. λ In this section we address some technical issues we think helpful to the under-standing of Theorem 1.2.1 (page 6): namely to investigate how the upper boundit provides could be optimized, or at least approximately optimized, in λ . It turnsout that this can be done quite explicitly.So we will consider in this discussion the posterior distribution ρ : Ω → M (Θ)to be fixed, and our aim will be to eliminate the constant λ from the bound bychoosing its value in some nearly optimal way as a function of P (cid:2) ρ ( r ) (cid:3) , the averageof the empirical risk, and of P (cid:2) K ( ρ, π ) (cid:3) , which controls overfitting.Let the bound be written as ϕ ( λ ) = (cid:2) − exp( − λN ) (cid:3) − n − exp h − λN P (cid:2) ρ ( r ) (cid:3) − N − P (cid:2) K ( ρ, π ) (cid:3)io . We see that
N ∂∂λ log (cid:2) ϕ ( λ ) (cid:3) = P (cid:2) ρ ( r ) (cid:3) exp h λN P (cid:2) ρ ( r ) (cid:3) + N − P (cid:2) K ( ρ, π ) (cid:3)i − − λN ) − . Thus, the optimal value for λ is such that (cid:2) exp( λN ) − (cid:3) P (cid:2) ρ ( r ) (cid:3) = exp h λN P (cid:2) ρ ( r ) (cid:3) + N − P (cid:2) K ( ρ, π ) (cid:3)i − . Assuming that 1 ≫ λN P (cid:2) ρ ( r ) (cid:3) ≫ P [ K ( ρ,π )] N , and keeping only higher order terms, weare led to choose λ = s N P (cid:2) K ( ρ, π ) (cid:3) P (cid:2) ρ ( r ) (cid:3)(cid:8) − P (cid:2) ρ ( r ) (cid:3)(cid:9) , obtaining .2. Non local bounds Theorem 1.2.2 . For any posterior distribution ρ : Ω → M (Θ) , P (cid:2) ρ ( R ) (cid:3) ≤ − exp n − q P [ K ( ρ,π )] P [ ρ ( r )] N { − P [ ρ ( r )] } − P [ K ( ρ,π )] N o − exp n − q P [ K ( ρ,π )] N P [ ρ ( r )] { − P [ ρ ( r )] } o . This result of course is not very useful in itself, since neither of the two quantities P (cid:2) ρ ( r ) (cid:3) and P (cid:2) K ( ρ, π ) (cid:3) are easy to evaluate. Anyhow it gives a hint that replacingthem boldly with ρ ( r ) and K ( ρ, π ) could produce something close to a legitimateempirical upper bound for ρ ( R ). We will see in the subsection about deviationbounds that this is indeed essentially true.Let us remark that in the third chapter of this monograph, we will see anotherway of bounding inf λ ∈ R + Φ − λN (cid:18) q + dλ (cid:19) , leading to Theorem 1.2.3 . For any prior distribution π ∈ M (Θ) , for any posterior distri-bution ρ : Ω → M (Θ) , P (cid:2) ρ ( R ) (cid:3) ≤ P (cid:2) K ( ρ, π ) (cid:3) N ! − ( P (cid:2) ρ ( r ) (cid:3) + P (cid:2) K ( ρ, π ) (cid:3) N + s P (cid:2) K ( ρ, π ) (cid:3) P (cid:2) ρ ( r ) (cid:3)(cid:8) − P (cid:2) ρ ( r ) (cid:3)(cid:9) N + P (cid:2) K ( ρ, π ) (cid:3) N ) , as soon as P (cid:2) ρ ( r ) (cid:3) + s P (cid:2) K ( ρ, π ) (cid:3) N ≤ , and P (cid:2) ρ ( R ) (cid:3) ≤ P (cid:2) ρ ( r ) (cid:3) + s P (cid:2) K ( ρ, π ) (cid:3) N otherwise. This theorem enlightens the influence of three terms on the average expectedrisk: • the average empirical risk, P (cid:2) ρ ( r ) (cid:3) , which as a rule will decrease as the size ofthe classification model increases, acts as a bias term, grasping the ability of themodel to account for the observed sample itself; • a variance term N P (cid:2) ρ ( r ) (cid:3)(cid:8) − P (cid:2) ρ ( r ) (cid:3)(cid:9) is due to the random fluctuations of ρ ( r ); • a complexity term P (cid:2) K ( ρ, π ) (cid:3) , which as a rule will increase with the size ofthe classification model, eventually acts as a multiplier of the variance term.We observed numerically that the bound provided by Theorem 1.2.2 is betterthan the more classical Vapnik-like bound of Theorem 1.2.3. For instance, when N = 1000, P (cid:2) ρ ( r ) (cid:3) = 0 . P (cid:2) K ( ρ, π ) (cid:3) = 10, Theorem 1.2.2 gives a bound lowerthan 0 . . Chapter 1. Inductive PAC-Bayesian learning
It is time now to come to less tentative results and see how far is the averageexpected error rate P (cid:2) ρ ( R ) (cid:3) from its best possible value inf Θ R .Let us notice first that λρ ( r ) + K ( ρ, π ) = K ( ρ, π exp( − λr ) ) − log n π (cid:2) exp( − λr ) (cid:3)o . Let us remark moreover that r log h π (cid:2) exp( − λr ) (cid:3)i is a convex functional, a prop-erty which from a technical point of view can be dealt with in the following way:(1.4) P n log h π (cid:2) exp( − λr ) (cid:3)io = P n sup ρ ∈ M (Θ) − λρ ( r ) − K ( ρ, π ) o ≥ sup ρ ∈ M (Θ) P n − λρ ( r ) − K ( ρ, π ) o = sup ρ ∈ M (Θ) − λρ ( R ) − K ( ρ, π )= log n π (cid:2) exp( − λR ) (cid:3)o = − R λ π exp( − βR ) ( R ) dβ. These remarks applied to Theorem 1.2.1 lead to
Theorem 1.2.4 . For any posterior distribution ρ : Ω → M (Θ) , for any positiveparameter λ , P (cid:2) ρ ( R ) (cid:3) ≤ − exp n − N R λ π exp( − βR ) ( R ) dβ − N P (cid:2) K ( ρ, π exp( − λr ) ) (cid:3)o − exp( − λN ) ≤ N (cid:2) − exp( − λN ) (cid:3) nR λ π exp( − βR ) ( R ) dβ + P (cid:2) K ( ρ, π exp( − λr ) ) (cid:3)o . This theorem is particularly well suited to the case of the Gibbs posterior distri-bution ρ = π exp( − λr ) , where the entropy factor cancels and where P (cid:2) π exp( − λr ) ( R ) (cid:3) is shown to get close to inf Θ R when N goes to + ∞ , as soon as λ/N goes to 0 while λ goes to + ∞ .We can elaborate on Theorem 1.2.4 and define a notion of dimension of (Θ , R ),with margin η ≥ d η (Θ , R ) = sup β ∈ R + β (cid:2) π exp( − βR ) ( R ) − ess inf π R − η (cid:3) ≤ − log n π (cid:2) R ≤ ess inf π R + η (cid:3)o . This last inequality can be established by the chain of inequalities: βπ exp( − βR ) ( R ) ≤ R β π exp( − γR ) ( R ) dγ = − log n π (cid:2) exp( − βR ) (cid:3)o ≤ β (cid:16) ess inf π R + η (cid:17) − log h π (cid:0) R ≤ ess inf π R + η (cid:1)i , where we have used successively the fact that λ π exp( − λR ) ( R ) is decreasing(because it is the derivative of the concave function λ
7→ − log (cid:8) π (cid:2) exp( − λR ) (cid:3)(cid:9) )and the fact that the exponential function takes positive values. .2. Non local bounds d (Θ , R ) will be finite, and in all circumstances d η (Θ , R ) will be finite for any η > Z λ π exp( − βR ) ( R ) dβ ≤ λ (cid:0) ess inf π R + η (cid:1) + Z λ (cid:20) d η β ∧ (1 − ess inf π R − η ) (cid:21) dβ = λ (cid:0) ess inf π R + η (cid:1) + d η (Θ , R ) log (cid:20) eλd η (Θ , R ) (cid:0) − ess inf π R − η (cid:1)(cid:21) . This leads to
Corollary 1.2.5
With the above notation, for any margin η ∈ R + , for any poste-rior distribution ρ : Ω → M (Θ) , P (cid:2) ρ ( R ) (cid:3) ≤ inf λ ∈ R + Φ − λN " ess inf π R + η + d η λ log (cid:18) eλd η (cid:19) + P (cid:8) K (cid:2) ρ, π exp( − λr ) (cid:3)(cid:9) λ . If one wants a posterior distribution with a small support, the theorem can alsobe applied to the case when ρ is obtained by truncating π exp( − λr ) to some levelset to reduce its support: let Θ p = { θ ∈ Θ : r ( θ ) ≤ p } , and let us define for any q ∈ )0 ,
1) the level p q = inf { p : π exp( − λr ) (Θ p ) ≥ q } , let us then define ρ q by itsdensity dρ q dπ exp( − λr ) ( θ ) = ( θ ∈ Θ p q ) π exp( − λr ) (Θ p q ) , then ρ = π exp( − λr ) and for any q ∈ (0 , P (cid:2) ρ q ( R ) (cid:3) ≤ − exp n − N R λ π exp( − βR ) ( R ) dβ − log( q ) N o − exp( − λN ) ≤ N (cid:2) − exp( − λN ) (cid:3) nR λ π exp( − βR ) ( R ) dβ − log( q ) o . They provide results holding under the distribution P of the sample with probabilityat least 1 − ǫ , for any given confidence level, set by the choice of ǫ ∈ )0 , − ǫ ) sure to do the right thing,although this right thing may be over-pessimistic, since deviation upper bounds arelarger than corresponding non-biased bounds.Starting again from Theorem 1.1.4 (page 4), and using Markov’s inequality P (cid:2) exp( h ) ≥ (cid:3) ≤ P (cid:2) exp( h ) (cid:3) , we obtain Theorem 1.2.6 . For any positive parameter λ , with P probability at least − ǫ ,for any posterior distribution ρ : Ω → M (Θ) , ρ ( R ) ≤ Φ − λN (cid:26) ρ ( r ) + K ( ρ, π ) − log( ǫ ) λ (cid:27) Chapter 1. Inductive PAC-Bayesian learning = 1 − exp (cid:26) − λρ ( r ) N − K ( ρ, π ) − log( ǫ ) N (cid:27) − exp (cid:0) − λN (cid:1) ≤ λN (cid:2) − exp (cid:0) − λN (cid:1)(cid:3) (cid:20) ρ ( r ) + K ( ρ, π ) − log( ǫ ) λ (cid:21) . We see that for a fixed value of the parameter λ , the upper bound is optimizedwhen the posterior is chosen to be the Gibbs distribution ρ = π exp( − λr ) .In this theorem, we have bounded ρ ( R ), the average expected risk of an estimator b θ drawn from the posterior ρ . This is what we will do most of the time in this study.This is the error rate we will get if we classify a large number of test patterns,drawing a new b θ for each one. However, we can also be interested in the error ratewe get if we draw only one b θ from ρ and use this single draw of b θ to classify alarge number of test patterns. This error rate is R ( b θ ). To state a result about itsdeviations, we can start back from Lemma 1.1.1 (page 3) and integrate it withrespect to the prior distribution π to get for any real constant λ P (cid:26) π (cid:20) exp n λ h Φ λN (cid:0) R (cid:1) − r io(cid:21)(cid:27) ≤ . For any posterior distribution ρ : Ω → M (Θ), this can be rewritten as P (cid:26) ρ (cid:20) exp n λ h Φ λN (cid:0) R (cid:1) − r i − log (cid:0) dρdπ (cid:1) + log( ǫ ) io(cid:21)(cid:27) ≤ ǫ, proving Theorem 1.2.7
For any positive real parameter λ , for any posterior distribution ρ : Ω → M (Θ) , with P ρ probability at least − ǫ , R ( b θ ) ≤ Φ − λN (cid:26) r ( b θ ) + λ − log (cid:18) ǫ − dρdπ (cid:19)(cid:27) ≤ λN (cid:2) − exp( − λN ) (cid:3) (cid:20) r ( b θ ) + λ − log (cid:18) ǫ − dρdπ (cid:19)(cid:21) . Let us remark that the bound provided here is the exact counterpart of the boundof Theorem 1.2.6, since log (cid:0) dρdπ (cid:1) appears as a disintegrated version of the divergence K ( ρ, π ). The parallel between the two theorems is particularly striking in the specialcase when ρ = π exp( − λr ) . Indeed Theorem 1.2.6 proves that with P probability atleast 1 − ǫ , π exp( − λr ) ( R ) ≤ Φ − λN (cid:26) − log (cid:8) π (cid:2) exp (cid:0) − λr (cid:1)(cid:3)(cid:9) + log( ǫ ) λ (cid:27) , whereas Theorem 1.2.7 proves that with P π exp( − λr ) probability at least 1 − ǫR ( b θ ) ≤ Φ − λN (cid:26) − log (cid:8) π (cid:2) exp (cid:0) − λr (cid:1)(cid:3)(cid:9) + log( ǫ ) λ (cid:27) , showing that we get the same deviation bound for π exp( − λr ) ( R ) under P and for b θ under P π exp( − λr ) . .2. Non local bounds λ the bound givenby Theorem 1.2.6 (the same discussion would apply to Theorem 1.2.7). Let usnotice first that values of λ less than 1 are not interesting (because they provide abound larger than one, at least as soon as ǫ ≤ exp( − α >
1, and the set Λ = { α k ; k ∈ N } , on which we put the probabilitymeasure ν ( α k ) = [( k +1)( k +2)] − . Applying Theorem 1.2.6 to λ = α k at confidencelevel 1 − ǫ ( k +1)( k +2) , and using a union bound, we see that with probability at least1 − ǫ , for any posterior distribution ρ , ρ ( R ) ≤ inf λ ′ ∈ Λ Φ − λ ′ N ρ ( r ) + K ( ρ, π ) − log( ǫ ) + 2 log h log( α λ ′ )log( α ) i λ ′ . Now we can remark that for any λ ∈ (1 , + ∞ (, there is λ ′ ∈ Λ such that α − λ ≤ λ ′ ≤ λ . Moreover, for any q ∈ (0 , β Φ − β ( q ) is increasing on R + . Thus withprobability at least 1 − ǫ , for any posterior distribution ρ , ρ ( R ) ≤ inf λ ∈ (1 , ∞ ( Φ − λN n ρ ( r ) + αλ h K ( ρ, π ) − log( ǫ ) + 2 log (cid:16) log( α λ )log( α ) (cid:17)io = inf λ ∈ (1 , ∞ ( − exp n − λN ρ ( r ) − αN h K ( ρ, π ) − log( ǫ ) + 2 log (cid:16) log( α λ )log( α ) (cid:17)io − exp( − λN ) . Taking the approximately optimal value λ = s N α [ K ( ρ, π ) − log( ǫ )] ρ ( r )[1 − ρ ( r )] , we obtain Theorem 1.2.8 . With probability − ǫ , for any posterior distribution ρ : Ω → M (Θ) , putting d ( ρ, ǫ ) = K ( ρ, π ) − log( ǫ ) , ρ ( R ) ≤ inf k ∈ N − exp (cid:26) − α k N ρ ( r ) − N h d ( ρ, ǫ ) + log (cid:2) ( k + 1)( k + 2) (cid:3)i(cid:27) − exp (cid:18) − α k N (cid:19) ≤ − exp − s αρ ( r ) d ( ρ, ǫ ) N [1 − ρ ( r )] − αN " d ( ρ, ǫ ) + 2 log (cid:18) log (cid:16) α q Nαd ( ρ,ǫ ) ρ ( r )[1 − ρ ( r )] (cid:17) log( α ) (cid:19) − exp " − s αd ( ρ, ǫ ) N ρ ( r )[1 − ρ ( r )] . Moreover with probability at least − ǫ , for any posterior distribution ρ such that ρ ( r ) = 0 , ρ ( R ) ≤ − exp (cid:20) − K ( ρ, π ) − log( ǫ ) N (cid:21) . We can also elaborate on the results in an other direction by introducing the empirical dimension (1.6) d e = sup β ∈ R + β (cid:2) π exp( − βr ) ( r ) − ess inf π r (cid:3) ≤ − log (cid:2) π (cid:0) r = ess inf π r (cid:1)(cid:3) . Chapter 1. Inductive PAC-Bayesian learning
There is no need to introduce a margin in this definition, since r takes at most N values, and therefore π (cid:0) r = ess inf π r (cid:1) is strictly positive. This leads to Corollary 1.2.9 . For any positive real constant λ , with P probability at least − ǫ ,for any posterior distribution ρ : Ω → M (Θ) , ρ ( R ) ≤ Φ − λN " ess inf π r + d e λ log (cid:18) eλd e (cid:19) + K (cid:2) ρ, π exp( − λr ) (cid:3) − log( ǫ ) λ . We could then make the bound uniform in λ and optimize this parameter in away similar to what was done to obtain Theorem 1.2.8. In this section, better bounds will be achieved through a better choice of the priordistribution. This better prior distribution turns out to depend on the unknownsample distribution P , and some work is required to circumvent this and obtainempirical bounds. As mentioned in the introduction, if one is willing to minimize the bound in ex-pectation provided by Theorem 1.2.1 (page 6), one is led to consider the optimalchoice π = P ( ρ ). However, this is only an ideal choice, since P is in all conceivablesituations unknown. Nevertheless it shows that it is possible through Theorem 1.2.1to measure the complexity of the classification model with P (cid:8) K (cid:2) ρ, P ( ρ ) (cid:3)(cid:9) , which isnothing but the mutual information between the random sample ( X i , Y i ) Ni =1 andthe estimated parameter ˆ θ , under the joint distribution P ρ .In practice, since we cannot choose π = P ( ρ ), we have to be content with a flat prior π , resulting in a bound measuring complexity according to P (cid:2) K ( ρ, π ) (cid:3) = P (cid:8) K (cid:2) ρ, P ( ρ ) (cid:3)(cid:9) + K (cid:2) P ( ρ ) , π (cid:3) larger by the entropy factor K (cid:2) P ( ρ ) , π (cid:3) than the optimalone (we are still commenting on Theorem 1.2.1).If we want to base the choice of π on Theorem 1.2.4 (page 10), and if we choose ρ = π exp( − λr ) to optimize this bound, we will be inclined to choose some π suchthat 1 λ R λ π exp( − βR ) ( R ) dβ = − λ log n π (cid:2) exp( − λR ) (cid:3)o is as far as possible close to inf θ ∈ Θ R ( θ ) in all circumstances. To give a more specificexample, in the case when the distribution of the design ( X i ) Ni =1 is known, one canintroduce on the parameter space Θ the metric D already defined by equation(1.3, page 7) (or some available upper bound for this distance). In view of the factthat R ( θ ) − R ( θ ′ ) ≤ D ( θ, θ ′ ), for any θ , θ ′ ∈ Θ, it can be meaningful, at leasttheoretically, to choose π as π = ∞ X k =1 k ( k + 1) π k , where π k is the uniform measure on some minimal (or close to minimal) 2 − k -net N (Θ , D, − k ) of the metric space (Θ , D ). With this choice .3. Local bounds − λ log n π (cid:2) exp( − λR ) (cid:3)o ≤ inf θ ∈ Θ R ( θ )+ inf k (cid:26) − k + log( | N (Θ , D, − k ) | ) + log[ k ( k + 1)] λ (cid:27) . Another possibility, when we have to deal with real valued parameters, meaningthat Θ ⊂ R d , is to code each real component θ i ∈ R of θ = ( θ i ) di =1 to someprecision and to use a prior µ which is atomic on dyadic numbers. More preciselylet us parametrize the set of dyadic real numbers as D = ( r (cid:2) s, m, p, ( b j ) pj =1 (cid:3) = s m (cid:18) p X j =1 b j − j (cid:19) : s ∈ {− , +1 } , m ∈ Z , p ∈ N , b j ∈ { , } ) , where, as can be seen, s codes the sign, m the order of magnitude, p the precisionand ( b j ) pj =1 the binary representation of the dyadic number r (cid:2) s, m, p, ( b j ) pj =1 (cid:3) . Wecan for instance consider on D the probability distribution(1.7) µ (cid:8) r (cid:2) s, m, p, ( b j ) pj =1 (cid:3)(cid:9) = h | m | + 1)( | m | + 2)( p + 1)( p + 2)2 p i − , and define π ∈ M ( R d ) as π = µ ⊗ d . This kind of “coding” prior distribution canbe used also to define a prior on the integers (by renormalizing the restriction of µ to integers to get a probability distribution). Using µ is somehow equivalent topicking up a representative of each dyadic interval, and makes it possible to restrictto the case when the posterior ρ is a Dirac mass without losing too much (whenΘ = (0 , flat prior seems at first glance to be the only alternativewhen nothing is known about the sample distribution P , the previous discussionshows that this type of choice is lacking proper localisation, and namely that weloose a factor K (cid:8) P (cid:2) π exp( − λr ) (cid:3) , π (cid:9) , the divergence between the bound-optimal prior P (cid:2) π exp( − λr ) (cid:3) , which is concentrated near the minima of R in favourable situations,and the flat prior π . Fortunately, there are technical ways to get around this diffi-culty and to obtain more local empirical bounds. The idea is to start with some flat prior π ∈ M (Θ), and the posterior distribution ρ = π exp( − λr ) minimizing the bound of Theorem 1.2.1 (page 6), when π is used as a6 Chapter 1. Inductive PAC-Bayesian learning prior. To improve the bound, we would like to use P (cid:2) π exp( − λr ) (cid:3) instead of π , and weare going to make the guess that we could approximate it with π exp( − βR ) (we havereplaced the parameter λ with some distinct parameter β to give some more freedomto our investigation, and also because, intuitively, P (cid:2) π exp( − λr ) (cid:3) may be expected tobe less concentrated than each of the π exp( − λr ) it is mixing, which suggests that thebest approximation of P (cid:2) π exp( − λr ) (cid:3) by some π exp( − βR ) may be obtained for someparameter β < λ ). We are then led to look for some empirical upper bound of K (cid:2) ρ, π exp( − βR ) (cid:3) . This is happily provided by the following computation P (cid:8) K (cid:2) ρ, π exp( − βR ) (cid:3)(cid:9) = P (cid:2) K ( ρ, π ) (cid:3) + β P (cid:2) ρ ( R ) (cid:3) + log n π (cid:2) exp( − βR ) (cid:3)o = P (cid:8) K (cid:2) ρ, π exp( − βr ) (cid:3)(cid:9) + β P (cid:2) ρ ( R − r ) (cid:3) + log n π (cid:2) exp( − βR ) (cid:3)o − P n log π (cid:2) exp( − βr ) (cid:3)o . Using the convexity of r log (cid:8) π (cid:2) exp( − βr ) (cid:3)(cid:9) as in equation (1.4) on page 10, weconclude that0 ≤ P (cid:8) K (cid:2) ρ, π exp( − βR ) (cid:3)(cid:9) ≤ β P (cid:2) ρ ( R − r ) (cid:3) + P (cid:8) K (cid:2) ρ, π exp( − βr ) (cid:3)(cid:9) . This inequality has an interest of its own, since it provides a lower bound for P (cid:2) ρ ( R ) (cid:3) . Moreover we can plug it into Theorem 1.2.1 (page 6) applied to the priordistribution π exp( − βR ) and obtain for any posterior distribution ρ and any positiveparameter λ thatΦ λN (cid:8) P (cid:2) ρ ( R ) (cid:3)(cid:9) ≤ P (cid:26) ρ ( r ) + βλ ρ ( R − r ) + 1 λ P n K (cid:2) ρ, π exp( − βr ) (cid:3)o(cid:27) . In view of this, it it convenient to introduce the function e Φ a,b ( p ) = (1 − b ) − (cid:2) Φ a ( p ) − bp (cid:3) = − (1 − b ) − n a − log (cid:8) − p (cid:2) − exp( − a ) (cid:3)(cid:9) + bp o ,p ∈ (0 , , a ∈ )0 , ∞ ( , b ∈ (0 , . This is a convex function of p , moreover e Φ ′ a,b (0) = n a − (cid:2) − exp( − a ) (cid:3) − b o (1 − b ) − , showing that it is an increasing one to one convex map of the unit interval untoitself as soon as b ≤ a − (cid:2) − exp( − a ) (cid:3) . Its convexity, combined with the value ofits derivative at the origin, shows that e Φ a,b ( p ) ≥ a − (cid:2) − exp( − a ) (cid:3) − b − b p. Using this notation and remarks, we can state
Theorem 1.3.1 . For any positive real constants β and λ such that ≤ β < N [1 − exp( − λN )] , for any posterior distribution ρ : Ω → M (Θ) , P (cid:26) ρ ( r ) − K (cid:2) ρ, π exp( − βr ) (cid:3) β (cid:27) ≤ P (cid:2) ρ ( R ) (cid:3) .3. Local bounds ≤ e Φ − λN , βλ (cid:26) P (cid:20) ρ ( r ) + K (cid:2) ρ, π exp( − βr ) (cid:3) λ − β (cid:21)(cid:27) ≤ λ − βN [1 − exp( − λN )] − β P (cid:20) ρ ( r ) + K (cid:2) ρ, π exp( − βr ) (cid:3) λ − β (cid:21) . Thus (taking λ = 2 β ), for any β such that ≤ β < N , P (cid:2) ρ ( R ) (cid:3) ≤ − βN P (cid:26) ρ ( r ) + K (cid:2) ρ, π exp( − βr ) (cid:3) β (cid:27) . Note that the last inequality is obtained using the fact that 1 − exp( − x ) ≥ x − x , x ∈ R + . Corollary 1.3.2 . For any β ∈ (0 , N ( , P (cid:2) π exp( − βr ) ( r ) (cid:3) ≤ P (cid:2) π exp( − βr ) ( R ) (cid:3) ≤ inf λ ∈ ( − N log(1 − βN ) , ∞ ( λ − βN [1 − exp( − λN )] − β P (cid:2) π exp( − βr ) ( r ) (cid:3) ≤ − βN P (cid:2) π exp( − βr ) ( r ) (cid:3) , the last inequality holding only when β < N . It is interesting to compare the upper bound provided by this corollary withTheorem 1.2.1 (page 6) when the posterior is a Gibbs measure ρ = π exp( − βr ) . Wesee that we have got rid of the entropy term K (cid:2) π exp( − βr ) , π (cid:3) , but at the price ofan increase of the multiplicative factor, which for small values of βN grows from(1 − β N ) − (when we take λ = β in Theorem 1.2.1), to (1 − βN ) − . Thereforenon-localized bounds have an interest of their own, and are superseded by localizedbounds only in favourable circumstances (presumably when the sample is largeenough when compared with the complexity of the classification model).Corollary 1.3.2 shows that when βN is small, π exp( − βr ) ( r ) is a tight approximationof π exp( − βr ) ( R ) in the mean (since we have an upper bound and a lower bound whichare close together).Another corollary is obtained by optimizing the bound given by Theorem 1.3.1in ρ , which is done by taking ρ = π exp( − λr ) . Corollary 1.3.3 . For any positive real constants β and λ such that ≤ β 2, we obtain from equation (1.8) thatinf λ P (cid:2) π exp( − λr ) ( R ) (cid:3) ≤ . When it comes to deviation bounds, for technical reasons we will choose a slightlymore involved change of prior distribution and apply Theorem 1.2.6 (page 11) tothe prior π exp[ − β Φ − βN ◦ R ] . The advantage of tweaking R with the nonlinear functionΦ − βN will appear in the search for an empirical upper bound of the local entropyterm. Theorem 1.1.4 (page 4), used with the above-mentioned local prior, showsthat(1.9) P ( sup ρ ∈ M (Θ) λ n ρ (cid:0) Φ λN ◦ R (cid:1) − ρ ( r ) o − K (cid:2) ρ, π exp( − β Φ − βN ◦ R ) (cid:3)) ≤ . Moreover(1.10) K (cid:2) ρ, π exp[ − β Φ − βN ◦ R ] (cid:3) = K (cid:2) ρ, π exp( − βr ) (cid:3) + βρ h Φ − βN ◦ R − r i + log n π h exp (cid:0) − β Φ − βN ◦ R (cid:1)io − log n π h exp( − βr ) io , which is an invitation to find an upper bound for log n π h exp (cid:2) − β Φ − λN ◦ R (cid:3)io − log n π (cid:2) exp( − βr ) (cid:3)o . For conciseness, let us call our localized prior distribution π ,thus defined by its density dπdπ ( θ ) = exp n − β Φ − βN (cid:2) R ( θ ) (cid:3)o π n exp (cid:2) − β Φ − βN ◦ R (cid:3)o . Applying once again Theorem 1.1.4 (page 4), but this time to − β , we see that(1.11) P (cid:26) exp (cid:20) log n π h exp (cid:0) − β Φ − βN ◦ R (cid:1)io − log n π (cid:2) exp( − βr ) (cid:3)o(cid:21)(cid:27) = P (cid:26) exp (cid:20) log n π h exp (cid:0) − β Φ − βN ◦ R ) (cid:1)io + inf ρ ∈ M (Θ) βρ ( r ) + K ( ρ, π ) (cid:21)(cid:27) ≤ P (cid:26) exp (cid:20) log n π h exp (cid:0) − β Φ − βN ◦ R ) (cid:1)io + βπ ( r ) + K ( π, π ) (cid:21)(cid:27) = P (cid:26) exp (cid:20) β h π ( r ) − π (cid:0) Φ − βN ◦ R (cid:1)i + K ( π, π ) (cid:21)(cid:27) ≤ . Combining equations (1.10) and (1.11) and using the concavity of Φ − βN , we see thatwith P probability at least 1 − ǫ , for any posterior distribution ρ : Ω → M (Θ),0 ≤ K ( ρ, π ) ≤ K (cid:2) ρ, π exp( − βr ) (cid:3) + β h Φ − βN (cid:2) ρ ( R ) (cid:3) − ρ ( r ) i − log( ǫ ) . We have proved a lower deviation bound:0 Chapter 1. Inductive PAC-Bayesian learning Theorem 1.3.5 For any positive real constant β , with P probability at least − ǫ ,for any posterior distribution ρ : Ω → M (Θ) , exp (cid:26) βN (cid:20) ρ ( r ) − K [ ρ, π exp( − βr ) ] − log( ǫ ) β (cid:21)(cid:27) − (cid:0) βN (cid:1) − ≤ ρ ( R ) . We can also obtain a lower deviation bound for b θ . Indeed equation (1.11) canalso be written as P (cid:26) π exp( − βr ) (cid:20) exp n β h r − Φ − βN ◦ R io(cid:21)(cid:27) ≤ . This means that for any posterior distribution ρ : Ω → M (Θ), P n ρ h exp (cid:8) β (cid:2) r − Φ − βN ◦ R (cid:3) − log (cid:0) dρdπ exp( − βr ) (cid:1)(cid:9)io ≤ . We have proved Theorem 1.3.6 For any positive real constant β , for any posterior distribution ρ : Ω → M (Θ) , with P ρ probability at least − ǫ , R ( b θ ) ≥ Φ − − βN (cid:20) r ( b θ ) − log (cid:0) dρdπ exp( − βr ) (cid:1) − log( ǫ ) β (cid:21) = exp (cid:26) βN (cid:20) r ( b θ ) − log (cid:0) dρdπ exp( − βr ) (cid:1) − log( ǫ ) β (cid:21)(cid:27) − (cid:18) βN (cid:19) − . Let us now resume our investigation of the upper deviations of ρ ( R ). Using theCauchy-Schwarz inequality to combine equations (1.9, page 19) and (1.11, page 19),we obtain(1.12) P (cid:26) exp (cid:20) 12 sup ρ ∈ M (Θ) λρ (cid:0) Φ λN ◦ R (cid:1) − βρ (cid:0) Φ − βN ◦ R (cid:1) − ( λ − β ) ρ ( r ) − K (cid:2) ρ, π exp( − βr ) (cid:3)(cid:21)(cid:27) = P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) (cid:18) λ n ρ (cid:0) Φ λN ◦ R (cid:1) − ρ ( r ) o − K ( ρ, π ) (cid:19)(cid:21) × exp (cid:20) (cid:18) log n π h exp (cid:0) − β Φ − βN ◦ R (cid:1)io − log n π h exp( − βr ) io(cid:19)(cid:21)(cid:27) ≤ P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) (cid:18) λ n ρ (cid:0) Φ λN ◦ R (cid:1) − ρ ( r ) o − K ( ρ, π ) (cid:19)(cid:21)(cid:27) / × P (cid:26) exp (cid:20)(cid:18) log n π h exp (cid:0) − β Φ − βN ◦ R (cid:1)io − log n π h exp( − βr ) io(cid:19)(cid:21)(cid:27) / ≤ . Thus with P probability at least 1 − ǫ , for any posterior distribution ρ , λ Φ λN (cid:2) ρ ( R ) (cid:3) − β Φ − βN (cid:2) ρ ( R ) (cid:3) .3. Local bounds ≤ λρ (cid:0) Φ λN ◦ R (cid:1) − βρ (cid:0) Φ − βN ◦ R (cid:1) ≤ ( λ − β ) ρ ( r ) + K ( ρ, π exp( − βr ) ) − ǫ ) . (It would have been more straightforward to use a union bound on deviation in-equalities instead of the Cauchy-Schwarz inequality on exponential moments, any-how, this would have led to replace − ǫ ) with the worse factor 2 log( ǫ ).) Letus now recall that λ Φ λN ( p ) − β Φ − βN ( p ) = − N log n − (cid:2) − exp (cid:0) − λN (cid:1)(cid:3) p o − N log n (cid:2) exp (cid:0) βN (cid:1) − (cid:3) p o , and let us put B = ( λ − β ) ρ ( r ) + K (cid:2) ρ, π exp( − βr ) (cid:3) − ǫ )= K (cid:2) ρ, π exp( − λr ) (cid:3) + R λβ π exp( − ξr ) ( r ) dξ − ǫ ) . Let us consider moreover the change of variables α = 1 − exp( − λN ) and γ = exp( βN ) − 1. We obtain (cid:2) − αρ ( R ) (cid:3)(cid:2) γρ ( R ) (cid:3) ≥ exp( − BN ) , leading to Theorem 1.3.7 . For any positive constants α , γ , such that ≤ γ < α < , with P probability at least − ǫ , for any posterior distribution ρ : Ω → M (Θ) , the bound M ( ρ ) = − log (cid:2) (1 − α )(1 + γ ) (cid:3) α − γ ρ ( r ) + K ( ρ, π exp[ − N log(1+ γ ) r ] ) − ǫ ) N ( α − γ )= K (cid:2) ρ, π exp[ N log(1 − α ) r ] (cid:3) + Z − N log(1 − α ) N log(1+ γ ) π exp( − ξr ) ( r ) dξ − ǫ ) N ( α − γ ) , is such that ρ ( R ) ≤ α − γ αγ s αγ ( α − γ ) (cid:8) − exp (cid:2) − ( α − γ ) M ( ρ ) (cid:3)(cid:9) − ! ≤ M ( ρ ) , Let us now give an upper bound for R ( b θ ). Equation (1.12 page 20) can also bewritten as P (cid:26)(cid:20) π exp( − βr ) n exp h λ Φ λN ◦ R − β Φ − βN ◦ R − ( λ − β ) r io(cid:21) (cid:27) ≤ . This means that for any posterior distribution ρ : Ω → M (Θ), P (cid:26)(cid:20) ρ n exp h λ Φ λN ◦ R − β Φ − βN ◦ R − ( λ − β ) r − log (cid:0) dρdπ exp( − βr ) (cid:1)io(cid:21) (cid:27) ≤ . Using the concavity of the square root function, this inequality can be weakenedto P (cid:26) ρ (cid:20) exp n h λ Φ λN ◦ R − β Φ − βN ◦ R − ( λ − β ) r − log (cid:0) dρdπ exp( − βr ) (cid:1)io(cid:21)(cid:27) ≤ . We have proved2 Chapter 1. Inductive PAC-Bayesian learning Theorem 1.3.8 . For any positive real constants λ and β and for any posteriordistribution ρ : Ω → M (Θ) , with P ρ probability at least − ǫ , λ Φ λN (cid:2) R ( b θ ) (cid:3) − β Φ − βN (cid:2) R ( b θ ) (cid:3) ≤ ( λ − β ) r ( b θ ) + log h dρdπ exp( − βr ) ( b θ ) i − ǫ ) . Putting α = 1 − exp (cid:0) − λN (cid:1) , γ = exp (cid:0) βN (cid:1) − and M ( θ ) = − log (cid:2) (1 − α )(1 + γ ) (cid:3) α − γ r ( θ ) + log h dρdπ exp[ − N log(1+ γ ) r ] ( θ ) i − ǫ ) N ( α − γ )= log h dρdπ exp[ N log(1 − α ) r ] ( θ ) i + Z − N log(1 − α ) N log(1+ γ ) π exp( − ξr ) ( r ) dξ − ǫ ) N ( α − γ ) , we can also, in the case when γ < α , write this inequality as R ( b θ ) ≤ α − γ αγ s αγ ( α − γ ) n − exp h − ( α − γ ) M ( b θ ) io − ! ≤ M ( b θ ) . It may be enlightening to introduce the empirical dimension d e defined by equa-tion (1.6) on page 13. It provides the upper bound Z λβ π exp( − ξr ) ( r ) dξ ≤ ( λ − β ) ess inf π r + d e log (cid:18) λβ (cid:19) , which shows that in Theorem 1.3.7 (page 21), M ( ρ ) ≤ log (cid:2) (1 + γ )(1 − α ) (cid:3) γ − α ess inf π r + d e log h − log(1 − α )log(1+ γ ) i + K (cid:2) ρ, π exp[ N log(1 − α ) r ] (cid:3) − ǫ ) N ( α − γ ) . Similarly, in Theorem 1.3.8 above, M ( θ ) ≤ log (cid:2) (1 + γ )(1 − α ) (cid:3) γ − α ess inf π r + d e log h − log(1 − α )log(1+ γ ) i + log h dρdπ exp[ N log(1 − α ) r ] ( θ ) i − ǫ ) N ( α − γ )Let us give a little numerical illustration: assuming that d e = 10, N = 1000, andess inf π r = 0 . 2, taking ǫ = 0 . α = 0 . γ = 0 . 1, we obtain from Theorem1.3.7 π exp[ N log(1 − α ) r ] ( R ) ≃ π exp( − r ) ( R ) ≤ . ≤ . α and γ wouldnot have yielded a significantly lower bound.The following corollary is obtained by taking λ = 2 β and keeping only the linearbound; we give it for the sake of its simplicity: .3. Local bounds Corollary 1.3.9 . For any positive real constant β such that exp( βN )+ exp( − βN ) < , which is the case when β < . N , with P probability at least − ǫ , for anyposterior distribution ρ : Ω → M (Θ) , ρ ( R ) ≤ βρ ( r ) + K (cid:2) ρ, π exp( − βr ) (cid:3) − ǫ ) N (cid:2) − exp (cid:0) βN (cid:1) − exp (cid:0) − βN (cid:1)(cid:3) = R ββ π exp( − ξr ) ( r ) dξ + K (cid:2) ρ, π exp( − βr ) (cid:3) − ǫ ) N (cid:2) − exp( βN ) − exp( − βN ) (cid:3) . Let us mention that this corollary applied to the above numerical example gives π exp( − r ) ( R ) ≤ . 475 (when we take β = 100, consistently with the choice γ =0 . Local bounds are suitable when the lowest values of the empirical error rate r arereached only on a small part of the parameter set Θ. When Θ is the disjoint unionof sub-models of different complexities, the minimum of r will as a rule not be“localized” in a way that calls for the use of local bounds. Just think for instanceof the case when Θ = F Mm =1 Θ m , where the sets Θ ⊂ Θ ⊂ · · · ⊂ Θ M are nested.In this case we will have inf Θ r ≥ inf Θ r ≥ · · · ≥ inf Θ M r , although Θ M maybe too large to be the right model to use. In this situation, we do not want tolocalize the bound completely. Let us make a more specific fanciful but typicalpseudo computation. Just imagine we have a countable collection (Θ m ) m ∈ M ofsub-models. Let us assume we are interested in choosing between the estimators b θ m ∈ arg min Θ m r , maybe randomizing them (e.g. replacing them with π m exp( − λr ) ).Let us imagine moreover that we are in a typically parametric situation, where,for some priors π m ∈ M (Θ m ), m ∈ M , there is a “dimension” d m such that λ (cid:2) π m exp( − λr ) ( r ) − r ( b θ m ) (cid:3) ≃ d m . Let µ ∈ M ( M ) be some distribution on the indexset M . It is easy to see that ( µπ ) exp( − λr ) will typically not be properly local, in thesense that typically( µπ ) exp( − λr ) ( r ) = µ n π exp( − λr ) ( r ) π (cid:2) exp( − λr ) (cid:3)o µ n π (cid:2) exp( − λr ) (cid:3)o ≃ X m ∈ M (cid:2) (inf Θ m r ) + d m λ (cid:3) exp (cid:2) − λ (inf Θ m r ) − d m log (cid:0) eλd m (cid:1)(cid:3) µ ( m ) X m ∈ M exp h − λ (inf Θ m r ) − d m log (cid:0) eλd m (cid:1)i µ ( m ) ≃ (cid:26) inf m ∈ M (inf Θ m r ) + d m λ log (cid:0) eλd m (cid:1) − λ log[ µ ( m )] (cid:27) + log (cid:26) X m ∈ M exp (cid:2) − d m log( λd m ) (cid:3) µ ( m ) (cid:27) . where we have used the approximations − log n π (cid:2) exp( − λr ) (cid:3)o = Z λ π exp( − βr ) ( r ) dβ Chapter 1. Inductive PAC-Bayesian learning ≃ Z λ (inf Θ m r ) + (cid:2) d m β ∧ (cid:3) dβ ≃ λ (inf Θ m r ) + d m (cid:2) log (cid:0) λd m (cid:1) + 1 (cid:3) , and P m h ( m ) exp[ − h ( m )] ν ( m ) P m exp[ − h ( m )] ν ( m ) ≃ inf m h ( m ) − log[ ν ( m )] , ν ∈ M ( M ), taking ν ( m ) = µ ( m ) exp (cid:2) − d m log (cid:0) λd m (cid:1)(cid:3)P m ′ µ ( m ′ ) exp (cid:2) − d m ′ log (cid:0) λd m ′ (cid:1)(cid:3) .These approximations have no pretension to be rigorous or very accurate, butthey nevertheless give the best order of magnitude we can expect in typical situa-tions, and show that this order of magnitude is not what we are looking for: mixingdifferent models with the help of µ spoils the localization, introducing a multiplierlog (cid:0) λd m (cid:1) to the dimension d m which is precisely what we would have got if we hadnot localized the bound at all. What we would really like to do in such situations isto use a partially localized posterior distribution, such as π b m exp( − λr ) , where b m is anestimator of the best sub-model to be used. While the most straightforward way todo this is to use a union bound on results obtained for each sub-model Θ m , herewe are going to show how to allow arbitrary posterior distributions on the indexset (corresponding to a randomization of the choice of b m ).Let us consider the framework we just mentioned: let the measurable parameterset (Θ , T ) be a union of measurable sub-models, Θ = S m ∈ M Θ m . Let the index set( M, M ) be some measurable space (most of the time it will be a countable set). Let µ ∈ M ( M ) be a prior probability distribution on ( M, M ). Let π : M → M (Θ) bea regular conditional probability measure such that π ( m, Θ m ) = 1, for any m ∈ M .Let µπ ∈ M ( M × Θ) be the product probability measure defined for any boundedmeasurable function h : M × Θ → R by µπ ( h ) = Z m ∈ M (cid:18)Z θ ∈ Θ h ( m, θ ) π ( m, dθ ) (cid:19) µ ( dm ) . For any bounded measurable function h : Ω × M × Θ → R , let π exp( h ) : Ω × M → M (Θ) be the regular conditional posterior probability measure defined by dπ exp( h ) dπ ( m, θ ) = exp (cid:2) h ( m, θ ) (cid:3) π (cid:2) m, exp( h ) (cid:3) , where consistently with previous notation π ( m, h ) = R Θ h ( m, θ ) π ( m, dθ ) (we willalso often use the less explicit notation π ( h )). For short, let U ( θ, ω ) = λ Φ λN (cid:2) R ( θ ) (cid:3) − β Φ − βN (cid:2) R ( θ ) (cid:3) − ( λ − β ) r ( θ, ω ) . Integrating with respect to µ equation (1.12, page 20), written in each sub-modelΘ m using the prior distribution π ( m, · ), we see that P (cid:26) exp (cid:20) sup ν ∈ M ( M ) sup ρ : M → M (Θ) h ( νρ )( U ) − ν (cid:8) K ( (cid:2) ρ, π exp( − βr ) (cid:3)(cid:9)i − K ( ν, µ ) (cid:21)(cid:27) ≤ P (cid:26) exp (cid:20) sup ν ∈ M ( M ) ν (cid:18) sup ρ : M → M (Θ) ρ ( U ) − K ( ρ, π exp( − βr ) ) (cid:19) − K ( ν, µ ) (cid:21)(cid:27) = P (cid:26) µ (cid:20) exp n sup ρ : M → M (Θ) h ρ ( U ) − K (cid:2) ρ, π exp( − βr ) (cid:3)io(cid:21)(cid:27) .3. Local bounds µ (cid:26) P (cid:20) exp n sup ρ : M → M (Θ) h ρ ( U ) − K (cid:2) ρ, π exp( − βr ) (cid:3)io(cid:21)(cid:27) ≤ . This proves that(1.13) P ( exp " 12 sup ν ∈ M ( M ) sup ρ : M → M (Θ) νρ (cid:2) λ Φ λN ( R ) − β Φ − βN ( R ) (cid:3) − ( λ − β ) νρ ( r ) − K ( ν, µ ) − ν (cid:8) K (cid:2) ρ, π exp( − βr ) (cid:3)(cid:9) ≤ . Introducing the optimal value of r on each sub-model r ⋆ ( m ) = ess inf π ( m, · ) r andthe empirical dimensions d e ( m ) = sup ξ ∈ R + ξ (cid:2) π exp( − ξr ) ( m, r ) − r ⋆ ( m ) (cid:3) , we can thus state Theorem 1.3.10 . For any positive real constants β < λ , with P probability at least − ǫ , for any posterior distribution ν : Ω → M ( M ) , for any conditional posteriordistribution ρ : Ω × M → M (Θ) , νρ (cid:2) λ Φ λN ( R ) − β Φ − βN ( R ) (cid:3) ≤ λ Φ λN (cid:2) νρ ( R ) (cid:3) − β Φ − βN (cid:2) νρ ( R ) (cid:3) ≤ B ( ν, ρ ) , where B ( ν, ρ ) = ( λ − β ) νρ ( r ) + 2 K ( ν, µ ) + ν (cid:8) K (cid:2) ρ, π exp( − βr ) (cid:3)(cid:9) − ǫ )= ν (cid:20)Z λβ π exp( − αr ) ( r ) dα (cid:21) + 2 K ( ν, µ ) + ν (cid:8) K (cid:2) ρ, π exp( − λr ) (cid:3)(cid:9) − ǫ )= − (cid:26) µ (cid:20) exp (cid:18) − Z λβ π exp( − αr ) ( r ) dα (cid:19)(cid:21)(cid:27) + 2 K (cid:2) ν, µ (cid:0) π [exp( − λr )] π [exp( − βr )] (cid:1) / (cid:3) + ν (cid:8) K (cid:2) ρ, π exp( − λr ) (cid:3)(cid:9) − ǫ ) , and therefore B ( ν, ρ ) ≤ ν h ( λ − β ) r ⋆ + log (cid:16) λβ (cid:17) d e i + 2 K ( ν, µ )+ ν (cid:8) K (cid:2) ρ, π exp( − λr ) (cid:3)(cid:9) − ǫ ) , as well as B ( ν, ρ ) ≤ − (cid:26) µ (cid:20) exp (cid:18) − ( λ − β )2 r ⋆ − log (cid:0) λβ (cid:1) d e (cid:19)(cid:21)(cid:27) + 2 K (cid:2) ν, µ (cid:0) π [exp( − λr )] π [exp( − βr )] (cid:1) / (cid:3) + ν (cid:8) K (cid:2) ρ, π exp( − λr ) (cid:3) − ǫ ) . Thus, for any real constants α and γ such that ≤ γ < α < , with P probabilityat least − ǫ , for any posterior distribution ν : Ω → M ( M ) and any conditionalposterior distribution ρ : Ω × M → M (Θ) , the bound B ( ν, ρ ) = − log (cid:2) (1 − α )(1+ γ ) (cid:3) α − γ νρ ( r ) + K ( ν,µ )+ ν (cid:8) K (cid:2) ρ,π (1+ γ ) − Nr (cid:3)(cid:9) − ǫ ) N ( α − γ ) = 1 N ( α − γ ) ( K (cid:20) ν, µ (cid:16) π [(1 − α ) Nr ] π [(1+ γ ) − Nr ] (cid:17) / (cid:21) + ν n K (cid:2) ρ, π (1 − α ) Nr (cid:3)o) Chapter 1. Inductive PAC-Bayesian learning − N ( α − γ ) log ( µ " exp (cid:20) − Z − N log(1 − α ) N log(1+ γ ) π exp( − ξr ) ( · , r ) dξ (cid:21) − ǫ ) N ( α − γ ) satisfies νρ ( R ) ≤ α − γ αγ s αγ ( α − γ ) n − exp (cid:2) − ( α − γ ) B ( ν, ρ ) (cid:3)o − ! ≤ B ( ν, ρ ) . If one is willing to bound the deviations with respect to P νρ , it is enough toremark that the equation preceding equation (1.13, page 25) can also be written as P ( µ "(cid:26) π exp( − βr ) (cid:20) exp n λ Φ λN ◦ R − β Φ − βN ◦ R − ( λ − β ) r o(cid:21)(cid:27) / ≤ . Thus for any posterior distributions ν : Ω → M ( M ) and ρ : Ω × M → M (Θ), P (cid:26) ν (cid:20)n ρ h exp (cid:8) λ Φ λN ◦ R − β Φ − βN ◦ R − ( λ − β ) r − (cid:0) dνdµ (cid:1) − log (cid:0) dρdπ exp( − βr ) (cid:1)(cid:9)io / (cid:21)(cid:27) ≤ . Using the concavity of the square root function to pull the integration with respectto ρ out of the square root, we get P νρ (cid:26) exp (cid:20) n λ Φ λN ◦ R − β Φ − βN ◦ R − ( λ − β ) r − (cid:0) dνdπ (cid:1) − log (cid:0) dρdπ exp( − βr ) (cid:1)o(cid:21)(cid:27) ≤ . This leads to Theorem 1.3.11 . For any positive real constants β < λ , for any posterior distri-butions ν : Ω → M ( M ) and ρ : Ω × M → M (Θ) , with P νρ probability at least − ǫ , λ Φ λN (cid:2) R ( b m, b θ ) (cid:3) − β Φ − βN (cid:2) R ( b m, b θ ) (cid:3) ≤ ( λ − β ) r ( b m, b θ )+ 2 log (cid:2) dνdµ ( b m ) (cid:3) + log (cid:2) dρdπ exp( − βr ) ( b m, b θ ) (cid:3) − ǫ )= Z λβ π exp( − αr ) ( r ) dα + 2 log (cid:2) dνdµ ( b m ) (cid:3) + log (cid:2) dρdπ exp( − λr ) ( b m, b θ ) (cid:3) − ǫ )= 2 log (cid:26) µ (cid:20) exp (cid:18) − Z λβ π exp( − αr ) ( r ) dα (cid:19)(cid:21)(cid:27) .3. Local bounds 27+ 2 log (cid:2) dνdµ (cid:0) π [exp( − λr )] π [exp( − βr )] (cid:1) / ( b m ) (cid:3) + log (cid:2) dρdπ exp( − λr ) ( b m, b θ ) (cid:3) − ǫ ) . Another way to state the same inequality is to say that for any real constants α and γ such that ≤ γ < α < , with P νρ probability at least − ǫ , R ( b m, b θ ) ≤ α − γ αγ (cid:18)s αγ ( α − γ ) n − exp (cid:2) − ( α − γ ) B ( b m, b θ ) (cid:3)o − (cid:19) ≤ B ( b m, b θ ) , where B ( b m, b θ ) = − log (cid:2) (1 − α )(1 + γ ) (cid:3) α − γ r ( b m, b θ )+ 2 log h dνdµ ( b m ) i + log h dρdπ (1+ γ ) − Nr ( b m, b θ ) i − ǫ ) N ( α − γ )= 2 N ( α − γ ) log (cid:20) dνdµ (cid:16) π [(1 − α ) Nr ] π [(1+ γ ) − Nr ] (cid:17) / ( b m ) (cid:21) + log h dρdπ (1 − α ) Nr ( b m, b θ ) i − ǫ ) N ( α − γ )+ 2 N ( α − γ ) log (cid:26) µ (cid:20) exp (cid:18) − Z λβ π exp( − αr ) ( r ) dα (cid:19)(cid:21)(cid:27) . Let us remark that in the case when ν = µ (cid:16) π [(1 − α ) Nr ] π [(1+ γ ) − Nr ] (cid:17) / and ρ = π (1 − α ) Nr , weget as desired a bound that is adaptively local in all the Θ m (at least when M iscountable and µ is atomic): B ( ν, ρ ) ≤ − N ( α − γ ) log ( µ (cid:26) exp (cid:20) N log (cid:2) (1 + γ )(1 − α ) (cid:3) r ⋆ − log (cid:16) − log(1 − α )log(1+ γ ) (cid:17) d e (cid:21)(cid:27)) − ǫ ) N ( α − γ ) ≤ inf m ∈ M (cid:26) − log (cid:2) (1 − α )(1+ γ ) (cid:3) α − γ r ⋆ ( m )+ log (cid:16) − log(1 − α )log(1+ γ ) (cid:17) d e ( m ) N ( α − γ ) − log (cid:2) ǫµ ( m ) (cid:3) N ( α − γ ) (cid:27) . The penalization by the empirical dimension d e ( m ) in each sub-model is as desiredlinear in d e ( m ). Non random partially local bounds could be obtained in a way thatis easy to imagine. We leave this investigation to the reader.8 Chapter 1. Inductive PAC-Bayesian learning We have seen that the bound optimal choice of the posterior distribution ν on theindex set in Theorem 1.3.10 (page 25) is such that dνdµ ( m ) ∼ π (cid:2) exp (cid:0) − λr ( m, · ) (cid:1)(cid:3) π (cid:2) exp (cid:0) − βr ( m, · ) (cid:1)(cid:3) ! = exp (cid:20) − Z λβ π exp( − αr ) ( m, r ) dα (cid:21) . This suggests replacing the prior distribution µ with µ defined by its density(1.14) dµdµ ( m ) = exp (cid:2) − h ( m ) (cid:3) µ (cid:2) exp( − h ) (cid:3) , where h ( m ) = − ξ Z γβ π exp( − α Φ − ηN ◦ R ) (cid:2) Φ − ηN ◦ R ( m, · ) (cid:3) dα. The use of Φ − ηN ◦ R instead of R is motivated by technical reasons which will appearin subsequent computations. Indeed, we will need to bound ν (cid:20)Z λβ π exp( − α Φ − ηN ◦ R ) (cid:0) Φ − ηN ◦ R (cid:1) dα (cid:21) in order to handle K ( ν, µ ). In the spirit of equation (1.9, page 19), starting backfrom Theorem 1.1.4 (page 4), applied in each sub-model Θ m to the prior distribution π exp( − γ Φ − ηN ◦ R ) and integrated with respect to µ , we see that for any positive realconstants λ , γ and η , with P probability at least 1 − ǫ , for any posterior distribution ν : Ω → M ( M ) on the index set and any conditional posterior distribution ρ :Ω × M → M (Θ),(1.15) νρ (cid:0) λ Φ λN ◦ R − γ Φ − ηN ◦ R (cid:1) ≤ λνρ ( r )+ ν K ( ρ, π ) + K ( ν, µ ) + ν n log h π (cid:2) exp (cid:0) − γ Φ − ηN ◦ R (cid:1)(cid:3)io − log( ǫ ) . Since x f ( x ) def = λ Φ λN − γ Φ − ηN ( x ) is a convex function, it is such that f ( x ) ≥ xf ′ (0) = xN n(cid:2) − exp( − λN ) (cid:3) + γη (cid:2) exp( ηN ) − (cid:3)o . Thus if we put(1.16) γ = η (cid:2) − exp( − λN ) (cid:3) exp( ηN ) − , we obtain that f ( x ) ≥ x ∈ R , and therefore that the left-hand side of equation(1.15) is non-negative. We can moreover introduce the prior conditional distribution π defined by dπdπ ( m, θ ) = exp (cid:2) − β Φ − ηN ◦ R ( θ ) (cid:3) π (cid:8) m, exp (cid:2) − β Φ − ηN ◦ R (cid:3)(cid:9) . With P probability at least 1 − ǫ , for any posterior distributions ν : Ω → M ( M )and ρ : Ω × M → M (Θ), .3. Local bounds βνρ ( r ) + ν (cid:2) K ( ρ, π ) (cid:3) = ν (cid:8) K (cid:2) ρ, π exp( − βr ) (cid:3)(cid:9) − ν (cid:20) log n π (cid:2) exp( − βr ) (cid:3)o(cid:21) ≤ ν (cid:8) K (cid:2) ρ, π exp( − βr ) (cid:3)(cid:9) + βνπ ( r ) + ν (cid:2) K ( π, π ) (cid:3) ≤ ν (cid:8) K (cid:2) ρ, π exp( − βr ) (cid:3)(cid:9) + βνπ (cid:0) Φ − ηN ◦ R (cid:1) + βη (cid:2) K ( ν, µ ) − log( ǫ ) (cid:3) + ν (cid:2) K ( π, π ) (cid:3) = ν (cid:8) K (cid:2) ρ, π exp( − βr ) (cid:3)(cid:9) − ν n log h π (cid:2) exp (cid:0) − β Φ − ηN ◦ R (cid:1)(cid:3)io + βη (cid:2) K ( ν, µ ) − log( ǫ ) (cid:3) . Thus, coming back to equation (1.15), we see that under condition (1.16), with P probability at least 1 − ǫ ,0 ≤ ( λ − β ) νρ ( r ) + ν (cid:8) K (cid:2) ρ, π exp( − βr ) (cid:3)(cid:9) − ν (cid:20)Z γβ π exp( − α Φ − ηN ◦ R ) (cid:0) Φ − ηN ◦ R (cid:1) dα (cid:21) + (1 + βη ) (cid:2) K ( ν, µ ) + log( ǫ ) (cid:3) . Noticing moreover that( λ − β ) νρ ( r ) + ν (cid:8) K (cid:2) ρ, π exp( − βr ) (cid:3)(cid:9) = ν (cid:8) K (cid:2) ρ, π exp( − λr ) (cid:3)(cid:9) + ν (cid:20)Z λβ π exp( − αr ) ( r ) dα (cid:21) , and choosing ρ = π exp( − λr ) , we have proved Theorem 1.3.12 . For any positive real constants β , γ and η , such that γ < η (cid:2) exp( ηN ) − (cid:3) − , defining λ by condition (1.16) , so that λ = − N log n − γη (cid:2) exp( ηN ) − (cid:3)o , with P probability at least − ǫ , for any posteriordistribution ν : Ω → M ( M ) , any conditional posterior distribution ρ : Ω × M → M (Θ) , ν (cid:20)Z γβ π exp( − α Φ − ηN ◦ R ) (cid:0) Φ − ηN ◦ R (cid:1) dα (cid:21) ≤ ν (cid:20)Z λβ π exp( − αr ) ( r ) dα (cid:21) + (cid:0) βη (cid:1)(cid:2) K ( ν, µ ) + log (cid:0) ǫ (cid:1)(cid:3) . Let us remark that this theorem does not require that β < γ , and thus providesboth an upper and a lower bound for the quantity of interest: Corollary 1.3.13 . For any positive real constants β , γ and η such that max { β,γ } < η (cid:2) exp( ηN ) − (cid:3) − , with P probability at least − ǫ , for any posterior distributions ν : Ω → M ( M ) and ρ : Ω × M → M (Θ) , ν (cid:20)Z γ − N log { − βN [exp( ηN ) − } π exp( − αr ) ( r ) dα (cid:21) − (cid:0) γη (cid:1)(cid:2) K ( ν, µ ) + log (cid:0) ǫ (cid:1)(cid:3) ≤ ν (cid:20)Z γβ π exp( − α Φ − ηN ◦ R ) (cid:0) Φ − ηN ◦ R (cid:1) dα (cid:21) Chapter 1. Inductive PAC-Bayesian learning ≤ ν (cid:20)Z − N log { − γη [exp( ηN ) − } β π exp( − αr ) ( r ) dα (cid:21) + (cid:0) βη (cid:1)(cid:2) K ( ν, µ ) + log (cid:0) ǫ (cid:1)(cid:3) . We can then remember that K ( ν, µ ) = ξ (cid:0) ν − µ (cid:1)(cid:20)Z γβ π exp( − α Φ − ηN ◦ R ) (cid:0) Φ − ηN ◦ R (cid:1) dα (cid:21) + K ( ν, µ ) − K ( µ, µ ) , to conclude that, putting(1.17) G η ( α ) = − N log (cid:8) − αη (cid:2) exp (cid:0) ηN ) − (cid:3)(cid:9) ≥ α, α ∈ R + , and(1.18) d b νdµ ( m ) def = exp (cid:2) − h ( m ) (cid:3) µ (cid:2) exp( − h ) (cid:3) where h ( m ) = ξ Z γG η ( β ) π exp( − αr ) ( m, r ) dα, the divergence of ν with respect to the local prior µ is bounded by (cid:2) − ξ (cid:0) βη (cid:1)(cid:3) K ( ν, µ ) ≤ ξν (cid:20)Z G η ( γ ) β π exp( − αr ) ( r ) dα (cid:21) − ξµ (cid:20)Z γG η ( β ) π exp( − αr ) ( r ) dα (cid:21) + K ( ν, µ ) − K ( µ, µ ) + ξ (cid:0) β + γη (cid:1) log (cid:0) ǫ (cid:1) ≤ ξν (cid:20)Z G η ( γ ) β π exp( − αr ) ( r ) dα (cid:21) + K ( ν, µ )+ log (cid:26) µ (cid:20) exp (cid:18) − ξ Z γG η ( β ) π exp( − αr ) ( r ) dα (cid:19)(cid:21)(cid:27) + ξ (cid:0) β + γη (cid:1) log (cid:0) ǫ (cid:1) = K ( ν, b ν ) + ξν (cid:20)(cid:18)Z G η ( β ) β + Z G η ( γ ) γ (cid:19) π exp( − αr ) ( r ) dα (cid:21) + ξ (cid:0) β + γη (cid:1) log (cid:0) ǫ (cid:1) . We have proved Theorem 1.3.14 . For any positive constants β , γ and η such that max { β, γ } < η (cid:2) exp( ηN ) − (cid:3) − , with P probability at least − ǫ , for any pos-terior distribution ν : Ω → M ( M ) and any conditional posterior distribution ρ : Ω × M → M (Θ) , K ( ν, µ ) ≤ h − ξ (cid:16) βη (cid:17)i − (cid:26) K ( ν, b ν )+ ξν (cid:20)(cid:18)Z G η ( β ) β + Z G η ( γ ) γ (cid:19) π exp( − αr ) ( r ) dα (cid:21) + ξ (cid:0) β + γη (cid:1) log (cid:0) ǫ (cid:1)(cid:27) ≤ h − ξ (cid:16) βη (cid:17)i − (cid:26) K ( ν, b ν ) .3. Local bounds ξν (cid:20)(cid:2) G η ( γ ) − γ + G η ( β ) − β (cid:3) r ⋆ + log (cid:18) G η ( β ) G η ( γ ) βγ (cid:19) d e (cid:21) + ξ (cid:0) β + γη (cid:1) log (cid:0) ǫ (cid:1)(cid:27) , where the local prior µ is defined by equation (1.14, page 28) and the local posterior b ν and the function G η are defined by equation (1.18, page 30). We can then use this theorem to give a local version of Theorem 1.3.10 (page 25).To get something pleasing to read, we can apply Theorem 1.3.14 with constants β ′ , γ ′ and η chosen so that ξ − ξ (1+ β ′ η ) = 1 , G η ( β ′ ) = β and γ ′ = λ , where β and λ arethe constants appearing in Theorem 1.3.10. This gives Theorem 1.3.15 . For any positive real constants β < λ and η such that λ <η (cid:2) exp( ηN ) − (cid:3) − , with P probability at least − ǫ , for any posterior distribution ν : Ω → M ( M ) , for any conditional posterior distribution ρ : Ω × M → M (Θ) , νρ (cid:2) λ Φ λN ( R ) − β Φ − βN ( R ) (cid:3) ≤ λ Φ λN (cid:2) νρ ( R ) (cid:3) − β Φ − βN (cid:2) νρ ( R ) (cid:3) ≤ B ( ν, ρ ) , where B ( ν, ρ ) = ν (cid:20)Z G η ( λ ) G − η ( β ) π exp( − αr ) ( r ) dα (cid:21) + (cid:16) G − η ( β ) η (cid:17) K (cid:2) ν, µ exp (cid:2) − (cid:0) G − η ( β ) η (cid:1) − R λβ π exp( − αr ) ( r ) dα (cid:3)(cid:3) + ν (cid:8) K ( ρ, π exp( − λr ) (cid:3)(cid:9) + (cid:16) G − η ( β )+ λη (cid:17) log (cid:0) ǫ (cid:1) ≤ ν h(cid:2) G η ( λ ) − G − η ( β ) (cid:3) r ⋆ + log (cid:16) G η ( λ ) G − η ( β ) (cid:17) d e i + (cid:16) G − η ( β ) η (cid:17) K (cid:2) ν, µ exp (cid:2) − (cid:0) G − η ( β ) η (cid:1) − R λβ π exp( − αr ) ( r ) dα (cid:3)(cid:3) + ν (cid:8) K ( ρ, π exp( − λr ) (cid:3)(cid:9) + (cid:16) G − η ( β )+ λη (cid:17) log (cid:0) ǫ (cid:1) , and where the function G η is defined by equation (1.17, page 30). A first remark: if we had the stamina to use Cauchy Schwarz inequalities (or moregenerally H¨older inequalities) on exponential moments instead of using weightedunion bounds on deviation inequalities, we could have replaced log( ǫ ) with − log( ǫ )in the above inequalities.We see that we have achieved the desired kind of localization of Theorem 1.3.10(page 25), since the new empirical entropy term K [ ν, µ exp[ − ξ R λβ π exp( − αr ) ( r ) dα ] ]cancels for a value of the posterior distribution on the index set ν which is of thesame form as the one minimizing the bound B ( ν, ρ ) of Theorem 1.3.10 (with adecreased constant, as could be expected). In a typical parametric setting, we willhave Z λβ π exp( − αr ) ( r ) dα ≃ ( λ − β ) r ⋆ ( m ) + log (cid:16) λβ (cid:17) d e ( m ) , and therefore, if we choose for ν the Dirac mass at b m ∈ arg min m ∈ M r ⋆ ( m ) + log( λβ ) λ − β d e ( m ),2 Chapter 1. Inductive PAC-Bayesian learning and ρ ( m, · ) = π exp( − λr ) ( m, · ), we will get, in the case when the index set M iscountable, B ( ν, ρ ) . max (cid:2) G η ( λ ) − G − η ( β ) (cid:3) , ( λ − β ) log (cid:2) Gη ( λ ) G − η ( β ) (cid:3) log( λβ ) × h r ⋆ ( b m ) + log( λβ ) λ − β d e ( b m ) i + (cid:16) G − η ( β ) η (cid:17) log ( X m ∈ M µ ( m ) µ ( b m ) exp (cid:20) − (cid:16) G − η ( β ) η (cid:17) − × n ( λ − β ) (cid:2) r ⋆ ( m ) − r ⋆ ( b m ) (cid:3) + log (cid:0) λβ (cid:1)(cid:2) d e ( m ) − d e ( b m ) (cid:3)o(cid:21)) + (cid:16) G − η ( β )+ λη (cid:17) log (cid:0) ǫ (cid:1) . This shows that the impact on the bound of the addition of supplementary modelsdepends on their penalized minimum empirical risk r ⋆ ( m ) + log( λβ ) λ − β d e ( m ). Moreprecisely the adaptive and local complexity factorlog ( X m ∈ M µ ( m ) µ ( b m ) exp (cid:20) − (cid:16) G − η ( β ) η (cid:17) − × n ( λ − β ) (cid:2) r ⋆ ( m ) − r ⋆ ( b m ) (cid:3) + log (cid:0) λβ (cid:1)(cid:2) d e ( m ) − d e ( b m ) (cid:3)o(cid:21)) replaces in this bound the non local factor K ( ν, µ ) = − log (cid:2) µ ( b m ) (cid:3) = log " X m ∈ M µ ( m ) µ ( b m ) which appears when applying Theorem 1.3.10 (page 25) to the Dirac mass ν = δ b m .Thus in the local bound, the influence of models decreases exponentially fast whentheir penalized empirical risk increases.One can deduce a result about the deviations with respect to the posterior νρ from Theorem 1.3.15 (page 31) without much supplementary work: it is enough forthat purpose to remark that with P probability at least 1 − ǫ , for any posteriordistribution ν : Ω → M ( M ), ν (cid:20) log n π exp( − λr ) h exp (cid:8) λ Φ λN ( R ) − β Φ − βN ( R ) (cid:9)io(cid:21) − ν Z G η ( λ ) G − η ( β ) π exp( − αr ) ( r ) dα ! − (cid:16) G − η ( β ) η (cid:17) K (cid:2) ν, µ exp h − (cid:0) G − η ( β ) η (cid:1) − R λβ π exp( − αr ) ( r ) dα i(cid:3) − (cid:16) G − η ( β )+ λη (cid:17) log (cid:16) ǫ (cid:17) ≤ , this inequality being obtained by taking a supremum in ρ in Theorem 1.3.15 (page31). One can then take a supremum in ν , to get, still with P probability at least .3. Local bounds − ǫ ,log ( µ exp h − (cid:0) G − η ( β ) η (cid:17) − R λβ π exp( − αr ) ( r ) dα i"n π exp( − λr ) h exp (cid:8) λ Φ λN ( R ) − β Φ − βN ( R ) (cid:9)io(cid:0) G − η ( β ) η (cid:1) − × exp − (cid:16) G − η ( β ) η (cid:17) − Z G η ( λ ) G − η ( β ) π exp( − αr ) ( r ) dα ! ≤ G − η ( β )+ λη G − η ( β ) η log (cid:0) ǫ (cid:1) . Using the fact that x x α is concave when α = (cid:0) G − η ( β ) η (cid:1) − < 1, we get forany posterior conditional distribution ρ : Ω × M → M (Θ), µ exp h − (cid:0) G − η ( β ) η (cid:17) − R λβ π exp( − αr ) ( r ) dα i ρ ( exp "(cid:16) G − η ( β ) η (cid:17) − λ Φ λN ( R ) − β Φ − βN ( R ) − Z G η ( λ ) G − η ( β ) π exp( − αr ) ( r ) dα + log (cid:20) dρdπ exp( − λr ) ( b m, b θ ) (cid:21)! ≤ exp G − η ( β )+ λη G − η ( β ) η log (cid:0) ǫ (cid:1)! . We can thus state Theorem 1.3.16 . For any ǫ ∈ )0 , , with P probability at least − ǫ , for anyposterior distribution ν : Ω → M ( M ) and conditional posterior distribution ρ :Ω × M → M (Θ) , for any ξ ∈ )0 , , with νρ probability at least − ξ , λ Φ λN ( R ) − β Φ − βN ( R ) ≤ Z G η ( λ ) G − η ( β ) π exp( − αr ) ( r ) dα + (cid:16) G − η ( β ) η (cid:17) log dνdµ exp h − (cid:0) G − η ( β ) η (cid:1) − R λβ π exp( − αr ) ( r ) dα i ( b m ) + log (cid:20) dρdπ exp( − λr ) ( b m, b θ ) (cid:21) + (cid:16) G − η ( β )+ λη (cid:17) log (cid:0) ǫ (cid:1) − (cid:16) G − η ( β ) η (cid:17) log( ξ ) . Note that the given bound consequently holds with P νρ probability at least(1 − ǫ )(1 − ξ ) ≥ − ǫ − ξ .4 Chapter 1. Inductive PAC-Bayesian learning The behaviour of the minimum of the empirical process θ r ( θ ) is known todepend on the covariances between pairs (cid:2) r ( θ ) , r ( θ ′ ) (cid:3) , θ, θ ′ ∈ Θ. In this respect,our previous study, based on the analysis of the variance of r ( θ ) (or technicallyon some exponential moment playing quite the same role), loses some accuracy insome circumstances (namely when inf Θ R is not close enough to zero).In this section, instead of bounding the expected risk ρ ( R ) of any posteriordistribution, we are going to upper bound the difference ρ ( R ) − inf Θ R , and moregenerally ρ ( R ) − R ( e θ ), where e θ ∈ Θ is some fixed parameter value.In the next section we will analyse ρ ( R ) − π exp( − βR ) ( R ), allowing us to comparethe expected error rate of a posterior distribution ρ with the error rate of a Gibbsprior distribution. We will also analyse ρ ( R ) − ρ ( R ), where ρ and ρ are twoarbitrary posterior distributions, using comparison with a Gibbs prior distributionas a tool, and in particular as a tool to establish the required Kullback divergencebounds.Relative bounds do not provide the same kind of results as direct bounds onthe error rate: it is not possible to estimate ρ ( R ) with an order of precision higherthan ( ρ ( R ) /N ) / , so that relative bounds cannot of course achieve that, but theyprovide a way to reach a faster rate for ρ ( R ) − inf Θ R , that is for the relativeperformance of the estimator within a restricted model.The study of PAC-Bayesian relative bounds was initiated in the second and thirdparts of J.-Y. Audibert’s dissertation (Audibert, 2004b).In this section and the next, we will suggest a series of possible uses of relativebounds. As usual, we will start with the simplest inequalities and proceed towardsmore sophisticated techniques with better theoretical properties, but at the sametime less precise constants, so that which one is the more fitted will depend on thesize of the training sample.The first thing we will do is to compute for any posterior distribution ρ : Ω → M (Θ) a relative performance bound bearing on ρ ( R ) − inf Θ R . We will also com-pare the classification model indexed by Θ with a sub-model indexed by one ofits measurable subsets Θ ⊂ Θ. For this purpose we will form the difference ρ ( R ) − R ( e θ ), where e θ ∈ Θ is some possibly unobservable value of the parame-ter in the sub-model defined by Θ , typically chosen in arg min Θ R . If this is soand ρ ( R ) − R ( e θ ) = ρ ( R ) − inf Θ R , a negative upper bound indicates that it isdefinitely worth using a randomized estimator ρ supported by the larger parameterset Θ instead of using only the classification model defined by the smaller set Θ . Relative bounds in this section are based on the control of r ( θ ) − r ( e θ ), where θ, e θ ∈ Θ.These differences are related to the random variables ψ i ( θ, e θ ) = σ i ( θ ) − σ i ( e θ ) = (cid:2) f θ ( X i ) = Y i (cid:3) − (cid:2) f e θ ( X i ) = Y i (cid:3) . Some supplementary technical difficulties, as compared to the previous sections,come from the fact that ψ i ( θ, e θ ) takes three values, whereas σ i ( θ ) takes only two.Let(1.19) r ′ ( θ, e θ ) = r ( θ ) − r ( e θ ) = 1 N N X i =1 ψ i ( θ, e θ ) , θ, e θ ∈ Θ , .4. Relative bounds R ′ ( θ, e θ ) = R ( θ ) − R ( e θ ) = P (cid:2) r ′ ( θ, e θ ) (cid:3) . We have as usual from independence thatlog n P h exp (cid:2) − λr ′ ( θ, e θ ) (cid:3)io = N X i =1 log n P h exp (cid:2) − λN ψ i ( θ, e θ ) (cid:3)io ≤ N log (cid:26) N N X i =1 P n exp h − λN ψ i ( θ, e θ ) io(cid:27) . Let C i be the distribution of ψ i ( θ, e θ ) under P and let ¯ C = N P Ni =1 C i ∈ M (cid:0) {− , , } (cid:1) . With this notation(1.20) log n P h exp (cid:2) − λr ′ ( θ, e θ ) (cid:3)io ≤ N log (cid:26)Z ψ ∈{− , , } exp (cid:16) − λN ψ (cid:17) ¯ C ( dψ ) (cid:27) . The right-hand side of this inequality is a function of ¯ C . On the other hand, ¯ C being a probability measure on a three point set, is defined by two parameters, thatwe may take equal to R ψ ¯ C ( dψ ) and R ψ ¯ C ( dψ ). To this purpose, let us introduce M ′ ( θ, e θ ) = Z ψ ¯ C ( dψ ) = ¯ C (+1) + ¯ C ( − 1) = 1 N N X i =1 P (cid:2) ψ i ( θ, e θ ) (cid:3) , θ, e θ ∈ Θ . It is a pseudo distance (meaning that it is symmetric and satisfies the triangleinequality), since it can also be written as M ′ ( θ, e θ ) = 1 N N X i =1 P n(cid:12)(cid:12)(cid:12) (cid:2) f θ ( X i ) = Y i (cid:3) − (cid:2) f e θ ( X i ) = Y i (cid:3)(cid:12)(cid:12)(cid:12)o , θ, e θ ∈ Θ . It is readily seen that N log (cid:26)Z exp (cid:18) − λN ψ (cid:19) ¯ C ( dψ ) (cid:27) = − λ Ψ λN (cid:2) R ′ ( θ, e θ ) , M ′ ( θ, e θ ) (cid:3) , where Ψ a ( p, m ) = − a − log h (1 − m ) + m + p − a ) + m − p a ) i = − a − log n − sinh( a ) (cid:2) p − m tanh( a ) (cid:3)o . (1.21)Thus plugging this equality into inequality (1.20, page 35) we get Theorem 1.4.1 . For any real parameter λ , log n P h exp (cid:2) − λr ′ ( θ, e θ ) (cid:3)io ≤ − λ Ψ λN (cid:2) R ′ ( θ, e θ ) , M ′ ( θ, e θ ) (cid:3) , θ, e θ ∈ Θ , where r ′ is defined by equation (1.19, page 34) and Ψ and M ′ are defined just above. To make a link with previous work of Mammen and Tsybakov — see e.g.Mammen et al. (1999) and Tsybakov (2004) — we may consider the pseudo-distance D on Θ defined by equation (1.3, page 7). This distance only depends on the dis-tribution of the patterns. It is often used to formulate margin assumptions, in thesense of Mammen and Tsybakov. Here we are going to work rather with M ′ : as it6 Chapter 1. Inductive PAC-Bayesian learning is dominated by D in the sense that M ′ ( θ, e θ ) ≤ D ( θ, e θ ), θ, e θ ∈ Θ, with equality inthe important case of binary classification, hypotheses formulated on D induce hy-potheses on M ′ , and working with M ′ may only sharpen the results when comparedto working with D .Using the same reasoning as in the previous section, we deduce Theorem 1.4.2 . For any real parameter λ , any e θ ∈ Θ , any prior distribution π ∈ M (Θ) , P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) λ h ρ (cid:8) Ψ λN (cid:2) R ′ ( · , e θ ) , M ′ ( · , e θ ) (cid:3)(cid:9) − ρ (cid:2) r ′ ( · , e θ ) (cid:3)i − K ( ρ, π ) (cid:21)(cid:27) ≤ . We are now going to derive some other type of relative exponential inequal-ity. In Theorem 1.4.2 we obtained an inequality comparing one observed quantity ρ (cid:2) r ′ ( · , e θ ) (cid:3) with two unobserved ones, ρ (cid:2) R ′ ( · , e θ ) (cid:3) and ρ (cid:2) M ′ ( · , e θ ) (cid:3) , — indeed, becauseof the convexity of the function λ Ψ λN , λρ (cid:8) Ψ λN (cid:2) R ′ ( · , e θ ) , M ′ ( · , e θ ) (cid:3)(cid:9) ≥ λ Ψ λN (cid:8) ρ (cid:2) R ′ ( · , e θ ) (cid:3) , ρ (cid:2) M ′ ( · , e θ ) (cid:3)(cid:9) . This may be inconvenient when looking for an empirical bound for ρ (cid:2) R ′ ( · , e θ ) (cid:3) ,and we are going now to seek an inequality comparing ρ (cid:2) R ′ ( · , e θ ) (cid:3) with empiricalquantities only.This is possible by considering the log-Laplace transform of some modified ran-dom variable χ i ( θ, e θ ). We may consider more precisely the change of variable definedby the equation exp (cid:18) − λN χ i (cid:19) = 1 − λN ψ i , which is possible when λN ∈ ) − , 1( and leads to define χ i = − Nλ log (cid:18) − λN ψ i (cid:19) . We may then work on the log-Laplace transformlog ( P " exp (cid:26) − λN N X i =1 χ i ( θ, e θ ) (cid:27) = log ( P " N Y i =1 (cid:18) − λN ψ i ( θ, e θ ) (cid:19) = log ( P " exp (cid:26) N X i =1 log (cid:20) − λN ψ i ( θ, e θ ) (cid:21)(cid:27) . We may now follow the same route as previously, writinglog ( P " exp (cid:26) N X i =1 log (cid:20) − λN ψ i ( θ, e θ ) (cid:21)(cid:27) = N X i =1 log (cid:20) − λN P (cid:2) ψ i ( θ, e θ ) (cid:3)(cid:21) ≤ N log h − λN R ′ ( θ, e θ ) i . Let us also introduce the random pseudo distance .4. Relative bounds m ′ ( θ, e θ ) = 1 N N X i =1 ψ i ( θ, e θ ) = 1 N N X i =1 (cid:12)(cid:12)(cid:12) (cid:2) f θ ( X i ) = Y i (cid:3) − (cid:2) f e θ ( X i ) = Y i (cid:3)(cid:12)(cid:12)(cid:12) , θ, e θ ∈ Θ . This is the empirical counterpart of M ′ , implying that P ( m ′ ) = M ′ . Let us noticethat1 N N X i =1 log (cid:2) − λN ψ i ( θ, e θ ) (cid:3) = log(1 − λN ) − log(1 + λN )2 r ′ ( θ, e θ )+ log(1 − λN ) + log(1 + λN )2 m ′ ( θ, e θ )= 12 log − λN λN ! r ′ (cid:0) θ, e θ (cid:1) + 12 log (cid:0) − λ N (cid:1) m ′ (cid:0) θ, e θ (cid:1) . Let us put γ = N log (cid:18) λN − λN (cid:19) , so that λ = N tanh (cid:0) γN (cid:1) and N log (cid:16) − λ N (cid:17) = − N log (cid:2) cosh( γN ) (cid:3) . With this notation, we can conveniently write the previous inequality as P n exp h − N log (cid:2) − tanh (cid:0) γN (cid:1) R ′ ( θ, e θ ) (cid:3) − γr ′ (cid:0) θ, e θ (cid:1) − N log (cid:2) cosh( γN ) (cid:3) m ′ (cid:0) θ, e θ (cid:1)io ≤ . Integrating with respect to a prior probability measure π ∈ M (Θ), we obtain Theorem 1.4.3 . For any real parameter γ , for any e θ ∈ Θ , for any prior probabilitydistribution π ∈ M (Θ) , P ( exp " sup ρ ∈ M (Θ) (cid:26) − N ρ n log (cid:2) − tanh (cid:0) γN (cid:1) R ′ ( · , e θ ) (cid:3)o − γρ (cid:2) r ′ ( · , e θ ) (cid:3) − N log (cid:2) cosh( γN ) (cid:3) ρ (cid:2) m ′ ( · , e θ ) (cid:3) − K ( ρ, π ) (cid:27) ≤ . Let us first deduce a non-random bound from Theorem 1.4.2 (page 36). This the-orem can be conveniently taken advantage of by throwing the non-linearity into alocalized prior, considering the prior probability measure µ defined by its density dµdπ ( θ ) = exp (cid:8) − λ Ψ λN (cid:2) R ′ ( θ, e θ ) , M ′ ( θ, e θ ) (cid:3) + βR ′ ( θ, e θ ) (cid:9) π n exp (cid:8) − λ Ψ λN (cid:2) R ′ ( · , e θ ) , M ′ ( · , e θ ) (cid:3) + βR ′ ( · , e θ ) (cid:9)o . Chapter 1. Inductive PAC-Bayesian learning Indeed, for any posterior distribution ρ : Ω → M (Θ), K ( ρ, µ ) = K ( ρ, π ) + λρ n Ψ λN (cid:2) R ′ ( · , e θ ) , M ′ ( · , e θ ) (cid:3)o − βρ (cid:2) R ′ ( · , e θ ) (cid:3) + log n π h exp (cid:8) − λ Ψ λN (cid:2) R ′ ( · , e θ ) , M ′ ( · , e θ ) (cid:3) + βR ′ ( · , e θ ) (cid:3)(cid:9)io . Plugging this into Theorem 1.4.2 (page 36) and using the convexity of the exponen-tial function, we see that for any posterior probability distribution ρ : Ω → M (Θ), β P (cid:8) ρ (cid:2) R ′ ( · , e θ ) (cid:3)(cid:9) ≤ λ P (cid:8) ρ (cid:2) r ′ ( · , e θ ) (cid:3)(cid:9) + P (cid:2) K ( ρ, π ) (cid:3) + log n π h exp (cid:8) − λ Ψ λN (cid:2) R ′ ( · , e θ ) , M ′ ( · , e θ ) (cid:3) + βR ′ ( · , e θ ) (cid:3)(cid:9)io . We can then recall that λρ (cid:2) r ′ ( · , e θ ) (cid:3) + K ( ρ, π ) = K (cid:2) ρ, π exp( − λr ) (cid:3) − log n π h exp (cid:2) − λr ′ ( · , e θ ) (cid:3)io , and notice moreover that − P (cid:26) log n π h exp (cid:2) − λr ′ ( · , e θ ) (cid:3)io(cid:27) ≤ − log n π h exp (cid:2) − λR ′ ( · , e θ ) (cid:3)io , since R ′ = P ( r ′ ) and h log n π (cid:2) exp( h ) (cid:3)o is a convex functional. Putting these tworemarks together, we obtain Theorem 1.4.4 . For any real positive parameter λ , for any prior distribution π ∈ M (Θ) , for any posterior distribution ρ : Ω → M (Θ) , P (cid:8) ρ (cid:2) R ′ ( · , e θ ) (cid:3)(cid:9) ≤ β P (cid:2) K ( ρ, π exp( − λr ) ) (cid:3) + 1 β log n π h exp (cid:8) − λ Ψ λN (cid:2) R ′ ( · , e θ ) , M ′ ( · , e θ ) (cid:3) + βR ′ ( · , e θ ) (cid:3)(cid:9)io − β log n π h exp (cid:2) − λR ′ ( · , e θ ) (cid:3)io ≤ β P (cid:2) K ( ρ, π exp( − λr ) ) (cid:3) + 1 β log n π h exp (cid:8) − (cid:2) N sinh( λN ) − β (cid:3) R ′ ( · , e θ )+ 2 N sinh( λ N ) M ′ ( · , e θ ) (cid:9)io − β log n π h exp (cid:2) − λR ′ ( · , e θ ) (cid:3)io . It may be interesting to derive some more suggestive (but slightly weaker) boundin the important case when Θ = Θ and R ( e θ ) = inf Θ R . In this case, it is convenientto introduce the expected margin function (1.23) ϕ ( x ) = sup θ ∈ Θ M ′ ( θ, e θ ) − xR ′ ( θ, e θ ) , x ∈ R + . We see that ϕ is convex and non-negative on R + . Using the bound M ′ ( θ, e θ ) ≤ xR ′ ( θ, e θ ) + ϕ ( x ), we obtain .4. Relative bounds P (cid:8) ρ (cid:2) R ′ ( · , e θ ) (cid:3)(cid:9) ≤ β P (cid:2) K ( ρ, π exp( − λr ) ) (cid:3) + 1 β log (cid:26) π (cid:20) exp n − (cid:8) N sinh( λN ) (cid:2) − x tanh( λ N ) (cid:3) − β (cid:9) R ′ ( · , e θ ) o(cid:21)(cid:27) + N sinh( λN ) tanh( λ N ) β ϕ ( x ) − β log n π h exp (cid:2) − λR ′ ( · , e θ ) (cid:3)io . Let us make the change of variable γ = N sinh( λN ) (cid:2) − x tanh( λ N ) (cid:3) − β to obtain Corollary 1.4.5 . For any real positive parameters x , γ and λ such that x ≤ tanh( λ N ) − and ≤ γ < N sinh( λN ) (cid:2) − x tanh( λ N ) (cid:3) , P (cid:2) ρ ( R ) (cid:3) − inf Θ R ≤ n N sinh( λN ) (cid:2) − x tanh( λ N ) (cid:3) − γ o − × (cid:26)Z λγ (cid:2) π exp( − αR ) ( R ) − inf Θ R (cid:3) dα + N sinh (cid:0) λN (cid:1) tanh (cid:0) λ N (cid:1) ϕ ( x ) + P (cid:2) K ( ρ, π exp( − λr ) ) (cid:3)(cid:27) . Let us remark that these results, although well suited to study Mammen andTsybakov’s margin assumptions, hold in the general case: introducing the convex expected margin function ϕ is a substitute for making hypotheses about the relationsbetween R and D .Using the fact that R ′ ( θ, e θ ) ≥ θ ∈ Θ and that ϕ ( x ) ≥ x ∈ R + , we canweaken and simplify the preceding corollary even more to get Corollary 1.4.6 . For any real parameters β , λ and x such that x ≥ and ≤ β < λ − x λ N , for any posterior distribution ρ : Ω → M (Θ) , P (cid:2) ρ ( R ) (cid:3) ≤ inf Θ R + h λ − x λ N − β i − (cid:26)Z λβ (cid:2) π exp( − αR ) ( R ) − inf Θ R (cid:3) dα + P (cid:8) K (cid:2) ρ, π exp( − λr ) (cid:3)(cid:9) + ϕ ( x ) λ N (cid:27) . Let us apply this bound under the margin assumption first considered by Mam-men and Tsybakov (Mammen et al., 1999; Tsybakov, 2004), which says that forsome real positive constant c and some real exponent κ ≥ R ′ ( θ, e θ ) ≥ cD ( θ, e θ ) κ , θ ∈ Θ . In the case when κ = 1, then ϕ ( c − ) = 0, proving that P (cid:8) π exp( − λr ) (cid:2) R ′ ( · , e θ ) (cid:3)(cid:9) ≤ R λβ π exp( − γR ) (cid:2) R ′ ( · , e θ ) (cid:3) dγN sinh( λN ) (cid:2) − c − tanh( λ N ) (cid:3) − β ≤ R λβ π exp( − γR ) (cid:2) R ′ ( · , e θ ) (cid:3) dγλ − λ cN − β . Chapter 1. Inductive PAC-Bayesian learning Taking for example λ = cN , β = λ = cN , we obtain P (cid:2) π exp( − − cNr ) ( R ) (cid:3) ≤ inf R + 8 cN Z cN cN π exp( − γR ) (cid:2) R ′ ( · , e θ ) (cid:3) dγ ≤ inf R + 2 π exp( − cN R ) (cid:2) R ′ ( · , e θ ) (cid:3) . If moreover the behaviour of the prior distribution π is parametric, meaning that π exp( − βR ) (cid:2) R ′ ( · , e θ ) (cid:3) ≤ dβ , for some positive real constant d linked with the dimensionof the classification model, then P (cid:2) π exp( − cN r ) ( R ) (cid:3) ≤ inf R + 8 log(2) dcN ≤ inf R + 5 . dcN . In the case when κ > ϕ ( x ) ≤ ( κ − κ − κκ − ( cx ) − κ − = (1 − κ − )( κcx ) − κ − , thus P (cid:8) π exp( − λr ) (cid:2) R ′ ( · , e θ ) (cid:3)(cid:9) ≤ R λβ π exp( − γR ) (cid:2) R ′ ( · , e θ ) (cid:3) dγ + (1 − κ − )( κcx ) − κ − λ N λ − xλ N − β . Taking for instance β = λ , x = N λ , and putting b = (1 − κ − )( cκ ) − κ − , we obtain P (cid:2) π exp( − λr ) ( R ) (cid:3) − inf R ≤ λ Z λλ/ π exp( − γR ) (cid:2) R ′ ( · , e θ ) (cid:3) dγ + b (cid:18) λN (cid:19) κκ − . In the parametric case when π exp( − γR ) (cid:2) R ′ ( · , e θ ) (cid:3) ≤ dγ , we get P (cid:2) π exp( − λr ) ( R ) (cid:3) − inf R ≤ dλ + b (cid:18) λN (cid:19) κκ − . Taking λ = 2 − (cid:2) d (cid:3) κ − κ − ( κc ) κ − N κ κ − , we obtain P (cid:2) π exp( − λr ) ( R ) (cid:3) − inf R ≤ (2 − κ − )( κc ) − κ − (cid:18) dN (cid:19) κ κ − . We see that this formula coincides with the result for κ = 1. We can thus reducethe two cases to a single one and state Corollary 1.4.7 . Let us assume that for some e θ ∈ Θ , some positive real constant c , some real exponent κ ≥ and for any θ ∈ Θ , R ( θ ) ≥ R ( e θ ) + cD ( θ, e θ ) κ . Let usalso assume that for some positive real constant d and any positive real parameter γ , π exp( − γR ) ( R ) − inf R ≤ dγ . Then P h π exp (cid:8) − − [8 log(2) d ] κ − κ − ( κc ) κ − N κ κ − r (cid:9) ( R ) i ≤ inf R + (2 − κ − )( κc ) − κ − (cid:18) dN (cid:19) κ κ − . .4. Relative bounds N in this corollary is known to be the mini-max exponent under these assumptions: it is unimprovable, whatever estimator isused in place of the Gibbs posterior shown here (at least in the worst case com-patible with the hypotheses). The interest of the corollary is to show not only theminimax exponent in N , but also an explicit non-asymptotic bound with reason-able and simple constants. It is also clear that we could have got slightly betterconstants if we had kept the full strength of Theorem 1.4.4 (page 38) instead ofusing the weaker Corollary 1.4.6 (page 39).We will prove in the following empirical bounds showing how the constant λ canbe estimated from the data instead of being chosen according to some margin andcomplexity assumptions. We are going to define an empirical counterpart for the expected margin function ϕ . It will appear in empirical bounds having otherwise the same structure as thenon-random bound we just proved. Anyhow, we will not launch into trying tocompare the behaviour of our proposed empirical margin function with the expectedmargin function , since the margin function involves taking a supremum which isnot straightforward to handle. When we will touch the issue of building provably adaptive estimators, we will instead formulate another type of bounds based onintegrated quantities, rather than try to analyse the properties of the empiricalmargin function.Let us start as in the previous subsection with the inequality β P n ρ (cid:2) R ′ ( · , e θ ) (cid:3)o ≤ P n λρ (cid:2) r ′ ( · , e θ ) (cid:3) + K ( ρ, π ) o + log n π h exp (cid:8) − λ Ψ λN (cid:2) R ′ ( · , e θ ) , M ′ ( · , e θ ) (cid:3) + βR ′ ( · , e θ ) (cid:9)io . We have already defined by equation (1.22, page 37) the empirical pseudo-distance m ′ ( θ, e θ ) = 1 N N X i =1 ψ i ( θ, e θ ) . Recalling that P (cid:2) m ′ ( θ, e θ ) (cid:3) = M ′ ( θ, e θ ), and using the convexity of h log n π (cid:2) exp( h ) (cid:3)o , leads to the following inequalities:log n π h exp (cid:8) − λ Ψ λN (cid:2) R ′ ( · , e θ ) , M ′ ( · , e θ ) (cid:3) + βR ′ ( · , e θ ) (cid:9)io ≤ log n π h exp (cid:8) − N sinh( λN ) R ′ ( · , e θ )+ N sinh( λN ) tanh( λ N ) M ′ ( · , e θ ) + βR ′ ( · , e θ ) (cid:3)(cid:9)io ≤ P (cid:26) log n π h exp (cid:8) − (cid:2) N sinh( λN ) − β (cid:3) r ′ ( · , e θ )+ N sinh( λN ) tanh( λ N ) m ′ ( · , e θ ) (cid:9)io(cid:27) . We may moreover remark that2 Chapter 1. Inductive PAC-Bayesian learning λρ (cid:2) r ′ ( · , e θ ) (cid:3) + K ( ρ, π ) = (cid:2) β − N sinh( λN ) + λ (cid:3) ρ (cid:2) r ′ ( · , e θ ) (cid:3) + K (cid:2) ρ, π exp {− [ N sinh( λN ) − β ] r } (cid:3) − log n π h exp (cid:8) − (cid:2) N sinh( λN ) − β (cid:3) r ′ ( · , e θ ) (cid:9)io . This establishes Theorem 1.4.8 . For any positive real parameters β and λ , for any posterior dis-tribution ρ : Ω → M (Θ) , P (cid:8) ρ (cid:2) R ′ ( · , e θ ) (cid:3)(cid:9) ≤ P (cid:26)(cid:20) − N sinh( λN ) − λβ (cid:21) ρ (cid:2) r ′ ( · , e θ ) (cid:3) + K (cid:2) ρ, π exp {− [ N sinh( λN ) − β ] r } (cid:3) β + β − log n π exp {− [ N sinh( λN ) − β ] r } h exp (cid:2) N sinh( λN ) tanh( λ N ) m ′ ( · , e θ ) (cid:3)io(cid:27) . Taking β = N sinh( λN ), using the fact that sinh( a ) ≥ a , a ≥ a ) = a − (cid:2)p a ) − (cid:3) and a = log (cid:2)p a ) +sinh( a ) (cid:3) , we deduce Corollary 1.4.9 . For any positive real constant β and any posterior distribution ρ : Ω → M (Θ) , P (cid:8) ρ (cid:2) R ′ ( · , e θ ) (cid:3)(cid:9) ≤ P ((cid:20) Nβ log (cid:16)q β N + βN (cid:17) − (cid:21)| {z } ≤ ρ (cid:2) r ′ ( · , e θ ) (cid:3) + 1 β (cid:26) K (cid:2) ρ, π exp( − βr ) (cid:3) + log (cid:20) π exp( − βr ) n exp h N (cid:16)q β N − (cid:17) m ′ ( · , e θ ) io(cid:21)(cid:27)) . This theorem and its corollary are really analogous to Theorem 1.4.4 (page 38),and it could easily be proved that under Mammen and Tsybakov margin assump-tions we obtain an upper bound of the same order as Corollary 1.4.7 (page 40).Anyhow, in order to obtain an empirical bound, we are now going to take a supre-mum over all possible values of e θ , that is over Θ . Although we believe that takingthis supremum will not spoil the bound in cases when over-fitting remains un-der control, we will not try to investigate precisely if and when this is actuallytrue, and provide our empirical bound as such. Let us say only that on qualitativegrounds, the values of the margin function quantify the steepness of the contrastfunction R or its empirical counterpart r , and that the definition of the empiricalmargin function is obtained by substituting P , the true sample distribution, with P = (cid:0) N P Ni =1 δ ( X i ,Y i ) (cid:1) ⊗ N , the empirical sample distribution, in the definition ofthe expected margin function. Therefore, on qualitative grounds, it seems hopelessto presume that R is steep when r is not, or in other words that a classificationmodel that would be inefficient at estimating a bootstrapped sample according toour non-random bound would be by some miracle efficient at estimating the true .4. Relative bounds e θ ∈ arg min Θ R .To obtain an observable bound, let b θ ∈ arg min θ ∈ Θ r ( θ ) and let us introduce the empirical margin functions ϕ ( x ) = sup θ ∈ Θ m ′ ( θ, b θ ) − x (cid:2) r ( θ ) − r ( b θ ) (cid:3) , x ∈ R + , e ϕ ( x ) = sup θ ∈ Θ m ′ ( θ, b θ ) − x (cid:2) r ( θ ) − r ( b θ ) (cid:3) , x ∈ R + . Using the fact that m ′ ( θ, e θ ) ≤ m ′ ( θ, b θ ) + m ′ ( b θ, e θ ), we get Corollary 1.4.10 . For any positive real parameters β and λ , for any posteriordistribution ρ : Ω → M (Θ) , P (cid:2) ρ ( R ) (cid:3) − inf Θ R ≤ P (cid:26)h − N sinh( λN ) − λβ i(cid:2) ρ ( r ) − r ( b θ ) (cid:3) + K (cid:2) ρ, π exp {− [ N sinh( λN ) − β ] r } (cid:3) β + β − log n π exp {− [ N sinh( λN ) − β ] r } h exp (cid:2) N sinh (cid:0) λN (cid:1) tanh (cid:0) λ N (cid:1) m ′ ( · , b θ ) (cid:3)io + β − N sinh( λN ) tanh( λ N ) e ϕ (cid:20) βN sinh( λN ) tanh( λ N ) − N sinh( λN ) − λβ !(cid:21)(cid:27) . Taking β = N sinh( λN ) , we also obtain P (cid:2) ρ ( R ) (cid:3) − inf Θ R ≤ P ((cid:20) Nβ log (cid:16)q β N + βN (cid:17) − (cid:21)| {z } ≤ (cid:2) ρ ( r ) − r ( b θ ) (cid:3) + 1 β (cid:26) K (cid:2) ρ, π exp( − βr ) (cid:3) + log (cid:20) π exp( − βr ) n exp h N (cid:16)q β N − (cid:17) m ′ ( · , b θ ) io(cid:21)(cid:27) + Nβ (cid:16)q β N − (cid:17) e ϕ " log (cid:16)q β N + βN (cid:17) − βN (cid:16)q β N − (cid:17) . Note that we could also use the upper bound m ′ ( θ, b θ ) ≤ x (cid:2) r ( θ ) − r ( b θ ) (cid:3) + ϕ ( x )and put α = N sinh( λN ) (cid:2) − x tanh( λ N ) (cid:3) − β , to obtain Corollary 1.4.11 . For any non-negative real parameters x , α and λ , such that α < N sinh( λN ) (cid:2) − x tanh( λ N ) (cid:3) , for any posterior distribution ρ : Ω → M (Θ) , Chapter 1. Inductive PAC-Bayesian learning P (cid:2) ρ ( R ) (cid:3) − inf Θ R ≤ P ((cid:20) − N sinh( λN ) (cid:2) − x tanh( λ N ) (cid:3) − λN sinh( λN ) (cid:2) − x tanh( λ N ) (cid:3) − α (cid:21)(cid:2) ρ ( r ) − r ( b θ ) (cid:3) + K (cid:2) ρ, π exp( − αr ) (cid:3) N sinh( λN ) (cid:2) − x tanh( λ N ) (cid:3) − α + N sinh( λN ) tanh( λ N ) N sinh( λN ) (cid:2) − x tanh( λ N ) (cid:3) − α × (cid:20) ϕ ( x ) + e ϕ (cid:18) λ − αN sinh( λN ) tanh( λ N ) (cid:19)(cid:21)) . Let us notice that in the case when Θ = Θ, the upper bound provided by thiscorollary has the same general form as the upper bound provided by Corollary 1.4.5(page 39), with the sample distribution P replaced with the empirical distributionof the sample P = (cid:0) N P Ni =1 δ ( X i ,Y i ) (cid:1) ⊗ N . Therefore, our empirical bound can be ofa larger order of magnitude than our non-random bound only in the case when ournon-random bound applied to the bootstrapped sample distribution P would be ofa larger order of magnitude than when applied to the true sample distribution P . Inother words, we can say that our empirical bound is close to our non-random boundin every situation where the bootstrapped sample distribution P is not harder tobound than the true sample distribution P . Although this does not prove that ourempirical bound is always of the same order as our non-random bound, this is a goodqualitative hint that this will be the case in most practical situations of interest,since in situations of “under-fitting”, if they exist, it is likely that the choice of theclassification model is inappropriate to the data and should be modified.Another reassuring remark is that the empirical margin functions ϕ and e ϕ behavewell in the case when inf Θ r = 0. Indeed in this case m ′ ( θ, b θ ) = r ′ ( θ, b θ ) = r ( θ ), θ ∈ Θ, and thus ϕ (1) = e ϕ (1) = 0, and e ϕ ( x ) ≤ − ( x − 1) inf Θ r , x ≥ r ( b θ ) = 0, which is another hintthat this may be an accurate bound in many situations. It is natural to make use of Theorem 1.4.3 (page 37) to obtain empirical deviationbounds, since this theorem provides an empirical variance term.Theorem 1.4.3 is written in a way which exploits the fact that ψ i takes only thethree values − 1, 0 and +1. However, it will be more convenient for the followingcomputations to use it in its more general form, which only makes use of the factthat ψ i ∈ ( − , P ( exp " sup ρ ∈ M (Θ) (cid:26) − N ρ n log h − λP ( ψ ) io + N ρ n P h log(1 − λψ ) io − K ( ρ, π ) (cid:27) ≤ . .4. Relative bounds P = 1 N N X i =1 δ ( X i ,Y i ) , so that P is our notation for the empirical distribution of the process( X i , Y i ) Ni =1 . Moreover we have also used P = P ( P ) = 1 N N X i =1 P i , where it should be remembered that the joint distribution of the process ( X i , Y i ) Ni =1 is P = N Ni =1 P i . We have considered ψ ( θ, e θ ) as a function defined on X × Y as ψ ( θ, e θ )( x, y ) = (cid:2) y = f θ ( x ) (cid:3) − (cid:2) y = f e θ ( x ) (cid:3) , ( x, y ) ∈ X × Y so that it should beunderstood that P ( ψ ) = 1 N N X i =1 P (cid:2) ψ i ( θ, e θ ) (cid:3) = 1 N N X i =1 P n (cid:2) Y i = f θ ( X i ) (cid:3) − (cid:2) Y i = f e θ ( X i ) (cid:3)o = R ′ ( θ, e θ ) . In the same way P h log(1 − λψ ) i = 1 N N X i =1 log (cid:2) − λψ i ( θ, e θ ) (cid:3) . Moreover integration with respect to ρ bears on the index θ , so that ρ n log h − λP ( ψ ) io = Z θ ∈ Θ log (cid:26) − λN N X i =1 P (cid:2) ψ i ( θ, e θ ) (cid:3)(cid:27) ρ ( dθ ) ,ρ n P h log(1 − λψ ) io = Z θ ∈ Θ (cid:26) N N X i =1 log (cid:2) − λψ i ( θ, e θ ) (cid:3)(cid:27) ρ ( dθ ) . We have chosen concise notation, as we did throughout these notes, in order tomake the computations easier to follow.To get an alternate version of empirical relative deviation bounds, we need to findsome convenient way to localize the choice of the prior distribution π in equation(1.25, page 44). Here we propose replacing π with µ = π exp {− N log[1+ βP ( ψ )] } , whichcan also be written π exp {− N log[1+ βR ′ ( · , e θ )] } . Indeed we see that K ( ρ, µ ) = N ρ n log (cid:2) βP ( ψ ) (cid:3)o + K ( ρ, π )+ log n π h exp (cid:8) − N log (cid:2) βP ( ψ ) (cid:3)(cid:9)io . Moreover, we deduce from our deviation inequality applied to − ψ , that (as long as β > − P (cid:26) exp (cid:20) N µ n P (cid:2) log(1 + βψ ) (cid:3)o − N µ n log (cid:2) βP ( ψ ) (cid:3)o(cid:21)(cid:27) ≤ . Chapter 1. Inductive PAC-Bayesian learning Thus P (cid:26) exp (cid:20) log n π h exp (cid:8) − N log (cid:2) βP ( ψ ) (cid:3)(cid:9)io − log n π h exp (cid:8) − N P (cid:2) log(1 + βψ ) (cid:3)(cid:9)io(cid:21)(cid:27) ≤ P (cid:26) exp (cid:20) − N µ n log (cid:2) βP ( ψ ) (cid:3)o − K ( µ, π )+ N µ n P (cid:2) log(1 + βψ ) (cid:3)o + K ( µ, π ) (cid:21)(cid:27) ≤ . This can be used to handle K ( ρ, µ ), making use of the Cauchy–Schwarz inequalityas follows P ( exp " (cid:20) − N log n(cid:16) − λρ (cid:2) P ( ψ ) (cid:3)(cid:17)(cid:16) βρ (cid:2) P ( ψ ) (cid:3)(cid:17)o + N ρ n P h log(1 − λψ ) io − K ( ρ, π ) − log n π h exp (cid:8) − N P (cid:2) log(1 + βψ ) (cid:3)(cid:9)io(cid:21) ≤ P ( exp " − N log n(cid:16) − λρ (cid:2) P ( ψ ) (cid:3)(cid:17)o + N ρ n P h log(1 − λψ ) io − K ( ρ, µ ) / × P ( exp " log n π h exp (cid:8) − N log (cid:2) βP ( ψ ) (cid:3)(cid:9)io − log n π h exp (cid:8) − N P (cid:2) log(1 + βψ ) (cid:3)(cid:9)io / ≤ . This implies that with P probability at least 1 − ǫ , − N log n(cid:16) − λρ (cid:2) P ( ψ ) (cid:3)(cid:17)(cid:16) βρ (cid:2) P ( ψ ) (cid:3)(cid:17)o ≤ − N ρ n P h log(1 − λψ ) io + K ( ρ, π ) + log n π h exp (cid:8) − N P (cid:2) log(1 + βψ ) (cid:3)(cid:9)io − ǫ ) . It is now convenient to remember that P h log(1 − λψ ) i = 12 log (cid:18) − λ λ (cid:19) r ′ ( θ, e θ ) + 12 log(1 − λ ) m ′ ( θ, e θ ) . We thus can write the previous inequality as − N log n(cid:16) − λρ (cid:2) R ′ ( · , e θ ) (cid:3)(cid:17)(cid:16) βρ (cid:2) R ′ ( · , e θ ) (cid:3)(cid:17)o ≤ N (cid:18) λ − λ (cid:19) ρ (cid:2) r ′ ( · , e θ ) (cid:3) − N − λ ) ρ (cid:2) m ′ ( · , e θ ) (cid:3) + K ( ρ, π ) .4. Relative bounds 47+ log (cid:26) π (cid:20) exp n − N (cid:16) β − β (cid:17) r ′ ( · , e θ ) − N − β ) m ′ ( · , e θ ) o(cid:21)(cid:27) − ǫ ) . Let us assume now that e θ ∈ arg min Θ R . Let us introduce b θ ∈ arg min Θ r . Decom-posing r ′ ( θ, e θ ) = r ′ ( θ, b θ ) + r ′ ( b θ, e θ ) and considering that m ′ ( θ, e θ ) ≤ m ′ ( θ, b θ ) + m ′ ( b θ, e θ ),we see that with P probability at least 1 − ǫ , for any posterior distribution ρ : Ω → M (Θ), − N log n(cid:16) − λρ (cid:2) R ′ ( · , e θ ) (cid:3)(cid:17)(cid:16) βρ (cid:2) R ′ ( · , e θ ) (cid:17)o ≤ N (cid:18) λ − λ (cid:19) ρ (cid:2) r ′ ( · , b θ ) (cid:3) − N − λ ) ρ (cid:2) m ′ ( · , b θ ) (cid:3) + K ( ρ, π )+ log (cid:26) π (cid:20) exp n − N log (cid:16) β − β (cid:17)(cid:2) r ′ ( · , b θ ) (cid:3) − N log(1 − β ) m ′ ( · , b θ ) o(cid:21)(cid:27) + N log h (1+ λ )(1 − β )(1 − λ )(1+ β ) i(cid:2) r ( b θ ) − r ( e θ ) (cid:3) − N log (cid:2) (1 − λ )(1 − β ) (cid:3) m ′ ( b θ , e θ ) − ǫ ) . Let us now define for simplicity the posterior ν : Ω → M (Θ) by the identity dνdπ ( θ ) = exp n − N log (cid:16) λ − λ (cid:17) r ′ ( θ, b θ ) + N log(1 − λ ) m ′ ( θ, b θ ) o π (cid:20) exp n − N log (cid:16) λ − λ (cid:17) r ′ ( · , b θ ) + N log(1 − λ ) m ′ ( · , b θ ) o(cid:21) . Let us also introduce the random bound B = 1 N log (cid:26) ν (cid:20) exp h N log h (1+ λ )(1 − β )(1 − λ )(1+ β ) i r ′ ( · , b θ ) − N log (cid:2) (1 − λ )(1 − β ) (cid:3) m ′ ( · , b θ ) i(cid:21)(cid:27) + sup θ ∈ Θ 12 log h (1 − λ )(1+ β )(1+ λ )(1 − β ) i r ′ ( θ, b θ ) − 12 log (cid:2) (1 − λ )(1 − β ) (cid:3) m ′ ( θ, b θ ) − N log( ǫ ) . Theorem 1.4.12 . Using the above notation, for any real constants ≤ β < λ < ,for any prior distribution π ∈ M (Θ) , for any subset Θ ⊂ Θ , with P probability atleast − ǫ , for any posterior distribution ρ : Ω → M (Θ) , − log n(cid:16) − λ (cid:2) ρ ( R ) − inf Θ R (cid:3)(cid:17)(cid:16) β (cid:2) ρ ( R ) − inf Θ R (cid:3)(cid:17)o ≤ K ( ρ, ν ) N + B. Therefore, Chapter 1. Inductive PAC-Bayesian learning ρ ( R ) − inf Θ R ≤ λ − β λβ s λβ ( λ − β ) (cid:20) − exp (cid:18) − B − K ( ρ, ν ) N (cid:19)(cid:21) − ! ≤ λ − β (cid:18) B + K ( ρ, ν ) N (cid:19) . Let us define the posterior b ν by the identity d b νdπ ( θ ) = exp h − N log (cid:16) β − β (cid:17) r ′ ( θ, b θ ) − N log(1 − β ) m ′ ( θ, b θ ) i π n exp h − N log (cid:16) β − β (cid:17) r ′ ( · , b θ ) − N log(1 − β ) m ′ ( · , b θ ) io . It is useful to remark that1 N log (cid:26) ν (cid:20) exp h N (cid:16) (1 + λ )(1 − β )(1 − λ )(1 + β ) (cid:17) r ′ ( · , b θ ) − N (cid:2) (1 − λ )(1 − β ) (cid:3) m ′ ( · , b θ ) i(cid:21)(cid:27) ≤ b ν (cid:26) 12 log (cid:16) (1 + λ )(1 − β )(1 − λ )(1 + β ) (cid:17) r ′ ( · , b θ ) − 12 log (cid:2) (1 − λ )(1 − β ) (cid:3) m ′ ( · , b θ ) (cid:27) . This inequality is a special case oflog n π (cid:2) exp( g ) (cid:3)o − log n π (cid:2) exp( h ) (cid:3)o = Z α =0 π exp[ h + α ( g − h )] ( g − h ) dα ≤ π exp( g ) ( g − h ) , which is a consequence of the convexity of α log n π h exp (cid:2) h + α ( g − h ) (cid:3)io .Let us introduce as previously ϕ ( x ) = sup θ ∈ Θ m ′ ( θ, b θ ) − x r ′ ( θ, b θ ), x ∈ R + . Letus moreover consider e ϕ ( x ) = sup θ ∈ Θ m ′ ( θ, b θ ) − x r ′ ( θ, b θ ), x ∈ R + . These functionscan be used to produce a result which is slightly weaker, but maybe easier to readand understand. Indeed, we see that, for any x ∈ R + , with P probability at least1 − ǫ , for any posterior distribution ρ , − N log n(cid:16) − λρ (cid:2) R ′ ( · , e θ ) (cid:3)(cid:17)(cid:16) βρ (cid:2) R ′ ( · , e θ ) (cid:3)(cid:17)o ≤ N (cid:20) (1 + λ )(1 − λ )(1 − λ ) x (cid:21) ρ (cid:2) r ′ ( · , b θ ) (cid:3) − N (cid:2) (1 − λ )(1 − β ) (cid:3) ϕ ( x ) + K ( ρ, π )+ log (cid:26) π (cid:20) exp n − N log h (1+ β )(1 − β )(1 − β ) x i r ′ ( · , b θ ) o(cid:21)(cid:27) − N (cid:2) (1 − λ )(1 − β ) (cid:3) e ϕ log h (1+ λ )(1 − β )(1 − λ )(1+ β ) i − log [(1 − λ )(1 − β )] − ǫ ) .4. Relative bounds Z N log (cid:2) (1+ λ )(1 − λ )(1 − λ x (cid:3) N log (cid:2) (1+ β )(1 − β )(1 − β x (cid:3) π exp( − αr ) (cid:2) r ′ ( · , b θ ) (cid:3) dα + K ( ρ, π exp {− N log[ (1+ λ )(1 − λ )(1 − λ x ] r } ) − ǫ ) − N (cid:2) (1 − λ )(1 − β ) (cid:3) ϕ ( x ) + e ϕ log h (1+ λ )(1 − β )(1 − λ )(1+ β ) i − log[(1 − λ )(1 − β )] . Theorem 1.4.13 . With the previous notation, for any real constants ≤ β < λ < , for any positive real constant x , for any prior probability distribution π ∈ M (Θ) ,for any subset Θ ⊂ Θ , with P probability at least − ǫ , for any posterior distribution ρ : Ω → M (Θ) , putting B ( ρ ) = 1 N ( λ − β ) Z N log (cid:2) (1+ λ )(1 − λ )(1 − λ x (cid:3) N log (cid:2) (1+ β )(1 − β )(1 − β x (cid:3) π exp( − αr ) (cid:2) r ′ ( · , b θ ) (cid:3) dα + K ( ρ, π exp {− N log[ (1+ λ )(1 − λ )(1 − λ x ] r } ) − ǫ ) N ( λ − β ) − λ − β ) log (cid:2) (1 − λ )(1 − β ) (cid:3) ϕ ( x ) + e ϕ log h (1+ λ )(1 − β )(1 − λ )(1+ β ) i − log[(1 − λ )(1 − β )] ≤ N ( λ − β ) d e log log h (1+ λ )(1 − λ )(1 − λ ) x i log (cid:16) (1+ β )(1 − β )(1 − β ) x (cid:17) + K ( ρ, π exp {− N log[ (1+ λ )(1 − λ )(1 − λ x ] r } ) − ǫ ) N ( λ − β ) − λ − β ) log (cid:2) (1 − λ )(1 − β ) (cid:3) ϕ ( x ) + e ϕ log h (1+ λ )(1 − β )(1 − λ )(1+ β ) i − log[(1 − λ )(1 − β )] , the following bounds hold true: ρ ( R ) − inf Θ R ≤ λ − β λβ s λβ ( λ − β ) n − exp (cid:2) − ( λ − β ) B ( ρ ) (cid:3)o − ! ≤ B ( ρ ) . Let us remark that this alternative way of handling relative deviation boundsmade it possible to carry on with non-linear bounds up to the final result. Forinstance, if λ = 0 . β = 0 . B ( ρ ) = 0 . 1, the non-linear bound gives ρ ( R ) − inf Θ R ≤ . Chapter 1. Inductive PAC-Bayesian learning hapter 2 Comparing posteriordistributions to Gibbs priors We now come to an approach to relative bounds whose performance can be analysedwith PAC-Bayesian tools.The empirical bounds at the end of the previous chapter involve taking supremain θ ∈ Θ, and replacing the expected margin function ϕ with some empirical coun-terparts ϕ or e ϕ , which may prove unsafe when using very complex classificationmodels.We are now going to focus on the control of the divergence K (cid:2) ρ, π exp( − βR ) (cid:3) . Itis already obvious, we hope, that controlling this divergence is the crux of thematter, and that it is a way to upper bound the mutual information betweenthe training sample and the parameter, which can be expressed as K (cid:2) ρ, P ( ρ ) (cid:3) = K (cid:2) ρ, π exp( − βR ) (cid:3) − K (cid:2) P ( ρ ) , π exp( − βR ) (cid:3) , as explained on page 14.Through the identity(2.1) K (cid:2) ρ, π exp( − βR ) (cid:3) = β (cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) + K ( ρ, π ) − K (cid:2) π exp( − βR ) , π (cid:3) , we see that the control of this divergence is related to the control of the difference ρ ( R ) − π exp( − βR ) ( R ). This is the route we will follow first.Thus comparing any posterior distribution with a Gibbs prior distribution willprovide a first way to build an estimator which can be proved to reach adaptivelythe best possible asymptotic error rate under Mammen and Tsybakov margin as-sumptions and parametric complexity assumptions (at least as long as orders ofmagnitude are concerned, we will not discuss the question of asymptotically opti-mal constants).Then we will provide an empirical bound for the Kullback divergence K (cid:2) ρ,π exp( − βR ) (cid:3) itself. This will serve to address the question of model selection, whichwill be achieved by comparing the performance of two posterior distributions possi-bly supported by two different models. This will also provide a second way to buildestimators which can be proved to be adaptive under Mammen and Tsybakov mar-gin assumptions and parametric complexity assumptions (somewhat weaker thanwith the first method). 512 Chapter 2. Comparing posterior distributions to Gibbs priors Finally, we will present two-step localization strategies, in which the performanceof the posterior distribution to be analysed is compared with a two-step Gibbs prior. Similarly to Theorem 1.4.3 (page 37) we can prove that for any prior distribution e π ∈ M (Θ),(2.2) P (e π ⊗ e π (cid:26) exp (cid:20) − N log(1 − N tanh (cid:0) γN (cid:1) R ′ ) − γr ′ − N log (cid:2) cosh( γN ) (cid:3) m ′ (cid:21)(cid:27)) ≤ . Replacing e π with π exp( − βR ) and considering the posterior distribution ρ ⊗ π exp( − βR ) ,provides a starting point in the comparison of ρ with π exp( − βR ) ; we can indeed statewith P probability at least 1 − ǫ that(2.3) − N log n − tanh (cid:0) γN (cid:1)h ρ ( R ) − π exp( − βR ) ( R ) io ≤ γ (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) + N log (cid:2) cosh( γN ) (cid:3)(cid:2) ρ ⊗ π exp( − βR ) (cid:3) ( m ′ )+ K (cid:2) ρ, π exp( − βR ) (cid:3) − log( ǫ ) . Using equation (2.1, page 51) to handle the entropy term, we get(2.4) − N log n − tanh( γN ) h ρ ( R ) − π exp( − βR ) ( R ) io − β (cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) ≤ γ (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) + N log (cid:2) cosh (cid:0) γN (cid:1)(cid:3) ρ ⊗ π exp( − βR ) ( m ′ )+ K ( ρ, π ) − K (cid:2) π exp( − βR ) , π (cid:3) − log( ǫ ) . We can then decompose in the right-hand side γ (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) into ( γ − λ ) (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) + λ (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) for some parameter λ to be setlater on and use the fact that λ (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) + N log (cid:2) cosh( γN ) (cid:3) ρ ⊗ π exp( − βR ) ( m ′ )+ K ( ρ, π ) − K (cid:2) π exp( − βR ) , π (cid:3) ≤ λρ ( r ) + K ( ρ, π ) + log n π h exp (cid:8) − λr + N log (cid:2) cosh( γN ) (cid:3) ρ ( m ′ ) (cid:9)io = K (cid:2) ρ, π exp( − λr ) (cid:3) + log n π exp( − λr ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) ρ ( m ′ ) (cid:9)io , to get rid of the appearance of the unobserved Gibbs prior π exp( − βR ) in most placesof the right-hand side of our inequality, leading to Theorem 2.1.1 . For any real constants β and γ , with P probability at least − ǫ ,for any posterior distribution ρ : Ω → M (Θ) , for any real constant λ , (cid:2) N tanh( γN ) − β (cid:3)(cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) ≤ − N log n − tanh( γN ) h ρ ( R ) − π exp( − βR ) ( R ) io .1. Bounds relative to a Gibbs distribution − β (cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) ≤ ( γ − λ ) (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) + K (cid:2) ρ, π exp( − λr ) (cid:3) + log n π exp( − λr ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) ρ ( m ′ ) (cid:9)io − log( ǫ )= K (cid:2) ρ, π exp( − γr ) (cid:3) + log n π exp( − γr ) h exp (cid:8) ( γ − λ ) r + N log (cid:2) cosh( γN ) (cid:3) ρ ( m ′ ) (cid:9)io − ( γ − λ ) π exp( − βR ) ( r ) − log( ǫ ) . We would like to have a fully empirical upper bound even in the case when λ = γ .This can be done by using the theorem twice. We will need a lemma. Lemma 2.1.2 For any probability distribution π ∈ M (Θ) , for any bounded mea-surable functions g, h : Θ → R , π exp( − g ) ( g ) − π exp( − h ) ( g ) ≤ π exp( − g ) ( h ) − π exp( − h ) ( h ) . Proof. Let us notice that0 ≤ K ( π exp( − g ) , π exp( − h ) ) = π exp( − g ) ( h ) + log (cid:8) π (cid:2) exp( − h ) (cid:3)(cid:9) + K ( π exp( − g ) , π )= π exp( − g ) ( h ) − π exp( − h ) ( h ) − K ( π exp( − h ) , π ) + K ( π exp( − g ) , π )= π exp( − g ) ( h ) − π exp( − h ) ( h ) − K ( π exp( − h ) , π ) − π exp( − g ) ( g ) − log (cid:8) π (cid:2) exp( − g ) (cid:3)(cid:9) . Moreover − log (cid:8) π (cid:2) exp( − g ) (cid:3)(cid:9) ≤ π exp( − h ) ( g ) + K ( π exp( − h ) , π ) , which ends the proof. (cid:3) For any positive real constants β and λ , we can then apply Theorem 2.1.1 to ρ = π exp( − λr ) , and use the inequality(2.5) λβ (cid:2) π exp( − λr ) ( r ) − π exp( − βR ) ( r ) (cid:3) ≤ π exp( − λr ) ( R ) − π exp( − βR ) ( R )provided by the previous lemma. We thus obtain with P probability at least 1 − ǫ − N log n − tanh( γN ) λβ h π exp( − λr ) ( r ) − π exp( − βR ) ( r ) io − γ (cid:2) π exp( − λr ) ( r ) − π exp( − βR ) ( r ) (cid:3) ≤ log n π exp( − λr ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) π exp( − λr ) ( m ′ ) (cid:9)io − log( ǫ ) . Let us introduce the convex function F γ,α ( x ) = − N log (cid:2) − tanh( γN ) x (cid:3) − αx ≥ (cid:2) N tanh( γN ) − α (cid:3) x. With P probability at least 1 − ǫ , − π exp( − βR ) ( r ) ≤ inf λ ∈ R ∗ + (cid:26) − π exp( − λr ) ( r )+ βλ F − γ, βγλ (cid:20) log n π exp( − λr ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) π exp( − λr ) ( m ′ ) (cid:9)io Chapter 2. Comparing posterior distributions to Gibbs priors − log( ǫ ) (cid:21)(cid:27) . Since Theorem 2.1.1 holds uniformly for any posterior distribution ρ , we can applyit again to some arbitrary posterior distribution ρ . We can moreover make the resultuniform in β and γ by considering some atomic measure ν ∈ M ( R ) on the realline and using a union bound. This leads to Theorem 2.1.3 . For any atomic probability distribution on the positive real line ν ∈ M ( R + ) , with P probability at least − ǫ , for any posterior distribution ρ :Ω → M (Θ) , for any positive real constants β and γ , (cid:2) N tanh( γN ) − β (cid:3)(cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) ≤ F γ,β (cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) ≤ B ( ρ, β, γ ) , where B ( ρ, β, γ ) = inf λ ∈ R + ,λ ≤ γλ ∈ R ,λ > βγN tanh( γN ) − ( K (cid:2) ρ, π exp( − λ r ) (cid:3) + ( γ − λ ) (cid:2) ρ ( r ) − π exp( − λ r ) ( r ) (cid:3) + log n π exp( − λ r ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) ρ ( m ′ ) (cid:9)io − log (cid:2) ǫν ( β ) ν ( γ ) (cid:3) + ( γ − λ ) βλ F − γ, βγλ (cid:20) log n π exp( − λ r ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) π exp( − λ r ) ( m ′ ) (cid:9)io − log (cid:2) ǫν ( β ) ν ( γ ) (cid:3)(cid:21)) ≤ inf λ ∈ R + ,λ ≤ γλ ∈ R ,λ > βγN tanh( γN ) − ( K (cid:2) ρ, π exp( − λ r ) (cid:3) + ( γ − λ ) (cid:2) ρ ( r ) − π exp( − λ r ) ( r ) (cid:3) + log n π exp( − λ r ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) ρ ( m ′ ) (cid:9)io + βλ (1 − λ γ ) (cid:2) Nγ tanh( γN ) − βλ (cid:3) log n π exp( − λ r ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) π exp( − λ r ) ( m ′ ) (cid:9)io − n βλ − λ γ )[ Nγ tanh( γN ) − βλ ] o log (cid:2) ǫν ( β ) ν ( γ ) (cid:3)) , where we have written for short ν ( β ) and ν ( γ ) instead of ν ( { β } ) and ν ( { γ } ) . Let us notice that B ( ρ, β, γ ) = + ∞ when ν ( β ) = 0 or ν ( γ ) = 0, the uniformityin β and γ of the theorem therefore necessarily bears on a countable number ofvalues of these parameters. We can typically choose distributions for ν such as theone used in Theorem 1.2.8 (page 13): namely we can put for some positive real ratio α > ν ( α k ) = 1( k + 1)( k + 2) , k ∈ N , .1. Bounds relative to a Gibbs distribution N , wecan prefer ν ( α k ) = log( α )log( αN ) , ≤ k < log( N )log( α ) . We can also use such a coding distribution on dyadic numbers as the one definedby equation (1.7, page 15).Following the same route as for Theorem 1.3.15 (page 31), we can also prove thefollowing result about the deviations under any posterior distribution ρ : Theorem 2.1.4 For any ǫ ∈ )0 , , with P probability at least − ǫ , for any posteriordistribution ρ : Ω → M (Θ) , with ρ probability at least − ξ , F γ,β (cid:2) R ( b θ ) − π exp( − βR ) ( R ) (cid:3) ≤ inf λ ∈ R + ,λ ≤ γ,λ ∈ R ,λ > βγN tanh( γN ) − ( log " dρdπ exp( − λ r ) ( b θ ) + ( γ − λ ) (cid:2) r ( b θ ) − π exp( − λ r ) ( r ) (cid:3) + log n π exp( − λ r ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) m ′ ( · , b θ ) (cid:9)io − log (cid:2) ǫξν ( β ) ν ( γ ) (cid:3) + ( γ − λ ) βλ F − γ, βγλ (cid:20) log n π exp( − λ r ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) π exp( − λ r ) ( m ′ ) (cid:9)io − log (cid:2) ǫν ( β ) ν ( γ ) (cid:3)(cid:21)) . The only tricky point is to justify that we can still take an infimum in λ withoutusing a union bound. To justify this, we have to notice that the following variant ofTheorem 2.1.1 (page 52) holds: with P probability at least 1 − ǫ , for any posteriordistribution ρ : Ω → M (Θ), for any real constant λ , ρ n F γ,β (cid:2) R − π exp( − βR ) ( R ) (cid:3)o ≤ K (cid:2) ρ, π exp( − γr ) (cid:3) + ρ (cid:20) inf λ ∈ R log n π exp( − γr ) h exp (cid:8) ( γ − λ ) r + N log (cid:2) cosh( γN (cid:1)(cid:3) m ′ ( · , b θ ) (cid:9)io − ( γ − λ ) π exp( − βR ) ( r ) (cid:21) − log( ǫ ) . We leave the details as an exercise. Using the parametric approximation π exp( − αr ) ( r ) − inf Θ r ≃ d e α , we get as an orderof magnitude B ( π exp( − λ r ) , β, γ ) . − ( γ − λ ) d e (cid:2) λ − − λ − (cid:3) Chapter 2. Comparing posterior distributions to Gibbs priors + 2 d e log λ λ − N log (cid:2) cosh( γN ) (cid:3) x + 2 βλ (1 − λ γ ) (cid:2) Nγ tanh( γN ) − βλ (cid:3) d e log λ λ − N log (cid:2) cosh( γN ) (cid:3) x ! + 2 N log (cid:2) cosh( γN ) (cid:3)(cid:20) βλ (1 − λ γ ) (cid:2) Nγ tanh( γN ) − βλ (cid:3) (cid:21) e ϕ ( x ) − n βλ (1 − λ γ )[ Nγ tanh( γN ) − βλ ] o log (cid:2) ν ( β ) ν ( γ ) ǫ (cid:3) . Therefore, if the empirical dimension d e stays bounded when N increases, we aregoing to obtain a negative upper bound for any values of the constants λ > λ > β ,as soon as γ and Nγ are chosen to be large enough. This ability to obtain negativevalues for the bound B ( π exp( − λ r ) , γ, β ), and more generally B ( ρ, γ, β ), leads theway to introducing the new concept of the effective temperature of an estimator. Definition 2.1.1 For any posterior distribution ρ : Ω → M (Θ) we define the effective temperature T ( ρ ) ∈ R ∪ {−∞ , + ∞} of ρ by the equation ρ ( R ) = π exp( − RT ( ρ ) ) ( R ) . Note that β π exp( − βR ) ( R ) : R ∪ {−∞ , + ∞} → (0 , 1) is continuous and strictlydecreasing from ess sup π R to ess inf π R (as soon as these two bounds do not co-incide). This shows that the effective temperature T ( ρ ) is a well-defined randomvariable.Theorem 2.1.3 provides a bound for T ( ρ ), indeed: Proposition 2.1.5 . Let b β ( ρ ) = sup (cid:8) β ∈ R ; inf γ,N tanh( γN ) >β B ( ρ, β, γ ) ≤ (cid:9) , where B ( ρ, β, γ ) is as in Theorem 2.1.3 (page 54). Then with P probability at least − ǫ , for any posterior distribution ρ : Ω → M (Θ) , T ( ρ ) ≤ b β ( ρ ) − , or equivalently ρ ( R ) ≤ π exp[ − b β ( ρ ) R ] ( R ) . This notion of effective temperature of a (randomized) estimator ρ is interestingfor two reasons: • the difference ρ ( R ) − π exp( − βR ) ( R ) can be estimated with better accuracythan ρ ( R ) itself, due to the use of relative deviation inequalities, leading toconvergence rates up to 1 /N in favourable situations, even when inf Θ R is notclose to zero; • and of course π exp( − βR ) ( R ) is a decreasing function of β , thus being able toestimate ρ ( R ) − π exp( − βR ) ( R ) with some given accuracy, means being ableto discriminate between values of ρ ( R ) with the same accuracy, althoughdoing so through the parametrization β π exp( − βR ) ( R ), which can neitherbe observed nor estimated with the same precision! .1. Bounds relative to a Gibbs distribution We are now going to launch into a mathematically rigorous analysis of the bound B ( π exp( − λ r ) ,β,γ ) provided by Theorem 2.1.3 (page 54), to show that inf ρ ∈ M (Θ) π exp[ − b β ( ρ ) R ] ( R ) converges indeed to inf Θ R at some optimal rate in favourable sit-uations.It is more convenient for this purpose to use deviation inequalities involving M ′ rather than m ′ . It is straightforward to extend Theorem 1.4.2 (page 36) to Theorem 2.1.6 . For any real constants β and γ , for any prior distributions π, µ ∈ M (Θ) , with P probability at least − η , for any posterior distribution ρ : Ω → M (Θ) , γρ ⊗ π exp( − βR ) (cid:2) Ψ γN ( R ′ , M ′ ) (cid:3) ≤ γρ ⊗ π exp( − βR ) ( r ′ ) + K ( ρ, µ ) − log( η ) . In order to transform the left-hand side into a linear expression and in the sametime localize this theorem, let us choose µ defined by its density dµdπ ( θ ) = C − exp (cid:20) − βR ( θ ) − γ Z Θ n Ψ γN (cid:2) R ′ ( θ , θ ) , M ′ ( θ , θ ) (cid:3) − Nγ sinh( γN ) R ′ ( θ , θ ) o π exp( − βR ) ( dθ ) (cid:21) , where C is such that µ (Θ) = 1. We get K ( ρ, µ ) = βρ ( R ) + γρ ⊗ π exp( − βR ) (cid:2) Ψ γN ( R ′ , M ′ ) − Nγ sinh( γN ) R ′ (cid:3) + K ( ρ, π )+ log (cid:26)Z Θ exp (cid:20) − βR ( θ ) − γ Z Θ n Ψ γN (cid:2) R ′ ( θ , θ ) , M ′ ( θ , θ ) (cid:3) − Nγ sinh( γN ) R ′ ( θ , θ ) o π exp( − βR ) ( dθ ) (cid:21) π ( dθ ) (cid:27) = β (cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) + γρ ⊗ π exp( − βR ) (cid:2) Ψ γN ( R ′ , M ′ ) − Nγ sinh( γN ) R ′ (cid:3) + K ( ρ, π ) − K ( π exp( − βR ) , π )+ log (cid:26)Z Θ exp (cid:20) − γ Z Θ n Ψ γN (cid:2) R ′ ( θ , θ ) , M ′ ( θ , θ ) (cid:3) − Nγ sinh( γN ) R ′ ( θ , θ ) o π exp( − βR ) ( dθ ) (cid:21) π exp( − βR ) ( dθ ) (cid:27) . Thus with P probability at least 1 − η ,(2.6) (cid:2) N sinh( γN ) − β (cid:3)(cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) ≤ γ (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) + K ( ρ, π ) − K ( π exp( − βR ) , π ) − log( η ) + C ( β, γ )where C ( β, γ ) = log (cid:26)Z Θ exp (cid:20) − γ Z Θ n Ψ γN (cid:2) R ′ ( θ , θ ) , M ′ ( θ , θ ) (cid:3) Chapter 2. Comparing posterior distributions to Gibbs priors − Nγ sinh( γN ) R ′ ( θ , θ ) o π exp( − βR ) ( dθ ) (cid:21) π exp( − βR ) ( dθ ) (cid:27) . Remarking that K (cid:2) ρ, π exp( − βR ) (cid:3) = β (cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) + K ( ρ, π ) − K ( π exp( − βR ) , π ) , we deduce from the previous inequality Theorem 2.1.7 . For any real constants β and γ , with P probability at least − η ,for any posterior distribution ρ : Ω → M (Θ) , N sinh( γN ) (cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) ≤ γ (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) + K (cid:2) ρ, π exp( − βR ) (cid:3) − log( η ) + C ( β, γ ) . We can also go into a slightly different direction, starting back again from equa-tion (2.6, page 57) and remarking that for any real constant λ , λ (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) + K ( ρ, π ) − K ( π exp( − βR ) , π ) ≤ λρ ( r ) + K ( ρ, π ) + log (cid:8) π (cid:2) exp( − λr ) (cid:3)(cid:9) = K (cid:2) ρ, π exp( − λr ) (cid:3) . This leads to Theorem 2.1.8 . For any real constants β and γ , with P probability at least − η ,for any real constant λ , (cid:2) N sinh( γN ) − β (cid:3)(cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) ≤ ( γ − λ ) (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) + K (cid:2) ρ, π exp( − λr ) (cid:3) − log( η ) + C ( β, γ ) , where the definition of C ( β, γ ) is given by equation (2.6, page 57). We can now use this inequality in the case when ρ = π exp( − λr ) and combine itwith Inequality (2.5, page 53) to obtain Theorem 2.1.9 For any real constants β and γ , with P probability at least − η ,for any real constant λ , (cid:2) Nλβ sinh( γN ) − γ (cid:3)(cid:2) π exp( − λr ) ( r ) − π exp( − βR ) ( r ) (cid:3) ≤ C ( β, γ ) − log( η ) . We deduce from this theorem Proposition 2.1.10 For any real positive constants β , β and γ , with P probabil-ity at least − η , for any real constants λ and λ , such that λ < β γN sinh( γN ) − and λ > β γN sinh( γN ) − , π exp( − λ r ) ( r ) − π exp( − λ r ) ( r ) ≤ π exp( − β R ) ( r ) − π exp( − β R ) ( r )+ C ( β , γ ) + log(2 /η ) Nλ β sinh( γN ) − γ + C ( β , γ ) + log(2 /η ) γ − Nλ β sinh( γN ) . .1. Bounds relative to a Gibbs distribution π exp( − β R ) and π exp( − β R ) being prior distributions, with P probabilityat least 1 − η , γ (cid:2) π exp( − β R ) ( r ) − π exp( − β R ) ( r ) (cid:3) ≤ γπ exp( − β R ) ⊗ π exp( − β R ) (cid:2) Ψ − γN ( R ′ , M ′ ) (cid:3) − log( η ) . Hence Proposition 2.1.11 For any positive real constants β , β and γ , with P prob-ability at least − η , for any positive real constants λ and λ such that λ <β γN sinh( γN ) − and λ > β γN sinh( γN ) − , π exp( − λ r ) ( r ) − π exp( − λ r ) ( r ) ≤ π exp( − β R ) ⊗ π exp( − β R ) (cid:2) Ψ − γN ( R ′ , M ′ ) (cid:3) + log( η ) γ + C ( β , γ ) + log( η ) Nλ β sinh( γN ) − γ + C ( β , γ ) + log( η ) γ − Nλ β sinh( γN ) . In order to achieve the analysis of the bound B ( π exp( − λ r ) , β, γ ) given by Theo-rem 2.1.3 (page 54), it now remains to bound quantities of the general formlog n π exp( − λr ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) π exp( − λr ) ( m ′ ) (cid:9)io = sup ρ ∈ M (Θ) N log (cid:2) cosh( γN ) (cid:3) ρ ⊗ π exp( − λ ) ( m ′ ) − K (cid:2) ρ, π exp( − λr ) (cid:3) . Let us consider the prior distribution µ ∈ M (Θ × Θ) on couples of parametersdefined by the density dµd ( π ⊗ π ) ( θ , θ ) = C − exp n − βR ( θ ) − βR ( θ ) + α Φ − αN (cid:2) M ′ ( θ , θ ) (cid:3)o , where the normalizing constant C is such that µ (Θ × Θ) = 1. Since for fixed values ofthe parameters θ and θ ′ ∈ Θ, m ′ ( θ, θ ′ ), like r ( θ ), is a sum of independent Bernoullirandom variables, we can easily adapt the proof of Theorem 1.1.4 on page 4, toestablish that with P probability at least 1 − η , for any posterior distribution ρ andany real constant λ , αρ ⊗ π exp( − λr ) ( m ′ ) ≤ αρ ⊗ π exp( − λr ) (cid:2) Φ − αN ( M ′ ) (cid:3) + K ( ρ ⊗ π exp( − λr ) , µ ) − log( η )= K (cid:2) ρ, π exp( − βR ) (cid:3) + K (cid:2) π exp( − λr ) , π exp( − βR ) (cid:3) + log n π exp( − βR ) ⊗ π exp( − βR ) h exp (cid:0) α Φ − αN ◦ M ′ (cid:1)io − log( η ) . Thus for any real constant β and any positive real constants α and γ , with P probability at least 1 − η , for any real constant λ ,(2.7) log n π exp( − λr ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) π exp( − λr ) ( m ′ ) (cid:9)io ≤ sup ρ ∈ M (Θ) (cid:18) Nα log (cid:2) cosh( γN ) (cid:3)n K (cid:2) ρ, π exp( − βR ) (cid:3) + K (cid:2) π exp( − λr ) , π exp( − βR ) (cid:3) Chapter 2. Comparing posterior distributions to Gibbs priors + log (cid:8) π exp( − βR ) ⊗ π exp( − βR ) (cid:2) exp( α Φ − αN ◦ M ′ ) (cid:3)(cid:9) − log( η ) o − K (cid:2) ρ, π exp( − λr ) (cid:3)(cid:19) . To finish, we need some appropriate upper bound for the entropy K (cid:2) ρ, π exp( − βR ) (cid:3) . This question can be handled in the following way: using The-orem 2.1.7 (page 58), we see that for any positive real constants γ and β , with P probability at least 1 − η , for any posterior distribution ρ , K (cid:2) ρ, π exp( − βR ) (cid:3) = β (cid:2) ρ ( R ) − π exp( − βR ) ( R ) (cid:3) + K ( ρ, π ) − K ( π exp( − βR ) , π ) ≤ βN sinh( γN ) (cid:20) γ (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) + K (cid:2) ρ, π exp( − βR ) (cid:3) − log( η ) + C ( β, γ ) (cid:21) + K ( ρ, π ) − K ( π exp( − βR ) , π ) ≤ K (cid:2) ρ, π exp( − βγN sinh( γN ) r ) (cid:3) + βN sinh( γN ) n K (cid:2) ρ, π exp( − βR ) (cid:3) + C ( β, γ ) − log( η ) o . In other words, Theorem 2.1.12 . For any positive real constants β and γ such that β < N × sinh( γN ) , with P probability at least − η , for any posterior distribution ρ : Ω → M (Θ) , K (cid:2) ρ, π exp( − βR ) (cid:3) ≤ K (cid:2) ρ, π exp[ − β γN sinh( γN ) − r ] (cid:3) − βN sinh( γN ) + C ( β, γ ) − log( η ) N sinh( γN ) β − , where the quantity C ( β, γ ) is defined by equation (2.6, page 57). Equivalently, it willbe in some cases more convenient to use this result in the form: for any positive realconstants λ and γ , with P probability at least − η , for any posterior distribution ρ : Ω → M (Θ) , K (cid:2) ρ, π exp[ − λ Nγ sinh( γN ) R ] (cid:3) ≤ K (cid:2) ρ, π exp( − λr ) (cid:3) − λγ + C ( λ Nγ sinh( γN ) , γ ) − log( η ) λβ − . Choosing in equation (2.7, page 59) α = N log (cid:2) cosh( γN ) (cid:3) − βN sinh( γN ) and β = λ Nγ sinh( γN ),so that α = N log (cid:2) cosh( γN ) (cid:3) − λγ , we obtain with P probability at least 1 − η ,log n π exp( − λr ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) π exp( − λr ) ( m ′ ) (cid:9)io ≤ λγ (cid:2) C ( β, γ ) + log( η ) (cid:3) + (cid:16) − λγ (cid:17)(cid:20) log n π exp( − βR ) ⊗ π exp( − βR ) (cid:2) exp( α Φ − αN ◦ M ′ ) (cid:3)o .1. Bounds relative to a Gibbs distribution 61+ log( η ) (cid:21) . This proves Proposition 2.1.13 . For any positive real constants λ < γ , with P probability atleast − η , log n π exp( − λr ) h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) π exp( − λr ) ( m ′ ) (cid:9)io ≤ λγ (cid:2) C ( Nλγ sinh( γN ) , γ ) + log( η ) (cid:3) + (cid:16) − λγ (cid:17) log (cid:26) π ⊗ − Nλγ sinh( γN ) R ] (cid:20) exp (cid:18) N log[cosh( γN )]1 − λγ Φ − log[cosh( γN )]1 − λγ ◦ M ′ (cid:19)(cid:21)(cid:27) + (cid:16) − λγ (cid:17) log( η ) . We are now ready to analyse the bound B ( π exp( − λ r ) , β, γ ) of Theorem 2.1.3(page 54). Theorem 2.1.14 . For any positive real constants λ , λ , β , β , β and γ , suchthat λ < γ, β < Nλ γ sinh( γN ) ,λ < γ, β > Nλ γ sinh( γN ) ,β < Nλ γ tanh( γN ) , with P probability − η , the bound B ( π exp( − λ r ) , β, γ ) of Theorem 2.1.3 (page 54)satisfies B ( π exp( − λ r ) , β, γ ) ≤ ( γ − λ ) ( π exp( − β R ) ⊗ π exp( − β R ) (cid:2) Ψ − γN ( R ′ , M ′ ) (cid:3) + log( η ) γ + C ( β , γ ) + log( η ) Nλ β sinh( γN ) − γ + C ( β , γ ) + log( η ) γ − Nλ β sinh( γN ) ) + 2 λ γ h C (cid:0) Nλ γ sinh( γN ) , γ (cid:1) + log( η ) i + (cid:16) − λ γ (cid:17) log (cid:26) π ⊗ − Nλ γ sinh( γN ) R ] (cid:20) exp (cid:18) N log[cosh( γN )]1 − λ γ Φ − log[cosh( γN )]1 − λ γ ◦ M ′ (cid:19)(cid:21)(cid:27) + (cid:16) − λ γ (cid:17) log( η ) − log (cid:2) ν ( { β } ) ν ( { γ } ) ǫ (cid:3) + ( γ − λ ) βλ F − γ, βγλ ( λ γ h C (cid:0) Nλ γ sinh( γN ) , γ (cid:1) + log (cid:0) η (cid:1)i + (cid:16) − λ γ (cid:17) log (cid:26) π ⊗ − Nλ γ sinh( γN ) R ] (cid:20) Chapter 2. Comparing posterior distributions to Gibbs priors exp (cid:18) N log[cosh( γN )]1 − λ γ Φ − log[cosh( γN )]1 − λ γ ◦ M ′ (cid:19)(cid:21)(cid:27) + (cid:16) − λ γ (cid:17) log (cid:0) η (cid:1) − log (cid:2) ν ( { β } ) ν ( { γ } ) ǫ (cid:3)) , where the function C ( β, γ ) is defined by equation (2.6, page 57). To help understand the previous theorem, it may be useful to give linear upper-bounds to the factors appearing in the right-hand side of the previous inequality.Introducing e θ such that R ( e θ ) = inf Θ R (assuming that such a parameter exists)and remembering thatΨ − a ( p, m ) ≤ a − sinh( a ) p + 2 a − sinh( a ) m, a ∈ R + , Φ − a ( p ) ≤ a − (cid:2) exp( a ) − (cid:3) p, a ∈ R + , Ψ a ( p, m ) ≥ a − sinh( a ) p − a − sinh( a ) m, a ∈ R + ,M ′ ( θ , θ ) ≤ M ′ ( θ , e θ ) + M ′ ( θ , e θ ) , θ , θ ∈ Θ ,M ′ ( θ , e θ ) ≤ xR ′ ( θ , e θ ) + ϕ ( x ) , x ∈ R + , θ ∈ Θ , the last inequality being rather a consequence of the definition of ϕ than a propertyof M ′ , we easily see that π exp( − β R ) ⊗ π exp( − β R ) (cid:2) Ψ − γN ( R ′ , M ′ ) (cid:3) ≤ Nγ sinh( γN ) (cid:2) π exp( − β R ) ( R ) − π exp( − β R ) ( R ) (cid:3) + Nγ sinh( γ N ) π exp( − β R ) ⊗ π exp( − β R ) ( M ′ ) ≤ Nγ sinh( γN ) (cid:2) π exp( − β R ) ( R ) − π exp( − β R ) ( R ) (cid:3) + 2 xNγ sinh( γ N ) n π exp( − β R ) (cid:2) R ′ ( · , e θ ) (cid:3) + π exp( − β R ) (cid:2) R ′ ( · , e θ ) (cid:3)o + 4 Nγ sinh( γ N ) ϕ ( x ) , that C ( β, γ ) ≤ log (cid:26) π exp( − βR ) n exp h N sinh (cid:0) γ N (cid:1) π exp( − βR ) ( M ′ ) io(cid:27) ≤ log (cid:26) π exp( − βR ) n exp h N sinh (cid:0) γ N (cid:1) M ′ ( · , e θ ) io(cid:27) + 2 N sinh( γ N ) π exp( − βR ) (cid:2) M ′ ( · , e θ ) (cid:3) ≤ log (cid:26) π exp( − βR ) n exp h xN sinh( γ N ) R ′ ( · , e θ ) io(cid:27) + 2 xN sinh( γ N ) π exp( − βR ) (cid:2) R ′ ( · , e θ ) (cid:3) + 4 N sinh( γ N ) ϕ ( x )= Z ββ − xN sinh( γ N ) π exp( − αR ) (cid:2) R ′ ( · , e θ ) (cid:3) dα .1. Bounds relative to a Gibbs distribution 63+ 2 xN sinh( γ N ) π exp( − βR ) (cid:2) R ′ ( · , e θ ) (cid:3) + 4 N sinh( γ N ) ϕ ( x ) ≤ xN sinh( γ N ) π exp[ − ( β − xN sinh( γ N ) ) R ] (cid:2) R ′ ( · , e θ ) (cid:3) + 4 N sinh( γ N ) ϕ ( x ) , and thatlog n π ⊗ − βR ) h exp (cid:16) N α Φ − α ◦ M ′ (cid:17)io ≤ n π exp( − βR ) h exp (cid:16) N (cid:2) exp( α ) − (cid:3) M ′ ( · , e θ ) (cid:17)io ≤ xN (cid:2) exp( α ) − (cid:3) π exp[ − ( β − xN [exp( α ) − R ] (cid:2) R ′ ( · , e θ ) (cid:3) + 2 xN (cid:2) exp( α ) − (cid:3) ϕ ( x ) . Let us push further the investigation under the parametric assumption that forsome positive real constant d (2.8) lim β → + ∞ βπ exp( − βR ) (cid:2) R ′ ( · , e θ ) (cid:3) = d, This assumption will for instance hold true with d = n when R : Θ → (0 , 1) is asmooth function defined on a compact subset Θ of R n that reaches its minimumvalue on a finite number of non-degenerate (i.e. with a positive definite Hessian)interior points of Θ, and π is absolutely continuous with respect to the Lebesguemeasure on Θ and has a smooth density.In case of assumption (2.8), if we restrict ourselves to sufficiently large values ofthe constants β , β , β , λ , λ and γ (the smaller of which is as a rule β , as wewill see), we can use the fact that for some (small) positive constant δ , and some(large) positive constant A ,(2.9) dα (1 − δ ) ≤ π exp( − αR ) (cid:2) R ′ ( · , e θ ) (cid:3) ≤ dα (1 + δ ) , α ≥ A. Under this assumption, π exp( − β R ) ⊗ π exp( − β R ) (cid:2) Ψ − γN ( R ′ , M ′ ) (cid:3) ≤ Nγ sinh( γN ) (cid:2) dβ (1 + δ ) − dβ (1 − δ ) (cid:3) + xNγ sinh( γ N ) (1 + δ ) (cid:2) dβ + dβ (cid:3) + Nγ sinh( γ N ) ϕ ( x ) .C ( β, γ ) ≤ d (1 + δ ) log (cid:16) ββ − xN sinh( γ N ) (cid:17) + 2 xN sinh( γ N ) δ ) dβ + 4 N sinh( γ N ) ϕ ( x ) . log n π ⊗ − βR ) h exp (cid:16) N α Φ − α ◦ M ′ (cid:17)io ≤ xN (cid:2) exp( α ) − (cid:3) d (1 + δ ) β − xN [exp( α ) − 1] + 2 N (cid:2) exp( α ) − (cid:3) ϕ ( x ) . Thus with P probability at least 1 − η , B ( π exp( − λ r ) , β, γ ) ≤ − ( γ − λ ) Nγ sinh( γN ) dβ (1 − δ )+ ( γ − λ ) (cid:26) Nγ sinh( γN ) (1+ δ ) dβ + xNγ sinh( γ N ) (1 + δ ) (cid:2) dβ + dβ (cid:3) + Nγ sinh( γ N ) ϕ ( x ) + log( η ) γ Chapter 2. Comparing posterior distributions to Gibbs priors + 4 xN sinh( γ N ) δ ) dβ − xN sinh( γ N ) + 4 N sinh( γ N ) ϕ ( x ) + log( η ) Nλ β sinh( γN ) − γ + 4 xN sinh( γ N ) δ ) dβ − xN sinh( γ N ) + 4 N sinh( γ N ) ϕ ( x ) + log( η ) γ − Nλ β sinh( γN ) (cid:27) + 2 λ γ (cid:26) xN sinh( γ N ) δ ) dNλ γ sinh( γN ) − xN sinh( γ N ) + 4 N sinh( γ N ) ϕ ( x ) + log( η ) (cid:27) + (cid:16) − λ γ (cid:17)( d (1 + δ ) λ sinh (cid:0) γN (cid:1) xγ h exp (cid:16) log[cosh( γN )]1 − λ γ (cid:17) − i − ! − + 2 N h exp (cid:16) log[cosh( γN )]1 − λ γ (cid:17) − i ϕ ( x ) ) + (cid:16) − λ γ (cid:17) log( η ) − log (cid:2) ν ( { β } ) ν ( { γ } ) ǫ (cid:3) + 1 − λ γNλ βγ tanh( γN ) − ( λ γ (cid:26) xN sinh( γ N ) δ ) dNλ γ sinh( γN ) − xN sinh( γ N ) + 4 N sinh( γ N ) ϕ ( x ) + log( η ) (cid:27) + (cid:16) − λ γ (cid:17)" d (1 + δ ) λ sinh (cid:0) γN (cid:1) xγ h exp (cid:16) log[cosh( γN )]1 − λ γ (cid:17) − i − ! − + 2 N h exp (cid:16) log[cosh( γN )]1 − λ γ (cid:17) − i ϕ ( x ) + (cid:16) − λ γ (cid:17) log( η ) − log (cid:2) ν ( β ) ν ( γ ) ǫ (cid:3)) . Now let us choose for simplicity β = 2 λ = 4 β , β = λ / γ/ 4, and let usintroduce the notation C = Nγ sinh( γN ) ,C = Nγ tanh( γN ) ,C = N γ (cid:2) exp( γ N ) − (cid:3) and C = 2 N (1 − βγ ) γ h exp (cid:16) γ N (1 − βγ ) (cid:17) − i , to obtain B ( π exp( − λ r ) , β, γ ) ≤ − C γ β (1 − δ ) d .1. Bounds relative to a Gibbs distribution C γ (cid:26) δ ) dγ + x γ N (1 + δ ) (cid:2) dγ + d β (cid:3) + γN ϕ ( x ) (cid:27) + log (cid:0) η (cid:1) + 12 C − h (1 + δ ) d (cid:16) N xC γ − (cid:17) − + C γ N ϕ ( x ) + log( η ) i + 12 − C (cid:20) δ ) d (cid:16) NβxC γ − (cid:17) − + C γ N ϕ ( x ) + log( η ) (cid:21) + 2 xγ (1 + δ ) dN − xγ + C γ N ϕ ( x ) + log( η )+ d (1 + δ ) xγN (cid:18) C C − xγN (cid:19) − + γ N C ϕ ( x ) + log( η )2 − log (cid:2) ν ( β ) ν ( γ ) ǫ (cid:3) + (cid:16) C − (cid:17) − ( βγ (cid:26) x γ N C (1 + δ ) d (cid:16) βC − xC γ N (cid:17) − + γ N ϕ ( x ) + log( η ) (cid:27) + (cid:16) − βγ (cid:17)(cid:26) d (1 + δ ) xγN (cid:20) βC γC (cid:18) − βγ (cid:19) − xγN (cid:21) − + γ N (1 − βγ ) C ϕ ( x ) (cid:27) + (cid:16) − βγ (cid:17) log( η ) − log (cid:2) ν ( β ) ν ( γ ) ǫ (cid:3)) . This simplifies to B ( π exp( − λ r ) , β, γ ) ≤ − C − δ ) d γβ + 2 C (1 + δ ) d + log( η ) (cid:20) C (4 C − − C ) + 1 + βγ C − (cid:21) − (cid:0) C − (cid:1) log (cid:2) ν ( β ) ν ( γ ) ǫ (cid:3) + (1 + δ ) dxγN (cid:26) C + C − (cid:16) C − γxN (cid:17) − + 2 (cid:16) − γxN (cid:17) − + (cid:16) C C − γxN (cid:17) − + C βγ (4 C − (cid:27) + (1 + δ ) dxγ N β (cid:26) C + − C (cid:16) C − xγ Nβ (cid:17) − + (cid:16) − βγ (cid:17) C − h C C (cid:16) − βγ (cid:17) − γ xβN i − (cid:27) + γ N ϕ ( x ) (cid:26) C + C C − + C − C + C + βγ (4 C − + C C − (cid:27) . This shows that there exist universal positive real constants A , A , B , B , B ,and B such that as soon as γ max { x, } N ≤ A βγ ≤ A , B ( π exp( − λ r ) , β, γ ) ≤ − B (1 − δ ) d γβ + B (1 + δ ) d Chapter 2. Comparing posterior distributions to Gibbs priors − B log (cid:2) ν ( β ) ν ( γ ) ǫ η (cid:3) + B γ N ϕ ( x ) . Thus π exp( − λ r ) ( R ) ≤ π exp( − βR ) ( R ) ≤ inf Θ R + (1+ δ ) dβ as soon as βγ ≤ B B δ )(1 − δ ) + B γ N ϕ ( x ) − B log[ ν ( β ) ν ( γ ) ǫη ](1 − δ ) d . Choosing some real ratio α > 1, we can now make the above result uniform forany(2.10) β, γ ∈ Λ α def = n α k ; k ∈ N , ≤ k < log( N )log( α ) o , by substituting ν ( β ) and ν ( γ ) with log( α )log( αN ) and − log( η ) with − log( η ) + 2 × log h log( αN )log( α ) i .Taking η = ǫ for simplicity, we can summarize our result in Theorem 2.1.15 . There exist positive real universal constants A , B , B , B and B such that for any positive real constants α > , d and δ , for any priordistribution π ∈ M (Θ) , with P probability at least − ǫ , for any β, γ ∈ Λ α (where Λ α is defined by equation (2.10) above) such that sup β ′ ∈ R ,β ′ ≥ β (cid:12)(cid:12)(cid:12)(cid:12) β ′ d (cid:2) π exp( − β ′ R ) ( R ) − inf Θ R (cid:3) − (cid:12)(cid:12)(cid:12)(cid:12) ≤ δ and such that also for some positive real parameter xγ max { x, } N ≤ Aβγ and βγ ≤ B B δ )(1 − δ ) + B γ N ϕ ( x ) − B log( ǫ )+4 B log (cid:2) log( N )log( α ) (cid:3) (1 − δ ) d , the bound B ( π exp( − γ r ) , β, γ ) given by Theorem 2.1.3 on page 54 in the case where wehave chosen ν to be the uniform probability measure on Λ α , satisfies B ( π exp( − γ r ) , β,γ ) ≤ , proving that b β ( π exp( − γ r ) ) ≥ β and therefore that π exp( − γ r ) ( R ) ≤ π exp( − βR ) ( R ) ≤ inf Θ R + (1 + δ ) dβ . What is important in this result is that we do not only bound π exp( − γ r ) ( R ),but also B ( π exp( − γ r ) , β, γ ), and that we do it uniformly on a grid of values of β and γ , showing that we can indeed set the constants β and γ adaptively using theempirical bound B ( π exp( − γ r ) , β, γ ).Let us see what we get under the margin assumption (1.24, page 39). When κ = 1, we have ϕ ( c − ) ≤ 0, leading to Corollary 2.1.16 . Assuming that the margin assumption (1.24, page 39) is sat-isfied for κ = 1 , that R : Θ → (0 , is independent of N (which is the case forinstance when P = P ⊗ N ), and is such that lim β ′ → + ∞ β ′ (cid:2) π exp( − β ′ R ) ( R ) − inf Θ R (cid:3) = d, .1. Bounds relative to a Gibbs distribution there are universal positive real constants B and B and N ∈ N such that for any N ≥ N , with P probability at least − ǫπ exp( − b γ r ) ( R ) ≤ inf Θ R + B dcN (cid:20) B d log (cid:18) log( N ) ǫ (cid:19)(cid:21) , where b γ ∈ arg max γ ∈ Λ max (cid:8) β ∈ Λ ; B ( π exp( − γ r ) , β, γ ) ≤ (cid:9) , where Λ is definedby equation (2.10, page 66), and B is the bound of Theorem 2.1.3 (page 54). When κ > ϕ ( x ) ≤ (1 − κ − ) (cid:0) κcx (cid:1) − κ − , and we can choose γ and x such that γ N ϕ ( x ) ≃ d to prove Corollary 2.1.17 . Assuming that the margin assumption (1.24, page 39) is sat-isfied for some exponent κ > , that R : Θ → (0 , is independent of N (which isfor instance the case when P = P ⊗ N ), and is such that lim β ′ → + ∞ β ′ (cid:2) π exp( − β ′ R ) ( R ) − inf Θ R (cid:3) = d, there are universal positive constants B and B and N ∈ N such that for any N ≥ N , with P probability at least − ǫ , π exp( − b γ r ) ( R ) ≤ inf Θ R + B c − κ − (cid:20) B d log (cid:18) log( N ) ǫ (cid:19)(cid:21) κ κ − (cid:18) dN (cid:19) κ κ − , where b γ ∈ arg max γ ∈ Λ max (cid:8) β ∈ Λ ; B ( π exp( − γ r ) , β, γ ) ≤ (cid:9) , Λ being defined byequation (2.10, page 66) and B by Theorem 2.1.3 (page 54). We find the same rate of convergence as in Corollary 1.4.7 (page 40), but thistime, we were able to provide an empirical posterior distribution π exp( − b γ r ) whichachieves this rate adaptively in all the parameters (meaning in particular that we donot need to know d , c or κ ). Moreover, as already mentioned, the power of N in thisrate of convergence is known to be optimal in the worst case (see Mammen et al.(1999); Tsybakov (2004); Tsybakov et al. (2005), and more specifically in Audibert(2004b) — downloadable from its author’s web page — Theorem 3.3, page 132). Another interesting question is to estimate K (cid:2) ρ, π exp( − βR ) (cid:3) using relative deviationinequalities. We follow here an idea to be found first in (Audibert, 2004b, page 93).Indeed, combining equation (2.3, page 52) with equation (2.1, page 51), we see thatfor any positive real parameters β and λ , with P probability at least 1 − ǫ , for anyposterior distribution ρ : Ω → M (Θ), K (cid:2) ρ, π exp( − βR ) (cid:3) ≤ βN tanh( γN ) (cid:26) γ (cid:2) ρ ( r ) − π exp( − βR ) ( r ) (cid:3) + N log (cid:2) cosh( γN ) (cid:3) ρ ⊗ π exp( − βR ) ( m ′ )+ K (cid:2) ρ, π exp( − βR ) (cid:3) − log( ǫ ) (cid:27) + K ( ρ, π ) − K (cid:2) π exp( − βR ) , π (cid:3) Chapter 2. Comparing posterior distributions to Gibbs priors ≤ K (cid:2) ρ, π exp[ − βγN tanh( γN ) r ] (cid:3) + βN tanh( γN ) n K (cid:2) ρ, π exp( − βR ) (cid:3) − log( ǫ ) o + log (cid:20) π exp[ − βγN tanh( γN ) r ] n exp h β tanh( γN ) log (cid:2) cosh( γN ) (cid:3) ρ ( m ′ ) io(cid:21) . We thus obtain Theorem 2.1.18 . For any positive real constants β and γ such that β < N × tanh( γN ) , with P probability at least − ǫ , for any posterior distribution ρ : Ω → M (Θ) , K (cid:2) ρ, π exp( − βR ) (cid:3) ≤ (cid:18) − βN tanh (cid:16) γN (cid:17) − (cid:19) − × ( K (cid:2) ρ, π exp[ − βγN tanh( γN ) − r ] (cid:3) − βN tanh( γN ) log( ǫ )+ log n π exp[ − βγN tanh( γN ) − r ] h exp (cid:8) β tanh( γN ) − log[cosh( γN )] ρ ( m ′ ) (cid:9)io) . This theorem provides another way of measuring over-fitting, since it gives anupper bound for K (cid:2) π exp[ − βγN tanh( γN ) − r ] , π exp( − βR ) (cid:3) . It may be used in combinationwith Theorem 1.2.6 (page 11) as an alternative to Theorem 1.3.7 (page 21). It willalso be used in the next section.An alternative parametrization of the same result providing a simpler right-handside is also useful: Corollary 2.1.19 . For any positive real constants β and γ such that β < γ , with P probability at least − ǫ , for any posterior distribution ρ : Ω → M (Θ) , K (cid:2) ρ, π exp[ − N βγ tanh( γN ) R ] (cid:3) ≤ (cid:18) − βγ (cid:19) − ( K (cid:2) ρ, π exp( − βr ) (cid:3) − βγ log( ǫ )+ log n π exp( − βr ) h exp (cid:8) N βγ log (cid:2) cosh( γN ) (cid:3) ρ ( m ′ ) (cid:9)io) . Estimating the effective temperature of an estimator provides an efficient way totune parameters in a model with parametric behaviour. On the other hand, it willnot be fitted to choose between different models, especially when they are nested,because as we already saw in the case when Θ is a union of nested models, the priordistribution π exp( − βR ) does not provide an efficient localization of the parameter inthis case, in the sense that π exp( − βR ) ( R ) does not go down to inf Θ R at the desiredrate when β goes to + ∞ , requiring a resort to partial localization.Once some estimator (in the form of a posterior distribution) has been chosenin each sub-model, these estimators can be compared between themselves with thehelp of the relative bounds that we will establish in this section. It is also possible .2. Playing with two posterior and two local prior distributions π ⊗ π with π ⊗ π ),we easily obtain Theorem 2.2.1 . For any positive real constant λ , for any prior distributions π , π ∈ M (Θ) , with P probability at least − ǫ , for any posterior distributions ρ and ρ : Ω → M (Θ) , − N log n − tanh (cid:0) λN (cid:1)h ρ ( R ) − ρ ( R ) io ≤ λ (cid:2) ρ ( r ) − ρ ( r ) (cid:3) + N log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) ρ ⊗ ρ ( m ′ )+ K (cid:0) ρ , π (cid:1) + K (cid:0) ρ , π (cid:1) − log( ǫ ) . This is where the entropy bound of the previous section enters into the game,providing a localized version of Theorem 2.2.1 (page 69). We will use the notation(2.11) Ξ a ( q ) = tanh( a ) − (cid:2) − exp( − aq ) (cid:3) ≤ a tanh( a ) q, a, q ∈ R . Theorem 2.2.2 . For any ǫ ∈ )0 , , any sequence of prior distributions ( π i ) i ∈ N ∈ M (Θ) N , any probability distribution µ on N , any atomic probability distribution ν on R + , with P probability at least − ǫ , for any posterior distributions ρ , ρ : Ω → M (Θ) , ρ ( R ) − ρ ( R ) ≤ B ( ρ , ρ ) , where B ( ρ , ρ ) = inf λ,β <γ ,β <γ ∈ R + ,i,j ∈ N Ξ λN ((cid:2) ρ ( r ) − ρ ( r ) (cid:3) + Nλ log (cid:2) cosh( λN ) (cid:3) ρ ⊗ ρ ( m ′ )+ 1 λ (cid:16) − β γ (cid:17) (cid:26) K (cid:2) ρ , π i exp( − β r ) (cid:3) + log n π i exp( − β r ) h exp (cid:8) β Nγ log (cid:2) cosh( γ N ) (cid:3) ρ ( m ′ ) (cid:9)io − β γ log (cid:2) ν ( γ ) (cid:3)(cid:27) + 1 λ (cid:16) − β γ (cid:17) (cid:26) K (cid:2) ρ , π j exp( − β r ) (cid:3) + log n π j exp( − β r ) h exp (cid:8) β Nγ log (cid:2) cosh( γ N ) (cid:3) ρ ( m ′ ) (cid:9)io − β γ log (cid:2) ν ( γ ) (cid:3)(cid:27) − h(cid:0) γ β − (cid:1) − + (cid:0) γ β − (cid:1) − + 1 i log (cid:2) − ν ( β ) ν ( β ) ν ( λ ) µ ( i ) µ ( j ) ǫ (cid:3) λ ) . Chapter 2. Comparing posterior distributions to Gibbs priors The sequence of prior distributions ( π i ) i ∈ N should be understood to be typicallysupported by subsets of Θ corresponding to parametric sub-models, that is sub-models for which it is reasonable to expect thatlim β → + ∞ β (cid:2) π i exp( − βR ) ( R ) − ess inf π i R (cid:3) exists and is positive and finite. As there is no reason why the bound B ( ρ , ρ ) pro-vided by the previous theorem should be sub-additive (in the sense that B ( ρ , ρ ) ≤ B ( ρ , ρ )+ B ( ρ , ρ )), it is adequate to consider some workable subset P of posteriordistributions (for instance the distributions of the form π i exp( − βr ) , i ∈ N , β ∈ R + ),and to define the sub-additive chained bound(2.12) e B ( ρ, ρ ′ ) = inf ( n − X k =0 B ( ρ k , ρ k +1 ); n ∈ N ∗ , ( ρ k ) nk =0 ∈ P n +1 ,ρ = ρ, ρ n = ρ ′ ) , ρ, ρ ′ ∈ P . Proposition 2.2.3 . With P probability at least − ǫ , for any posterior distribu-tions ρ , ρ ∈ P , ρ ( R ) − ρ ( R ) ≤ e B ( ρ , ρ ) . Moreover for any posterior distribution ρ ∈ P , any posterior distribution ρ ∈ P such that e B ( ρ , ρ ) = inf ρ ∈ P e B ( ρ , ρ ) isunimprovable with the help of e B in P in the sense that inf ρ ∈ P e B ( ρ , ρ ) ≥ . Proof. The first assertion is a direct consequence of the previous theorem, so onlythe second assertion requires a proof: for any ρ ∈ P , we deduce from the optimalityof ρ and the sub-additivity of e B that e B ( ρ , ρ ) ≤ e B ( ρ , ρ ) ≤ e B ( ρ , ρ ) + e B ( ρ , ρ ) . (cid:3) This proposition provides a way to improve a posterior distribution ρ ∈ P bychoosing ρ ∈ arg min ρ ∈ P e B ( ρ , ρ ) whenever e B ( ρ , ρ ) < 0. This improvement isproved by Proposition 2.2.3 to be one-step: the obtained improved posterior ρ cannot be improved again using the same technique.Let us give some examples of possible starting distributions ρ for this improve-ment scheme: ρ may be chosen as the best posterior Gibbs distribution accordingto Proposition 2.1.5 (page 56). More precisely, we may build from the prior distri-butions π i , i ∈ N , a global prior π = P i ∈ N µ ( i ) π i . We can then define the estimatorof the inverse effective temperature as in Proposition 2.1.5 (page 56) and choose ρ ∈ arg min ρ ∈ P b β ( ρ ), where P is as suggested above the set of posterior distribu-tions P = n π i exp( − βr ) ; i ∈ N , β ∈ R + o . This starting point ρ should already be pretty good, at least in an asymptoticperspective, the only gain in the rate of convergence to be expected bearing onspurious log( N ) factors. More elaborate uses of relative bounds are described in the third section of thesecond chapter of Audibert (2004b), where an algorithm is proposed and analysed, .2. Playing with two posterior and two local prior distributions P is finite (so that among other things any ordering of it has a firstelement).It is natural to define the estimated complexity of any given posterior distribution ρ ∈ P in our working set as the bound for inf i ∈ N K ( ρ, π i ) used in Theorem 2.2.1(page 69). This leads to set (given some confidence level 1 − ǫ ) C ( ρ ) = inf β<γ ∈ R + ,i ∈ N (cid:18) − βγ (cid:19) − (cid:26) K (cid:2) ρ, π i exp( − βr ) (cid:3) + log n π i exp( − βr ) h exp (cid:8) β Nγ log (cid:2) cosh( γN ) (cid:3) ρ ( m ′ ) (cid:9)io − βγ log (cid:2) − ν ( γ ) ν ( β ) µ ( i ) ǫ (cid:3)(cid:27) . Let us moreover call γ ( ρ ), β ( ρ ) and i ( ρ ) the values achieving this infimum, ornearly achieving it, which requires a slight change of the definition of C ( ρ ) to takethis modification into account. For the sake of simplicity, we can assume withoutsubstantial loss of generality that the supports of ν and µ are large but finite, andthus that the minimum is reached.To understand how this notion of complexity comes into play, it may be inter-esting to keep in mind that for any posterior distributions ρ and ρ ′ we can writethe bound in Theorem 2.2.2 (page 69) as(2.13) B ( ρ, ρ ′ ) = inf λ ∈ R + Ξ λN (cid:2) ρ ′ ( r ) − ρ ( r ) + S λ ( ρ, ρ ′ ) (cid:3) , where S λ ( ρ, ρ ′ ) = S λ ( ρ ′ , ρ ) ≤ Nλ log (cid:2) cosh( λN ) (cid:3) ρ ⊗ ρ ′ ( m ′ ) + C ( ρ ) + C ( ρ ′ ) λ − log(3 − ǫ ) λ − log (cid:8) ν (cid:2) β ( ρ ) (cid:3) µ (cid:2) i ( ρ ) (cid:3)(cid:9) λ (cid:0) − β ( ρ ′ ) γ ( ρ ′ ) (cid:1) − log (cid:8) ν (cid:2) β ( ρ ′ ) (cid:3) µ (cid:2) i ( ρ ′ ) (cid:3)(cid:9) λ (cid:0) − β ( ρ ) γ ( ρ ) (cid:1) − h(cid:0) γ ( ρ ) β ( ρ ) − (cid:1) − + (cid:0) γ ( ρ ′ ) β ( ρ ′ ) − (cid:1) − + 1 i log (cid:2) ν ( λ ) (cid:3) λ . (Let us recall that the function Ξ is defined by equation (2.11, page 69).) Thus forany ρ, ρ ′ such that B ( ρ ′ , ρ ) > 0, we can deduce from the monotonicity of Ξ λN that ρ ′ ( r ) − ρ ( r ) ≤ inf λ ∈ R + S λ ( ρ, ρ ′ ) , proving that the left-hand side is small, and consequently that B ( ρ, ρ ′ ) and itschained counterpart defined by equation (2.12, page 70) are small: e B ( ρ, ρ ′ ) ≤ B ( ρ, ρ ′ ) ≤ inf λ ∈ R + Ξ λN (cid:2) S λ ( ρ, ρ ′ ) (cid:3) . It is also worth noticing that B ( ρ, ρ ′ ) and e B ( ρ, ρ ′ ) are upper bounded in terms ofvariance and complexity only.2 Chapter 2. Comparing posterior distributions to Gibbs priors The presence of the ratios γ ( ρ ) β ( ρ ) should not be obnoxious, since their values shouldbe automatically tamed by the fact that β ( ρ ) and γ ( ρ ) should make the estimateof the complexity of ρ optimal.As an alternative, it is possible to restrict to set of parameter values β and γ such that, for some fixed constant ζ > 1, the ratio γβ is bounded away from 1 bythe inequality γβ ≥ ζ . This leads to an alternative definition of C ( ρ ): C ( ρ ) = inf γ ≥ ζβ ∈ R + ,i ∈ N (cid:18) − βγ (cid:19) − (cid:26) K (cid:2) ρ, π i exp( − βr ) (cid:3) + log n π i exp( − βr ) h exp (cid:8) β Nγ log (cid:2) cosh( γN ) (cid:3) ρ ( m ′ ) (cid:9)io − βγ log (cid:2) − ν ( γ ) ν ( β ) µ ( i ) ǫ (cid:3)(cid:27) − log (cid:2) ν ( β ) µ ( i ) (cid:3) (1 − ζ − ) − log(3 − ǫ )2 . We can even push simplification a step further, postponing the optimization of theratio γβ , and setting it to the fixed value ζ . This leads us to adopt the definition(2.14) C ( ρ ) = inf β ∈ R + ,i ∈ N (cid:0) − ζ − (cid:1) − (cid:26) K (cid:2) ρ, π i exp( − βr ) (cid:3) + log n π i exp( − βr ) h exp (cid:8) Nζ log (cid:2) cosh( ζβN ) (cid:3) ρ ( m ′ ) (cid:9)io(cid:27) − ζ + 1 ζ − (cid:26) log (cid:2) ν ( β ) µ ( i ) (cid:3) + 2 − log(3 − ǫ ) (cid:27) . With either of these modified definitions of the complexity C ( ρ ), we get the upperbound(2.15) S λ ( ρ, ρ ′ ) ≤ e S λ ( ρ, ρ ′ ) def = Nλ log (cid:2) cosh( λN ) (cid:3) ρ ⊗ ρ ′ ( m ′ )+ 1 λ (cid:26) C ( ρ ) + C ( ρ ′ ) − ζ + 1 ζ − (cid:2) ν ( λ ) (cid:3)(cid:27) . With these definitions, we have for any posterior distributions ρ and ρ ′ B ( ρ, ρ ′ ) ≤ inf λ ∈ R + Ξ λN n ρ ′ ( r ) − ρ ( r ) + e S λ ( ρ, ρ ′ ) o . Consequently in the case when B ( ρ ′ , ρ ) > 0, we get e B ( ρ, ρ ′ ) ≤ B ( ρ, ρ ′ ) ≤ inf λ ∈ R + Ξ λN (cid:2) e S λ ( ρ, ρ ′ ) (cid:3) . To select some nearly optimal posterior distribution in P , it is appropriate to or-der the posterior distributions of P according to increasing values of their complex-ity C ( ρ ) and consider some indexation P = { ρ , . . . , ρ M } , where C ( ρ k ) ≤ C ( ρ k +1 ),1 ≤ k < M .Let us now consider for each ρ k ∈ P the first posterior distribution in P whichcannot be proved to be worse than ρ k according to the bound e B :(2.16) t ( k ) = min n j ∈ { , . . . M } : e B ( ρ j , ρ k ) > o . .2. Playing with two posterior and two local prior distributions e B ( ρ, ρ ) = 0, for any posteriordistribution ρ . Let us now define our estimated best ρ ∈ P as ρ b k , where(2.17) b k = min(arg max t ) . Thus we take the posterior with smallest complexity which can be proved to be bet-ter than the largest starting interval of P in terms of estimated relative classificationerror.The following theorem is a simple consequence of the chosen optimisation scheme.It is valid for any arbitrary choice of the complexity function ρ C ( ρ ). Theorem 2.2.4 . Let us put b t = t ( b k ) , where t is defined by equation (2.16) and b k is defined by equation (2.17) . With P probability at least − ǫ , ρ b k ( R ) ≤ ρ j ( R ) + , ≤ j < b t, e B ( ρ j , ρ t ( j ) ) , b t ≤ j < b k, e B ( ρ j , ρ b t ) + e B ( ρ b t , ρ b k ) , j ∈ (arg max t ) , e B ( ρ j , ρ b k ) , j ∈ (cid:8)b k + 1 , . . . , M (cid:9) \ (arg max t ) , where the chained bound e B is defined from the bound of Theorem 2.2.2 (page 69)by equation (2.12, page 70). In the mean time, for any j such that b t ≤ j < b k , t ( j ) < b t = max t , because j (arg max t ) . Thus ρ b k ( R ) ≤ ρ t ( j ) ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λN (cid:2) S λ ( ρ j , ρ t ( j ) ) (cid:3) while ρ t ( j ) ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ t ( j ) ) , where the function Ξ is defined by equation (2.11, page 69) and S λ is defined byequation (2.13, page 71). For any j ∈ (arg max t ) , (including notably b k ), B ( ρ b t , ρ j ) ≥ e B ( ρ b t , ρ j ) > ,B ( ρ j , ρ b t ) ≥ e B ( ρ j , ρ b t ) > , so in this case ρ b k ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λN h S λ ( ρ j , ρ b t ) + S λ ( ρ b t , ρ b k ) + S λ ( ρ j , ρ b k ) i , while ρ b t ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ b t ) ,ρ b k ( r ) ≤ ρ b t ( r ) + inf λ ∈ R + S λ ( ρ b t , ρ b k ) , and ρ b t ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λN (cid:2) S λ ( ρ j , ρ b t ) (cid:3) . Finally in the case when j ∈ (cid:8)b k + 1 , . . . , M (cid:9) \ (arg max t ) , due to the fact that inparticular j (arg max t ) , B ( ρ b k , ρ j ) ≥ e B ( ρ b k , ρ j ) > . Thus in this last case ρ b k ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λN (cid:2) S λ ( ρ j , ρ b k ) (cid:3) , while ρ b k ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ b k ) . Chapter 2. Comparing posterior distributions to Gibbs priorsThus for any j = 1 , . . . , M , ρ b k ( R ) − ρ j ( R ) is bounded from above by an empiricalquantity involving only variance and entropy terms of posterior distributions ρ ℓ suchthat ℓ ≤ j , and therefore such that C ( ρ ℓ ) ≤ C ( ρ j ) . Moreover, these distributions ρ ℓ are such that ρ ℓ ( r ) − ρ j ( r ) and ρ ℓ ( R ) − ρ j ( R ) have an empirical upper bound ofthe same order as the bound stated for ρ b k ( R ) − ρ j ( R ) — namely the bound for ρ ℓ ( r ) − ρ j ( r ) is in all circumstances not greater than Ξ − λN applied to the boundstated for ρ b k ( R ) − ρ j ( R ) , whereas the bound for ρ ℓ ( R ) − ρ j ( R ) is always smallerthan two times the bound stated for ρ b k ( R ) − ρ j ( R ) . This shows that variance termsare between posterior distributions whose empirical as well as expected error ratescannot be much larger than those of ρ j . Let us remark that the estimation scheme described in this theorem is verygeneral, the same method can be used as soon as some confidence interval for therelative expected risks − B ( ρ , ρ ) ≤ ρ ( R ) − ρ ( R ) ≤ B ( ρ , ρ ) with P probability at least 1 − ǫ, is available. The definition of the complexity is arbitrary, and could in an abstractcontext be chosen as C ( ρ ) = inf ρ = ρ B ( ρ , ρ ) + B ( ρ , ρ ) . Proof. The case when 1 ≤ j < b t is straightforward from the definitions: when j < b t , e B ( ρ j , ρ b k ) ≤ ρ b k ( R ) ≤ ρ j ( R ).In the second case, that is when b t ≤ j < b k , j cannot be in arg max t , because ofthe special choice of b k in arg max t . Thus t ( j ) < b t and we deduce from the first casethat ρ b k ( R ) ≤ ρ t ( j ) ( R ) ≤ ρ j ( R ) + e B ( ρ j , ρ t ( j ) ) . Moreover, we see from the defintion of t that e B ( ρ t ( j ) , ρ j ) > 0, implying ρ t ( j ) ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ t ( j ) ) , and therefore that ρ b k ( R ) ≤ ρ j ( R ) + inf λ Ξ λN (cid:2) S λ ( ρ j , ρ t ( j ) ) (cid:3) . In the third case j belongs to arg max t . In this case, we are not sure that e B ( ρ b k , ρ j ) > 0, and it is appropriate to involve b t , which is the index of the firstposterior distribution which cannot be improved by ρ b k , implying notably that e B ( ρ b t , ρ k ) > k ∈ arg max t . On the other hand, ρ b t cannot either improveany posterior distribution ρ k with k ∈ (arg max t ), because this would imply for any ℓ < b t that e B ( ρ ℓ , ρ b t ) ≤ e B ( ρ ℓ , ρ k ) + e B ( ρ k , ρ b t ) ≤ 0, and therefore that t ( b t ) ≥ b t + 1, incontradiction of the fact that b t = max t . Thus e B ( ρ k , ρ b t ) > 0, and these two remarksimply that ρ b t ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ b t ) ,ρ b k ( r ) ≤ ρ b t ( r ) + inf λ ∈ R + S λ ( ρ b t , ρ b k ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ b t ) + inf λ ∈ R + S λ ( ρ b t , ρ b k ) , .2. Playing with two posterior and two local prior distributions ρ b k ( R ) ≤ ρ j ( R ) + e B ( ρ j , ρ b k ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λN h S λ ( ρ j , ρ b t ) + S λ ( ρ b t , ρ b k ) + S λ ( ρ j , ρ b k ) i and that ρ b t ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λN (cid:2) S λ ( ρ j , ρ b t ) (cid:3) ≤ ρ j ( R ) + 2 inf λ ∈ R + λN (cid:2) S λ ( ρ j , ρ b t ) (cid:3) , the last inequality being due to the fact that Ξ λN is a concave function. Let usnotice that it may be the case that b k < b t , but that only the case when j ≥ b t is tobe considered, since otherwise we already know that ρ b k ( R ) ≤ ρ j ( R ).In the fourth case, j is greater than b k , and the complexity of ρ j is larger than thecomplexity of ρ b k . Moreover, j is not in arg max t , and thus e B ( ρ b k , ρ j ) > 0, becauseotherwise, the sub-additivity of e B would imply that e B ( ρ ℓ , ρ j ) ≤ ℓ ≤ b t andtherefore that t ( j ) ≥ b t = max t . Therefore ρ b k ( r ) ≤ ρ j ( r ) + inf λ ∈ R + S λ ( ρ j , ρ b k ) , and ρ b k ( R ) ≤ ρ j ( R ) + e B ( ρ j , ρ b k ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λN (cid:2) S λ ( ρ j , ρ b k ) (cid:3) . (cid:3) Let us start our investigation of the theoretical properties of the algorithm describedin Theorem 2.2.4 (page 73) by computing some non-random upper bounds for B ( ρ, ρ ′ ), the bound of Theorem 2.2.2 (page 69), and C ( ρ ), the complexity factordefined by equation (2.14, page 72), for any ρ, ρ ′ ∈ P .This analysis will be done in the case when P = n π i exp( − βr ) : ν ( β ) > , µ ( i ) > o , in which it will be possible to get some control on the randomness of any ρ ∈ P ,in addition to controlling the other random expressions appearing in the definitionof B ( ρ, ρ ′ ), ρ, ρ ′ ∈ P . We will also use a simpler choice of complexity function,removing from equation (2.14 page 72) the optimization in i and β and usinginstead the definition(2.18) C ( π i exp( − βr ) ) def = (cid:0) − ζ − (cid:1) − log (cid:26) π i exp( − βr ) (cid:20) exp n Nζ log (cid:2) cosh (cid:0) ζβN (cid:1)(cid:3) π i exp( − βr ) ( m ′ ) o(cid:21)(cid:27) + ζ + 1 ζ − (cid:2) ν ( β ) µ ( i ) (cid:3) . With this definition,6 Chapter 2. Comparing posterior distributions to Gibbs priors S λ ( π i exp( − βr ) , π j exp( − β ′ r ) ) ≤ Nλ log (cid:2) cosh( λN ) (cid:3) π i exp( − βr ) ⊗ π j exp( − β ′ r ) ( m ′ )+ C (cid:2) π i exp( − βr ) (cid:3) + C (cid:2) π j exp( − β ′ r ) (cid:3) λ + ( ζ + 1)( ζ − λ log (cid:2) − ν ( λ ) ǫ (cid:3) , where S λ is defined by equation (2.13, page 71), so that B (cid:2) π i exp( − βr ) , π j exp( − β ′ r ) (cid:3) = inf λ ∈ R + Ξ λN n π j exp( − β ′ r ) ( r ) − π i exp( − βr ) ( r )+ S λ (cid:2) π i exp( − βr ) , π j exp( − βr ) (cid:3)o . Let us successively bound the various random factors entering into the defini-tion of B (cid:2) π i exp( − βr ) , π j exp( − β ′ r ) (cid:3) . The quantity π j exp( − β ′ r ) ( r ) − π i exp( − βr ) ( r ) can bebounded using a slight adaptation of Proposition 2.1.11 (page 59). Proposition 2.2.5 . For any positive real constants λ, λ ′ and γ , with P probabilityat least − η , for any positive real constants β , β ′ such that β < λ γN sinh( γN ) − and β ′ > λ ′ γN sinh( γN ) − , π j exp( − β ′ r ) ( r ) − π i exp( − βr ) ( r ) ≤ π j exp( − λ ′ R ) ⊗ π i exp( − λR ) (cid:2) Ψ − γN ( R ′ , M ′ ) (cid:3) + log (cid:0) η (cid:1) γ + C j ( λ ′ , γ ) + log( η ) Nβ ′ λ ′ sinh( γN ) − γ + C i ( λ, γ ) + log( η ) γ − Nβλ sinh( γN ) , where C i ( λ, γ ) def = log (cid:26)Z Θ exp (cid:20) − γ Z Θ n Ψ γN (cid:2) R ′ ( θ , θ ) , M ′ ( θ , θ ) (cid:3) − Nγ sinh( γN ) R ′ ( θ , θ ) o π i exp( − λR ) ( dθ ) (cid:21) π i exp( − λR ) ( dθ ) (cid:27) ≤ log (cid:26) π i exp( − λR ) (cid:20) exp n N sinh (cid:0) γ N (cid:1) π i exp( − λR ) (cid:0) M ′ (cid:1)o(cid:21)(cid:27) . As for π i exp( − βr ) ⊗ π j exp( − β ′ r ) ( m ′ ), we can write with P probability at least 1 − η ,for any posterior distributions ρ and ρ ′ : Ω → M (Θ), γρ ⊗ ρ ′ ( m ′ ) ≤ log h π i exp( − λR ) ⊗ π j exp( − λ ′ R ) (cid:8) exp (cid:2) γ Φ − γN ( M ′ ) (cid:3)(cid:9)i + K (cid:2) ρ, π i exp( − λR ) (cid:3) + K (cid:2) ρ ′ , π j exp( − λ ′ R ) (cid:3) − log( η ) . We can then replace λ with β Nλ sinh( λN ) and use Theorem 2.1.12 (page 60) to get Proposition 2.2.6 . For any positive real constants γ , λ , λ ′ , β and β ′ , with P probability − η , γρ ⊗ ρ ′ ( m ′ ) .2. Playing with two posterior and two local prior distributions ≤ log h π i exp[ − β Nλ sinh( λN ) R ] ⊗ π j exp[ − β ′ Nλ ′ sinh( λ ′ N ) R ] (cid:8) exp (cid:2) γ Φ − γN ( M ′ ) (cid:3)(cid:9)i + K (cid:2) ρ, π i exp( − βr ) (cid:3) − βλ + C i (cid:2) β Nλ sinh( λN ) , λ (cid:3) − log( η ) λβ − K (cid:2) ρ ′ , π j exp( − β ′ r ) (cid:3) − β ′ λ ′ + C j (cid:2) β ′ Nλ ′ sinh( λ ′ N ) , λ ′ (cid:3) − log( η ) λβ ′ − − log( η ) . The last random factor in B ( ρ, ρ ′ ) that we need to upper bound islog n π i exp( − βr ) h exp (cid:8) β Nγ log (cid:2) cosh( γN ) (cid:3) π i exp( − βr ) ( m ′ ) (cid:9)io . A slight adaptation of Proposition 2.1.13 (page 61) shows that with P probabilityat least 1 − η ,log n π i exp( − βr ) h exp (cid:8) β Nγ log (cid:2) cosh( γN ) (cid:3) π i exp( − βr ) ( m ′ ) (cid:9)io ≤ βγ C i (cid:2) Nβγ sinh( γN ) , γ (cid:3) + (cid:0) − βγ (cid:1) log (cid:26)(cid:16) π i exp[ − Nβγ sinh( γN ) R ] (cid:17) ⊗ (cid:20) exp (cid:18) N log (cid:2) cosh( γN ) (cid:3) γβ − − log[cosh( γN )] γβ − ◦ M ′ (cid:19)(cid:21)(cid:27) + (cid:0) βγ (cid:1) log( η ) , where as usual Φ is the function defined by equation (1.1, page 2). This leads us todefine for any i, j ∈ N , any β, β ′ ∈ R + ,(2.19) C ( i, β ) def = 2 ζ − C i h Nζ sinh( ζβN ) , ζβ i + log (cid:26)(cid:0) π i exp[ − Nζ sinh( ζβN ) R ] (cid:1) ⊗ (cid:20) exp (cid:18) N log (cid:2) cosh( ζβN ) (cid:3) ζ − − log[cosh( ζβN )] ζ − ◦ M ′ (cid:19)(cid:21)(cid:27) − ζ + 1 ζ − (cid:26) (cid:2) ν ( β ) µ ( i ) (cid:3) + log (cid:0) η (cid:1)(cid:27) . Recall that the definition of C i ( λ, γ ) is to be found in Proposition 2.2.5, page 76.Let us remark that, sinceexp (cid:2) N a Φ − a ( p ) (cid:3) = exp n N log h (cid:2) exp( a ) − (cid:3) p io ≤ exp n N (cid:2) exp( a ) − (cid:3) p o , p ∈ (0 , , a ∈ R , we have C ( i, β ) ≤ ζ − (cid:26) π i exp[ − Nζ sinh( ζβN ) R ] (cid:20) exp n N sinh (cid:0) ζβ N (cid:1) π i exp[ − Nζ sinh( ζβN ) R ] (cid:0) M ′ (cid:1)o(cid:21)(cid:27) Chapter 2. Comparing posterior distributions to Gibbs priors + log (cid:26)(cid:16) π i exp[ − Nζ sinh( ζβN ) R ] (cid:17) ⊗ (cid:20) exp n N h exp (cid:8) ( ζ − − log (cid:2) cosh (cid:0) ζβN (cid:1)(cid:3)(cid:9) − i M ′ o(cid:21)(cid:27) − ζ + 1 ζ − (cid:26) (cid:2) ν ( β ) µ ( i ) (cid:3) + log (cid:0) η (cid:1)(cid:27) . Let us put S λ (cid:2) ( i, β ) , ( j, β ′ ) (cid:3) def = Nλ log (cid:2) cosh( λN ) (cid:3) inf γ ∈ R + γ − (cid:26) log (cid:20)(cid:16) π i exp[ − Nζ sinh( ζβN ) R ] ⊗ π j exp[ − Nζ sinh( ζβ ′ N ) R ] (cid:17)n exp (cid:2) γ Φ − γN ( M ′ ) (cid:3)o(cid:21) + C i (cid:2) Nζ sinh( ζβN ) , ζβ (cid:3) − log( η ) ζ − C j (cid:2) Nζ sinh( ζβ ′ N ) , ζβ ′ (cid:3) − log( η ) ζ − − log( η ) (cid:27) + 1 λ (cid:20) C ( i, β ) + C ( j, β ′ ) − ζ + 1 ζ − (cid:2) − ν ( λ ) ǫ (cid:3)(cid:21) , where η = ν ( γ ) ν ( β ) ν ( β ′ ) µ ( i ) µ ( j ) η. Let us remark that S λ (cid:2) ( i, β ) , ( j, β ′ ) (cid:3) ≤ inf γ ∈ R + λ N γ log (cid:20)(cid:16) π i exp[ − Nζ sinh( ζβN ) R ] ⊗ π j exp[ − Nζ sinh( ζβ ′ N ) R ] (cid:17)n exp h N (cid:2) exp (cid:0) γN (cid:1) − (cid:3) M ′ io(cid:21) + (cid:18) λ N γ ( ζ − 1) + 2 λ ( ζ − (cid:19) log (cid:26) π i exp[ − Nζ sinh( ζβN ) R ] (cid:20) exp n N sinh (cid:0) ζβ N (cid:1) π i exp[ − Nζ sinh( ζβN ) R ] (cid:0) M ′ (cid:1)o(cid:21)(cid:27) + λ − log (cid:26)(cid:16) π i exp[ − Nζ sinh( ζβN ) R ] (cid:17) ⊗ (cid:20) exp n N h exp (cid:8) ( ζ − − log (cid:2) cosh (cid:0) ζβN (cid:1)(cid:3)(cid:9) − i M ′ o(cid:21)(cid:27) + (cid:18) λ N γ ( ζ − 1) + 2 λ ( ζ − (cid:19) log (cid:26) π j exp[ − Nζ sinh( ζβ ′ N ) R ] (cid:20) exp n N sinh (cid:0) ζβ ′ N (cid:1) π j exp[ − Nζ sinh( ζβ ′ N ) R ] (cid:0) M ′ (cid:1)o(cid:21)(cid:27) + λ − log (cid:26)(cid:16) π j exp[ − Nζ sinh( ζβ ′ N ) R ] (cid:17) ⊗ (cid:20) exp n N h exp (cid:8) ( ζ − − log (cid:2) cosh (cid:0) ζβ ′ N (cid:1)(cid:3)(cid:9) − i M ′ o(cid:21)(cid:27) .2. Playing with two posterior and two local prior distributions − ( ζ + 1) λ N ( ζ − γ log (cid:2) − ν ( γ ) ν ( β ) ν ( β ′ ) µ ( i ) µ ( j ) η (cid:3) − ( ζ + 1)( ζ − λ (cid:18) (cid:2) − ν ( β ) ν ( β ′ ) µ ( i ) µ ( j ) η (cid:3) + log (cid:2) − ν ( λ ) ǫ (cid:3)(cid:19) . Let us define accordingly B (cid:2) ( i, β ) , ( j, β ′ ) (cid:3) def =inf λ Ξ λN ( inf α,γ,α ′ ,γ ′ (cid:20) π j exp( − α ′ R ) ⊗ π i exp( − αR ) (cid:2) Ψ − λN ( R ′ , M ′ ) (cid:3) − log (cid:0) e η (cid:1) λ + C j ( α ′ , γ ′ ) − log (cid:0) e η (cid:1) Nβ ′ α ′ sinh( γ ′ N ) − γ ′ + C i ( α, γ ) − log (cid:0)e η (cid:1) γ − Nβα sinh( γN ) (cid:21) + S λ (cid:2) ( i, β ) , ( j, β ′ ) (cid:3)) , where e η = ν ( λ ) ν ( α ) ν ( γ ) ν ( β ) ν ( α ′ ) ν ( γ ′ ) ν ( β ′ ) µ ( i ) µ ( j ) η. Proposition 2.2.7 . • With P probability at least − η , for any β ∈ R + and i ∈ N , C ( π i exp( − βr ) ) ≤ C ( i, β ) ; • With P probability at least − η , for any λ, β, β ′ ∈ R + , any i, j ∈ N , S λ (cid:2) ( i, β ) , ( j, β ′ ) (cid:3) ≤ S λ (cid:2) ( i, β ) , ( j, β ′ ) (cid:3) ; • With P probability at least − η , for any i, j ∈ N , any β, β ′ ∈ R + , B ( π i exp( − βr ) , π j exp( − β ′ r ) ) ≤ B (cid:2) ( i, β ) , ( j, β ′ ) (cid:3) . It is also interesting to find a non-random lower bound for C ( π i exp( − βr ) ). Let usstart from the fact that with P probability at least 1 − η , π i exp( − αR ) ⊗ π i exp( − αR ) (cid:2) Φ γ ′ N ( M ′ ) (cid:3) ≤ π i exp( − αR ) ⊗ π i exp( − αR ) ( m ′ ) − log( η ) γ ′ . On the other hand, we already proved that with P probability at least 1 − η ,0 ≤ (cid:18) − αN tanh( λN ) (cid:19) K (cid:2) ρ, π i exp( − αR ) (cid:3) ≤ αN tanh( λN ) (cid:26) λ (cid:2) ρ ( r ) − π i exp( αR ) ( r ) (cid:3) + N log (cid:2) cosh( λN ) (cid:3) ρ ⊗ π i exp( − αR ) ( m ′ ) − log( η ) (cid:27) + K (cid:0) ρ, π i (cid:1) − K (cid:0) π i exp( − αR ) , π i (cid:1) . Thus for any ξ > 0, putting β = αλN tanh( λN ) , with P probability at least 1 − η , ξπ i exp( − αR ) ⊗ π i exp( − αR ) (cid:2) Φ γ ′ N ( M ′ ) (cid:3) Chapter 2. Comparing posterior distributions to Gibbs priors ≤ π i exp( − αR ) (cid:26) log (cid:20) π i exp( − βr ) n exp h β Nλ log (cid:2) cosh( λN ) (cid:3) π i exp( − βr ) ( m ′ ) + ξm ′ io(cid:21)(cid:27) − (cid:18) βλ + ξγ ′ (cid:19) log (cid:18) η (cid:19) ≤ log (cid:26) π i exp( − βr ) (cid:20) exp n β Nλ log (cid:2) cosh( λN ) (cid:3) π i exp( − βr ) ( m ′ ) o × π i exp( − βr ) n exp h β Nλ log (cid:2) cosh( λN ) (cid:3) π i exp( − βr ) ( m ′ ) + ξm ′ io(cid:21)(cid:27) − (cid:18) βλ + ξγ ′ (cid:19) log (cid:18) η (cid:19) ≤ (cid:26) π i exp( − βr ) (cid:20) exp nh ξ + β Nλ log (cid:2) cosh( λN ) (cid:3)i π i exp( − βr ) ( m ′ ) o(cid:21)(cid:27) − (cid:18) βλ + ξγ ′ (cid:19) log (cid:18) η (cid:19) ≤ (cid:26) π i exp( − βr ) (cid:20) exp nh ξ + βλ N i π i exp( − βr ) ( m ′ ) o(cid:21)(cid:27) − (cid:18) βλ + ξγ ′ (cid:19) log (cid:18) η (cid:19) . Taking ξ = βλ N , we get with P probability at least 1 − ηβλ N (cid:16) π i exp[ − β Nλ tanh( λN ) R ] (cid:17) ⊗ h Φ γ ′ N (cid:0) M ′ (cid:1)i ≤ log (cid:26) π i exp( − βr ) (cid:20) exp n βλN π i exp( − βr ) ( m ′ ) o(cid:21)(cid:27) − (cid:18) βλ + βλ N γ ′ (cid:19) log (cid:18) η (cid:19) . Putting λ = N γ log (cid:2) cosh( γN ) (cid:3) and Υ( γ ) def = γ tanh (cid:8) Nγ log (cid:2) cosh( γN ) (cid:3)(cid:9) N log (cid:2) cosh( γN ) (cid:3) ∼ γ → , this can be rewritten as βN γ log (cid:2) cosh( γN ) (cid:3)(cid:16) π i exp( − β Υ( γ ) R ) (cid:17) ⊗ h Φ γ ′ N (cid:0) M ′ (cid:1)i ≤ log (cid:26) π i exp( − βr ) (cid:20) exp n β Nγ log (cid:2) cosh( γN ) (cid:3) π i exp( − βr ) ( m ′ ) o(cid:21)(cid:27) − (cid:18) βγN log (cid:2) cosh( γN ) (cid:3) + βN log (cid:2) cosh( γN ) (cid:3) γγ ′ (cid:19) log (cid:18) η (cid:19) . It is now tempting to simplify the picture a little bit by setting γ ′ = γ , leading to .2. Playing with two posterior and two local prior distributions Proposition 2.2.8 . With P probability at least − η , for any i ∈ N , any β ∈ R + , C (cid:2) π i exp( − βr ) (cid:3) ≥ C ( i, β ) def = 1 ζ − ( N (cid:2) cosh( ζβN ) (cid:3)(cid:16) π i exp( − β Υ( ζβ ) R ) (cid:17) ⊗ h Φ ζβN (cid:0) M ′ (cid:1)i + ζ β N log (cid:2) cosh( ζβN ) (cid:3) + N log (cid:2) cosh( ζβN ) (cid:3) ζβ ! log (cid:2) − ν ( β ) µ ( i ) η (cid:3) − ( ζ + 1) n log (cid:2) ν ( β ) µ ( i ) (cid:3) + 2 − log (cid:0) − ǫ (cid:1)o) , where C (cid:2) π i exp( − βr ) (cid:3) is defined by equation (2.18, page 75). We are now going to analyse Theorem 2.2.4 (page 73). For this, we will also needan upper bound for S λ ( ρ, ρ ′ ), defined by equation (2.13, page 71), using M ′ andempirical complexities, because of the special relations between empirical complex-ities induced by the selection algorithm. To this purpose, a useful alternative toProposition 2.2.6 (page 76) is to write, with P probability at least 1 − η , γρ ⊗ ρ ′ ( m ′ ) ≤ γρ ⊗ ρ ′ (cid:2) Φ − γN (cid:0) M ′ (cid:1)(cid:3) + K (cid:2) ρ, π i exp( − λR ) (cid:3) + K (cid:2) ρ ′ , π j exp( − λ ′ R ) (cid:3) − log( η ) , and thus at least with P probability 1 − η , γρ ⊗ ρ ′ ( m ′ ) ≤ γρ ⊗ ρ ′ (cid:2) Φ − γN (cid:0) M ′ (cid:1)(cid:3) + (1 − ζ − ) − (cid:26) K (cid:2) ρ, π i exp( − βr ) (cid:3) + log n π i exp( − βr ) h exp (cid:8) Nζ log (cid:2) cosh (cid:0) ζβN (cid:1)(cid:3) ρ ( m ′ ) (cid:9)io − ζ − log( η ) (cid:27) + (1 − ζ − ) − (cid:26) K (cid:2) ρ, π j exp( − β ′ r ) (cid:3) + log n π j exp( − β ′ r ) h exp (cid:8) Nζ log (cid:2) cosh (cid:0) ζβ ′ N (cid:1)(cid:3) ρ ( m ′ ) (cid:9)io − ζ − log( η ) (cid:27) − log( η ) . When ρ = π i exp( − βr ) and ρ ′ = π j exp( − β ′ r ) , we get with P probability at least 1 − η ,for any β , β ′ , γ ∈ R + , any i , j ∈ N , γρ ⊗ ρ ′ ( m ′ ) ≤ γρ ⊗ ρ ′ (cid:2) Φ − γN (cid:2)(cid:0) M ′ (cid:1)(cid:3) + C ( ρ ) + C ( ρ ′ ) − ζ + 1 ζ − (cid:20) log (cid:2) − ν ( γ ) η (cid:3)(cid:21) . Proposition 2.2.9 . With P probability at least − η , for any ρ = π i exp( − βr ) , any ρ ′ = π j exp( − β ′ r ) ∈ P , Chapter 2. Comparing posterior distributions to Gibbs priors S λ ( ρ, ρ ′ ) ≤ Nλ log (cid:2) cosh( λN ) (cid:3) ρ ⊗ ρ ′ (cid:2) Φ − γN (cid:0) M ′ (cid:1)(cid:3) + 1 + Nγ log (cid:2) cosh( λN ) (cid:3) λ (cid:2) C ( ρ ) + C ( ρ ′ ) (cid:3) − ( ζ + 1)( ζ − λ (cid:26) log (cid:2) − ν ( λ ) ǫ (cid:3) + Nγ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) log (cid:2) − ν ( γ ) η (cid:3)(cid:27) . In order to analyse Theorem 2.2.4 (page 73), we need to index P = (cid:8) ρ , . . . , ρ M (cid:9) in order of increasing empirical complexity C ( ρ ). To deal in a convenient way withthis indexation, we will write C ( i, β ) as C (cid:2) π i exp( − βr ) (cid:3) , C ( i, β ) as C (cid:2) π i exp( − βr ) (cid:3) , and S (cid:2) ( i, β ) , ( j, β ′ ) (cid:3) as S (cid:2) π i exp( − βr ) , π j exp( − β ′ r ) (cid:3) .With P probability at least 1 − ǫ , when b t ≤ j < b k , as we already saw, ρ b k ( R ) ≤ ρ i ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λN (cid:2) S λ ( ρ j , ρ i ) (cid:3) , where i = t ( j ) < b t . Therefore, with P probability at least 1 − ǫ − η , ρ i ( R ) ≤ ρ j ( R ) + inf λ ∈ R + Ξ λN ( Nλ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) ρ j ⊗ ρ i (cid:2) Φ − γN (cid:0) M ′ (cid:1)(cid:3) + 4 1 + Nγ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) λ C ( ρ j ) − ( ζ + 1)( ζ − λ (cid:26) log (cid:2) − ν ( λ ) ǫ (cid:3) + Nγ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) log (cid:2) − ν ( γ ) η (cid:3)(cid:27)) . We can now remark thatΞ a ( p + q ) ≤ Ξ a ( p ) + q Ξ ′ a ( p ) q ≤ Ξ a ( p ) + Ξ ′ a (0) q = Ξ a ( p ) + a tanh( a ) q and that Φ − a ( p + q ) ≤ Φ − a ( p ) + Φ ′− a (0) q = Φ − a ( p ) + exp( a ) − a q. Moreover, assuming as usual without substantial loss of generality that there exists e θ ∈ arg min Θ R , we can split M ′ ( θ, θ ′ ) ≤ M ′ ( θ, e θ ) + M ′ ( e θ, θ ′ ). Let us then considerthe expected margin function defined by ϕ ( y ) = sup θ ∈ Θ M ′ ( θ, e θ ) − yR ′ ( θ, e θ ) , y ∈ R + , and let us write for any y ∈ R + , ρ j ⊗ ρ i (cid:2) Φ − λN (cid:0) M ′ (cid:1)(cid:3) ≤ ρ j ⊗ ρ i (cid:8) Φ − γN (cid:2) M ′ ( ., e θ ) + yR ′ ( ., e θ ) + ϕ ( y ) (cid:3)(cid:9) ≤ ρ j (cid:8) Φ − λN (cid:2) M ′ ( ., e θ ) + ϕ ( y ) (cid:3)(cid:9) + N y (cid:2) exp( γN ) − (cid:3) γ (cid:2) ρ i ( R ) − R ( e θ ) (cid:3) and − yN (cid:2) exp( γN ) − (cid:3) log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) γ tanh (cid:0) λN (cid:1) !(cid:2) ρ i ( R ) − R ( e θ ) (cid:3) .2. Playing with two posterior and two local prior distributions ≤ (cid:2) ρ j ( R ) − R ( e θ ) (cid:3) + Ξ λN ( Nλ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) ρ j (cid:8) Φ − γN (cid:2) M ′ ( ., e θ ) + ϕ ( y ) (cid:3)(cid:9) + 4 1 + Nγ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) λ C ( ρ j ) − ζ + 1)( ζ − λ (cid:26) log (cid:2) − ν ( λ ) ǫ (cid:3) + Nγ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) log (cid:2) − ν ( γ ) η (cid:3)(cid:27)) . With P probability at least 1 − ǫ − η , for any λ , γ , x , y ∈ R + , any j ∈ (cid:8)b t, . . . , b k − (cid:9) , ρ b k ( R ) − R ( e θ ) ≤ ρ i ( R ) − R ( e θ ) ≤ − yN (cid:2) exp( γN ) − (cid:3) log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) γ tanh (cid:0) λN (cid:1) ! − ( xN (cid:2) exp( γN ) − (cid:3) log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) γ tanh (cid:0) λN (cid:1) !(cid:2) ρ j ( R ) − R ( e θ ) (cid:3) + Ξ λN (cid:26) Nλ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) Φ − γN (cid:2) ϕ ( x ) + ϕ ( y ) (cid:3) + 4 1 + Nγ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) λ C ( ρ j ) − ζ + 1)( ζ − λ n log (cid:2) − ν ( λ ) ǫ (cid:3) + Nγ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) log (cid:2) − ν ( γ ) η (cid:3)o(cid:27)) . Now we have to get an upper bound for ρ j ( R ). We can write ρ j = π ℓ exp( − β ′ r ) , as weassumed that all the posterior distributions in P are of this special form. Moreover,we already know from Theorem 2.1.8 (page 58) that with P probability at least1 − η , (cid:2) N sinh (cid:0) β ′ N (cid:1) − β ′ ζ − (cid:3)(cid:2) π ℓ exp( − β ′ r ) ( R ) − π ℓ exp( − β ′ ζ − R ) ( R ) (cid:3) ≤ C ℓ ( β ′ ζ − , β ′ ) − log (cid:2) ν ( β ′ ) µ ( ℓ ) η (cid:3) . This proves that with P probability at least 1 − ǫ − η , ρ b k ( R ) ≤ R ( e θ )+ (cid:18) − yN (cid:2) exp (cid:0) γN (cid:1) − (cid:3) log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) γ tanh (cid:0) λN (cid:1) (cid:19) − ((cid:18) xN (cid:2) exp (cid:0) γN (cid:1) − (cid:3) log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) γ tanh (cid:0) λN (cid:1) (cid:19) × π ℓ exp( − ζ − β ′ R ) ( R ) − R ( e θ ) + C ℓ ( ζ − β ′ , β ′ ) − log (cid:2) ν ( β ′ ) µ ( ℓ ) η (cid:3) N sinh( β ′ N ) − ζ − β ′ ! + Ξ λN (cid:26) Nλ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) Φ − γN (cid:2) ϕ ( x ) + ϕ ( y ) (cid:3) + 4 1 + Nγ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) λ C ( ℓ, β ′ )4 Chapter 2. Comparing posterior distributions to Gibbs priors − ζ + 1)( ζ − λ n log (cid:2) − ν ( λ ) ǫ (cid:3) + Nγ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) log (cid:2) − ν ( γ ) η (cid:3)o(cid:27)) . The case when j ∈ (cid:8)b k + 1 , . . . , M (cid:9) \ (arg max t ) is dealt with exactly in the sameway, with i = t ( j ) replaced directly with b k itself, leading to the same inequality.The case when j ∈ (arg max t ) is dealt with bounding first ρ b k ( R ) − R ( e θ ) in termsof ρ b t ( R ) − R ( e θ ), and this latter in terms of ρ j ( R ) − R ( e θ ). Let us put A ( λ, γ ) = − xN (cid:2) exp (cid:0) γN (cid:1) − (cid:3) log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) γ tanh (cid:0) λN (cid:1) ! ,B ( λ, γ ) = 1 + 2 yN (cid:2) exp (cid:0) γN (cid:1) − (cid:3) log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) γ tanh (cid:0) λN (cid:1) ,D ( λ, γ, ρ j ) = Ξ λN (cid:26) Nλ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) Φ − γN (cid:2) ϕ ( x ) + ϕ ( y ) (cid:3) +4 1 + Nγ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) λ C ( ρ j ) − ζ + 1)( ζ − λ n log (cid:2) − ν ( λ ) ǫ (cid:3) + Nγ log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) log (cid:2) − ν ( γ ) η (cid:3)o(cid:27) , (2.20)where C ( ρ j ) = C ( ℓ, β ′ ) is defined, when ρ j = π ℓ exp( − β ′ r ) , by equation (2.19, page77). We obtain, still with P probability 1 − ǫ − η , ρ b k ( R ) − R ( e θ ) ≤ B ( λ, γ ) A ( λ, γ ) (cid:2) ρ b t ( R ) − R ( e θ ) (cid:3) + D ( λ, γ, ρ j ) A ( λ, γ ) ,ρ b t ( R ) − R ( e θ ) ≤ B ( λ, γ ) A ( λ, γ ) (cid:2) ρ j ( R ) − R ( e θ ) (cid:3) + D ( λ, γ, ρ j ) A ( λ, γ ) . The use of the factor D ( λ, γ, ρ j ) in the first of these two inequalities, instead of D ( λ, γ, ρ b t ), is justified by the fact that C ( ρ b t ) ≤ C ( ρ j ). Combining the two we get ρ b k ( R ) ≤ R ( e θ ) + B ( λ, γ ) A ( λ, γ ) (cid:2) ρ j ( R ) − R ( e θ ) (cid:3) + (cid:20) B ( λ, γ ) A ( λ, γ ) + 1 (cid:21) D ( λ, γ, ρ j ) A ( λ, γ ) . Since it is the worst bound of all cases, it holds for any value of j , proving Theorem 2.2.10 . With P probability at least − ǫ − η , ρ b k ( R ) ≤ R ( e θ ) + inf i,β,λ,γ,x,y ( B ( λ, γ ) A ( λ, γ ) h π i exp( − βr ) ( R ) − R ( e θ ) i + (cid:20) B ( λ, γ ) A ( λ, γ ) + 1 (cid:21) D ( λ, γ, π i exp( − βr ) ) A ( λ, γ ) ) .2. Playing with two posterior and two local prior distributions ≤ R ( e θ ) + inf i,β,λ,γ,x,y ( B ( λ, γ ) A ( λ, γ ) π i exp( − ζ − βR ) ( R ) − R ( e θ ) + C i ( ζ − β, β ) − log (cid:2) ν ( β ) µ ( i ) η (cid:3) N sinh (cid:0) βN (cid:1) − ζ − β ! + (cid:20) B ( λ, γ ) A ( λ, γ ) + 1 (cid:21) D ( λ, γ, π i exp( − βr ) ) A ( λ, γ ) ) , where the notation A ( λ, γ ) , B ( λ, γ ) and D ( λ, γ, ρ ) is defined by equation (2.20 page84) and where the notation C i ( β, γ ) is defined in Proposition 2.2.5 (page 76). The bound is a little involved, but as we will prove next, it gives the same rateas Theorem 2.1.15 (page 66) and its corollaries, when we work with a single model(meaning that the support of µ is reduced to one point) and the goal is to chooseadaptively the temperature of the Gibbs posterior, except for the appearance of theunion bound factor − log (cid:2) ν ( β ) (cid:3) which can be made of order log (cid:2) log( N ) (cid:3) withoutspoiling the order of magnitude of the bound.We will encompass the case when one must choose between possibly severalparametric models. Let us assume that each π i is supported by some measurableparameter subset Θ i ( meaning that π i (Θ i ) = 1), let us also assume that thebehaviour of π i is parametric in the sense that there exists a dimension d i ∈ R + such that(2.21) sup β ∈ R + β (cid:2) π i exp( − βR ) ( R ) − inf Θ i R (cid:3) ≤ d i . Then C i ( λ, γ ) ≤ log (cid:26) π i exp( − λR ) (cid:20) exp n N sinh (cid:0) γ N (cid:1) M ′ ( ., e θ ) o(cid:21)(cid:27) + 2 N sinh (cid:0) γ N (cid:1) π i exp( − λR ) (cid:2) M ′ ( ., e θ ) (cid:3) ≤ log (cid:26) π i exp( − λR ) (cid:20) exp 2 xN sinh (cid:0) γ N (cid:1) (cid:2) R − R ( e θ ) (cid:3)o(cid:21)(cid:27) + 2 xN sinh (cid:0) γ N (cid:1) π i exp( − λR ) (cid:2) R − R ( e θ ) (cid:3) + 4 N sinh (cid:0) γ N (cid:1) ϕ ( x ) ≤ xN sinh (cid:0) γ N (cid:1) π i exp {− [ λ − xN sinh( γ N ) ] R } (cid:2) R − R ( e θ ) (cid:3) + 2 xN sinh (cid:0) γ N (cid:1) π i exp( − λR ) (cid:2) R − R ( e θ ) (cid:3) + 4 N sinh (cid:0) γ N (cid:1) ϕ ( x ) . Thus C i ( λ, γ ) ≤ N sinh (cid:0) γ N (cid:1) x (cid:2) inf Θ i R − R ( e θ ) (cid:3) + ϕ ( x )+ xd i λ + xd i λ − xN sinh (cid:0) γ N (cid:1) ! . In the same way,6 Chapter 2. Comparing posterior distributions to Gibbs priors C ( i, β ) ≤ Nζ − sinh (cid:0) ζβ N (cid:1) " x (cid:2) inf Θ i R − R ( e θ ) (cid:3) + ϕ ( x )+ ζxd i N sinh (cid:0) ζβN (cid:1) (cid:18) − xζ tanh (cid:0) ζβ N (cid:1) (cid:19) + 2 N h exp (cid:16) ζ β N ( ζ − (cid:17) − i ϕ ( x ) + x (cid:2) inf Θ i R − R ( e θ ) (cid:3) + xζd i N sinh (cid:0) ζβN (cid:1) − xζN (cid:2) exp (cid:0) ζ β N ( ζ − (cid:1) − (cid:3) ! − ( ζ + 1)( ζ − (cid:20) (cid:2) ν ( β ) µ ( i ) (cid:3) + log (cid:0) η (cid:1)(cid:21) . In order to keep the right order of magnitude while simplifying the bound, let usconsider(2.22) C = max (cid:26) ζ − , (cid:16) Nζβ max (cid:17) sinh (cid:16) ζβ max N (cid:17) , N ( ζ − ζ β h exp (cid:16) ζ β N ( ζ − (cid:17) − i(cid:27) . Then, for any β ∈ (0 , β max ), C ( i, β ) ≤ inf y ∈ R + C ζ β ( ζ − N " y (cid:2) inf Θ i R − R ( e θ ) (cid:3) + ϕ ( y ) + yd i β (cid:2) − yC ζ β ζ − N (cid:3) − ( ζ + 1)( ζ − (cid:20) (cid:2) ν ( β ) µ ( i ) (cid:3) + log (cid:0) η (cid:1)(cid:21) . Thus D (cid:2) λ, γ, π i exp( − βr ) (cid:3) ≤ λN tanh (cid:0) λN (cid:1) ( λ (cid:2) exp (cid:0) γN (cid:1) − (cid:3) γ (cid:2) ϕ ( x ) + ϕ ( y ) (cid:3) + 4 1 + λ Nγ λ " C ζ β ( ζ − N z (cid:2) inf Θ i R − R ( e θ ) (cid:3) + ϕ ( z ) + zd i β (cid:2) − zC ζ β ζ − N (cid:3) ! − ( ζ + 1)( ζ − (cid:20) (cid:2) ν ( β ) µ ( i ) (cid:3) + log (cid:0) η (cid:1)(cid:21) − ζ + 1)( ζ − λ (cid:20) log (cid:2) − ν ( λ ) ǫ (cid:3) + λ N γ log (cid:2) − ν ( γ ) η (cid:3)(cid:21)) If we are not seeking tight constants, we can take for the sake of simplicity λ = γ = β , x = y and ζ = 2.Let us put(2.23) C = max (cid:26) C , N (cid:2) exp (cid:0) β max N (cid:1) − (cid:3) β max , N log (cid:2) cosh (cid:0) β max N (cid:1)(cid:3) β max tanh (cid:0) β max N (cid:1) , β max N tanh (cid:0) β max N (cid:1) (cid:27) , .2. Playing with two posterior and two local prior distributions A ( β, β ) − ≤ (cid:18) − C xβN (cid:19) − ,B ( β, β ) ≤ C xβN ,D (cid:2) β, β, π i exp( − βr ) (cid:3) ≤ C βN ϕ ( x )+ (cid:16) βN (cid:17) C β " C β N z (cid:2) inf Θ i R − R ( e θ ) (cid:3) + ϕ ( z ) + zd i β (cid:2) − zC βN (cid:3) ! − (cid:2) ν ( β ) µ ( i ) (cid:3) − (cid:0) η (cid:1) − C β (cid:20) log (cid:2) − ν ( β ) ǫ (cid:3) + β N log (cid:2) − ν ( β ) η (cid:3)(cid:21) and C i ( ζ − β, β ) ≤ C β N (cid:18) x (cid:2) inf Θ i R − R ( e θ ) (cid:3) + ϕ ( x ) + 2 xd i β (cid:2) − xβN (cid:3) (cid:19) . This leads to ρ b k ( R ) ≤ R ( e θ ) + inf i,β C xβN − C xβN ! ( d i β + inf Θ i R − R ( e θ )+ 2 β " C β N (cid:18) x (cid:2) inf Θ i R − R ( e θ (cid:3) + ϕ ( x ) + 2 xd i β (cid:0) − xβN (cid:1) (cid:19) − log (cid:2) ν ( β ) µ ( i ) η (cid:3) + 2 (cid:16) − C xβN (cid:17) ( C βN ϕ ( x )+ (cid:16) βN (cid:17) C β (cid:20) C β N (cid:18) x (cid:2) inf Θ i R − R ( e θ ) (cid:3) + ϕ ( x ) + xd i β (cid:2) − xC βN (cid:3) (cid:19) − (cid:2) ν ( β ) µ ( i ) (cid:3) − (cid:0) η (cid:1)(cid:21) − C β (cid:20) log (cid:2) − ν ( β ) ǫ (cid:3) + β N log (cid:2) − ν ( β ) η (cid:3)(cid:21)) . We see in this expression that, in order to balance the various factors dependingon x it is advisable to choose x such thatinf Θ i R − R ( e θ ) = ϕ ( x ) x , as long as x ≤ N C β .8 Chapter 2. Comparing posterior distributions to Gibbs priors Following Mammen and Tsybakov, let us assume that the usual margin assump-tion holds: for some real constants c > κ ≥ R ( θ ) − R ( e θ ) ≥ c (cid:2) D ( θ, e θ ) (cid:3) κ . As D ( θ, e θ ) ≥ M ′ ( θ, e θ ), this also implies the weaker assumption R ( θ ) − R ( e θ ) ≥ c (cid:2) M ′ ( θ, e θ ) (cid:3) κ , θ ∈ Θ , which we will really need and use. Let us take β max = N and ν = 1 ⌈ log ( N ) ⌉ ⌈ log ( N ) ⌉ X k =1 δ k . Then, as we have already seen, ϕ ( x ) ≤ (1 − κ − ) (cid:0) κcx (cid:1) − κ − . Thus ϕ ( x ) /x ≤ bx − κκ − ,where b = (1 − κ − ) (cid:0) κc (cid:1) − κ − . Let us choose accordingly x = min (cid:26) x = (cid:18) inf Θ i R − R ( e θ ) b (cid:19) − κ − κ , x = N C β (cid:27) . Using the fact that when r ∈ (0 , ), (cid:0) r − r (cid:1) ≤ r ≤ 9, we get with P probabilityat least 1 − ǫ , for any β ∈ supp ν , in the case when x = x ≤ x , ρ b k ( R ) ≤ inf Θ i R + 538 C βN b κ − κ (cid:2) inf Θ i R − R ( e θ ) (cid:3) κ + C β (cid:20) d i + 166 log (cid:2) ( N ) (cid:3) − 134 log (cid:2) µ ( i ) (cid:3) − 102 log( ǫ ) + 724 (cid:21) , and in the case when x = x ≤ x , ρ b k ( R ) ≤ inf Θ i R + 68 C (cid:2) inf Θ i R − R ( e θ ) (cid:3) + 269 C βN ϕ ( x )+ C β (cid:20) d i + 166 log (cid:2) ( N ) (cid:3) − 134 log (cid:2) µ ( i ) (cid:3) − 102 log( ǫ ) + 724 (cid:21) ≤ inf Θ i R + 541 C βN ϕ ( x )+ C β (cid:20) d i + 166 log (cid:2) ( N ) (cid:3) − 134 log (cid:2) µ ( i ) (cid:3) − 102 log( ǫ ) + 724 (cid:21) . Thus with P probability at least 1 − ǫ , ρ b k ( R ) ≤ inf Θ i R + inf β ∈ (1 ,N ) C βN max (cid:26) b κ − κ (cid:2) inf Θ i R − R ( e θ ) (cid:3) κ ,b (cid:18) C βN (cid:19) κ − (cid:27) + C β (cid:20) d i + 166 log (cid:2) ( N ) (cid:3) − 134 log (cid:2) µ ( i ) (cid:3) − 102 log( ǫ ) + 724 (cid:21) . .3. Two step localization Theorem 2.2.11 . With probability at least − ǫ , for any i ∈ N , ρ b k ( R ) ≤ inf Θ i R + max C vuut b κ − κ (cid:2) inf Θ i R − R ( e θ ) (cid:3) κ n d i + log (cid:16) ( N ) ǫµ ( i ) (cid:17) + 5 o N , C (cid:2) b (cid:3) κ − κ − κ − C h d i + log (cid:16) ( N ) ǫµ ( i ) (cid:17) + 5 i N κ κ − , where C , given by equation (2.23 page 86), will in most cases be close to , and inany case less than . . This result gives a bound of the same form as that given in Theorem 2.1.15 (page66) in the special case when there is only one model — that is when µ is a Diracmass, for instance µ (1) = 1, implying that R ( e θ ) − R ( e θ ) = 0. Morover the parametriccomplexity assumption we made for this theorem, given by equation (2.21 page 85),is weaker than the one used in Theorem 2.1.15 and described by equation (2.8, page63). When there is more than one model, the bound shows that the estimator makesa trade-off between model accuracy, represented by inf Θ i R − R ( e θ ), and dimension,represented by d i , and that for optimal parametric sub-models, meaning those forwhich inf Θ i R = inf Θ R , the estimator does at least as well as the minimax optimalconvergence speed in the best of these.Another point is that we obtain more explicit constants than in Theorem 2.1.15.It is also clear that a more careful choice of parameters could have brought someimprovement in the value of these constants.These results show that the selection scheme described in this section is a goodcandidate to perform temperature selection of a Gibbs posterior distribution builtwithin a single parametric model in a rate optimal way, as well as a proposal withproven performance bound for model selection. Let us reconsider the case where we want to choose adaptively among a family ofparametric models. Let us thus assume that the parameter set is a disjoint unionof measurable sub-models, so that we can write Θ = ⊔ m ∈ M Θ m , where M is somemeasurable index set. Let us choose some prior probability distribution on theindex set µ ∈ M ( M ), and some regular conditional prior distribution π : M → M (Θ), such that π ( i, Θ i ) = 1, i ∈ M . Let us then study some arbitrary posteriordistributions ν : Ω → M ( M ) and ρ : Ω × M : → M (Θ), such that ρ ( ω, i, Θ i ) = 1, ω ∈ Ω, i ∈ M . We would like to compare νρ ( R ) with some doubly localized priordistribution µ exp[ − β ζ π exp( − βR ) ( R )] (cid:2) π exp( − βR ) (cid:3) ( R ) (where ζ is a positive parameterto be set as needed later on). To ease notation we will define two prior distributions0 Chapter 2. Comparing posterior distributions to Gibbs priors (one being more precisely a conditional distribution) depending on the positive realparameters β and ζ , putting(2.24) π = π exp( − βR ) and µ = µ exp[ − β ζ π ( R )] . Similarly to Theorem 1.4.3 on page 37 we can write for any positive real constants β and γ P (cid:26) ( µ π ) ⊗ ( µ π ) (cid:20) exp h − N log (cid:2) − tanh( γN ) R ′ (cid:3) − γr ′ − N log (cid:2) cosh( γN ) (cid:3) m ′ i(cid:21)(cid:27) ≤ , and deduce, using Lemma 1.1.3 on page 4, that(2.25) P (cid:26) exp (cid:20) sup ν ∈ M ( M ) sup ρ : M → M (Θ) n − N log (cid:2) − tanh( γN )( νρ − µ π )( R ) (cid:3) − γ ( νρ − µ π )( r ) − N log (cid:2) cosh( γN ) (cid:3) ( νρ ) ⊗ ( µ π )( m ′ ) − K ( ν, µ ) − ν (cid:2) K ( ρ, π ) (cid:3)o(cid:21)(cid:27) ≤ . This will be our starting point in comparing νρ ( R ) with µ π ( R ). However, obtainingan empirical bound will require some supplementary efforts. For each index of themodel index set M , we can write in the same way P (cid:26) π ⊗ π (cid:20) exp h − N log (cid:2) − tanh( γN ) R ′ (cid:3) − γr ′ − N log (cid:2) cosh( γN ) (cid:3) m ′ i(cid:21)(cid:27) ≤ . Integrating this inequality with respect to µ and using Fubini’s lemma for positivefunctions, we get P (cid:26) µ ( π ⊗ π ) (cid:20) exp h − N log (cid:2) − tanh( γN ) R ′ (cid:3) − γr ′ − N log (cid:2) cosh( γN ) (cid:3) m ′ i(cid:21)(cid:27) ≤ . Note that µ ( π ⊗ π ) is a probability measure on M × Θ × Θ, whereas ( µ π ) ⊗ ( µ π )considered previously is a probability measure on ( M × Θ) × ( M × Θ). We get aspreviously(2.26) P (cid:26) exp (cid:20) sup ν ∈ M ( M ) sup ρ : M → M (Θ) n − N log (cid:2) − tanh( γN ) ν ( ρ − π )( R ) (cid:3) − γν ( ρ − π )( r ) − N log (cid:2) cosh( γN ) (cid:3) ν ( ρ ⊗ π )( m ′ ) − K ( ν, µ ) − ν (cid:2) K ( ρ, π ) (cid:3)o(cid:21)(cid:27) ≤ . Let us finally recall that K ( ν, µ ) = β ζ ( ν − µ ) π ( R ) + K ( ν, µ ) − K ( µ, µ ) , (2.27) K ( ρ, π ) = β ( ρ − π )( R ) + K ( ρ, π ) − K ( π, π ) . (2.28)From equations (2.25), (2.26) and (2.28) we deduce .3. Two step localization Proposition 2.3.1 . For any positive real constants β , γ and ζ , with P probabilityat least − ǫ , for any posterior distribution ν : Ω → M ( M ) and any conditionalposterior distribution ρ : Ω × M → M (Θ) , − N log (cid:2) − tanh( γN )( νρ − µ π )( R ) (cid:3) − βν ( ρ − π )( R ) ≤ γ ( νρ − µ π )( r ) + N log (cid:2) cosh( γN ) (cid:3) ( νρ ) ⊗ ( µ π )( m ′ )+ K ( ν, µ ) + ν (cid:2) K ( ρ, π ) (cid:3) − ν (cid:2) K ( π, π ) (cid:3) + log (cid:0) ǫ (cid:1) . and − N log (cid:2) − tanh( γN ) ν ( ρ − π )( R ) (cid:3) ≤ γν ( ρ − π )( r ) + N log (cid:2) cosh( γN ) (cid:3) ν ( ρ ⊗ π )( m ′ )+ K ( ν, µ ) + ν (cid:2) K ( ρ, π ) (cid:3) + log (cid:0) ǫ (cid:1) , where the prior distribution µ π is defined by equation (2.24) on page 90 and dependson β and ζ . Let us put for short T = tanh( γN ) and C = N log (cid:2) cosh( γN ) (cid:3) . We will use an entropy compensation strategy for which we need a couple ofentropy bounds. We have according to Proposition 2.3.1, with P probability atleast 1 − ǫ , ν (cid:2) K ( ρ, π ) (cid:3) = βν ( ρ − π )( R ) + ν (cid:2) K ( ρ, π ) − K ( π, π ) (cid:3) ≤ βN T (cid:20) γν ( ρ − π )( r ) + Cν ( ρ ⊗ π )( m ′ )+ K ( ν, µ ) + ν (cid:2) K ( ρ, π ) (cid:3) + log( ǫ ) (cid:21) + ν (cid:2) K ( ρ, π ) − K ( π, π ) (cid:3) . Similarly K ( ν, µ ) = β ζ ( ν − µ ) π ( R ) + K ( ν, µ ) − K ( µ, µ ) ≤ β (1 + ζ ) N T (cid:20) γ ( ν − µ ) π ( r ) + C ( νπ ) ⊗ ( µ π )( m ′ )+ K ( ν, µ ) + log( ǫ ) (cid:21) + K ( ν, µ ) − K ( µ, µ ) . Thus, for any positive real constants β , γ and ζ i , i = 1 , . . . , 5, with P probabilityat least 1 − ǫ , for any posterior distributions ν, ν : Ω → M (Θ), any posteriorconditional distributions ρ, ρ , ρ , ρ , ρ : Ω × M → M (Θ), − N log (cid:2) − T ( νρ − µ π )( R ) (cid:3) − βν ( ρ − π )( R ) ≤ γ ( νρ − µ π )( r ) + C ( νρ ) ⊗ ( µ π )( m ′ )+ K ( ν, µ ) + ν (cid:2) K ( ρ, π ) − K ( π, π ) (cid:3) + log( ǫ ) , Chapter 2. Comparing posterior distributions to Gibbs priors ζ N Tβ µ (cid:2) K ( ρ , π ) (cid:3) ≤ ζ γµ ( ρ − π )( r ) + ζ Cµ ( ρ ⊗ π )( m ′ )+ ζ µ (cid:2) K ( ρ , π ) (cid:3) + ζ log( ǫ ) + ζ N Tβ µ (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) ,ζ N Tβ ν (cid:2) K ( ρ , π ) (cid:3) ≤ ζ γν ( ρ − π )( r ) + ζ Cν ( ρ ⊗ π )( m ′ )+ ζ K ( ν, µ ) + ζ ν (cid:2) K ( ρ , π ) (cid:3) + ζ log( ǫ )+ ζ N Tβ ν (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) ,ζ (1 + ζ ) N Tβ K ( ν , µ ) ≤ ζ γ ( ν − µ ) π ( r )+ ζ C (cid:2) ( ν π ) ⊗ ( ν ρ ) + ( ν ρ ) ⊗ ( µ π ) (cid:3) ( m ′ ) + ζ K ( ν , µ ) + ζ log( ǫ )+ ζ (1 + ζ ) N Tβ (cid:2) K ( ν , µ ) − K ( µ, µ ) (cid:3) ,ζ N Tβ ν (cid:2) K ( ρ , π ) (cid:3) ≤ ζ γν ( ρ − π )( r )+ ζ Cν ( ρ ⊗ π )( m ′ ) + ζ K ( ν , µ ) + ζ ν (cid:2) K ( ρ , π ) (cid:3) + ζ log( ǫ )+ ζ N Tβ ν (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) ,ζ N Tβ µ (cid:2) K ( ρ , π ) (cid:3) ≤ ζ γµ ( ρ − π )( r ) + ζ Cµ ( ρ ⊗ π )( m ′ )+ ζ µ (cid:2) K ( ρ , π ) (cid:3) + ζ log( ǫ ) + ζ N Tβ µ (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) . Adding these six inequalities and assuming that(2.29) ζ ≤ ζ (cid:2) (1 + ζ ) NTβ − (cid:3) , we find − N log (cid:2) − T ( νρ − µ π )( R ) (cid:3) − β ( νρ − µ π )( R ) ≤ − N log (cid:2) − T ( νρ − µ π )( R ) (cid:3) − β ( νρ − µ π )( R )+ ζ (cid:0) NTβ − (cid:1) µ (cid:2) K ( ρ , π ) (cid:3) + ζ (cid:0) NTβ − (cid:1) ν (cid:2) K ( ρ , π ) (cid:3) + (cid:2) ζ (1 + ζ ) NTβ − ζ − ζ (cid:3) K ( ν , µ )+ ζ (cid:0) NTβ − (cid:1) ν (cid:2) K ( ρ , π ) (cid:3) + ζ (cid:0) NTβ − (cid:1) µ (cid:2) K ( ρ , π ) (cid:3) ≤ γ ( νρ − µ π )( r ) + ζ γµ ( ρ − π )( r ) + ζ γν ( ρ − π )( r )+ ζ γ ( ν − µ ) π ( r ) + ζ γν ( ρ − π )( r ) + ζ γµ ( ρ − π )( r )+ C (cid:2) ( νρ ) ⊗ ( µ π ) + ζ µ ( ρ ⊗ π ) + ζ ν ( ρ ⊗ π )+ ζ ( ν π ) ⊗ ( ν ρ ) + ζ ( ν ρ ) ⊗ ( µ π ) + ζ ν ( ρ ⊗ π ) + ζ µ ( ρ ⊗ π ) (cid:3) ( m ′ )+ (1 + ζ ) (cid:2) K ( ν, µ ) − K ( µ, µ ) (cid:3) + ν (cid:2) K ( ρ, π ) − K ( π, π ) (cid:3) + ζ NTβ µ (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) + ζ NTβ ν (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) + ζ (1 + ζ ) NTβ (cid:2) K ( ν , µ ) − K ( µ, µ ) (cid:3) + ζ NTβ ν (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) + ζ NTβ µ (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) + (1 + ζ + ζ + ζ + ζ + ζ ) log( ǫ ) , .3. Two step localization − β ( νρ − µ π )( R ) + K ( ν, µ ) + ν (cid:2) K ( ρ, π ) (cid:3) ≤ − β ( νρ − µ π )( R ) + (1 + ζ ) K ( ν, µ ) + ν (cid:2) K ( ρ, π ) (cid:3) = (1 + ζ ) (cid:2) K ( ν, µ ) − K ( µ, µ ) (cid:3) + ν (cid:2) K ( ρ, π ) − K ( π, π ) (cid:3) . Let us now apply to π (we shall later do the same with µ ) the following inequalities,holding for any random functions of the sample and the parameters h : Ω × Θ → R and g : Ω × Θ → R , π ( g − h ) − K ( π, π ) ≤ sup ρ :Ω × M → M (Θ) ρ ( g − h ) − K ( ρ, π )= log (cid:8) π (cid:2) exp( g − h ) (cid:3)(cid:9) = log (cid:8) π (cid:2) exp( − h ) (cid:3)(cid:9) + log (cid:8) π exp( − h ) (cid:2) exp( g ) (cid:3)(cid:9) = − π exp( − h ) ( h ) − K ( π exp( − h ) , π ) + log (cid:8) π exp( − h ) (cid:2) exp( g ) (cid:3)(cid:9) . When h and g are observable, and h is not too far from βr ≃ βR , this gives away to replace π with a satisfactory empirical approximation. We will apply thismethod, choosing ρ and ρ such that µ π is replaced either with µρ , when it comesfrom the first two inequalities or with µρ otherwise, choosing ρ such that νπ isreplaced with νρ and ρ such that ν π is replaced with ν ρ . We will do so becauseit leads to a lot of helpful cancellations. For those to happen, we need to choose ρ i = π exp( − λ i r ) , i = 1 , , 4, where λ , λ and λ are such that(1 + ζ ) γ = ζ NTβ λ , (2.30) ζ γ = (cid:0) ζ NTβ (cid:1) λ , (2.31) ( ζ − ζ ) γ = ζ N Tβ λ , (2.32) ζ γ = ζ NTβ λ , (2.33)and to assume that(2.34) ζ > ζ . We obtain that with P probability at least 1 − ǫ , − N log (cid:2) − T ( µρ − µ π )( R ) (cid:3) − β ( νρ − µ π )( R ) ≤ γ ( νρ − µ ρ )( r ) + ζ γ ( ν ρ − µρ )( r )+ ζ NTβ µ ( log " ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) νρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) + (cid:0) ζ NTβ (cid:1) ν ( log ( ρ (cid:26) exp (cid:20) C ζ NTβ ζ ρ ( m ′ ) (cid:21)(cid:27) + ζ NTβ ν ( log " ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) ζ ν ρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) + ζ NTβ µ ( log " ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) ζ ν ρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) Chapter 2. Comparing posterior distributions to Gibbs priors + (1 + ζ ) (cid:2) K ( ν, µ ) − K ( µ, µ ) (cid:3) + ν (cid:2) K ( ρ, π ) − K ( ρ , π ) (cid:3) + ζ (1 + ζ ) NTβ (cid:2) K ( ν , µ ) − K ( µ, µ ) (cid:3) + (cid:18) X i =1 ζ i (cid:19) log (cid:0) ǫ (cid:1) . In order to obtain more cancellations while replacing µ by some posterior distri-bution, we will choose the constants such that λ = λ , which can be done bychoosing(2.35) ζ = ζ ζ ζ − ζ . We can now replace µ with µ exp − ξ ρ ( r ) − ξ ρ ( r ) , where ξ = γ (1 + ζ ) (cid:0) NTβ ζ (cid:1) , (2.36) ξ = γζ (1 + ζ ) (cid:0) NTβ ζ (cid:1) . (2.37)Choosing moreover ν = µ exp − ξ ρ ( r ) − ξ ρ ( r ) , to induce some more cancellations,we get Theorem 2.3.2 . Let us use the notation introduced above. For any positive realconstants satisfying equations (2.29, page 92), (2.30, page 93), (2.31, page 93),(2.32, page 93), (2.33, page 93), (2.34, page 93), (2.35, page 94), (2.36, page 94),(2.37, page 94), with P probability at least − ǫ , for any posterior distribution ν : Ω → M ( M ) and any conditional posterior distribution ρ : Ω × M → M (Θ) , − N log (cid:2) − T ( νρ − µ π )( R ) (cid:3) − β ( νρ − µ π )( R ) ≤ B ( ν, ρ, β ) , where B ( ν, ρ, β ) def = γ ( νρ − ν ρ )( r )+ (1 + ζ ) (cid:0) NTβ ζ (cid:1) × log ( ν " ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) νρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) ζ NTβ (1+ ζ NTβ ζ × ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) ζ ν ρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) ζ NTβ (1+ ζ NTβ ζ + (cid:0) ζ NTβ (cid:1) ν ( log ( ρ (cid:26) exp (cid:20) C ζ NTβ ζ ρ ( m ′ ) (cid:21)(cid:27) + ζ NTβ ν ( log " ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) ζ ν ρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) + (1 + ζ ) (cid:2) K ( ν, µ ) − K ( ν , µ ) (cid:3) + ν (cid:2) K ( ρ, π ) − K ( ρ , π ) (cid:3) + (cid:18) X i =1 ζ i (cid:19) log (cid:0) ǫ (cid:1) . This theorem can be used to find the largest value b β ( νρ ) of β such that B ( ν, ρ,β ) ≤ 0, thus providing an estimator for β ( νρ ) defined as νρ ( R ) = µ β ( νρ ) π β ( νρ ) ( R ), .3. Two step localization µ and π in β , the constant ζ staying fixed. The posterior distribution νρ may then be chosen to maximize b β ( νρ ) within some manageable subset of posterior distributions P , thus gainingthe assurance that νρ ( R ) ≤ µ b β ( νρ ) π b β ( νρ ) ( R ), with the largest parameter b β ( νρ )that this approach can provide. Maximizing b β ( νρ ) is supported by the fact thatlim β → + ∞ µ β π β ( R ) = ess inf µπ R . Anyhow, there is no assurance (to our knowledge)that β µ β π β ( R ) will be a decreasing function of β all the way, although this maybe expected to be the case in many practical situations.We can make the bound more explicit in several ways. One point of view is toput forward the optimal values of ρ and ν . We can thus remark that ν (cid:2) γρ ( r ) + K ( ρ, π ) − K ( ρ , π ) (cid:3) + (1 + ζ ) K ( ν, µ )= ν (cid:20) K (cid:2) ρ, π exp( − γr ) (cid:3) + λ ρ ( r ) + Z γλ π exp( − αr ) ( r ) dα (cid:21) + (1 + ζ ) K ( ν, µ )= ν (cid:8) K (cid:2) ρ, π exp( − γr ) (cid:3)(cid:9) + (1 + ζ ) K (cid:2) ν, µ exp (cid:0) − λ ρ r )1+ ζ − ζ R γλ π exp( − αr ) ( r ) dα (cid:1)(cid:3) − (1 + ζ ) log ( µ " exp (cid:26) − λ ζ ρ ( r ) − 11 + ζ Z γλ π exp( − αr ) ( r ) dα (cid:27) . Thus B ( ν, ρ, β ) = (1 + ζ ) h ξ ν ρ ( r ) + ξ ν ρ ( r )+ log (cid:8) µ (cid:2) exp (cid:0) − ξ ρ ( r ) − ξ ρ ( r ) (cid:1)(cid:3)(cid:9)i − (1 + ζ ) log ( µ " exp (cid:26) − λ ζ ρ ( r ) − 11 + ζ Z γλ π exp( − αr ) ( r ) dα (cid:27) − γν ρ ( r ) + (1 + ζ ) (cid:0) NTβ ζ (cid:1) × log ( ν " ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) νρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) ζ NTβ (1+ ζ NTβ ζ × ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) ζ ν ρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) ζ NTβ (1+ ζ NTβ ζ + (cid:0) ζ NTβ (cid:1) ν ( log ( ρ (cid:26) exp (cid:20) C ζ NTβ ζ ρ ( m ′ ) (cid:21)(cid:27) + ζ NTβ ν ( log " ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) ζ ν ρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) + ν (cid:8) K (cid:2) ρ, π exp( − γr ) (cid:3)(cid:9) + (1 + ζ ) K (cid:2) ν, µ exp (cid:0) − λ ρ r )1+ ζ − ζ R γλ π exp( − αr ) ( r ) dα (cid:1)(cid:3) + (cid:18) X i =1 ζ i (cid:19) log (cid:0) ǫ (cid:1) . This formula is better understood when thinking about the following upper boundfor the two first lines in the expression of B ( ν, ρ, β ):6 Chapter 2. Comparing posterior distributions to Gibbs priors (1 + ζ ) h ξ ν ρ ( r ) + ξ ν ρ ( r ) + log (cid:8) µ (cid:2) exp (cid:0) − ξ ρ ( r ) − ξ ρ ( r ) (cid:1)(cid:3)(cid:9)i − (1 + ζ ) log ( µ " exp (cid:26) − λ ζ ρ ( r ) − 11 + ζ Z γλ π exp( − αr ) ( r ) dα (cid:27) − γν ρ ( r ) ≤ ν (cid:20) λ ρ ( r ) + Z γλ π exp( − αr ) ( r ) dα − γρ ( r ) (cid:21) . Another approach to understanding Theorem 2.3.2 is to put forward ρ = π exp( − λ r ) , for some positive real constant λ < γ , noticing that ν (cid:2) K ( ρ , π ) − K ( ρ , π ) (cid:3) = λ ν ( ρ − ρ )( r ) − ν (cid:2) K ( ρ , ρ ) (cid:3) . Thus B ( ν, ρ , β ) ≤ ν (cid:2) ( γ − λ )( ρ − ρ )( r ) + λ ( ρ − ρ )( r ) (cid:3) + (1 + ζ ) (cid:0) NTβ ζ (cid:1) × log ( ν " ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) νρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) ζ NTβ (1+ ζ NTβ ζ × ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) ζ ν ρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) ζ NTβ (1+ ζ NTβ ζ + (cid:0) ζ NTβ (cid:1) ν ( log ( ρ (cid:26) exp (cid:20) C ζ NTβ ζ ρ ( m ′ ) (cid:21)(cid:27) + ζ NTβ ν ( log " ρ (cid:26) exp (cid:20) C βNT ζ (cid:2) ζ ν ρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) + (1 + ζ ) K h ν, µ exp (cid:0) − ( γ − λ ρ r )+ λ ρ r )1+ ζ (cid:1)i − ν (cid:2) K ( ρ , ρ ) (cid:3) + (cid:18) X i =1 ζ i (cid:19) log (cid:0) ǫ (cid:1) . In the case when we want to select a single model b m ( ω ), and therefore to set ν = δ b m , the previous inequality engages us to take b m ∈ arg min m ∈ M ( γ − λ ) ρ ( m, r ) + λ ρ ( m, r ).In parametric situations where π exp( − λr ) ( r ) ≃ r ⋆ ( m ) + d e ( m ) λ , we get ( γ − λ ) ρ ( m, r ) − λ ρ ( m, r ) ≃ γ (cid:2) r ⋆ ( m ) + d e ( m ) (cid:0) λ + λ − λ γλ (cid:1)(cid:3) , resulting in a linear penalization of the empirical dimension of the models. .3. Two step localization We will not state a formal result, but will nevertheless give some hints about howto establish one. This is a rather technical section, which can be skipped at a firstreading , since it will not be used below. We should start from Theorem 1.4.2 (page36), which gives a deterministic variance term. From Theorem 1.4.2, after a changeof prior distribution, we obtain for any positive constants α and α , any priordistributions e µ and e µ ∈ M ( M ), for any prior conditional distributions e π and e π : M → M (Θ), with P probability at least 1 − η , for any posterior distributions ν ρ and ν ρ , α ( ν ρ − ν ρ )( R ) ≤ α ( ν ρ − ν ρ )( r )+ K (cid:2) ( ν ρ ) ⊗ ( ν ρ ) , ( e µ e π ) ⊗ ( e µ e π ) (cid:3) + log n ( e µ e π ) ⊗ ( e µ e π ) h exp (cid:8) − α Ψ α N ( R ′ , M ′ ) + α R ′ (cid:9)io − log( η ) . Applying this to α = 0, we get that( νρ − ν ρ )( r ) ≤ α (cid:20) K (cid:2) ( νρ ) ⊗ ( ν ρ ) , ( e µ e π ) ⊗ ( e µ e π ) (cid:3) + log n ( e µ e ν ) ⊗ ( e µ e π ) h exp (cid:8) α Ψ − α N ( R ′ , M ′ ) (cid:9)io − log( η ) (cid:21) . In the same way, to bound quantities of the formlog ( ν " ρ (cid:26) exp (cid:20) C ( νρ + ζ ρ )( m ′ ) (cid:21)(cid:27) p × ρ (cid:26) exp (cid:20) C (cid:2) ζ ν ρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) p = sup ν (cid:26) p sup ρ n C (cid:2) ( νρ ) ⊗ ( ν ρ ) + ζ ν ( ρ ⊗ ρ ) (cid:3) ( m ′ ) − K ( ρ , ρ ) o + p sup ρ n C (cid:2) ζ ( ν ρ ) ⊗ ( ν ρ )+ ζ ν ( ρ ⊗ ρ ) (cid:3) ( m ′ ) − K ( ρ , ρ ) o − K ( ν , ν ) (cid:27) , where C , C , p and p are positive constants, and similar terms, we need to useinequalities of the type: for any prior distributions e µ i e π i , i = 1 , 2, with P probabilityat least 1 − η , for any posterior distributions ν i ρ i , i = 1 , α ( ν ρ ) ⊗ ( ν ρ )( m ′ ) ≤ log n ( e µ e π ) ⊗ ( e µ e π ) exp h α Φ − α N ( M ′ ) io + K (cid:2) ( ν ρ ) ⊗ ( ν ρ ) , ( e µ e π ) ⊗ ( e µ e π ) (cid:3) − log( η ) . We need also the variant: with P probability at least 1 − η , for any posterior dis-tribution ν : Ω → M ( M ) and any conditional posterior distributions ρ , ρ :Ω × M → M (Θ), α ν ( ρ ⊗ ρ )( m ′ ) ≤ log ne µ (cid:0)e π ⊗ e π (cid:1) exp h α Φ − α N ( M ′ ) io Chapter 2. Comparing posterior distributions to Gibbs priors + K ( ν , e µ ) + ν (cid:8) K (cid:2) ρ ⊗ ρ , e π ⊗ e π (cid:3)(cid:9) − log( η ) . We deduce thatlog ( ν " ρ (cid:26) exp (cid:20) C ( νρ + ζ ρ )( m ′ ) (cid:21)(cid:27) p × ρ (cid:26) exp (cid:20) C (cid:2) ζ ν ρ + ζ ρ (cid:3) ( m ′ ) (cid:21)(cid:27) p ≤ sup ν ( p sup ρ " C α (cid:26) log n ( e µ e π ) ⊗ ( e µ e π ) exp h α Φ − α N ( M ′ ) io + K (cid:2) ( νρ ) ⊗ ( ν ρ ) , ( e µ e π ⊗ ( e µ e π ) (cid:3) + log( η )+ ζ (cid:20) log ne µ (cid:0)e π ⊗ e π (cid:1) exp h α Φ − α N ( M ′ ) io + K ( ν , e µ ) + ν (cid:8) K (cid:2) ρ ⊗ ρ , e π ⊗ e π (cid:3)(cid:9) + log (cid:0) η (cid:1)(cid:21)(cid:27) − K ( ρ , ρ ) + p sup ρ " C α (cid:26) log n ( e µ e π ) ⊗ ( e µ e π ) exp h α Φ − α N ( M ′ ) io + K (cid:2) ( ν ρ ) ⊗ ( ν ρ ) , ( e µ e π ⊗ ( e µ e π ) (cid:3) + log( η )+ ζ (cid:20) log ne µ (cid:0)e π ⊗ e π (cid:1) exp h α Φ − α N ( M ′ ) io + K ( ν , e µ ) + ν (cid:8) K (cid:2) ρ ⊗ ρ , e π ⊗ e π (cid:3)(cid:9) + log (cid:0) η (cid:1)(cid:21)(cid:27) − K ( ρ , ρ ) − K ( ν , ν ) ) . We are then left with the need to bound entropy terms like K ( ν ρ , e µ e π ), wherewe have the choice of e µ and e π , to obtain a useful bound. As could be expected,we decompose it into K ( ν ρ , e µ e π ) = K ( ν , e µ ) + ν (cid:2) K ( ρ , e π ) (cid:3) . Let us look after the second term first, choosing e π = π exp( − β R ) : ν (cid:2) K ( ρ , e π ) (cid:3) = ν (cid:2) β ( ρ − e π )( R ) + K ( ρ , π ) − K ( e π , π ) (cid:3) ≤ β α (cid:20) α ν ( ρ − e π )( r ) + K ( ν , e µ ) + ν (cid:2) K ( ρ , e π ) (cid:3) + log ne µ (cid:0)e π ⊗ (cid:1)h exp (cid:8) − α Ψ α N ( R ′ , M ′ ) + α R ′ (cid:9)io − log( η ) (cid:21) + ν (cid:2) K ( ρ , π ) − K ( e π , π ) (cid:3) ≤ β α (cid:20) K ( ν , e µ ) + ν (cid:2) K ( ρ , e π ) (cid:3) + log ne µ (cid:0)e π ⊗ (cid:1)h exp (cid:8) − α Ψ α N ( R ′ , M ′ ) + α R ′ (cid:9)io − log( η ) (cid:21) + ν (cid:8) K (cid:2) ρ , π exp( − β α α r ) (cid:3)(cid:9) . .3. Two step localization λ = β α α is satisfied, ν (cid:2) K ( ρ , e π ) (cid:3) ≤ (cid:16) − β α (cid:17) − β α (cid:20) K ( ν , e µ )+ log ne µ (cid:0)e π ⊗ (cid:1)h exp (cid:8) − α Ψ α N ( R ′ , M ′ ) + α R ′ (cid:9)io − log( η ) (cid:21) . We can further specialize the constants, choosing α = N sinh( α N ), so that − α Ψ α N ( R ′ , M ′ ) + α R ′ ≤ N sinh (cid:16) α N (cid:17) M ′ . We can for instance choose α = γ , α = N sinh( γN ) and β = λ Nγ sinh( γN ), leadingto Proposition 2.3.3 . With the notation of Theorem 2.3.2, the constants being setas explained above, putting e π = π exp( − λ Nγ sinh( γN ) R ) , with P probability at least − η , ν (cid:2) K ( ρ , e π ) (cid:3) ≤ (cid:16) − λ γ (cid:17) − λ γ (cid:20) K ( ν , e µ )+ log ne µ (cid:0)e π ⊗ (cid:1)h exp (cid:8) N sinh( γ N ) M ′ (cid:9)io − log( η ) (cid:21) . More generally ν (cid:2) K ( ρ, e π ) (cid:3) ≤ (cid:16) − λ γ (cid:17) − λ γ (cid:20) K ( ν , e µ )+ log ne µ (cid:0)e π ⊗ (cid:1)h exp (cid:8) N sinh( γ N ) M ′ (cid:9)io − log( η ) (cid:21) + (cid:16) − λ γ (cid:17) − ν (cid:2) K ( ρ, ρ ) (cid:3) . In a similar way, let us now choose e µ = µ exp[ − α π ( R )] . We can write K ( ν, e µ ) = α ( ν − e µ ) π ( R ) + K ( ν, µ ) − K ( e µ , µ ) ≤ α α (cid:20) α ( ν − e µ ) π ( r ) + K ( ν, e µ )+ log n ( e µ π ) ⊗ ( e µ π ) h exp (cid:8) − α Ψ α N ( R ′ , M ′ ) + α R ′ (cid:9)io − log( η ) (cid:21) + K ( ν, µ ) − K ( e µ , µ ) . Let us choose α = γ , α = N sinh( γN ), and let us add some other entropy inequal-ities to get rid of π in a suitable way, the approach of entropy compensation beingthe same as that used to obtain the empirical bound of Theorem 2.3.2 (page 94).This results with P probability at least 1 − η in (cid:16) − α α (cid:17) K ( ν, e µ ) ≤ α α (cid:20) γ ( ν − e µ ) π ( r )00 Chapter 2. Comparing posterior distributions to Gibbs priors + log n ( e µ π ) ⊗ ( e µ π ) h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + log( η ) (cid:21) + K ( ν, µ ) − K ( e µ , µ ) ,ζ (cid:16) − βα (cid:17)e µ (cid:2) K ( ρ , π ) (cid:3) ≤ ζ βα (cid:20) γ e µ ( ρ − π )( r )+ log ne µ (cid:0) π ⊗ (cid:1)h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + log( η ) (cid:21) + ζ e µ (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) ,ζ (cid:16) − βα (cid:17)e µ (cid:2) K ( ρ , π ) (cid:3) ≤ ζ βα (cid:20) γ e µ ( ρ − π )( r )+ log ne µ (cid:0) π ⊗ (cid:1)h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + log( η ) (cid:21) + ζ e µ (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) ,ζ (cid:16) − βα (cid:17) ν (cid:2) K ( ρ , π ) (cid:3) ≤ ζ βα (cid:20) γν ( ρ − π )( r ) + K ( ν, e µ )+ log ne µ (cid:0) π ⊗ (cid:1)h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + log( η ) (cid:21) + ζ ν (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) ,ζ (cid:16) − βα (cid:17) ν (cid:2) K ( ρ , π ) (cid:3) ≤ ζ βα (cid:20) γν ( ρ − π )( r ) + K ( ν, e µ )+ log ne µ (cid:0) π ⊗ (cid:1)h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + log( η ) (cid:21) + ζ ν (cid:2) K ( ρ , π ) − K ( π, π ) (cid:3) , where we have introduced a bunch of constants, assumed to be positive, that wewill more precisely set to x + x = 1 , ( ζ β + x α ) γα = λ , ( ζ β + x α ) γα = λ , ( ζ β − x α ) γα = λ , ( ζ β − x α ) γα = λ . We get with P probability at least 1 − η , (cid:16) − α α − ( ζ + ζ ) βα (cid:17) K ( ν, e µ ) ≤ α α (cid:20) γ (cid:2) ν ( x ρ + x ρ )( r ) − e µ ( x ρ + x ρ )( r ) (cid:3) + α α log n ( e µ π ) ⊗ ( e µ π ) h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + ( ζ + ζ + ζ + ζ ) βα log ne µ (cid:0) π ⊗ (cid:1)h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io .3. Two step localization K ( ν, µ ) − K ( e µ , µ ) + (cid:16) α α + ( ζ + ζ + ζ + ζ ) βα (cid:17) log (cid:0) η (cid:1) . Let us choose the constants so that λ = λ = λ , λ = λ = λ , α x γα = ξ and α x γα = ξ . This is done by setting x = ξ ξ + ξ ,x = ξ ξ + ξ ,α = Nγ sinh( γN )( ξ + ξ ) ,ζ = Nγ sinh( γN ) ( λ − ξ ) β ,ζ = Nγ sinh( γN ) ( λ − ξ ) β ,ζ = Nγ sinh( γN ) ( λ + ξ ) β ,ζ = Nγ sinh( γN ) ( λ + ξ ) β . The inequality λ > ξ is always satisfied. The inequality λ > ξ is required forthe above choice of constants, and will be satisfied for a suitable choice of ζ and ζ .Under these assumptions, we obtain with P probability at least 1 − η (cid:16) − α α − ( ζ + ζ ) βα (cid:17) K ( ν, e µ ) ≤ ( ν − e µ )( ξ ρ + ξ ρ )( r )+ α α log n ( e µ π ) ⊗ ( e µ π ) h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + ( ζ + ζ + ζ + ζ ) βα log ne µ (cid:0) π ⊗ (cid:1)h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + K ( ν, µ ) − K ( e µ , µ ) + (cid:16) α α + ( ζ + ζ + ζ + ζ ) βα (cid:17) log (cid:0) η (cid:1) . This proves Proposition 2.3.4 . The constants being set as explained above, with P probabilityat least − η , for any posterior distribution ν : Ω → M ( M ) , K ( ν, e µ ) ≤ (cid:16) − α α − ( ζ + ζ ) βα (cid:17) − (cid:20) K ( ν, ν )+ α α log n ( e µ π ) ⊗ ( e µ π ) h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + ( ζ + ζ + ζ + ζ ) βα log ne µ (cid:0) π ⊗ (cid:1)h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + (cid:16) α α + ( ζ + ζ + ζ + ζ ) βα (cid:17) log (cid:0) η (cid:1)(cid:21) . Thus02 Chapter 2. Comparing posterior distributions to Gibbs priors K ( ν ρ , e µ e π ) ≤ (cid:0) − λ γ (cid:1) − λ γ − α α − ( ζ + ζ ) βα × (cid:20) α α log n ( e µ π ⊗ ( e µ π ) h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + ( ζ + ζ + ζ + ζ ) βα log ne µ (cid:0) π ⊗ (cid:1)h exp (cid:8) − γ Ψ γN ( R ′ , M ′ ) + α R ′ (cid:9)io + (cid:16) α α + ( ζ + ζ + ζ + ζ ) βα (cid:17) log (cid:0) η (cid:1)(cid:21) + (cid:16) − λ γ (cid:17) − λ γ (cid:20) log ne µ (cid:0)e π ⊗ (cid:1)h exp (cid:8) N sinh (cid:0) γ N (cid:1) M ′ (cid:9)io − log( η ) (cid:21) . We will not go further, lest it may become tedious, but we hope we have givensufficient hints to state informally that the bound B ( ν, ρ, β ) of Theorem 2.3.2 (page94) is upper bounded with P probability close to one by a bound of the same flavourwhere the empirical quantities r and m ′ have been replaced with their expectations R and M ′ . Here we work with a family of prior distributions described by a regular conditionalprior distribution π = M → M (Θ), where M is some measurable index set. Thisfamily may typically describe a countable family of parametric models. In this case M = N , and each of the prior distributions π ( i, . ), i ∈ N satisfies some parametriccomplexity assumption of the typelim sup β → + ∞ β (cid:2) π exp( − βR ) ( i, . )( R ) − ess inf π ( i,. ) R (cid:3) = d i < + ∞ , i ∈ M. Let us consider also a prior distribution µ ∈ M ( M ) defined on the index set M .Our aim here will be to compare the performance of two given posterior distri-butions ν ρ and ν ρ , where ν , ν : Ω → M ( M ), and where ρ , ρ : Ω × M → M (Θ). More precisely, we would like to establish a bound for ( ν ρ − ν ρ )( R )which could be a starting point to implement a selection method similar to the onedescribed in Theorem 2.2.4 (page 73). To this purpose, we can start with Theorem2.2.1 (page 69), which says that with P probability at least 1 − ǫ , − N log n − tanh( λN ) (cid:0) ν ρ − ν ρ (cid:1) ( R ) o ≤ λ ( ν ρ − ν ρ )( r )+ N log (cid:2) cosh( λN ) (cid:3) ( ν ρ ) ⊗ ( ν ρ )( m ′ ) + K ( ν , e µ ) + K ( ν , e µ )+ ν (cid:2) K ( ρ , e π ) (cid:3) + ν (cid:2) K ( ρ , e π ) (cid:3) − log( ǫ ) , where e µ ∈ M ( M ) and e π : M → M (Θ) are suitably localized prior distributionsto be chosen later on. To use these localized prior distributions, we need empiricalbounds for the entropy terms K ( ν i , e µ ) and ν i (cid:2) K ( ρ i , e π ) (cid:3) , i = 1 , ν (cid:2) K ( ρ, e π ) (cid:3) can be done using the following generalization of Corollary2.1.19 page 68: Corollary 2.3.5 . For any positive real constants γ and λ such that γ < λ , forany prior distribution µ ∈ M ( M ) and any conditional prior distribution π : M → .3. Two step localization M (Θ) , with P probability at least − ǫ , for any posterior distribution ν : Ω → M ( M ) , and any conditional posterior distribution ρ : Ω × M → M (Θ) , ν n K (cid:2) ρ, π exp[ − N γλ tanh( λN ) R ] (cid:3)o ≤ K ′ ( ν, ρ, γ, λ, ǫ ) + 1 λγ − K ( ν, µ ) , where K ′ ( ν, ρ, γ, λ, ǫ ) def = (cid:16) − γλ (cid:17) − (cid:26) ν (cid:2) K ( ρ, π exp( − γr ) (cid:3) − γλ log( ǫ ) + ν n log h π exp( − γr ) (cid:16) exp (cid:8) N γλ log (cid:2) cosh( λN ) (cid:3) ρ ( m ′ ) (cid:9)(cid:17)io(cid:27) . To apply this corollary to our case, we have to set e π = π exp[ − N γλ tanh( λN ) R ] . Let us also consider for some positive real constant β the conditional prior distri-bution π = π exp( − βR ) and the prior distribution µ = µ exp[ − απ ( R )] . Let us see how we can bound, given any posterior distribution ν : Ω → M ( M ),the divergence K ( ν, µ ). We can see that K ( ν, µ ) = α ( ν − µ ) π ( R ) + K ( ν, µ ) − K ( µ, µ ) . Now, let us introduce the conditional posterior distribution b π = π exp( − γr ) and let us decompose( ν − µ ) (cid:2) π ( R ) (cid:3) = ν (cid:2) π ( R ) − b π ( R ) (cid:3) + ( ν − µ ) (cid:2)b π ( R ) (cid:3) + µ (cid:2)b π ( R ) − π ( R ) (cid:3) . Starting from the exponential inequality P (cid:20) µ (cid:2) π ⊗ π (cid:3) exp n − N log (cid:2) − tanh( γN ) R ′ (cid:3) − γr ′ − N log (cid:2) cosh( γN ) (cid:3) m ′ o(cid:21) ≤ , and reasoning in the same way that led to Theorem 2.1.1 (page 52) in the simplecase when we take in this theorem λ = γ , we get with P probability at least 1 − ǫ ,that − N log (cid:8) − tanh( γN ) ν ( π − b π )( R ) (cid:9) + βν ( π − b π )( R ) ≤ ν (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) b π ( m ′ ) (cid:9)io(cid:21) + K ( ν, µ ) − log( ǫ ) . − N log (cid:8) − tanh( γN ) µ ( b π − π )( R ) (cid:9) − βµ ( b π − π )( R ) ≤ µ (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) b π ( m ′ ) (cid:9)io(cid:21) − log( ǫ ) . Chapter 2. Comparing posterior distributions to Gibbs priors In the meantime, using Theorem 2.2.1 (page 69) and Corollary 2.3.5 above, wesee that with P probability at least 1 − ǫ , for any conditional posterior distribution ρ : Ω × M → M (Θ), − N log n − tanh( λN )( ν − µ ) ρ ( R ) o ≤ λ ( ν − µ ) ρ ( r )+ N log (cid:2) cosh( λN ) (cid:3) ( νρ ) ⊗ ( µρ )( m ′ ) + ( ν + µ ) K ( ρ, e π ) + K ( ν, µ ) − log( ǫ ) ≤ λ ( ν − µ ) ρ ( r ) + N log (cid:2) cosh( λN ) (cid:3) ( νρ ) ⊗ ( µρ )( m ′ ) + K ( ν, µ ) − log( ǫ )+ (cid:16) − γλ (cid:17) − ( ν + µ ) (cid:26) K (cid:0) ρ, b π (cid:1) + log nb π h exp (cid:8) N γλ log (cid:2) cosh( λN ) (cid:3) ρ ( m ′ ) (cid:9)io(cid:27) + (cid:16) λγ − (cid:17) − (cid:2) K ( ν, µ ) − ǫ ) (cid:3) . Putting all this together, we see that with P probability at least 1 − ǫ , for anyposterior distribution ν ∈ M ( M ), (cid:20) − αN tanh( γN ) + β − αN tanh( λN ) (cid:0) − γλ (cid:1) (cid:21) K ( ν, µ ) ≤ α h N tanh( γN ) + β i − (cid:26) ν (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21) − log( ǫ ) (cid:27) + α h N tanh( γN ) − β i − (cid:26) µ (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21) − log( ǫ ) (cid:27) + α (cid:2) N tanh( λN ) (cid:3) − ( λ ( ν − µ ) b π ( r ) + N log (cid:2) cosh( λN ) (cid:3) ( ν b π ) ⊗ ( µ b π )( m ′ )+ (cid:16) − γλ (cid:17) − ( ν + µ ) (cid:20) log nb π h exp (cid:8) N γλ log (cid:2) cosh( λN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21) − γλ − γλ log( ǫ ) ) + K ( ν, µ ) − K ( µ, µ ) . Replacing in the right-hand side of this inequality the unobserved prior distribution µ with the worst possible posterior distribution, we obtain Theorem 2.3.6 . For any positive real constants α , β , γ and λ , using the notation, π = π exp( − βR ) ,µ = µ exp[ − απ ( R )] , b π = π exp( − γr ) , b µ = µ exp[ − α λN tanh( λN ) − b π ( r )] , with P probability at least − ǫ , for any posterior distribution ν : Ω → M ( M ) , (cid:20) − αN tanh( γN ) + β − αN tanh( λN ) (cid:0) − γλ (cid:1) (cid:21) K ( ν, µ ) ≤ K ( ν, b µ )+ αN tanh( γN ) + β (cid:26) ν (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21)(cid:27) .3. Two step localization αN tanh( λN )(1 − γλ ) (cid:26) ν (cid:20) log nb π h exp (cid:8) N γλ log (cid:2) cosh( λN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21)(cid:27) + log (b µ "(cid:20)b π n exp h N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) io(cid:21) αN tanh( γN ) − β × (cid:20)b π n exp h N γλ log (cid:2) cosh( λN ) (cid:3)b π ( m ′ ) io(cid:21) αN tanh( λN )(1 − γλ ) × exp (cid:20) α log[cosh( λN )]tanh( λN ) ( ν b π ) ⊗ b π ( m ′ ) (cid:21) + (cid:20) N tanh( γN ) + β + 1 N tanh( γN ) − β + 1 + γλ N tanh( λN ) (cid:0) − γλ (cid:1) (cid:21) log (cid:0) ǫ (cid:1) . This result is satisfactory, but in the same time hints at some possible improve-ment in the choice of the localized prior µ , which is here somewhat lacking a varianceterm. We will consider in the remainder of this section the use of(2.38) µ = µ exp[ − απ ( R ) − ξ e π ⊗ e π ( M ′ ) , where ξ is some positive real constant and e π = π exp( − e βR ) is some appropriateconditional prior distribution with positive real parameter e β . With this new choice K ( ν, µ ) = α ( ν − µ ) π ( R ) + ξ ( ν − µ )( e π ⊗ e π )( M ′ ) + K ( ν, µ ) − K ( µ, µ ) . We already know how to deal with the first factor α ( ν − µ ) π ( R ), since the com-putations we made to give it an empirical upper bound were valid for any choiceof the localized prior distribution µ . Let us now deal with ξ ( ν − µ )( e π ⊗ e π )( M ′ ).Since m ′ ( θ, θ ′ ) is a sum of independent Bernoulli random variables, we can easilygeneralize the result of Theorem 1.1.4 (page 4) to prove that with P probability atleast 1 − ǫN (cid:2) − exp( − ζN ) (cid:3) ν ( e π ⊗ e π )( M ′ ) ≤ ζ Φ ζN (cid:2) ν ( e π ⊗ e π )( M ′ ) (cid:3) ≤ ζν ( e π ⊗ e π )( m ′ ) + K ( ν, µ ) − log( ǫ ) . In the same way, with P probability at least 1 − ǫ , − N (cid:2) exp( ζN ) − (cid:3) µ ( e π ⊗ e π )( M ′ ) ≤ − ζ Φ − ζN (cid:2) µ ( e π ⊗ e π )( M ′ ) (cid:3) ≤ − ζµ ( e π ⊗ e π )( m ′ ) − log( ǫ ) . We would like now to replace ( e π ⊗ e π )( m ′ ) with an empirical quantity. In order to dothis, we will use an entropy bound. Indeed for any conditional posterior distribution ρ : Ω × M → M (Θ), ν (cid:2) K ( ρ, e π ) (cid:3) = e βν ( ρ − e π )( R ) + ν (cid:2) K ( ρ, π ) − K ( e π, π ) (cid:3) ≤ e βN tanh( γN ) (cid:26) γν ( ρ − e π )( r ) + N log (cid:2) cosh( γN ) (cid:3) ν ( ρ ⊗ e π )( m ′ )+ K ( ν, µ ) + ν (cid:2) K ( ρ, e π ) (cid:3) − log( ǫ ) (cid:27) + ν (cid:2) K ( ρ, π ) − K ( e π, π ) (cid:3) . Chapter 2. Comparing posterior distributions to Gibbs priors Thus choosing e β = N tanh( γN ), γν ( e π − ρ )( r ) + ν (cid:2) K ( e π, π ) − K ( ρ, π ) (cid:3) ≤ N log (cid:2) cosh( γN ) (cid:3) ν ( ρ ⊗ e π )( m ′ ) + K ( ν, µ ) − log( ǫ ) . Choosing ρ = b π , we get ν (cid:2) K ( e π, b π ) (cid:3) ≤ N log (cid:2) cosh( γN ) (cid:3) ν ( b π ⊗ e π )( m ′ ) + K ( ν, µ ) − log( ǫ ) . This implies that ξν ( b π ⊗ e π )( m ′ ) = ν ne π (cid:2) ξ b π ( m ′ ) (cid:3) − K ( e π, b π ) o + ν (cid:2) K ( e π, b π ) (cid:3) ≤ ν n log hb π (cid:8) exp (cid:2) ξ b π ( m ′ ) (cid:3)(cid:9)io + N log (cid:2) cosh( γN ) (cid:3) ν ( b π ⊗ e π )( m ′ ) + K ( ν, µ ) − log( ǫ ) . Thus (cid:8) ξ − N log (cid:2) cosh( γN ) (cid:3)(cid:9) ν ( b π ⊗ e π )( m ′ ) ≤ ν n log hb π (cid:8) exp (cid:2) ξ b π ( m ′ ) (cid:3)(cid:9)io + K ( ν, µ ) − log( ǫ )and ν (cid:2) K ( e π, b π ) (cid:3) ≤ (cid:18) ξN log[cosh( γN )] − (cid:19) − (cid:20) ν n log hb π (cid:8) exp (cid:2) ξ b π ( m ′ ) (cid:3)(cid:9)io + K ( ν, µ ) − log( ǫ ) (cid:21) + K ( ν, µ ) − log( ǫ ) . Taking for simplicity ξ = 2 N log (cid:2) cosh( γN ) (cid:3) and noticing that2 N log (cid:2) cosh( γN ) (cid:3) = − N log (cid:0) − e β N (cid:1) , we get Theorem 2.3.7 . Let us put e π = π exp( − e βR ) and b π = π exp( − γr ) , where γ is somearbitrary positive real constant and e β = N tanh( γN ) , so that γ = N log (cid:16) e βN − e βN (cid:17) .With P probability at least − ǫ , ν (cid:2) K ( e π, b π ) (cid:3) ≤ ν (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21) + 2 (cid:2) K ( ν, µ ) − log( ǫ ) (cid:3) . As a consequence ζν ( e π ⊗ e π )( m ′ ) = ζν ( e π ⊗ e π )( m ′ ) − ν (cid:2) K ( e π ⊗ e π, b π ⊗ b π ) (cid:3) + 2 ν (cid:2) K ( e π, b π ) (cid:3) ≤ ν n log hb π ⊗ b π (cid:2) exp( ζm ′ ) (cid:3)io + 2 ν (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21) + 4 (cid:2) K ( ν, µ ) − log( ǫ ) (cid:3) . Let us take for the sake of simplicity ζ = 2 N log (cid:2) cosh( γN ) (cid:3) , to get ζν ( e π ⊗ e π )( m ′ ) ≤ ν n log hb π ⊗ b π (cid:2) exp( ζm ′ ) (cid:3)io + 4 (cid:2) K ( ν, µ ) − log( ǫ ) (cid:3) . This proves .3. Two step localization Proposition 2.3.8 . Let us consider some arbitrary prior distribution µ ∈ M ( M ) and some arbitrary conditional prior distribution π : M → M (Θ) . Let e β < N besome positive real constant. Let us put e π = π exp( − e βR ) and b π = π exp( − γr ) , with e β = N tanh( γN ) . Moreover let us put ζ = 2 N log (cid:2) cosh( γN ) (cid:3) . With P probability atleast − ǫ , for any posterior distribution ν ∈ M ( M ) , ν ( e π ⊗ e π )( M ′ ) ≤ ν n log hb π ⊗ b π (cid:2) exp( ζm ′ ) (cid:3)io + 5 (cid:2) K ( ν, µ ) − log( ǫ ) (cid:3) N (cid:2) − exp( − ζN ) (cid:3) = 1 N tanh( γN ) (cid:26) ν (cid:20) log nb π ⊗ b π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) m ′ (cid:9)io(cid:21) + 5 (cid:2) K ( ν, µ ) − log( ǫ ) (cid:3)(cid:27) . In the same way, − ζµ ( e π ⊗ e π )( m ′ ) ≤ µ n log hb π ⊗ b π (cid:2) exp( − ζm ′ ) (cid:3)io + 2 µ (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21) − ǫ )and thus − µ ( e π ⊗ e π )( M ′ ) ≤ N (cid:2) exp( ζN ) − (cid:3) (cid:26) µ n log hb π ⊗ b π (cid:2) exp( − ζm ′ ) (cid:3)io + 2 µ (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21) − ǫ ) (cid:27) . Here we have purposely kept ζ as an arbitrary positive real constant, to be tunedlater (in order to be able to strengthen more or less the compensation of varianceterms).We are now properly equipped to estimate the divergence with respect to µ , thechoice of prior distribution made in equation (2.38, page 105). Indeed we can nowwrite (cid:20) − αN tanh( γN ) + β − αN tanh( λN ) (cid:0) − γλ (cid:1) − ξN tanh( γN ) (cid:21) K ( ν, µ ) ≤ αN tanh( γN ) + β (cid:26) ν (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21) − log( ǫ ) (cid:27) + αN tanh( γN ) − β (cid:26) µ (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21) − log( ǫ ) (cid:27) + αN tanh( λN ) ( λ ( ν − µ ) b π ( r ) + N log (cid:2) cosh( λN ) (cid:3) ( ν b π ) ⊗ ( µ b π )( m ′ )+ (cid:16) − γλ (cid:17) − ( ν + µ ) (cid:20) log nb π h exp (cid:8) N γλ log (cid:2) cosh( λN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21) − γN − γN log( ǫ ) ) Chapter 2. Comparing posterior distributions to Gibbs priors + ξN tanh( γN ) (cid:26) ν (cid:20) log nb π ⊗ b π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) m ′ (cid:9)io(cid:21) − ǫ ) (cid:27) + ξN (cid:2) exp( ζN ) − (cid:3) (cid:26) µ n log hb π ⊗ b π (cid:2) exp( − ζm ′ ) (cid:3)io + 2 µ (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21) − ǫ ) (cid:27) . + K ( ν, µ ) − K ( µ, µ ) . It remains now only to replace in the right-hand side of this inequality µ withthe worst possible posterior distribution to obtain Theorem 2.3.9 . Let λ > γ > β , ζ , α and ξ be arbitrary positive real constants.Let us use the notation π = π exp( − βR ) , e π = π exp( − N tanh( γN ) R ) , b π = π exp( − γr ) , µ = µ exp[ − απ ( R ) − ξ e π ⊗ e π ( M ′ )] and let us define the posterior distribution b µ : Ω → M ( M ) by d b µdµ ∼ exp (cid:26) − αλN tanh( λN ) b π ( r ) + ξN (cid:2) exp( ζN ) − (cid:3) log nb π ⊗ b π (cid:2) exp( − ζm ′ ) (cid:3)o(cid:27) . Let us assume moreover that αN tanh( γN ) + β + αN tanh( λN )(1 − γλ ) + 5 ξN tanh( γN ) < . With P probability at least − ǫ , for any posterior distribution ν : Ω → M ( M ) , K ( ν, µ ) ≤ (cid:20) − αN tanh( γN ) + β − αN tanh( λN ) (cid:0) − γλ (cid:1) − ξN tanh( γN ) (cid:21) − ( K ( ν, b µ )+ αN tanh( γN ) + β (cid:26) ν (cid:20) log nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21)(cid:27) + αN tanh( λN ) (cid:0) − γλ (cid:1) (cid:26) ν (cid:20) log nb π h exp (cid:8) N γλ log (cid:2) cosh( λN ) (cid:3)b π ( m ′ ) (cid:9)io(cid:21)(cid:27) + ξN tanh( γN ) (cid:26) ν (cid:20) log nb π ⊗ b π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3) m ′ (cid:9)io(cid:21)(cid:27) + ξN (cid:2) exp( ζN ) − (cid:3) (cid:26) ν n log hb π ⊗ b π (cid:2) exp( − ζm ′ ) (cid:3)io(cid:27) + log (cid:26)b µ (cid:20)nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io αN tanh( γN ) − β × nb π h exp (cid:8) N γλ log (cid:2) cosh( λN ) (cid:3)b π ( m ′ ) (cid:9)io αN tanh( λN ) (cid:0) − γλ (cid:1) × nb π h exp (cid:8) N log (cid:2) cosh( γN ) (cid:3)b π ( m ′ ) (cid:9)io ξN (cid:2) exp( ζN ) − (cid:3) .3. Two step localization × exp n N log (cid:2) cosh( λN ) (cid:3)(cid:2) ( ν b π ) ⊗ b π (cid:3) ( m ′ ) o(cid:21)(cid:27) + " αN tanh( γN ) + β + αN tanh( γN ) − β + 2 α (cid:0) γN (cid:1) N tanh( λN ) (cid:0) − γλ (cid:1) + 5 ξN tanh( γN ) + 5 ξN (cid:2) exp( ζN ) − (cid:3) log (cid:0) ǫ (cid:1)) . The interest of this theorem lies in the presence of a variance term in the localizedposterior distribution b µ , which with a suitable choice of parameters seems to be aninteresting option in the case when there are nested models: in this situation theremay be a need to prevent integration with respect to b µ in the right-hand side toput weight on wild oversized models with large variance terms. Moreover, the right-hand side being empirical, parameters can be, as usual, optimized from data usinga union bound on a grid of candidate values.If one is only interested in the general shape of the result, a simplified inequalityas the one below may suffice: Corollary 2.3.10 . For any positive real constants λ > γ > β , ζ , α and ξ , let ususe the same notation as in Theorem 2.3.9 (page 108). Let us put moreover A = αN tanh( γN ) + β + αN tanh( λN ) (cid:0) − γλ (cid:1) + 5 ξN tanh( γN ) ,A = αN tanh( γN ) + β + αN tanh( λN ) (cid:0) − γλ (cid:1) + 3 ξN tanh( γN ) A = ξN (cid:2) exp (cid:0) ζN (cid:1) − (cid:3) A = αN tanh( γN ) − β + αN tanh( λN )(1 − γλ ) + 2 ξN [exp( ζN ) − ,A = αN tanh( γN ) + β + αN tanh( γN ) − β + 2 α (cid:0) γN (cid:1) N tanh( λN ) (cid:0) − γλ (cid:1) + 5 ξN tanh( γN ) + 5 ξN (cid:2) exp( ζN ) − (cid:3) ,C = 2 N log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) ,C = N log (cid:2) cosh (cid:0) λN (cid:1)(cid:3) . Let us assume that A < . With P probability at least − ǫ , for any posteriordistribution ν : Ω → M ( M ) , K ( ν, µ ) ≤ K ( ν, α, β, γ, λ, ξ, ζ, ǫ ) def = (cid:0) − A (cid:1) − ( K ( ν, b µ )+ A ν h log (cid:16)b π ⊗ b π (cid:2) exp (cid:0) C m ′ (cid:1)(cid:3)(cid:17)i + A ν h log (cid:16)b π ⊗ b π (cid:2) exp (cid:0) − ζm ′ (cid:1)(cid:3)(cid:17)i + log (cid:26)b µ (cid:20)hb π (cid:16) exp (cid:2) C b π ( m ′ ) (cid:3)(cid:17)i A exp (cid:16) C (cid:2) ( ν b π ) ⊗ b π (cid:3) ( m ′ ) (cid:17)(cid:21)(cid:27) + A log (cid:0) ǫ (cid:1)) . Chapter 2. Comparing posterior distributions to Gibbs priors Putting this corollary together with Corollary 2.3.5 (page 102), we obtain Theorem 2.3.11 . Let us consider the notation introduced in Corollary 2.3.5 (page102) and in Theorem 2.3.9 (page 108) and its Corollary 2.3.10 (page 109). Let usconsider real positive parameters λ , γ ′ < λ ′ and γ ′ < λ ′ . Let us consider also twosets of parameters α i , β i , γ i , λ i , ξ i , ζ i ,where i = 1 , , both satisfying the conditionsstated in Corollary 2.3.10 (page 109). With P probability at least − ǫ , for anyposterior distributions ν , ν : Ω → M ( M ) , any conditional posterior distributions ρ , ρ : Ω × M → M (cid:0) Θ (cid:1) , − N log n − tanh (cid:0) λN (cid:1)(cid:0) ν ρ − ν ρ (cid:1) ( R ) o ≤ λ (cid:0) ν ρ − ν ρ (cid:1) ( r )+ N log (cid:2) cosh (cid:0) λN (cid:1)(cid:3)(cid:0) ν ρ (cid:1) ⊗ (cid:0) ν ρ (cid:1)(cid:0) m ′ (cid:1) + K ′ (cid:0) ν , ρ , γ ′ , λ ′ , ǫ (cid:1) + K ′ (cid:0) ν , ρ , γ ′ , λ ′ , ǫ (cid:1) + 11 − γ ′ λ ′ K (cid:0) ν , α , β , γ , λ , ξ , ζ , ǫ (cid:1) + 11 − γ ′ λ ′ K (cid:0) ν , α , β , γ , λ , ξ , ζ , ǫ (cid:1) − log (cid:0) ǫ (cid:1) . This theorem provides, using a union bound argument to further optimize theparameters, an empirical bound for ν ρ ( R ) − ν ρ ( R ), which can serve to builda selection algorithm exactly in the same way as what was done in Theorem 2.2.4(page 73). This represents the highest degree of sophistication that we will achievein this monograph, as far as model selection is concerned: this theorem shows thatit is indeed possible to derive a selection scheme in which localization is performedin two steps and in which the localization of the model selection itself, as opposedto the localization of the estimation in each model, includes a variance term as wellas a bias term, so that it should be possible to localize the choice of nested mod-els, something that would not have been feasible with the localization techniquesexposed in the previous sections of this study. We should point out however that more sophisticated does not necessarily mean more efficient : as the reader mayhave noticed, sophistication comes at a price, in terms of the complexity of theestimation schemes, with some possible loss of accuracy in the constants that canmar the benefits of using an asymptotically more efficient method for small samplesizes.We will do the hurried reader a favour: we will not launch into a study of thetheoretical properties of this selection algorithm, although it is clear that all thetools needed are at hand!We would like as a conclusion to this chapter, to put forward a simple idea:this approach of model selection revolves around entropy estimates concerned withthe divergence of posterior distributions with respect to localized prior distribu-tions. Moreover, this localization of the prior distribution is more effectively donein several steps in some situations, and it is worth mentioning that these situationsinclude the typical case of selection from a family of parametric models. Finally,the whole story relies upon estimating the relative generalization error rate of oneposterior distribution with respect to some local prior distribution as well as withrespect to another posterior distribution, because these relative rates can be esti-mated more accurately than absolute generalization error rates, at least as soonas no classification model of reasonable size provides a good match to the trainingsample, meaning that the classification problem is either difficult or noisy. hapter 3 Transductive PAC-Bayesianlearning In this chapter the observed sample ( X i , Y i ) Ni =1 will be supplemented with a test or shadow sample ( X i , Y i ) ( k +1) Ni = N +1 . This point of view, called transductive classification ,has been introduced by V. Vapnik. It may be justified in different ways.On the practical side, one interest of the transductive setting is that it is often alot easier to collect examples than it is to label them, so that it is not unrealistic toassume that we indeed have two training samples, one labelled and one unlabelled.It also covers the case when a batch of patterns is to be classified and we are allowedto observe the whole batch before issuing the classification.On the mathematical side, considering a shadow sample proves technically fruit-ful. Indeed, when introducing the Vapnik–Cervonenkis entropy and Vapnik–Cervo-nenkis dimension concepts, as well as when dealing with compression schemes, albeitthe inductive setting is our final concern, the transductive setting is a useful detour.In this second scenario, intermediate technical results involving the shadow sampleare integrated with respect to unobserved random variables in a second stage of theproofs.Let us describe now the changes to be made to previous notation to adapt themto the transductive setting. The distribution P will be a probability measure on thecanonical space Ω = ( X × Y ) ( k +1) N , and ( X i , Y i ) ( k +1) Ni =1 will be the canonical processon this space (that is the coordinate process). Unless explicitly mentioned, theparameter k indicating the size of the shadow sample will remain fixed. Assumingthe shadow sample size is a multiple of the training sample size is convenient withoutsignificantly restricting generality. For a while, we will use a weaker assumption thanindependence, assuming that P is partially exchangeable , since this is all we need inthe proofs. Definition 3.1.1 . For i = 1 , . . . , N , let τ i : Ω → Ω be defined for any Chapter 3. Transductive PAC-Bayesian learning ω = ( ω j ) ( k +1) Nj =1 ∈ Ω by τ i ( ω ) i + jN = ω i +( j − N , j = 1 , . . . , k,τ i ( ω ) i = ω i + kN , and τ i ( ω ) m + jN = ω m + jN , m = i, m = 1 , . . . , N, j = 0 , . . . k. Clearly, if we arrange the ( k + 1) N samples in a N × ( k + 1) array, τ i performsa circular permutation of k + 1 entries on the i th row, leaving the other rows un-changed. Moreover, all the circular permutations of the i th row have the form τ ji , j ranging from to k .The probability distribution P is said to be partially exchangeable if for any i =1 , . . . , N , P ◦ τ − i = P .This means equivalently that for any bounded measurable function h : Ω → R , P ( h ◦ τ i ) = P ( h ) .In the same way a function h defined on Ω will be said to be partially exchangeableif h ◦ τ i = h for any i = 1 , . . . , N . Accordingly a posterior distribution ρ : Ω → M (Θ , T ) will be said to be partially exchangeable when ρ ( ω, A ) = ρ (cid:2) τ i ( ω ) , A (cid:3) , forany ω ∈ Ω , any i = 1 , . . . , N and any A ∈ T . For any bounded measurable function h , let us define T i ( h ) = k +1 P kj =0 h ◦ τ ji .Let T ( h ) = T N ◦ · · · ◦ T ( h ). For any partially exchangeable probability distribution P , and for any bounded measurable function h , P (cid:2) T ( h ) (cid:3) = P ( h ). Let us put σ i ( θ ) = (cid:2) f θ ( X i ) = Y i (cid:3) , indicating the success or failure of f θ to predict Y i from X i , r ( θ ) = 1 N N X i =1 σ i ( θ ) , the empirical error rate of f θ on the observed sample, r ( θ ) = 1 kN ( k +1) N X i = N +1 σ i ( θ ) , the error rate of f θ on the shadow sample, r ( θ ) = r ( θ ) + kr ( θ ) k + 1 = 1( k + 1) N ( k +1) N X i =1 σ i ( θ ) , the global errorrate of f θ , R i ( θ ) = P (cid:2) f θ ( X i ) = Y i (cid:3) , the expected errorrate of f θ on the i th input, R ( θ ) = 1 N N X i =1 R i ( θ ) = P (cid:2) r ( θ ) (cid:3) = P (cid:2) r ( θ ) (cid:3) , the average expectederror rate of f θ on all inputs.We will allow for posterior distributions ρ : Ω → M (Θ) depending on the shadowsample. The most interesting ones will anyhow be independent of the shadow labels Y N +1 , . . . , Y ( k +1) N . We will be interested in the conditional expected error rate ofthe randomized classification rule described by ρ on the shadow sample, given theobserved sample, that is, P (cid:2) ρ ( r ) | ( X i , Y i ) Ni =1 (cid:3) . This is a natural extension of thenotion of generalization error rate : this is indeed the error rate to be expectedwhen the randomized classification rule described by the posterior distribution ρ is applied to the shadow sample (which should in this case more purposefully becalled the test sample). .1. Basic inequalities P is invariant by any permutation of any row, meaningthat P (cid:2) h ( ω ◦ s ) (cid:3) = P (cid:2) h ( ω ) (cid:3) for all s ∈ S ( { i + jN ; j = 0 , . . . , k } )and all i = 1 , . . . , N , where S ( A ) is the set of permutations of A , extended to { , . . . , ( k + 1) N } so as to be the identity outside of A . In other words, P is as-sumed to be invariant under any permutation which keeps the rows unchanged.In this case, if ρ is invariant by any permutation of any row of the shadow sam-ple, meaning that ρ ( ω ◦ s ) = ρ ( ω ) ∈ M (Θ), s ∈ S ( { i + jN ; j = 1 , . . . , k } ), i = 1 , . . . , N , then P (cid:2) ρ ( r ) | ( X i , Y i ) Ni =1 (cid:3) = N P Ni =1 P (cid:2) ρ ( σ i + N ) | ( X i , Y i ) Ni =1 (cid:3) , meaningthat the expectation can be taken on a restricted shadow sample of the same size asthe observed sample. If moreover the rows are equidistributed, meaning that theirmarginal distributions are equal, then P (cid:2) ρ ( r ) | ( X i , Y i ) Ni =1 (cid:3) = P (cid:2) ρ ( σ N +1 ) | ( X i , Y i ) Ni =1 (cid:3) .This means that under these quite commonly fulfilled assumptions, the expectationcan be taken on a single new object to be classified, our study thus covers the casewhen only one of the patterns from the shadow sample is to be labelled and one isinterested in the expected error rate of this single labelling. Of course, in the casewhen P is i.i.d. and ρ depends only on the training sample ( X i , Y i ) Ni =1 , we fall backon the usual criterion of performance P (cid:2) ρ ( r ) | ( Z i ) Ni =1 (cid:3) = ρ ( R ) = ρ ( R ). Using an obvious factorization, and considering for the moment a fixed value of θ and any partially exchangeable positive real measurable function λ : Ω → R + , wecan compute the log-Laplace transform of r under T , which acts like a conditionalprobability distribution:log n T (cid:2) exp( − λr ) (cid:3)o = N X i =1 log n T i (cid:2) exp( − λN σ i ) (cid:3)o ≤ N log (cid:26) N N X i =1 T i h exp (cid:0) − λN σ i (cid:1)i(cid:27) = − λ Φ λN ( r ) , where the function Φ λN was defined by equation (1.1, page 2). Remarking that T n exp h λ (cid:2) Φ λN ( r ) − r (cid:3)io = exp (cid:2) λ Φ λN ( r ) (cid:3) T (cid:2) exp( − λr ) (cid:3) we obtain Lemma 3.1.1 . For any θ ∈ Θ and any partially exchangeable positive real mea-surable function λ : Ω → R + , T n exp h λ (cid:8) Φ λN (cid:2) r ( θ ) (cid:3) − r ( θ ) (cid:9)io ≤ . We deduce from this lemma a result analogous to the inductive case: Theorem 3.1.2 . For any partially exchangeable positive real measurable function λ : Ω × Θ → R + , for any partially exchangeable posterior distribution π : Ω → M (Θ) , P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) ρ h λ (cid:2) Φ λN ( r ) − r (cid:3)i − K ( ρ, π ) (cid:21)(cid:27) ≤ . Chapter 3. Transductive PAC-Bayesian learning The proof is deduced from the previous lemma, using the fact that π is partiallyexchangeable: P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) ρ h λ (cid:2) Φ λN ( r ) − r (cid:3)i − K ( ρ, π ) (cid:21)(cid:27) = P (cid:26) π n exp h λ (cid:2) Φ λN ( r ) − r (cid:3)io(cid:27) = P (cid:26) T π n exp h λ (cid:2) Φ λN ( r ) − r (cid:3)io(cid:27) = P (cid:26) π n T exp h λ (cid:2) Φ λN ( r ) − r (cid:3)io(cid:27) ≤ . Introducing in the same way m ′ ( θ, θ ′ ) = 1 N N X i =1 (cid:12)(cid:12)(cid:12) (cid:2) f θ ( X i ) = Y i (cid:3) − (cid:2) f θ ′ ( X i ) = Y i (cid:3)(cid:12)(cid:12)(cid:12) and m ( θ, θ ′ ) = 1( k + 1) N ( k +1) N X i =1 (cid:12)(cid:12)(cid:12) (cid:2) f θ ( X i ) = Y i (cid:3) − (cid:2) f θ ′ ( X i ) = Y i (cid:3)(cid:12)(cid:12)(cid:12) , we could prove along the same line of reasoning Theorem 3.1.3 . For any real parameter λ , any e θ ∈ Θ , any partially exchangeableposterior distribution π : Ω → M (Θ) , P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) λ h ρ (cid:8) Ψ λN (cid:2) r ( · ) − r ( e θ ) , m ( · , e θ ) (cid:3)(cid:9) − (cid:2) ρ ( r ) − r ( e θ ) (cid:3)i − K ( ρ, π ) (cid:21)(cid:27) ≤ , where the function Ψ λN was defined by equation (1.21, page 35). Theorem 3.1.4 . For any real constant γ , for any e θ ∈ Θ , for any partially ex-changeable posterior distribution π : Ω → M (Θ) , P ( exp " sup ρ ∈ M (Θ) (cid:26) − N ρ n log h − tanh (cid:0) γN (cid:1)(cid:2) r ( · ) − r ( e θ ) (cid:3)io − γ (cid:2) ρ ( r ) − r ( e θ ) (cid:3) − N log (cid:2) cosh (cid:0) γN (cid:1)(cid:3) ρ (cid:2) m ′ ( · , e θ ) (cid:3) − K ( ρ, π ) (cid:27) ≤ . This last theorem can be generalized to give Theorem 3.1.5 . For any real constant γ , for any partially exchangeable posteriordistributions π , π : Ω → M (Θ) , P ( exp " sup ρ ,ρ ∈ M (Θ) (cid:26) − N log n − tanh (cid:0) γN (cid:1)(cid:2) ρ ( r ) − ρ ( r ) (cid:3)o .2. Vapnik bounds for transductive classification − γ (cid:2) ρ ( r ) − ρ ( r ) (cid:3) − N log (cid:2) cosh (cid:0) γN (cid:1)(cid:3) ρ ⊗ ρ ( m ′ ) − K ( ρ , π ) − K ( ρ , π ) (cid:27) ≤ . To conclude this section, we see that the basic theorems of transductive PAC-Bayesian classification have exactly the same form as the basic inequalities of in-ductive classification, Theorems 1.1.4 (page 4), 1.4.2 (page 36) and 1.4.3 (page 37) with R ( θ ) replaced with r ( θ ), r ( θ ) replaced with r ( θ ) and M ′ ( θ, e θ ) replaced with m ( θ, e θ ). Thus all the results of the first two chapters remain true under the hypothesesof transductive classification, with R ( θ ) replaced with r ( θ ) , r ( θ ) replaced with r ( θ ) and M ′ ( θ, e θ ) replaced with m ( θ, e θ ) .Consequently, in the case when the unlabelled shadow sample is observed, it ispossible to improve on the Vapnik bounds to be discussed hereafter by using an ex-plicit partially exchangeable posterior distribution π and resorting to localized orto relative bounds (in the case at least of unlimited computing resources, which ofcourse may still be unrealistic in many real world situations, and with the caveat,to be recalled in the conclusion of this study, that for small sample sizes and com-paratively complex classification models, the improvement may not be so decisive). Let us notice also that the transductive setting when experimentally available,has the advantage that d ( θ, θ ′ ) = 1( k + 1) N ( k +1) N X i =1 (cid:2) f θ ′ ( X i ) = f θ ( X i ) (cid:3) ≥ m ( θ, θ ′ ) ≥ r ( θ ) − r ( θ ′ ) , θ, θ ′ ∈ Θ , is observable in this context, providing an empirical upper bound for the difference r ( b θ ) − ρ ( r ) for any non-randomized estimator b θ and any posterior distribution ρ ,namely r ( b θ ) ≤ ρ ( r ) + ρ (cid:2) d ( · , b θ ) (cid:3) . Thus in the setting of transductive statistical experiments, the PAC-Bayesian frame-work provides fully empirical bounds for the error rate of non-randomized estima-tors b θ : Ω → Θ, even when using a non-atomic prior π (or more generally a non-atomic partially exchangeable posterior distribution π ), even when Θ is not a vectorspace and even when θ R ( θ ) cannot be proved to be convex on the support ofsome useful posterior distribution ρ . In this section, we will stick to plain unlocalized non-relative bounds. As we havealready mentioned, (and as it was put forward by Vapnik himself in his seminalworks), these bounds are not always superseded by the asymptotically better oneswhen the sample is of small size: they deserve all our attention for this reason. Wewill start with the general case of a shadow sample of arbitrary size. We will thendiscuss the case of a shadow sample of equal size to the training set and the case ofa fully exchangeable sample distribution, showing how they can be taken advantageof to sharpen inequalities.16 Chapter 3. Transductive PAC-Bayesian learning The great thing with the transductive setting is that we are manipulating only r and r which can take only a finite number of values and therefore are piecewiseconstant on Θ. This makes it possible to derive inequalities that will hold uniformlyfor any value of the parameter θ ∈ Θ. To this purpose, let us consider for any value θ ∈ Θ of the parameter the subset ∆( θ ) ⊂ Θ of parameters θ ′ such that theclassification rule f θ ′ answers the same on the extended sample ( X i ) ( k +1) Ni =1 as f θ .Namely, let us put for any θ ∈ Θ∆( θ ) = (cid:8) θ ′ ∈ Θ; f θ ′ ( X i ) = f θ ( X i ) , i = 1 , . . . , ( k + 1) N (cid:9) . We see immediately that ∆( θ ) is an exchangeable parameter subset on which r and r and therefore also r take constant values. Thus for any θ ∈ Θ we may considerthe posterior ρ θ defined by dρ θ dπ ( θ ′ ) = (cid:2) θ ′ ∈ ∆( θ ) (cid:3) π (cid:2) ∆( θ ) (cid:3) − , and use the fact that ρ θ ( r ) = r ( θ ) and ρ θ ( r ) = r ( θ ), to prove that Lemma 3.2.1 . For any partially exchangeable positive real measurable function λ : Ω × Θ → R such that (3.1) λ ( ω, θ ′ ) = λ ( ω, θ ) , θ ∈ Θ , θ ′ ∈ ∆( θ ) , ω ∈ Ω , and any partially exchangeable posterior distribution π : Ω → M (Θ) , with P prob-ability at least − ǫ , for any θ ∈ Θ , Φ λN (cid:2) r ( θ ) (cid:3) + log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9) λ ( θ ) ≤ r ( θ ) . We can then remark that for any value of λ independent of ω , the left-hand sideof the previous inequality is a partially exchangeable function of ω ∈ Ω. Thus thisleft-hand side is maximized by some partially exchangeable function λ , namelyarg max λ ( Φ λN (cid:2) r ( θ ) (cid:3) + log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9) λ ) is partially exchangeable as depending only on partially exchangeable quantities.Moreover this choice of λ ( ω, θ ) satisfies also condition (3.1) stated in the previouslemma of being constant on ∆( θ ), proving Lemma 3.2.2 . For any partially exchangeable posterior distribution π : Ω → M (Θ) , with P probability at least − ǫ , for any θ ∈ Θ and any λ ∈ R + , Φ λN (cid:2) r ( θ ) (cid:3) + log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9) λ ≤ r ( θ ) . Writing r = r + kr k +1 and rearranging terms we obtain .2. Vapnik bounds for transductive classification Theorem 3.2.3 . For any partially exchangeable posterior distribution π : Ω → M (Θ) , with P probability at least − ǫ , for any θ ∈ Θ , r ( θ ) ≤ k + 1 k inf λ ∈ R + − exp − λN r ( θ ) + log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9) N ! − exp (cid:0) − λN (cid:1) − r ( θ ) k . If we have a set of binary classification rules { f θ ; θ ∈ Θ } whose Vapnik–Cervo-nenkis dimension is not greater than h , we can choose π such that π (cid:2) ∆( θ ) (cid:3) isindependent of θ and not less than (cid:18) he ( k + 1) N (cid:19) h , as will be proved further on inTheorem 4.2.2 (page 144).Another important setting where the complexity term − log (cid:8) π (cid:2) ∆( θ ) (cid:3)(cid:9) can eas-ily be controlled is the case of compression schemes , introduced by Little et al.(1986). It goes as follows: we are given for each labelled sub-sample ( X i , Y i ) i ∈ J , J ⊂ { , . . . , N } , an estimator of the parameter b θ (cid:2) ( X i , Y i ) i ∈ J (cid:3) = b θ J , J ⊂ { , . . . , N } , | J | ≤ h, where b θ : N G k =1 (cid:0) X × Y (cid:1) k → Θis an exchangeable function providing estimators for sub-samples of arbitrary size.Let us assume that b θ is exchangeable, meaning that for any k = 1 , . . . , N and anypermutation σ of { , . . . , k } b θ (cid:2) ( x i , y i ) ki =1 (cid:3) = b θ (cid:2) ( x σ ( i ) , y σ ( i ) ) ki =1 (cid:3) , ( x i , y i ) ki =1 ∈ (cid:0) X × Y (cid:1) k . In this situation, we can introduce the exchangeable subset nb θ J ; J ⊂ { , . . . , ( k + 1) N } , | J | ≤ h o ⊂ Θ , which is seen to contain at most h X j =0 (cid:18) ( k + 1) Nj (cid:19) ≤ (cid:18) e ( k + 1) Nh (cid:19) h classification rules — as will be proved later on in Theorem 4.2.3 (page 144). Notethat we had to extend the range of J to all the subsets of the extended sample,although we will use for estimation only those of the training sample, on whichthe labels are observed. Thus in this case also we can find a partially exchangeableposterior distribution π such that π (cid:2) ∆( b θ J ) (cid:3) ≥ (cid:18) he ( k + 1) N (cid:19) h . We see that the size of the compression scheme plays the same role in this complexitybound as the Vapnik–Cervonenkis dimension for Vapnik–Cervonenkis classes.In these two cases of binary classification with Vapnik–Cervonenkis dimensionnot greater than h and compression schemes depending on a compression set withat most h points, we get a bound of18 Chapter 3. Transductive PAC-Bayesian learning r ( θ ) ≤ k + 1 k inf λ ∈ R + − exp − λN r ( θ ) − h log (cid:16) e ( k +1) Nh (cid:17) − log( ǫ ) N − exp (cid:0) − λN (cid:1) − r ( θ ) k . Let us make some numerical application: when N = 1000 , h = 10 , ǫ = 0 . 01, andinf Θ r = r ( b θ ) = 0 . 2, we find that r ( b θ ) ≤ . k between 15 and 17,and values of λ equal respectively to 965, 968 and 971. For k = 1, we find only r ( b θ ) ≤ . k to be larger than 1. In the case when k = 1, we can improve Theorem 3.1.2 by taking advantage ofthe fact that T i ( σ i ) can take only 3 values, namely 0, 0 . T i ( σ i ) − Φ λN (cid:2) T i ( σ i ) (cid:3) can take only two values, 0 and − Φ λN ( ), because Φ λN (0) = 0and Φ λN (1) = 1. Thus T i ( σ i ) − Φ λN (cid:2) T i ( σ i ) (cid:3) = (cid:2) − | − T i ( σ i ) | (cid:3)(cid:2) − Φ λN ( ) (cid:3) . This shows that in the case when k = 1,log n T (cid:2) exp( − λr ) (cid:3)o = − λr + λN N X i =1 T i ( σ i ) − Φ λN (cid:2) T i ( σ i ) (cid:3) = − λr + λN N X i =1 (cid:2) − | − T i ( σ i ) | (cid:3)(cid:2) − Φ λN ( ) (cid:3) ≤ − λr + λ (cid:2) − Φ λN ( ) (cid:3)(cid:2) − | − r | (cid:3) . Noticing that − Φ λN ( ) = Nλ log (cid:2) cosh( λ N ) (cid:3) , we obtain Theorem 3.2.4 . For any partially exchangeable function λ : Ω × Θ → R + , for anypartially exchangeable posterior distribution π : Ω → M (Θ) , P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) ρ h λ ( r − r ) − N log (cid:2) cosh( λ N ) (cid:3)(cid:0) − | − r | (cid:1)i − K ( ρ, π ) (cid:21)(cid:27) ≤ . As a consequence, reasoning as previously, we deduce Theorem 3.2.5 . In the case when k = 1 , for any partially exchangeable posteriordistribution π : Ω → M (Θ) , with P probability at least − ǫ , for any θ ∈ Θ andany λ ∈ R + , r ( θ ) − Nλ log (cid:2) cosh( λ N ) (cid:3)(cid:0) − | − r ( θ ) | (cid:1) + log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9) λ ≤ r ( θ ); .2. Vapnik bounds for transductive classification and consequently for any θ ∈ Θ , r ( θ ) ≤ λ ∈ R + r ( θ ) − log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9) λ − Nλ log (cid:2) cosh( λ N ) (cid:3) − r ( θ ) . In the case of binary classification using a Vapnik–Cervonenkis class ofVapnik–Cervonenkis dimension not greater than h , we can choose π such that − log (cid:8) π (cid:2) ∆( θ ) (cid:3)(cid:9) ≤ h log( eNh ) and obtain the following numerical illustration ofthis theorem: for N = 1000, h = 10, ǫ = 0 . 01 and inf Θ r = r ( b θ ) = 0 . 2, we find anupper bound r ( b θ ) ≤ . (achieved by blind random classification). This indicatesthat considering shadow samples of arbitrary sizes some noisy situations yields asignificant improvement on bounds obtained with a shadow sample of the same sizeas the training sample. When k = 1 and P is exchangeable meaning that for any bounded measurablefunction h : Ω → R and any permutation s ∈ S (cid:0) { , . . . , N } (cid:1) P (cid:2) h ( ω ◦ s ) (cid:3) = P (cid:2) h ( ω ) (cid:3) ,then we can still improve the bound as follows. Let T ′ ( h ) = 1 N ! X s ∈ S (cid:0) { N +1 ,..., N } (cid:1) h ( ω ◦ s ) . Then we can write1 − | − T i ( σ i ) | = ( σ i − σ i + N ) = σ i + σ i + N − σ i σ i + N . Using this identity, we get for any exchangeable function λ : Ω × Θ → R + , T (cid:26) exp (cid:20) λ ( r − r ) − log (cid:2) cosh( λ N ) (cid:3) N X i =1 (cid:0) σ i + σ i + N − σ i σ i + N (cid:1)(cid:21)(cid:27) ≤ . Let us put A ( λ ) = Nλ log (cid:2) cosh( λ N ) (cid:3) , (3.2) v ( θ ) = 12 N N X i =1 ( σ i + σ i + N − σ i σ i + N ) . (3.3)With this notation T n exp (cid:8) λ (cid:2) r − r − A ( λ ) v (cid:3)(cid:9)o ≤ . Let us notice now that T ′ (cid:2) v ( θ ) (cid:3) = r ( θ ) − r ( θ ) r ( θ ) . Let π : Ω → M (Θ) be any given exchangeable posterior distribution. Using theexchangeability of P and π and the exchangeability of the exponential function, weget P n π h exp (cid:8) λ (cid:2) r − r − A ( r − r r ) (cid:3)(cid:9)io = P n π h exp (cid:8) λ (cid:2) r − r − AT ′ ( v ) (cid:3)(cid:9)io Chapter 3. Transductive PAC-Bayesian learning ≤ P n π h T ′ exp (cid:8) λ (cid:2) r − r − Av (cid:3)(cid:9)io = P n T ′ π h exp (cid:8) λ (cid:2) r − r − Av (cid:3)(cid:9)io = P n π h exp (cid:8) λ (cid:2) r − r − Av (cid:3)(cid:9)io = P n T π h exp (cid:8) λ (cid:2) r − r − Av (cid:3)(cid:9)io = P n π h T exp (cid:8) λ (cid:2) r − r − Av (cid:3)(cid:9)io ≤ . We are thus ready to state Theorem 3.2.6 . In the case when k = 1 , for any exchangeable probability dis-tribution P , for any exchangeable posterior distribution π : Ω → M (Θ) , for anyexchangeable function λ : Ω × Θ → R + , P (cid:26) exp (cid:20) sup ρ ∈ M (Θ) ρ n λ (cid:2) r − r − A ( λ )( r − r r ) (cid:3)o − K ( ρ, π ) (cid:21)(cid:27) ≤ , where A ( λ ) is defined by equation (3.2, page 119). We then deduce as previously Corollary 3.2.7 . For any exchangeable posterior distribution π : Ω → M (Θ) , forany exchangeable probability measure P ∈ M (Ω) , for any measurable exchangeablefunction λ : Ω × Θ → R + , with P probability at least − ǫ , for any θ ∈ Θ , r ( θ ) ≤ r ( θ ) + A ( λ ) (cid:2) r ( θ ) − r ( θ ) r ( θ ) (cid:3) − log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9) λ , where A ( λ ) is defined by equation (3.2, page 119). In order to deduce an empirical bound from this theorem, we have to makesome choice for λ ( ω, θ ). Fortunately, it is easy to show that the bound holds uni-formly in λ , because the inequality can be rewritten as a function of only onenon-exchangeable quantity, namely r ( θ ). Indeed, since r = 2 r − r , we see thatthe inequality can be written as r ( θ ) ≤ r ( θ ) + A ( λ ) (cid:2) r ( θ ) − r ( θ ) r ( θ ) + r ( θ ) (cid:3) − log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3) λ . It can be solved in r ( θ ), to get r ( θ ) ≥ f (cid:16) λ, r ( θ ) , − log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9)(cid:17) , where f ( λ, r, d ) = (cid:2) A ( λ ) (cid:3) − (cid:26) rA ( λ ) − r(cid:2) − rA ( λ ) (cid:3) + 4 A ( λ ) n r (cid:2) − A ( λ ) (cid:3) − dλ o(cid:27) . Thus we can find some exchangeable function λ ( ω, θ ), such that f (cid:16) λ ( ω, θ ) , r ( θ ) , − log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9)(cid:17) = sup β ∈ R + f (cid:16) β, r ( θ ) , − log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9)(cid:17) . Applying Corollary 3.2.7 (page 120) to that choice of λ , we see that .3. Vapnik bounds for inductive classification Theorem 3.2.8 . For any exchangeable probability measure P ∈ M (Ω) , for anyexchangeable posterior probability distribution π : Ω → M (Θ) , with P probabilityat least − ǫ , for any θ ∈ Θ , for any λ ∈ R + , r ( θ ) ≤ r ( θ ) + A ( λ ) (cid:2) r ( θ ) − r ( θ ) r ( θ ) (cid:3) − log (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9) λ , where A ( λ ) is defined by equation (3.2, page 119). Solving the previous inequality in r ( θ ), we get Corollary 3.2.9 . Under the same assumptions as in the previous theorem, with P probability at least − ǫ , for any θ ∈ Θ , r ( θ ) ≤ inf λ ∈ R + r ( θ ) n Nλ log (cid:2) cosh( λ N ) (cid:3)o − (cid:8) ǫπ (cid:2) ∆( θ ) (cid:3)(cid:9) λ − Nλ log (cid:2) cosh( λ N ) (cid:3)(cid:2) − r ( θ ) (cid:3) . Applying this to our usual numerical example of a binary classification modelwith Vapnik–Cervonenkis dimension not greater than h = 10, when N = 1000,inf Θ r = r ( b θ ) = 10 and ǫ = 0 . 01, we obtain that r ( b θ ) ≤ . We assume in this section that P = (cid:18) N O i =1 P i (cid:19) ⊗ ∞ ∈ M n(cid:2)(cid:0) X × Y (cid:1) N (cid:3) N o , where P i ∈ M (cid:0) X × Y (cid:1) : we consider an infinite i.i.d. sequence of independent non -identically distributed samples of size N , the first one only being observed.More precisely, under P each sample ( X i + jN , Y i + jN ) Ni =1 is distributed accordingto N Ni =1 P i , and they are all independent from each other. Only the first sample( X i , Y i ) Ni =1 is assumed to be observed. The shadow samples will only appear in theproofs. The aim of this section is to prove better Vapnik bounds, generalizing themin the same time to the independent non-i.i.d. setting, which to our knowledge hasnot been done before.Let us introduce the notation P ′ (cid:2) h ( ω ) (cid:3) = P (cid:2) h ( ω ) | ( X i , Y i ) Ni =1 (cid:3) , where h may beany suitable (e.g. bounded) random variable, let us also put Ω = (cid:2) ( X × Y ) N (cid:3) N . Definition 3.3.1 . For any subset A ⊂ N of integers, let C ( A ) be the set of circularpermutations of the totally ordered set A , extended to a permutation of N by takingit to be the identity on the complement N \ A of A . We will say that a randomfunction h : Ω → R is k -partially exchangeable if h ( ω ◦ s ) = h ( ω ) , s ∈ C (cid:0) { i + jN ; j = 0 , . . . , k } (cid:1) , i = 1 , . . . , N. In the same way, we will say that a posterior distribution π : Ω → M (Θ) is k -partially exchangeable if π ( ω ◦ s ) = π ( ω ) ∈ M (Θ) , s ∈ C (cid:0) { i + jN ; j = 0 , . . . , k } (cid:1) , i = 1 , . . . , N. Chapter 3. Transductive PAC-Bayesian learning Note that P itself is k -partially exchangeable for any k in the sense that for anybounded measurable function h : Ω → RP (cid:2) h ( ω ◦ s ) (cid:3) = P (cid:2) h ( ω ) (cid:3) , s ∈ C (cid:0) { i + jN ; j = 0 , . . . , k } (cid:1) , i = 1 , . . . , N. Let ∆ k ( θ ) = n θ ′ ∈ Θ ; (cid:2) f θ ′ ( X i ) (cid:3) ( k +1) Ni =1 = (cid:2) f θ ( X i ) (cid:3) ( k +1) Ni =1 o , θ ∈ Θ , k ∈ N ∗ , and letalso r k ( θ ) = 1( k + 1) N ( k +1) N X i =1 (cid:2) f θ ( X i ) = Y i (cid:3) . Theorem 3.1.2 shows that for anypositive real parameter λ and any k -partially exchangeable posterior distribution π k : Ω → M (Θ), P (cid:26) exp (cid:20) sup θ ∈ Θ λ (cid:2) Φ λN ( r k ) − r (cid:3) + log (cid:8) ǫπ k (cid:2) ∆ k ( θ ) (cid:3)(cid:9)(cid:21)(cid:27) ≤ ǫ. Using the general fact that P (cid:2) exp( h ) (cid:3) = P n P ′ (cid:2) exp( h ) (cid:3)o ≥ P n exp (cid:2) P ′ ( h ) (cid:3)o , and the fact that the expectation of a supremum is larger than the supremum ofan expectation, we see that with P probability at most 1 − ǫ , for any θ ∈ Θ, P ′ n Φ λN (cid:2) r k ( θ ) (cid:3)o ≤ r ( θ ) − P ′ n log (cid:8) ǫπ k (cid:2) ∆ k ( θ ) (cid:3)(cid:9)o λ . For short let us put ¯ d k ( θ ) = − log (cid:8) ǫπ k (cid:2) ∆ k ( θ ) (cid:3)(cid:9) ,d ′ k ( θ ) = − P ′ n log (cid:8) ǫπ k (cid:2) ∆ k ( θ ) (cid:3)(cid:9)o ,d k ( θ ) = − P n log (cid:8) ǫπ k (cid:2) ∆ k ( θ ) (cid:3)(cid:9)o . We can use the convexity of Φ λN and the fact that P ′ ( r k ) = r + kRk +1 , to establishthat P ′ n Φ λN (cid:2) r k ( θ ) (cid:3)o ≥ Φ λN (cid:20) r ( θ ) + kR ( θ ) k + 1 (cid:21) . We have proved Theorem 3.3.1 . Using the above hypotheses and notation, for any sequence π k :Ω → M (Θ) , where π k is a k -partially exchangeable posterior distribution, for anypositive real constant λ , any positive integer k , with P probability at least − ǫ , forany θ ∈ Θ , Φ λN (cid:20) r ( θ ) + kR ( θ ) k + 1 (cid:21) ≤ r ( θ ) + d ′ k ( θ ) λ . We can make as we did with Theorem 1.2.6 (page 11) the result of this theoremuniform in λ ∈ { α j ; j ∈ N ∗ } and k ∈ N ∗ (considering on k the prior k ( k +1) and on j the prior j ( j +1) ), and obtain .3. Vapnik bounds for inductive classification Theorem 3.3.2 . For any real parameter α > , with P probability at least − ǫ ,for any θ ∈ Θ , R ( θ ) ≤ inf k ∈ N ∗ ,j ∈ N ∗ − exp (cid:26) − α j N r ( θ ) − N n d ′ k ( θ ) + log (cid:2) k ( k + 1) j ( j + 1) (cid:3)o(cid:27) kk +1 h − exp (cid:16) − α j N (cid:17)i − r ( θ ) k . As a special case we can choose π k such that log (cid:8) π k (cid:2) ∆ k ( θ ) (cid:3)(cid:9) is independent of θ and equal to log( N k ), where N k = (cid:12)(cid:12)(cid:8)(cid:2) f θ ( X i ) (cid:3) ( k +1) Ni =1 ; θ ∈ Θ (cid:9)(cid:12)(cid:12) is the size of the trace of the classification model on the extended sample of size( k +1) N . With this choice, we obtain a bound involving a new flavour of conditionalVapnik entropy, namely d ′ k ( θ ) = P (cid:2) log( N k ) | ( Z i ) Ni =1 (cid:3) − log( ǫ ) . In the case of binary classification using a Vapnik–Cervonenkis class of Vapnik–Cervonenkis dimension not greater than h = 10, when N = 1000, inf Θ r = r ( b θ ) =0 . ǫ = 0 . 01, choosing α = 1 . 1, we obtain R ( b θ ) ≤ . λ = 1071 . 8, and an optimal value of k = 16). If we are not pleased with optimizing λ on a discrete subset of the real line, wecan use a slightly different approach. From Theorem 3.1.2 (page 113), we see thatfor any positive integer k , for any k -partially exchangeable positive real measurablefunction λ : Ω × Θ → R + satisfying equation (3.1, page 116) — with ∆( θ ) replacedwith ∆ k ( θ ) — for any ǫ ∈ )0 , 1) and η ∈ )0 , P (cid:26) P ′ (cid:20) exp h sup θ λ (cid:2) Φ λN ( r k ) − r (cid:3) + log (cid:8) ǫηπ k (cid:2) ∆ k ( θ ) (cid:3)(cid:9)(cid:21)(cid:27) ≤ ǫη, therefore with P probability at least 1 − ǫ , P ′ (cid:26) exp h sup θ λ (cid:2) Φ λN ( r k ) − r (cid:3) + log (cid:8) ǫηπ k (cid:2) ∆ k ( θ ) (cid:3)(cid:9)i(cid:27) ≤ η, and consequently, with P probability at least 1 − ǫ , with P ′ probability at least 1 − η ,for any θ ∈ Θ, Φ λN ( r k ) + log (cid:8) ǫηπ k (cid:2) ∆ k ( θ ) (cid:3)(cid:9) λ ≤ r . Now we are entitled to choose λ ( ω, θ ) ∈ arg max λ ′ ∈ R + Φ λ ′ N ( r k ) + log (cid:8) ǫηπ k (cid:2) ∆ k ( θ ) (cid:3)(cid:9) λ ′ . Chapter 3. Transductive PAC-Bayesian learning This shows that with P probability at least 1 − ǫ , with P ′ probability at least 1 − η ,for any θ ∈ Θ, sup λ ∈ R + Φ λN ( r k ) − ¯ d k ( θ ) − log( η ) λ ≤ r , which can also be writtenΦ λN ( r k ) − r − ¯ d k ( θ ) λ ≤ − log( η ) λ , λ ∈ R + . Thus with P probability at least 1 − ǫ , for any θ ∈ Θ, any λ ∈ R + , P ′ (cid:20) Φ λN ( r k ) − r − ¯ d k ( θ ) λ (cid:21) ≤ − log( η ) λ + (cid:20) − r + log( η ) λ (cid:21) η. On the other hand, Φ λN being a convex function, P ′ (cid:20) Φ λN ( r k ) − r − ¯ d k ( θ ) λ (cid:21) ≥ Φ λN (cid:2) P ′ ( r k ) (cid:3) − r − d ′ k λ = Φ λN (cid:18) kR + r k + 1 (cid:19) − r − d ′ k λ . Thus with P probability at least 1 − ǫ , for any θ ∈ Θ, kR + r k + 1 ≤ inf λ ∈ R + Φ − λN (cid:20) r (1 − η ) + η + d ′ k − log( η )(1 − η ) λ (cid:21) . We can generalize this approach by considering a finite decreasing sequence η =1 > η > η > · · · > η J > η J +1 = 0, and the corresponding sequence of levels L j = − log( η j ) λ , ≤ j ≤ J,L J +1 = 1 − r − log( J ) − log( ǫ ) λ . Taking a union bound in j , we see that with P probability at least 1 − ǫ , for any θ ∈ Θ, for any λ ∈ R + , P ′ (cid:20) Φ λN ( r k ) − r − ¯ d k + log( J ) λ ≥ L j (cid:21) ≤ η j , j = 0 , . . . , J + 1 , and consequently P ′ (cid:20) Φ λN ( r k ) − r − ¯ d k + log( J ) λ (cid:21) ≤ Z L J +1 P ′ (cid:20) Φ λN ( r k ) − r − ¯ d k + log( J ) λ ≥ α (cid:21) dα ≤ J +1 X j =1 η j − ( L j − L j − )= η J (cid:20) − r − log( J ) − log( ǫ ) − log( η J ) λ (cid:21) − log( η ) λ + J − X j =1 η j λ log (cid:18) η j η j +1 (cid:19) . Let us put .3. Vapnik bounds for inductive classification d ′′ k (cid:2) θ, ( η j ) Jj =1 (cid:3) = d ′ k ( θ ) + log( J ) − log( η ) + J − X j =1 η j log (cid:18) η j η j +1 (cid:19) + log (cid:16) ǫη J J (cid:17) η J . We have proved that for any decreasing sequence ( η j ) Jj =1 , with P probability atleast 1 − ǫ , for any θ ∈ Θ, kR + r k + 1 ≤ inf λ ∈ R + Φ − λN (cid:20) r (1 − η J ) + η J + d ′′ k (cid:2) θ, ( η j ) Jj =1 (cid:3) λ (cid:21) . Remark 3.3.1 . We can for instance choose J = 2 , η = N , η = N ) ,resulting in d ′′ k = d ′ k + log(2) + log log(10 N ) + 1 − log log(10 N )log(10 N ) − log (cid:0) Nǫ (cid:1) N . In the case where N = 1000 and for any ǫ ∈ )0 , , we get d ′′ k ≤ d ′ k + 3 . , in the casewhere N = 10 , we get d ′′ k ≤ d ′ k +4 . , and in the case N = 10 , we get d ′′ k ≤ d ′ k +4 . .Therefore, for any practical purpose we could take d ′′ k = d ′ k + 4 . and η J = N in the above inequality. Taking moreover a weighted union bound in k , we get Theorem 3.3.3 . For any ǫ ∈ )0 , , any sequence > η > · · · > η J > , anysequence π k : Ω → M (Θ) , where π k is a k -partially exchangeable posterior distri-bution, with P probability at least − ǫ , for any θ ∈ Θ , R ( θ ) ≤ inf k ∈ N ∗ k + 1 k inf λ ∈ R + Φ − λN (cid:20) r ( θ ) + η J (cid:2) − r ( θ ) (cid:3) + d ′′ k (cid:2) θ, ( η j ) Jj =1 (cid:3) + log (cid:2) k ( k + 1) (cid:3) λ (cid:21) − r ( θ ) k . Corollary 3.3.4 . For any ǫ ∈ )0 , , for any N ≤ , with P probability at least − ǫ , for any θ ∈ Θ , R ( θ ) ≤ inf k ∈ N ∗ inf λ ∈ R + k + 1 k (cid:2) − exp( − λN ) (cid:3) − (cid:26) − exp (cid:20) − λN (cid:2) r ( θ ) + N (cid:3) − P ′ (cid:2) log( N k ) | ( Z i ) Ni =1 (cid:3) − log( ǫ ) + log (cid:2) k ( k + 1) (cid:3) + 4 . N (cid:21)(cid:27) − r ( θ ) k . Let us end this section with a numerical example: in the case of binary classi-fication with a Vapnik–Cervonenkis class of dimension not greater than 10, when N = 1000, inf Θ r = r ( b θ ) = 0 . ǫ = 0 . 01, we get a bound R ( b θ ) ≤ . k = 15 and of λ = 1010). In the case when k = 1, we can use Theorem 3.2.5 (page 118) and replace Φ − λN ( q )with (cid:8) − Nλ × log (cid:2) cosh( λ N ) (cid:3)(cid:9) − q , resulting in26 Chapter 3. Transductive PAC-Bayesian learning Theorem 3.3.5 . For any ǫ ∈ )0 , , any N ≤ , any one-partially exchangeableposterior distribution π : Ω → M (Θ) , with P probability at least − ǫ , for any θ ∈ Θ , R ( θ ) ≤ inf λ ∈ R + n Nλ log (cid:2) cosh( λ N ) (cid:3)o r ( θ ) + 15 N + 2 d ′ ( θ ) + 4 . λ − Nλ log (cid:2) cosh( λ N ) (cid:3) . Finally, in the case when P is i.i.d., meaning that all the P i are equal, we can improvethe previous bound. For any partially exchangeable function λ : Ω × Θ → R + , wesaw in the discussion preceding Theorem 3.2.6 (page 120) that T h exp (cid:2) λ ( r k − r ) − A ( λ ) v (cid:3)i ≤ , with the notation introduced therein. Thus for any partially exchangeable positivereal measurable function λ : Ω × Θ → R + satisfying equation (3.1, page 116), anyone-partially exchangeable posterior distribution π : Ω → M (Θ), P n exp h sup θ ∈ Θ λ (cid:2) r k ( θ ) − r ( θ ) − A ( λ ) v ( θ ) (cid:3) + log (cid:2) ǫπ (cid:2) ∆( θ ) (cid:3)io ≤ . Therefore with P probability at least 1 − ǫ , with P ′ probability 1 − η , r k ( θ ) ≤ r ( θ ) + A ( λ ) v ( θ ) + 1 λ (cid:2) ¯ d ( θ ) − log( η ) (cid:3) . We can then choose λ ( ω, θ ) ∈ arg min λ ′ ∈ R + A ( λ ′ ) v ( θ ) + ¯ d ( θ ) − log( η ) (cid:3) λ ′ , which satis-fies the required conditions, to show that with P probability at least 1 − ǫ , for any θ ∈ Θ, with P ′ probability at least 1 − η , for any λ ∈ R + , r k ( θ ) ≤ r ( θ ) + A ( λ ) v ( θ ) + ¯ d ( θ ) − log( η ) λ . We can then take a union bound on a decreasing sequence of J values η ≥ · · · ≥ η J of η . Weakening the order of quantifiers a little, we then obtain the followingstatement: with P probability at least 1 − ǫ , for any θ ∈ Θ, for any λ ∈ R + , for any j = 1 , . . . , J P ′ (cid:20) r k ( θ ) − r ( θ ) − A ( λ ) v ( θ ) − ¯ d ( θ ) + log( J ) λ ≥ − log( η j ) λ (cid:21) ≤ η j . Consequently for any λ ∈ R + , P ′ (cid:20) r k ( θ ) − r ( θ ) − A ( λ ) v ( θ ) − ¯ d ( θ ) + log( J ) λ (cid:21) ≤ − log( η ) λ + η J (cid:20) − r ( θ ) − log( J ) − log( ǫ ) − log( η J ) λ (cid:21) + J − X j =1 η j λ log (cid:18) η j η j +1 (cid:19) . .4. Gaussian approximation in Vapnik bounds P ′ (cid:2) v ( θ ) (cid:3) = r + R − r R , (this is where we need equidistribution) thusproving that R − r ≤ A ( λ )2 h R + r − r R i + d ′′ (cid:2) θ, ( η j ) Jj =1 (cid:3) λ + η J (cid:2) − r ( θ ) (cid:3) . Keeping track of quantifiers, we obtain Theorem 3.3.6 . For any decreasing sequence ( η j ) Jj =1 , any ǫ ∈ )0 , , any one-partially exchangeable posterior distribution π : Ω → M (Θ) , with P probability atleast − ǫ , for any θ ∈ Θ , R ( θ ) ≤ inf λ ∈ R + n Nλ log (cid:2) cosh( λ N ) (cid:3)o r ( θ ) + 2 d ′′ (cid:2) θ, ( η j ) Jj =1 (cid:3) λ + 2 η J (cid:2) − r ( θ ) (cid:3) − Nλ log (cid:2) cosh( λ N ) (cid:3)(cid:2) − r ( θ ) (cid:3) . To obtain formulas which could be easily compared with original Vapnik bounds,we may replace p − Φ a ( p ) with a Gaussian upper bound: Lemma 3.4.1 . For any p ∈ (0 , ) , any a ∈ R + , p − Φ a ( p ) ≤ a p (1 − p ) . For any p ∈ ( , , p − Φ a ( p ) ≤ a . Proof. Let us notice that for any p ∈ (0 , ∂∂a (cid:2) − a Φ a ( p ) (cid:3) = − p exp( − a )1 − p + p exp( − a ) ,∂ ∂ a (cid:2) − a Φ a ( p ) (cid:3) = p exp( − a )1 − p + p exp( − a ) (cid:18) − p exp( − a )1 − p + p exp( − a ) (cid:19) ≤ ( p (1 − p ) p ∈ (0 , ) , p ∈ ( , . Thus taking a Taylor expansion of order one with integral remainder: − a Φ( a ) ≤ − ap + Z a p (1 − p )( a − b ) db = − ap + a p (1 − p ) , p ∈ (0 , ) , − ap + Z a 14 ( a − b ) db = − ap + a , p ∈ ( , . This ends the proof of our lemma. (cid:3) Chapter 3. Transductive PAC-Bayesian learning Lemma 3.4.2 . Let us consider the bound B ( q, d ) = (cid:18) dN (cid:19) − (cid:20) q + dN + r dq (1 − q ) N + d N (cid:21) , q ∈ R + , d ∈ R + . Let us also put ¯ B ( q, d ) = ( B ( q, d ) B ( q, d ) ≤ ,q + q d N otherwise . For any positive real parameters q and d inf λ ∈ R + Φ − λN (cid:18) q + dλ (cid:19) ≤ ¯ B ( q, d ) . Proof. Let p = inf λ Φ − λN (cid:18) q + dλ (cid:19) . For any λ ∈ R + , p − λ N ( p ∧ ) (cid:2) − ( p ∧ ) (cid:3) ≤ Φ λN ( p ) ≤ q + dλ . Thus p ≤ q + inf λ ∈ R + λ N ( p ∧ ) (cid:2) − ( p ∧ ) (cid:3) + dλ = q + s d ( p ∧ ) (cid:2) − ( p ∧ ) (cid:3) N ≤ q + r d N . Then let us remark that B ( q, d ) = sup ( p ′ ∈ R + ; p ′ ≤ q + r dp ′ (1 − p ′ ) N ) . Ifmoreover ≥ B ( q, d ), then according to this remark ≥ q + q d N ≥ p . Therefore p ≤ , and consequently p ≤ q + q dp (1 − p ) N , implying that p ≤ B ( q, d ). (cid:3) The previous lemma combined with Corollary 3.3.4 (page 125) implies Corollary 3.4.3 . Let us use the notation introduced in Lemma 3.4.2 (page 128).For any ǫ ∈ )0 , , any integer N ≤ , with P probability at least − ǫ , for any θ ∈ Θ , R ( θ ) ≤ inf k ∈ N ∗ k + 1 k n ¯ B h r ( θ ) + 110 N , d ′ k ( θ ) + log (cid:2) k ( k + 1) (cid:3) + 4 . io − r ( θ ) k . To make a link with Vapnik’s result, it is useful to state the Gaussian approximationto Theorem 3.3.6 (page 127). Indeed, using the upper bound A ( λ ) ≤ λ N , where A ( λ ) is defined by equation (3.2) on page 119, we get with P probability at least1 − ǫR − r − η J ≤ inf λ ∈ R + λ N (cid:2) R + r − r R (cid:3) + 2 d ′′ λ = r d ′′ ( R + r − r R ) N , which can be solved in R to obtain .4. Gaussian approximation in Vapnik bounds Corollary 3.4.4 . With P probability at least − ǫ , for any θ ∈ Θ , R ( θ ) ≤ r ( θ ) + d ′′ ( θ ) N (cid:2) − r ( θ ) (cid:3) + 2 η J + s d ′′ ( θ ) (cid:2) − r ( θ ) (cid:3) r ( θ ) N + d ′′ ( θ ) N (cid:2) − r ( θ ) (cid:3) + 4 d ′′ ( θ ) N (cid:2) − r ( θ ) (cid:3) η J . This is to be compared with Vapnik’s result, as proved in Vapnik (1998, page138): Theorem 3.4.5 (Vapnik) . For any i.i.d. probability distribution P , with P prob-ability at least − ǫ , for any θ ∈ Θ , putting d V = log (cid:2) P ( N ) (cid:3) + log(4 /ǫ ) ,R ( θ ) ≤ r ( θ ) + 2 d V N + r d V r ( θ ) N + 4 d V N . Recalling that we can choose ( η j ) j =1 such that η J = η = N (which brings anegligible contribution to the bound) and such that for any N ≤ , d ′′ ( θ ) ≤ P (cid:2) log( N ) | ( Z i ) Ni =1 (cid:3) − log( ǫ ) + 4 . , we see that our complexity term is somehow more satisfactory than Vapnik’s, sinceit is integrated outside the logarithm, with a slightly larger additional constant(remember that log 4 ≃ . 4, which is better than our 4 . 7, which could presumablybe improved by working out a better sequence η j , but not down to log(4)). Ourvariance term is better, since we get r (1 − r ), instead of r . We also have d ′′ N instead of 2 d V N , because we use no symmetrization trick.Let us illustrate these bounds on a numerical example, corresponding to a situ-ation where the sample is noisy or the classification model is weak. Let us assumethat N = 1000, inf Θ r = r ( b θ ) = 0 . 2, that we are performing binary classificationwith a model with Vapnik–Cervonenkis dimension not greater than h = 10, andthat we work at confidence level ǫ = 0 . 01. Vapnik’s theorem provides an upperbound for R ( b θ ) not smaller than 0 . R ( b θ ) ≤ . d ′′ ≤ d ′ + 3 . N = 1000). Now if we go for Theorem 3.3.6and do not make a Gaussian approximation, we get R ( b θ ) ≤ . λ = 1195 > N = 1000. This explains whythe Gaussian approximation in Vapnik’s bound can be improved: for such a largevalue of λ , λr ( θ ) does not behave like a Gaussian random variable.Let us recall in conclusion that the best bound is provided by Theorem 3.3.3(page 125), giving R ( b θ ) ≤ . / k = 15, and of λ = 1010. This bound can be seen to take ad-vantage of the fact that Bernoulli random variables are not Gaussian (its Gaussianapproximation, Corollary 3.4.3, gives a bound R ( θ ) ≃ . k = 15), and of the fact that the optimal size of the shadow sample is significantlylarger than the size of the observed sample. Moreover, Theorem 3.3.3 does not as-sume that the sample is i.i.d., but only that it is independent, thus generalizingVapnik’s bounds to inhomogeneous data (this will presumably be the case when30 Chapter 3. Transductive PAC-Bayesian learning data are collected from different places where the experimental conditions may notbe the same, although they may reasonably be assumed to be independent).Our little numerical example was chosen to illustrate the case when it is non-trivial to decide whether the chosen classifier does better than the 0.5 error rateof blind random classification. This case is of interest to choose “weak learners”to be aggregated or combined in some appropriate way in a second stage to reacha better classification rate. This stage of feature selection is unavoidable in manyreal world classification tasks. Our little computations are meant to exemplify thefact that Vapnik’s bounds, although asymptotically suboptimal, as is obvious bycomparison with the first two chapters, can do the job when dealing with moderatesample sizes. hapter 4 Support Vector Machines Support Vector Machines, of wide use and renown, were conceived by V. Vapkik(Vapnik, 1998). Before introducing them, we will study as a prerequisite the sep-aration of points by hyperplanes in a finite dimensional Euclidean space. SupportVector Machines perform the same kind of linear separation after an implicit changeof pattern space. The preceding PAC-Bayesian results provide a fit framework toanalyse their generalization properties.In this section we deal with the classification of points in R d in two classes. Let Z = ( x i , y i ) Ni =1 ∈ (cid:0) R d × {− , +1 } (cid:1) N be some set of labelled examples (called thetraining set hereafter). Let us split the set of indices I = { , . . . , N } according tothe labels into two subsets I + = { i ∈ I : y i = +1 } ,I − = { i ∈ I : y i = − } . Let us then consider the set of admissible separating directions A Z = (cid:8) w ∈ R d : sup b ∈ R inf i ∈ I ( h w, x i i − b ) y i ≥ (cid:9) , which can also be written as A Z = (cid:8) w ∈ R d : max i ∈ I − h w, x i i + 2 ≤ min i ∈ I + h w, x i i (cid:9) . As it is easily seen, the optimal value of b for a fixed value of w , in other words thevalue of b which maximizes inf i ∈ I ( h w, x i i − b ) y i , is equal to b w = 12 h max i ∈ I − h w, x i i + min i ∈ I + h w, x i i i . Lemma 4.1.1 . When A Z = ∅ , inf {k w k : w ∈ A Z } is reached for only one value w Z of w . Chapter 4. Support Vector Machines Proof. Let w ∈ A Z . The set A Z ∩ { w ∈ R d : k w k ≤ k w k} is a compact convexset and w 7→ k w k is strictly convex and therefore has a unique minimum on thisset, which is also obviously its minimum on A Z . (cid:3) Definition 4.1.1 . When A Z = ∅ , the training set Z is said to be linearly separa-ble. The hyperplane H = { x ∈ R d : h w Z , x i − b Z = 0 } , where w Z = arg min {k w k : w ∈ A Z } ,b Z = b w Z , is called the canonical separating hyperplane of the training set Z . The quantity k w Z k − is called the margin of the canonical hyperplane. As min i ∈ I + h w Z , x i i − max i ∈ I − h w Z , x i i = 2, the margin is also equal to half thedistance between the projections on the direction w Z of the positive and negativepatterns. Let us consider the convex hulls X + and X − of the positive and negative patterns: X + = n X i ∈ I + λ i x i : (cid:0) λ i (cid:1) i ∈ I + ∈ R I + + , X i ∈ I + λ i = 1 o , X − = n X i ∈ I − λ i x i : (cid:0) λ i (cid:1) i ∈ I − ∈ R I − + , X i ∈ I − λ i = 1 o . Let us introduce the closed convex set V = X + − X − = (cid:8) x + − x − : x + ∈ X + , x − ∈ X − (cid:9) . As v 7→ k v k is strictly convex, with compact lower level sets, there is a uniquevector v ∗ such that k v ∗ k = inf v ∈ V (cid:8) k v k : v ∈ V (cid:9) . Lemma 4.1.2 . The set A Z is non-empty (i.e. the training set Z is linearly sepa-rable) if and only if v ∗ = 0 . In this case w Z = 2 k v ∗ k v ∗ , and the margin of the canonical hyperplane is equal to k v ∗ k . This lemma proves that the distance between the convex hulls of the positiveand negative patterns is equal to twice the margin of the canonical hyperplane. Proof. Let us assume first that v ∗ = 0, or equivalently that X + ∩ X − = ∅ . Forany vector w ∈ R d , min i ∈ I + h w, x i i = min x ∈ X + h w, x i , .1. How to build them i ∈ I − h w, x i i = max x ∈ X − h w, x i , so min i ∈ I + h w, x i i − max i ∈ I − h w, x i i ≤ 0, which shows that w cannot be in A Z andtherefore that A Z is empty.Let us assume now that v ∗ = 0, or equivalently that X + ∩ X − = ∅ . Let us put w ∗ = 2 v ∗ / k v ∗ k . Let us remark first thatmin i ∈ I + h w ∗ , x i i − max i ∈ I − h w ∗ , x i i = inf x ∈ X + h w ∗ , x i − sup x ∈ X − h w ∗ , x i = inf x + ∈ X + ,x − ∈ X − h w ∗ , x + − x − i = 2 k v ∗ k inf v ∈ V h v ∗ , v i . Let us now prove that inf v ∈ V h v ∗ , v i = k v ∗ k . Some arbitrary v ∈ V being fixed,consider the function β 7→ k βv + (1 − β ) v ∗ k : [0 , → R . By definition of v ∗ , it reaches its minimum value for β = 0, and therefore hasa non-negative derivative at this point. Computing this derivative, we find that h v − v ∗ , v ∗ i ≥ 0, as claimed. We have proved thatmin i ∈ I + h w ∗ , x i i − max i ∈ I − h w ∗ , x i i = 2 , and therefore that w ∗ ∈ A Z . On the other hand, any w ∈ A Z is such that2 ≤ min i ∈ I + h w, x i i − max i ∈ I − h w, x i i = inf v ∈ V h w, v i ≤ k w k inf v ∈ V k v k = k w k k v ∗ k . This proves that k w ∗ k = inf (cid:8) k w k : w ∈ A Z (cid:9) , and therefore that w ∗ = w Z asclaimed. (cid:3) One way to compute w Z would therefore be to compute v ∗ by minimizing ((cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X i ∈ I λ i y i x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) : ( λ i ) i ∈ I ∈ R I + , X i ∈ I λ i = 2 , X i ∈ I y i λ i = 0 ) . Although this is a tractable quadratic programming problem, a direct computationof w Z through the following proposition is usually preferred. Proposition 4.1.3 . The canonical direction w Z can be expressed as w Z = N X i =1 α ∗ i y i x i , where ( α ∗ i ) Ni =1 is obtained by minimizing inf (cid:8) F ( α ) : α ∈ A (cid:9) where A = n ( α i ) i ∈ I ∈ R I + , X i ∈ I α i y i = 0 o , and F ( α ) = (cid:13)(cid:13)(cid:13)X i ∈ I α i y i x i (cid:13)(cid:13)(cid:13) − X i ∈ I α i . Chapter 4. Support Vector Machines Proof. Let w ( α ) = P i ∈ I α i y i x i and let S ( α ) = P i ∈ I α i . We can expressthe function F ( α ) as F ( α ) = k w ( α ) k − S ( α ). Moreover it is important to no-tice that for any s ∈ R + , { w ( α ) : α ∈ A , S ( α ) = s } = s V . This shows thatfor any s ∈ R + , inf { F ( α ) : α ∈ A , S ( α ) = s } is reached and that for any α s ∈ { α ∈ A : S ( α ) = s } reaching this infimum, w ( α s ) = sv ∗ . As s s k v ∗ k − s : R + → R reaches its infimum for only one value s ∗ of s , namelyat s ∗ = k v ∗ k , this shows that F ( α ) reaches its infimum on A , and that for any α ∗ ∈ A such that F ( α ∗ ) = inf { F ( α ) : α ∈ A } , w ( α ∗ ) = k v ∗ k v ∗ = w Z . (cid:3) Definition 4.1.2 . The set of support vectors S is defined by S = { x i : h w Z , x i i − b Z = y i } . Proposition 4.1.4 . Any α ∗ minimizing F ( α ) on A is such that { x i : α ∗ i > } ⊂ S . This implies that the representation w Z = w ( α ∗ ) involves in general only a limitednumber of non-zero coefficients and that w Z = w Z ′ , where Z ′ = { ( x i , y i ) : x i ∈ S } . Proof. Let us consider any given i ∈ I + and j ∈ I − , such that α ∗ i > α ∗ j > I − and I + , since the sum of thecomponents of α ∗ on each of these sets are equal and since P k ∈ I α ∗ k > 0. For any t ∈ R , consider α k ( t ) = α ∗ k + t ( k ∈ { i, j } ) , k ∈ I. The vector α ( t ) is in A for any value of t in some neighbourhood of 0, therefore ∂∂t | t =0 F (cid:2) α ( t ) (cid:3) = 0. Computing this derivative, we find that y i h w ( α ∗ ) , x i i + y j h w ( α ∗ ) , x j i = 2 . As y i = − y j , this can also be written as y i (cid:2) h w ( α ∗ ) , x i i − b Z (cid:3) + y j (cid:2) h w ( α ∗ ) , x j i − b Z (cid:3) = 2 . As w ( α ∗ ) ∈ A Z , y k (cid:2) h w ( α ∗ ) , x k i − b Z (cid:3) ≥ , k ∈ I, which implies necessarily as claimed that y i (cid:2) h w ( α ∗ ) , x i i − b Z (cid:3) = y j (cid:2) h w ( α ∗ ) , x j i − b Z (cid:3) = 1 . (cid:3) In the case when the training set Z = ( x i , y i ) Ni =1 is not linearly separable, we candefine a noisy canonical hyperplane as follows: we can choose w ∈ R d and b ∈ R tominimize(4.1) C ( w, b ) = N X i =1 (cid:2) − (cid:0) h w, x i i − b (cid:1) y i (cid:3) + + k w k , where for any real number r , r + = max { r, } is the positive part of r . .1. How to build them Theorem 4.1.5 . Let us introduce the dual criterion F ( α ) = N X i =1 α i − (cid:13)(cid:13)(cid:13)(cid:13) N X i =1 y i α i x i (cid:13)(cid:13)(cid:13)(cid:13) and the domain A ′ = (cid:26) α ∈ R N + : α i ≤ , i = 1 , . . . , N, N X i =1 y i α i = 0 (cid:27) . Let α ∗ ∈ A ′ be such that F ( α ∗ ) = sup α ∈ A ′ F ( α ) . Let w ∗ = P Ni =1 y i α ∗ i x i . There is a threshold b ∗ (whose construction will be detailed in the proof ), such that C ( w ∗ , b ∗ ) = inf w ∈ R d ,b ∈ R C ( w, b ) . Corollary 4.1.6 . (scaled criterion) For any positive real parameter λ let usconsider the criterion C λ ( w, b ) = λ N X i =1 (cid:2) − ( h w, x i i − b ) y i (cid:3) + + k w k and the domain A ′ λ = (cid:26) α ∈ R N + : α i ≤ λ , i = 1 , . . . , N, N X i =1 y i α i = 0 (cid:27) . For any solution α ∗ of the minimization problem F ( α ∗ ) = sup α ∈ A ′ λ F ( α ) , the vector w ∗ = P Ni =1 y i α ∗ i x i is such that inf b ∈ R C λ ( w ∗ , b ) = inf w ∈ R d ,b ∈ R C λ ( w, b ) . In the separable case, the scaled criterion is minimized by the canonical hyper-plane for λ large enough. This extension of the canonical hyperplane computationin dual space is often called the box constraint , for obvious reasons. Proof. The corollary is a straightforward consequence of the scale property C λ ( w, b, x ) = λ C ( λ − w, b, λx ), where we have made the dependence of the crite-rion in x ∈ R dN explicit. Let us come now to the proof of the theorem.The minimization of C ( w, b ) can be performed in dual space extending the coupleof parameters ( w, b ) to w = ( w, b, γ ) ∈ R d × R × R N + and introducing the dualmultipliers α ∈ R N + and the criterion G ( α, w ) = N X i =1 γ i + N X i =1 α i (cid:8)(cid:2) − ( h w, x i i − b ) y i (cid:3) − γ i (cid:9) + k w k . We see that C ( w, b ) = inf γ ∈ R N + sup α ∈ R N + G (cid:2) α, ( w, b, γ ) (cid:3) , and therefore, putting W = { ( w, b, γ ) : w ∈ R d , b ∈ R , γ ∈ R N + (cid:9) , we are led to solvethe minimization problem G ( α ∗ , w ∗ ) = inf w ∈ W sup α ∈ R N + G ( α, w ) , Chapter 4. Support Vector Machines whose solution w ∗ = ( w ∗ , b ∗ , γ ∗ ) is such that C ( w ∗ , b ∗ ) = inf ( w,b ) ∈ R d +1 C ( w, b ),according to the preceding identity. As for any value of α ′ ∈ R N + ,inf w ∈ W sup α ∈ R N + G ( α, w ) ≥ inf w ∈ W G ( α ′ , w ) , it is immediately seen thatinf w ∈ W sup α ∈ R N + G ( α, w ) ≥ sup α ∈ R N + inf w ∈ W G ( α, w ) . We are going to show that there is no duality gap, meaning that this inequality isindeed an equality. More importantly, we will do so by exhibiting a saddle point,which, solving the dual minimization problem will also solve the original one.Let us first make explicit the solution of the dual problem (the interest of thisdual problem precisely lies in the fact that it can more easily be solved explicitly).Introducing the admissible set of values of α , A ′ = (cid:8) α ∈ R N : 0 ≤ α i ≤ , i = 1 , . . . , N, N X i =1 y i α i = 0 (cid:9) , it is elementary to check thatinf w ∈ W G ( α, w ) = ( inf w ∈ R d G (cid:2) α, ( w, , (cid:3) , α ∈ A ′ , −∞ , otherwise . As G (cid:2) α, ( w, , (cid:3) = k w k + N X i =1 α i (cid:0) − h w, x i i y i (cid:1) , we see that inf w ∈ R d G (cid:2) α, ( w, , (cid:3) is reached at w α = N X i =1 y i α i x i . This proves that inf w ∈ W G ( α, w ) = F ( α ) . The continuous map α inf w ∈ W G ( α, w ) reaches a maximum α ∗ , not necessarilyunique, on the compact convex set A ′ . We are now going to exhibit a choice of w ∗ ∈ W such that ( α ∗ , w ∗ ) is a saddle point . This means that we are going to showthat G ( α ∗ , w ∗ ) = inf w ∈ W G ( α ∗ , w ) = sup α ∈ R N + G ( α, w ∗ ) . It will imply that inf w ∈ W sup α ∈ R d + G ( α, w ) ≤ sup α ∈ R N + G ( α, w ∗ ) = G ( α ∗ , w ∗ )on the one hand and thatinf w ∈ W sup α ∈ R d + G ( α, w ) ≥ inf w ∈ W G ( α ∗ , w ) = G ( α ∗ , w ∗ ) .1. How to build them G ( α ∗ , w ∗ ) = inf w ∈ W sup α ∈ R N + G ( α, w )as required. Construction of w ∗ . • Let us put w ∗ = w α ∗ . • If there is j ∈ { , . . . , N } such that 0 < α ∗ j < 1, let us put b ∗ = h x j , w ∗ i − y j . Otherwise, let us put b ∗ = sup {h x i , w ∗ i − α ∗ i > , y i = +1 , i = 1 , . . . , N } . • Let us then put γ ∗ i = ( , α ∗ i < , − ( h w ∗ , x i i − b ∗ ) y i , α ∗ i = 1 . If we can prove that(4.2) 1 − ( h w ∗ , x i i − b ∗ ) y i ≤ , α ∗ i = 0 , = 0 , < α ∗ i < , ≥ , α ∗ i = 1 , it will show that γ ∗ ∈ R N + and therefore that w ∗ = ( w ∗ , b ∗ , γ ∗ ) ∈ W . It will alsoshow that G ( α, w ∗ ) = N X i =1 γ ∗ i + X i,α ∗ i =0 α i (cid:2) − ( h w ∗ , x i i − b ∗ ) y i (cid:3) + k w ∗ k , proving that G ( α ∗ , w ∗ ) = sup α ∈ R N + G ( α, w ∗ ). As obviously G ( α ∗ , w ∗ ) = G (cid:2) α ∗ , ( w ∗ , , (cid:3) , we already know that G ( α ∗ , w ∗ ) = inf w ∈ W G ( α ∗ , w ). This will show that( α ∗ , w ∗ ) is the saddle point we were looking for, thus ending the proof of the theo-rem. (cid:3) Proof of equation (4.2) . Let us deal first with the case when there is j ∈{ , . . . , N } such that 0 < α ∗ j < i ∈ { , . . . , N } such that 0 < α ∗ i < 1, there is ǫ > t ∈ ( − ǫ, ǫ ), α ∗ + ty i e i − ty j e j ∈ A ′ , where ( e k ) Nk =1 is the canonical base of R N . Thus ∂∂t | t =0 F ( α ∗ + ty i e i − ty j e j ) = 0. Computing this derivative, we obtain ∂∂t | t =0 F ( α ∗ + ty i e i − ty j e j ) = y i − h w ∗ , x i i + h w ∗ , x j i − y j = y i (cid:2) − (cid:0) h w, x i i − b ∗ (cid:1) y i (cid:3) . Thus 1 − (cid:0) h w, x i i − b ∗ (cid:1) y i = 0, as required. This shows also that the definition of b ∗ does not depend on the choice of j such that 0 < α ∗ j < Chapter 4. Support Vector Machines For any i ∈ { , . . . , N } such that α ∗ i = 0, there is ǫ > t ∈ (0 , ǫ ), α ∗ + te i − ty i y j e j ∈ A ′ . Thus ∂∂t | t =0 F ( α ∗ + te i − ty i y j e j ) ≤ 0, showingthat 1 − (cid:0) h w ∗ , x i i − b ∗ (cid:1) y i ≤ i ∈ { , . . . , N } such that α ∗ i = 1, there is ǫ > α ∗ − te i + ty i y j e j ∈ A ′ . Thus ∂∂t | t =0 F ( α ∗ − te i + ty i y j e j ) ≤ 0, showing that 1 − (cid:0) h w ∗ , x i i − b ∗ (cid:1) y i ≥ α ∗ , w ∗ ) is a saddle point in this case.Let us deal now with the case where α ∗ ∈ { , } N . If we are not in the trivial casewhere the vector ( y i ) Ni =1 is constant, the case α ∗ = 0 is ruled out. Indeed, in thiscase, considering α ∗ + te i + te j , where y i y j = − 1, we would get the contradiction2 = ∂∂t | t =0 F ( α ∗ + te i + te j ) ≤ j such that α ∗ j = 1, and since P Ni =1 α i y i = 0, bothclasses are present in the set { j : α ∗ j = 1 } .Now for any i, j ∈ { , . . . , N } such that α ∗ i = α ∗ j = 1 and such that y i = +1 and y j = − ∂∂t | t =0 F ( α ∗ − te i − te j ) = − h w ∗ , x i i − h w ∗ , x j i ≤ 0. Thussup {h w ∗ , x i i − α ∗ i = 1 , y i = +1 } ≤ inf {h w ∗ , x j i + 1 : α ∗ j = 1 , y j = − } , showing that 1 − (cid:0) h w ∗ , x k i − b ∗ (cid:1) y k ≥ , α ∗ k = 1 . Finally, for any i such that α ∗ i = 0, for any j such that α ∗ j = 1 and y j = y i , we have ∂∂t | t =0 F ( α ∗ + te i − te j ) = y i h w ∗ , x i − x j i ≤ , showing that 1 − (cid:0) h w ∗ , x i i − b ∗ (cid:1) y i ≤ 0. This shows that ( α ∗ , w ∗ ) is always a saddlepoint. Definition 4.1.3 . The symmetric measurable kernel K : X × X → R is said to bepositive (or more precisely positive semi-definite) if for any n ∈ N , any ( x i ) ni =1 ∈ X n , inf α ∈ R n n X i =1 n X j =1 α i K ( x i , x j ) α j ≥ . Let Z = ( x i , y i ) Ni =1 be some training set. Let us consider as previously A = ( α ∈ R N + : N X i =1 α i y i = 0 ) . Let F ( α ) = N X i =1 N X j =1 α i y i K ( x i , x j ) y j α j − N X i =1 α i . Definition 4.1.4 . Let K be a positive symmetric kernel. The training set Z is saidto be K -separable if inf (cid:8) F ( α ) : α ∈ A (cid:9) > −∞ . .1. How to build them Lemma 4.1.7 . When Z is K -separable, inf { F ( α ) : α ∈ A } is reached. Proof. Consider the training set Z ′ = ( x ′ i , y i ) Ni =1 , where x ′ i = (cid:26)(cid:20)n K ( x k , x ℓ ) o N Nk =1 ,ℓ =1 (cid:21) / ( i, j ) (cid:27) Nj =1 ∈ R N . We see that F ( α ) = k P Ni =1 α i y i x ′ i k − P Ni =1 α i . We proved in the previous sectionthat Z ′ is linearly separable if and only if inf { F ( α ) : α ∈ A } > −∞ , and that theinfimum is reached in this case. (cid:3) Proposition 4.1.8 . Let K be a symmetric positive kernel and let Z = ( x i , y i ) Ni =1 be some K -separable training set. Let α ∗ ∈ A be such that F ( α ∗ ) = inf { F ( α ) : α ∈ A } . Let I ∗− = { i ∈ N : 1 ≤ i ≤ N, y i = − , α ∗ i > } I ∗ + = { i ∈ N : 1 ≤ i ≤ N, y i = +1 , α ∗ i > } b ∗ = 12 n N X j =1 α ∗ j y j K ( x j , x i − ) + N X j =1 α ∗ j y j K ( x j , x i + ) o , i − ∈ I ∗− , i + ∈ I ∗ + , where the value of b ∗ does not depend on the choice of i − and i + . The classificationrule f : X → Y defined by the formula f ( x ) = sign N X i =1 α ∗ i y i K ( x i , x ) − b ∗ ! is independent of the choice of α ∗ and is called the support vector machine definedby K and Z . The set S = { x j : P Ni =1 α ∗ i y i K ( x i , x j ) − b ∗ = y j } is called the set ofsupport vectors. For any choice of α ∗ , { x i : α ∗ i > } ⊂ S . An important consequence of this proposition is that the support vector machinedefined by K and Z is also the support vector machine defined by K and Z ′ = { ( x i , y i ) : α ∗ i > , ≤ i ≤ N } , since this restriction of the index set contains thevalue α ∗ where the minimum of F is reached. Proof. The independence of the choice of α ∗ , which is not necessarily unique,is seen as follows. Let ( x i ) Ni =1 and x ∈ X be fixed. Let us put for ease of notation x N +1 = x . Let M be the ( N + 1) × ( N + 1) symmetric semi-definite matrix definedby M ( i, j ) = K ( x i , x j ), i = 1 , . . . , N + 1, j = 1 , . . . , N + 1. Let us consider themapping Ψ : { x i : i = 1 , . . . , N + 1 } → R N +1 defined by(4.3) Ψ( x i ) = (cid:2) M / ( i, j ) (cid:3) N +1 j =1 ∈ R N +1 . Let us consider the training set Z ′ = (cid:2) Ψ( x i ) , y i (cid:3) Ni =1 . Then Z ′ is linearly separable, F ( α ) = (cid:13)(cid:13)(cid:13) N X i =1 α i y i Ψ( x i ) (cid:13)(cid:13)(cid:13) − N X i =1 α i , and we have proved that for any choice of α ∗ ∈ A minimizing F ( α ), w Z ′ = P Ni =1 α ∗ i y i Ψ( x i ). Thus the support vector machine defined by K and Z can also be expressed by the formula f ( x ) = sign h h w Z ′ , Ψ( x ) i − b Z ′ (cid:3) Chapter 4. Support Vector Machines which does not depend on α ∗ . The definition of S is such that Ψ( S ) is the set ofsupport vectors defined in the linear case, where its stated property has alreadybeen proved. (cid:3) We can in the same way use the box constraint and show that any solution α ∗ ∈ arg min { F ( α ) : α ∈ A , α i ≤ λ , i = 1 , . . . , N } minimizes(4.4) inf b ∈ R λ N X i =1 (cid:20) − (cid:18) N X j =1 y j α j K ( x j , x i ) − b (cid:19) y i (cid:21) + + 12 N X i =1 N X j =1 α i α j y i y j K ( x i , x j ) . Except the last, the results of this section are drawn from Cristianini et al. (2000).We have no reference for the last proposition of this section, although we believe itis well known. We include them for the convenience of the reader. Proposition 4.1.9 . Let K and K be positive symmetric kernels on X . Then forany a ∈ R + ( aK + K )( x, x ′ ) def = aK ( x, x ′ ) + K ( x, x ′ ) and ( K · K )( x, x ′ ) def = K ( x, x ′ ) K ( x, x ′ ) are also positive symmetric kernels. Moreover, for any measurable function g : X → R , K g ( x, x ′ ) def = g ( x ) g ( x ′ ) is also a positive symmetric kernel. Proof. It is enough to prove the proposition in the case when X is finite andkernels are just ordinary symmetric matrices. Thus we can assume without loss ofgenerality that X = { , . . . , n } . Then for any α ∈ R N , using usual matrix notation, h α, ( aK + K ) α i = a h α, K α i + h α, K α i ≥ , h α, ( K · K ) α i = X i,j α i K ( i, j ) K ( i, j ) α j = X i,j,k α i K / ( i, k ) K / ( k, j ) K ( i, j ) α j = X k X i,j (cid:2) K / ( k, i ) α i (cid:3) K ( i, j ) (cid:2) K / ( k, j ) α j (cid:3)| {z } ≥ ≥ , h α, K g α i = X i,j α i g ( i ) g ( j ) α j = X i α i g ( i ) ! ≥ . (cid:3) Proposition 4.1.10 . Let K be some positive symmetric kernel on X . Let p : R → R be a polynomial with positive coefficients. Let g : X → R d be a measurable func-tion. Then p ( K )( x, x ′ ) def = p (cid:2) K ( x, x ′ ) (cid:3) , .1. How to build them K )( x, x ′ ) def = exp (cid:2) K ( x, x ′ ) (cid:3) and G g ( x, x ′ ) def = exp (cid:0) −k g ( x ) − g ( x ′ ) k (cid:1) are all positive symmetric kernels. Proof. The first assertion is a direct consequence of the previous proposition. Thesecond comes from the fact that the exponential function is the pointwise limit of asequence of polynomial functions with positive coefficients. The third is seen fromthe second and the decomposition G g ( x, x ′ ) = h exp (cid:0) −k g ( x ) k (cid:1) exp (cid:0) −k g ( x ′ ) k (cid:1)i exp (cid:2) h g ( x ) , g ( x ′ ) i (cid:3) (cid:3) Proposition 4.1.11 . With the notation of the previous proposition, any trainingset Z = ( x i , y i ) Ni =1 ∈ (cid:0) X ×{− , +1 } (cid:1) N is G g -separable as soon as g ( x i ) , i = 1 , . . . , N are distinct points of R d . Proof. It is clearly enough to prove the case when X = R d and g is the identity.Let us consider some other generic point x N +1 ∈ R d and define Ψ as in (4.3). It isenough to prove that Ψ( x ) , . . . , Ψ( x N ) are affine independent, since the simplex,and therefore any affine independent set of points, can be split in any arbitrary wayby affine half-spaces. Let us assume that ( x , . . . , x N ) are affine dependent; thenfor some ( λ , . . . , λ N ) = 0 such that P Ni =1 λ i = 0, N X i =1 N X j =1 λ i G ( x i , x j ) λ j = 0 . Thus, ( λ i ) N +1 i =1 , where we have put λ N +1 = 0 is in the kernel of the symmetricpositive semi-definite matrix G ( x i , x j ) i,j ∈{ ,...,N +1 } . Therefore N X i =1 λ i G ( x i , x N +1 ) = 0 , for any x N +1 ∈ R d . This would mean that the functions x exp( −k x − x i k )are linearly dependent, which can be easily proved to be false. Indeed, let n ∈ R d be such that k n k = 1 and h n, x i i , i = 1 , . . . , N are distinct (such a vector exists,because it has to be outside the union of a finite number of hyperplanes, which isof zero Lebesgue measure on the sphere). Let us assume for a while that for some( λ i ) Ni =1 ∈ R N , for any x ∈ R d , N X i =1 λ i exp( −k x − x i k ) = 0 . Considering x = tn , for t ∈ R , we would get N X i =1 λ i exp(2 t h n, x i i − k x i k ) = 0 , t ∈ R . Letting t go to infinity, we see that this is only possible if λ i = 0 for all values of i . (cid:3) Chapter 4. Support Vector Machines We can use Support Vector Machines in the framework of compression schemesand apply Theorem 3.3.3 (page 125). More precisely, given some positive symmetrickernel K on X , we may consider for any training set Z ′ = ( x ′ i , y ′ i ) hi =1 the classifierˆ f Z ′ : X → Y which is equal to the Support Vector Machine defined by K and Z ′ whenever Z ′ is K -separable, and which is equal to some constant classification ruleotherwise; we take this convention to stick to the framework described on page 117,we will only use ˆ f Z ′ in the K -separable case, so this extension of the definition is justa matter of presentation. In the application of Theorem 3.3.3 in the case when theobserved sample ( X i , Y i ) Ni =1 is K -separable, a natural if perhaps sub-optimal choiceof Z ′ is to choose for ( x ′ i ) the set of support vectors defined by Z = ( X i , Y i ) Ni =1 andto choose for ( y ′ i ) the corresponding values of Y . This is justified by the fact thatˆ f Z = ˆ f Z ′ , as shown in Proposition 4.1.8 (page 139). If Z is not K -separable, wecan train a Support Vector Machine with the box constraint, then remove all theerrors to obtain a K -separable sub-sample Z ′ = { ( X i , Y i ) : α ∗ i < λ , ≤ i ≤ N } ,using the same notation as in equation (4.4) on page 140, and then consider itssupport vectors as the compression set. Still using the notation of page 140, thismeans we have to compute successively α ∗ ∈ arg min { F ( α ) : α ∈ A , α i ≤ λ } , and α ∗∗ ∈ arg min { F ( α ) : α ∈ A , α i = 0 when α ∗ i = λ } , to keep the compression setindexed by J = { i : 1 ≤ i ≤ N, α ∗∗ i > } , and the corresponding Support VectorMachine b f J . Different values of λ can be used at this stage, producing differentcandidate compression sets: when λ increases, the number of errors should decrease,on the other hand when λ decreases, the margin k w k − of the separable subset Z ′ increases, supporting the hope for a smaller set of support vectors, thus we can use λ to monitor the number of errors on the training set we accept from the compressionscheme. As we can use whatever heuristic we want while selecting the compressionset, we can also try to threshold in the previous construction α ∗∗ i at different levels η ≥ 0, to produce candidate compression sets J η = { i : 1 ≤ i ≤ N, α ∗∗ i > η } ofvarious sizes.As the size | J | of the compression set is random in this construction, we mustuse a version of Theorem 3.3.3 (page 125) which handles compression sets of arbi-trary sizes. This is done by choosing for each k a k -partially exchangeable posteriordistribution π k which weights the compression sets of all dimensions. We immedi-ately see that we can choose π k such that − log (cid:2) π k (∆ k ( J )) (cid:3) ≤ log (cid:2) | J | ( | J | + 1) (cid:3) + | J | log h ( k +1) eN | J | i .If we observe the shadow sample patterns, and if computer resources permit, wecan of course use more elaborate bounds than Theorem 3.3.3, such as the transduc-tive equivalent for Theorem 1.3.15 (page 31) (where we may consider the submod-els made of all the compression sets of the same size). Theorems based on relativebounds, such as Theorem 2.2.4 (page 73) or Theorem 2.3.9 (page 108) can also beused. Gibbs distributions can be approximated by Monte Carlo techniques, wherea Markov chain with the proper invariant measure consists in appropriate localperturbations of the compression set.Let us mention also that the use of compression schemes based on Support VectorMachines can be tailored to perform some kind of feature aggregation . Imagine thatthe kernel K is defined as the scalar product in L ( π ), where π ∈ M (Θ). Moreprecisely let us consider for some set of soft classification rules (cid:8) f θ : X → R ; θ ∈ Θ (cid:9) .2. Bounds for Support Vector Machines K ( x, x ′ ) = Z θ ∈ Θ f θ ( x ) f θ ( x ′ ) π ( dθ ) . In this setting, the Support Vector Machine applied to the training set Z = ( x i ,y i ) Ni =1 has the form f Z ( x ) = sign Z θ ∈ Θ f θ ( x ) N X i =1 y i α i f θ ( x i ) π ( dθ ) − b ! and, if this is too burdensome to compute, we can replace it with some finiteapproximation e f Z ( x ) = sign m m X k =1 f θ k ( x ) w k − b ! , where the set { θ k , k = 1 , . . . , m } and the weights { w k , k = 1 , . . . , m } are computedin some suitable way from the set Z ′ = ( x i , y i ) i,α i > of support vectors of f Z . Forinstance, we can draw { θ k , k = 1 , . . . , m } at random according to the probabilitydistribution proportional to (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 y i α i f θ ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) π ( dθ ) , define the weights w k by w k = sign N X i =1 y i α i f θ k ( x i ) ! Z θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X i =1 y i α i f θ ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) π ( dθ ) , and choose the smallest value of m for which this approximation still classifies Z ′ without errors. Let us remark that we have built e f Z in such a way thatlim m → + ∞ e f Z ( x i ) = f Z ( x i ) = y i , a.s.for any support index i such that α i > Z ′ , we can select a finite set of features Θ ′ ⊂ Θ such that Z ′ is K Θ ′ separable, where K Θ ′ ( x, x ′ ) = P θ ∈ Θ ′ f θ ( x ) f θ ( x ′ ) and consider the SupportVector Machines f Z ′ built with the kernel K Θ ′ . As soon as Θ ′ is chosen as a functionof Z ′ only, Theorem 3.3.3 (page 125) applies and provides some level of confidencefor the risk of f Z ′ . Let us consider some set X and some set S ⊂ { , } X of subsets of X . Let h ( S ) bethe Vapnik–Cervonenkis dimension of S , defined as h ( S ) = max n | A | : A ⊂ X, | A | < ∞ and A ∩ S = { , } A o , where by definition A ∩ S = { A ∩ B : B ∈ S } and | A | is the number of points in A .Let us notice that this definition does not depend on the choice of the reference set X . Indeed X can be chosen to be S S , the union of all the sets in S or any bigger44 Chapter 4. Support Vector Machines set. Let us notice also that for any set B , h ( B ∩ S ) ≤ h ( S ), the reason being that A ∩ ( B ∩ S ) = B ∩ ( A ∩ S ).This notion of Vapnik–Cervonenkis dimension is useful because, as we will seefor Support Vector Machines, it can be computed in some important special cases.Let us prove here as an illustration that h ( S ) = d + 1 when X = R d and S is madeof all the half spaces: S = { A w,b : w ∈ R d , b ∈ R } , where A w,b = { x ∈ X : h w, x i ≥ b } . Proposition 4.2.1 . With the previous notation, h ( S ) = d + 1 . Proof. Let ( e i ) d +1 i =1 be the canonical base of R d +1 , and let X be the affine subspaceit generates, which can be identified with R d . For any ( ǫ i ) d +1 i =1 ∈ {− , +1 } d +1 ,let w = P d +1 i =1 ǫ i e i and b = 0. The half space A w,b ∩ X is such that { e i ; i =1 , . . . , d + 1 } ∩ ( A w,b ∩ X ) = { e i ; ǫ i = +1 } . This proves that h ( S ) ≥ d + 1.To prove that h ( S ) ≤ d + 1, we have to show that for any set A ⊂ R d of size | A | = d + 2, there is B ⊂ A such that B ( A ∩ S ). Obviously this will be the case ifthe convex hulls of B and A \ B have a non-empty intersection: indeed if a hyperplaneseparates two sets of points, it also separates their convex hulls. As | A | > d + 1, A is affine dependent: there is ( λ x ) x ∈ A ∈ R d +2 \ { } such that P x ∈ A λ x x = 0 and P x ∈ A λ x = 0. The set B = { x ∈ A : λ x > } and its complement A \ B are non-empty, because P x ∈ A λ x = 0 and λ = 0. Moreover P x ∈ B λ x = P x ∈ A \ B − λ x > P x ∈ B λ x X x ∈ B λ x x = 1 P x ∈ B λ x X x ∈ A \ B − λ x x shows that the convex hulls of B and A \ B have a non-void intersection. (cid:3) Let us introduce the function of two integersΦ hn = h X k =0 (cid:18) nk (cid:19) , which can alternatively be defined by the relationsΦ hn = ( n when n ≤ h, Φ h − n − + Φ hn − when n > h. Theorem 4.2.2 . Whenever S S is finite, | S | ≤ Φ (cid:16)(cid:12)(cid:12)(cid:12)[ S (cid:12)(cid:12)(cid:12) , h ( S ) (cid:17) . Theorem 4.2.3 . For any h ≤ n , Φ hn ≤ exp (cid:2) nH (cid:0) hn (cid:1)(cid:3) ≤ exp (cid:2) h (cid:0) log( nh ) + 1 (cid:1)(cid:3) , where H ( p ) = − p log( p ) − (1 − p ) log(1 − p ) is the Shannon entropy of the Bernoullidistribution with parameter p ..2. Bounds for Support Vector Machines Proof of theorem 4.2.2. Let us prove this theorem by induction on | S S | . Itis easy to check that it holds true when | S S | = 1. Let X = S S , let x ∈ X and X ′ = X \ { x } . Define ( △ denoting the symmetric difference of two sets) S ′ = { A ∈ S : A △ { x } ∈ S } ,S ′′ = { A ∈ S : A △ { x } 6∈ S } . Clearly, ⊔ denoting the disjoint union, S = S ′ ⊔ S ′′ and S ∩ X ′ = ( S ′ ∩ X ′ ) ⊔ ( S ′′ ∩ X ′ ).Moreover | S ′ | = 2 | S ′ ∩ X ′ | and | S ′′ | = | S ′′ ∩ X ′ | . Thus | S | = | S ′ | + | S ′′ | = 2 | S ′ ∩ X ′ | + | S ′′ | = | S ∩ X ′ | + | S ′ ∩ X ′ | . Obviously h ( S ∩ X ′ ) ≤ h ( S ). Moreover h ( S ′ ∩ X ′ ) = h ( S ′ ) − 1, because if A ⊂ X ′ isshattered by S ′ (or equivalently by S ′ ∩ X ′ ), then A ∪ { x } is shattered by S ′ (we saythat A is shattered by S when A ∩ S = { , } A ). Using the induction hypothesis, wethen see that | S ∩ X ′ | ≤ Φ h ( S ) | X ′ | + Φ h ( S ) − | X ′ | . But as | X ′ | = | X | − 1, the right-hand sideof this inequality is equal to Φ h ( S ) | X | , according to the recurrence equation satisfiedby Φ. Proof of theorem 4.2.3: This is the well-known Chernoff bound for thedeviation of sums of Bernoulli random variables: let ( σ , . . . , σ n ) be i.i.d. Bernoullirandom variables with parameter 1 / 2. Let us notice thatΦ hn = 2 n P n X i =1 σ i ≤ h ! . For any positive real number λ , P (cid:18) n X i =1 σ i ≤ h (cid:19) ≤ exp( λh ) E (cid:20) exp (cid:18) − λ n X i =1 σ i (cid:19)(cid:21) = exp n λh + n log (cid:8) E (cid:2) exp (cid:0) − λσ (cid:1)(cid:3)(cid:9)o . Differentiating the right-hand side in λ shows that its minimal value isexp (cid:2) − n K ( hn , ) (cid:3) , where K ( p, q ) = p log( pq ) + (1 − p ) log( − p − q ) is the Kullback diver-gence function between two Bernoulli distributions B p and B q of parameters p and q . Indeed the optimal value λ ∗ of λ is such that h = n E (cid:2) σ exp( − λ ∗ σ ) (cid:3) E (cid:2) exp( − λ ∗ σ ) (cid:3) = nB h/n ( σ ) . Therefore, using the fact that two Bernoulli distributions with the same expecta-tions are equal,log (cid:8) E (cid:2) exp( − λ ∗ σ ) (cid:3)(cid:9) = − λ ∗ B h/n ( σ ) − K ( B h/n , B / ) = − λ ∗ hn − K ( hn , ) . The announced result then follows from the identity H ( p ) = log(2) − K ( p, )= p log( p − ) + (1 − p ) log(1 + p − p ) ≤ p (cid:2) log( p − ) + 1 (cid:3) . Chapter 4. Support Vector Machines The proof of the following theorem was suggested to us by a similar proof presentedin Cristianini et al. (2000). Theorem 4.2.4 . Consider a family of points ( x , . . . , x n ) in some Euclidean vec-tor space E and a family of affine functions H = (cid:8) g w,b : E → R ; w ∈ E, k w k = 1 , b ∈ R (cid:9) , where g w,b ( x ) = h w, x i − b, x ∈ E. Assume that there is a set of thresholds ( b i ) ni =1 ∈ R n such that for any ( y i ) ni =1 ∈ {− , +1 } n , there is g w,b ∈ H such that n inf i =1 (cid:0) g w,b ( x i ) − b i (cid:1) y i ≥ γ. Let us also introduce the empirical variance of ( x i ) ni =1 , V ar( x , . . . , x n ) = 1 n n X i =1 (cid:13)(cid:13)(cid:13)(cid:13) x i − n n X j =1 x j (cid:13)(cid:13)(cid:13)(cid:13) . In this case and with this notation, (4.5) V ar( x , . . . , x n ) γ ≥ ( n − when n is even, ( n − n − n when n is odd.Moreover, equality is reached when γ is optimal, b i = 0 , i = 1 , . . . , n and ( x , . . . ,x n ) is a regular simplex (i.e. when γ is the minimum distance between the convexhulls of any two subsets of { x , . . . , x n } and k x i − x j k does not depend on i = j ). Proof. Let ( s i ) ni =1 ∈ R n be such that P ni =1 s i = 0. Let σ be a uniformly distributedrandom variable with values in S n , the set of permutations of the first n integers { , . . . , n } . By assumption, for any value of σ , there is an affine function g w,b ∈ H such that min i =1 ,...,n (cid:2) g w,b ( x i ) − b i (cid:3)(cid:2) ( s σ ( i ) > − (cid:3) ≥ γ. As a consequence * n X i =1 s σ ( i ) x i , w + = n X i =1 s σ ( i ) (cid:0) h x i , w i − b − b i (cid:1) + n X i =1 s σ ( i ) b i ≥ n X i =1 γ | s σ ( i ) | + s σ ( i ) b i . Therefore, using the fact that the map x (cid:16) max (cid:8) , x (cid:9)(cid:17) is convex, E (cid:13)(cid:13)(cid:13)(cid:13) n X i =1 s σ ( i ) x i (cid:13)(cid:13)(cid:13)(cid:13) ! ≥ E max ( , n X i =1 γ | s σ ( i ) | + s σ ( i ) b i )! .2. Bounds for Support Vector Machines ≥ max ( , n X i =1 γ E (cid:0) | s σ ( i ) | (cid:1) + E (cid:0) s σ ( i ) (cid:1) b i )! = γ n X i =1 | s i | ! , where E is the expectation with respect to the random permutation σ . On the otherhand E (cid:13)(cid:13)(cid:13)(cid:13) n X i =1 s σ ( i ) x i (cid:13)(cid:13)(cid:13)(cid:13) ! = n X i =1 E ( s σ ( i ) ) k x i k + X i = j E ( s σ ( i ) s σ ( j ) ) h x i , x j i . Moreover E ( s σ ( i ) ) = 1 n E n X i =1 s σ ( i ) ! = 1 n n X i =1 s i . In the same way, for any i = j , E (cid:0) s σ ( i ) s σ ( j ) (cid:1) = 1 n ( n − E X i = j s σ ( i ) s σ ( j ) = 1 n ( n − X i = j s i s j = 1 n ( n − " n X i =1 s i | {z } =0 ! − n X i =1 s i = − n ( n − n X i =1 s i . Thus E (cid:13)(cid:13)(cid:13)(cid:13) n X i =1 s σ ( i ) x i (cid:13)(cid:13)(cid:13)(cid:13) ! = n X i =1 s i ! n n X i =1 k x i k − n ( n − X i = j h x i , x j i = n X i =1 s i ! "(cid:18) n + 1 n ( n − (cid:19) n X i =1 k x i k − n ( n − (cid:13)(cid:13)(cid:13)(cid:13) n X i =1 x i (cid:13)(cid:13)(cid:13)(cid:13) = nn − n X i =1 s i ! V ar( x , . . . , x n ) . We have proved that V ar( x , . . . , x n ) γ ≥ ( n − (cid:18) n X i =1 | s i | (cid:19) n n X i =1 s i . This can be used with s i = ( i ≤ n ) − ( i > n ) in the case when n is even and s i = n − ( i ≤ n − ) − n +1 ( i > n − ) in the case when n is odd, to establish thefirst inequality (4.5) of the theorem.48 Chapter 4. Support Vector Machines Checking that equality is reached for the simplex is an easy computation whenthe simplex ( x i ) ni =1 ∈ ( R n ) n is parametrized in such a way that x i ( j ) = ( i = j, (cid:3) We are going to apply Theorem 4.2.4 (page 146) to Support Vector Machines inthe transductive case. Let ( X i , Y i ) ( k +1) Ni =1 be distributed according to some partiallyexchangeable distribution P and assume that ( X i ) ( k +1) Ni =1 and ( Y i ) Ni =1 are observed.Let us consider some positive kernel K on X . For any K -separable training set ofthe form Z ′ = ( X i , y ′ i ) ( k +1) Ni =1 , where ( y ′ i ) ( k +1) Ni =1 ∈ Y ( k +1) N , let ˆ f Z ′ be the SupportVector Machine defined by K and Z ′ and let γ ( Z ′ ) be its margin. Let R = max i =1 ,..., ( k +1) N K ( X i , X i ) + 1( k + 1) N k +1) N X j =1 ( k +1) N X k =1 K ( X j , X k ) − k + 1) N ( k +1) N X j =1 K ( X i , X j ) . This is an easily computable upper-bound for the radius of some ball containingthe image of ( X , . . . , X ( k +1) N ) in feature space.Let us define for any integer h the margins(4.6) γ h = (2 h − − / and γ h +1 = (cid:20) h (cid:18) − h + 1) (cid:19)(cid:21) − / . Let us consider for any h = 1 , . . . , N the exchangeable model R h = (cid:8) ˆ f Z ′ : Z ′ = ( X i , y ′ i ) ( k +1) Ni =1 is K -separable and γ ( Z ′ ) ≥ Rγ h (cid:9) . The family of models R h , h = 1 , . . . , N is nested, and we know from Theorem 4.2.4(page 146) and Theorems 4.2.2 (page 144) and 4.2.3 (page 144) thatlog (cid:0) | R h | (cid:1) ≤ h log (cid:0) ( k +1) eNh (cid:1) . We can then consider on the large model R = F Nh =1 R h (the disjoint union of thesub-models) an exchangeable prior π which is uniform on each R h and is such that π ( R h ) ≥ h ( h +1) . Applying Theorem 3.2.3 (page 117) we get Proposition 4.2.5 . With P probability at least − ǫ , for any h = 1 , . . . , N , anySupport Vector Machine f ∈ R h , r ( f ) ≤ k + 1 k inf λ ∈ R + − exp h − λN r ( f ) − hN log (cid:16) e ( k +1) Nh (cid:17) − log[ h ( h +1)] − log( ǫ ) N i − exp( − λN ) − r ( f ) k . .2. Bounds for Support Vector Machines R h to optimize the bound may require more computerresources than are available, but any heuristic can be applied to choose f , since thebound is uniform. For instance, a Support Vector Machine f ′ using a box constraintcan be trained from the training set ( X i , Y i ) Ni =1 and then ( y ′ i ) ( k +1) Ni =1 can be set to y ′ i = sign( f ′ ( X i )), i = 1 , . . . , ( k + 1) N . In order to establish inductive margin bounds, we will need a different combinatoriallemma. It is due to Alon et al. (1997). We will reproduce their proof with some tinyimprovements on the values of constants.Let us consider the finite case when X = { , . . . , n } , Y = { , . . . , b } and b ≥ 3. The question we will study would be meaningless when b ≤ 2. Assumeas usual that we are dealing with a prescribed set of classification rules R = (cid:8) f : X → Y (cid:9) . Let us say that a pair ( A, s ), where A ⊂ X is a non-empty set of shapesand s : A → { , . . . , b − } a threshold function, is shattered by the set of func-tions F ⊂ R if for any ( σ x ) x ∈ A ∈ {− , +1 } A , there exists some f ∈ F such thatmin x ∈ A σ x (cid:2) f ( x ) − s ( x ) (cid:3) ≥ Definition 4.2.1 . Let the fat shattering dimension of ( X , R ) be the maximal size | A | of the first component of the pairs which are shattered by R . Let us say that a subset of classification rules F ⊂ Y X is separated whenever forany pair ( f, g ) ∈ F such that f = g , k f − g k ∞ = max x ∈ X | f ( x ) − g ( x ) | ≥ 2. Let M ( R ) be the maximum size | F | of separated subsets F of R . Note that if F is aseparated subset of R such that | F | = M ( R ), then it is a 1-net for the L ∞ distance:for any function f ∈ R there exists g ∈ F such that k f − g k ∞ ≤ f could be added to F to create a larger separated set). Lemma 4.2.6 . With the above notation, whenever the fat shattering dimensionof ( X , R ) is not greater than h , log (cid:2) M ( R ) (cid:3) < log (cid:2) ( b − b − n (cid:3)( log (cid:2)P hi =1 (cid:0) ni (cid:1) ( b − i (cid:3) log(2) + 1 ) + log(2) ≤ log (cid:2) ( b − b − n (cid:3)((cid:20) log h ( b − nh i + 1 (cid:21) h log(2) + 1 ) + log(2) . Proof. For any set of functions F ⊂ Y X , let t ( F ) be the number of pairs ( A, s )shattered by F . Let t ( m, n ) be the minimum of t ( F ) over all separated sets offunctions F ⊂ Y X of size | F | = m ( n is here to recall that the shape space X ismade of n shapes). For any m such that t ( m, n ) > P hi =1 (cid:0) ni (cid:1) ( b − i , it is clear thatany separated set of functions of size | F | ≥ m shatters at least one pair ( A, s ) suchthat | A | > h . Indeed, from its definition t ( m, n ) is clearly a non-decreasing functionof m , so that t ( | F | , n ) > P hi =1 (cid:0) ni (cid:1) ( b − i . Moreover there are only P hi =1 (cid:0) ni (cid:1) ( b − i pairs ( A, s ) such that | A | ≤ h . As a consequence, whenever the fat shatteringdimension of ( X , R ) is not greater than h we have M ( R ) < m .It is clear that for any n ≥ t (2 , n ) = 1. Lemma 4.2.7 . For any m ≥ , t (cid:2) mn ( b − b − , n (cid:3) ≥ t (cid:2) m, n − (cid:3) , and therefore t (cid:2) n ( n − · · · ( n − r + 1)( b − r ( b − r , n (cid:3) ≥ r . Chapter 4. Support Vector Machines Proof. Let F = { f , . . . , f mn ( b − b − } be some separated set of functions of size mn ( b − b − f i − , f i ), i = 1 , . . . , mn ( b − b − / 2, there is x i ∈ X such that | f i − ( x i ) − f i ( x i ) | ≥ 2. Since | X | = n , there is x ∈ X such that P mn ( b − b − / i =1 ( x i = x ) ≥ m ( b − b − / 2. Let I = { i : x i = x } . Since thereare ( b − b − / y , y ) ∈ Y such that 1 ≤ y < y − ≤ b − 1, thereis some pair ( y , y ), such that 1 ≤ y < y ≤ b and such that P i ∈ I ( { y , y } = { f i − ( x ) , f i ( x ) } ) ≥ m . Let J = (cid:8) i ∈ I : { f i − ( x ) , f i ( x ) } = { y , y } (cid:9) . Let F = { f i − : i ∈ J, f i − ( x ) = y } ∪ { f i : i ∈ J, f i ( x ) = y } ,F = { f i − : i ∈ J, f i − ( x ) = y } ∪ { f i : i ∈ J, f i ( x ) = y } . Obviously | F | = | F | = | J | = m . Moreover the restrictions of the functions of F to X \ { x } are separated, and it is the same with F . Thus F strongly shatters at least t ( m, n − 1) pairs ( A, s ) such that A ⊂ X \ { x } and it is the same with F . Finally,if the pair ( A, s ) where A ⊂ X \ { x } is both shattered by F and F , then F ∪ F shatters also ( A ∪ { x } , s ′ ) where s ′ ( x ′ ) = s ( x ′ ) for any x ′ ∈ A and s ′ ( x ) = ⌊ y + y ⌋ .Thus F ∪ F , and therefore F , shatters at least 2 t ( m, n − 1) pairs ( A, s ). (cid:3) Resuming the proof of lemma 4.2.6, let us choose for r the smallest integer suchthat 2 r > P hi =1 (cid:0) ni (cid:1) ( b − i , which is no greater than (cid:26) log (cid:2)P hi =1 ( ni ) ( b − i (cid:3) log(2) + 1 (cid:27) .In the case when 1 ≤ n ≤ r ,log( M ( R )) < | X | log( | Y | ) = n log( b ) ≤ r log( b ) ≤ r log (cid:2) ( b − b − n (cid:3) + log(2) , which proves the lemma. In the remaining case n > r , t (cid:2) n r ( b − r ( b − r , n (cid:3) ≥ t (cid:2) n ( n − . . . ( n − r + 1)( b − r ( b − r , n (cid:3) > h X i =1 (cid:18) ni (cid:19) ( b − i . Thus | M ( R ) | < h ( b − b − n i r as claimed. (cid:3) In order to apply this combinatorial lemma to Support Vector Machines, let usconsider now the case of separating hyperplanes in R d (the generalization to SupportVector Machines being straightforward). Assume that X = R d and Y = {− , +1 } .For any sample ( X ) ( k +1) Ni =1 , let R ( X ( k +1) N ) = max {k X i k : 1 ≤ i ≤ ( k + 1) N } . Let us consider the set of parametersΘ = (cid:8) ( w, b ) ∈ R d × R : k w k = 1 (cid:9) . For any ( w, b ) ∈ Θ, let g w,b ( x ) = h w, x i − b . Let h be some fixed integer and let γ = R ( X ( k +1) N ) γ h , where γ h is defined by equation (4.6, page 148). .2. Bounds for Support Vector Machines ζ : R → Z by ζ ( r ) = − r ≤ − γ, − − γ Theorem 4.2.8 . Let us consider the sequence ( γ h ) h ∈ N ∗ defined by equation (4.6,page 148). With P probability at least − ǫ , for any ( w, b ) ∈ Θ , Chapter 4. Support Vector Machines kN ( k +1) N X i = N +1 n (cid:2) g w,b ( X i ) ≥ (cid:3) − = Y i o ≤ k + 1 k inf λ ∈ R + ,h ∈ N ∗ (cid:2) − exp( − λN ) (cid:3) − ( − exp " − λN N X i =1 (cid:2) g w,b ( X i ) Y i ≤ Rγ h (cid:3) − log (cid:2) k + 1) N (cid:3)n h log(2) log (cid:16) e ( k +1) Nh (cid:17) + 1 o + log h h ( h +1) ǫ i N − kN N X i =1 (cid:2) g w,b ( X i ) Y i ≤ Rγ h (cid:3) . Properly speaking this theorem is not a margin bound, but more precisely a marginquantile bound, since it covers the case where some fraction of the training samplefalls within the region defined by the margin parameter γ h which optimizes thebound.As a consequence though, we get a true (weaker) margin bound: with P proba-bility at least 1 − ǫ , for any ( w, b ) ∈ Θ such that γ = min i =1 ,...,N g w,b ( X i ) Y i > , kN ( k +1) N X i = N +1 (cid:2) g w,b ( X i ) Y i < (cid:3) ≤ k +1 k (cid:26) − exp (cid:20) − log (cid:2) k +1) N (cid:3) N n R +2 γ log(2) γ log (cid:16) e ( k +1) Nγ R (cid:17) + 1 o + 1 N log( ǫ ) (cid:21)(cid:27) . This inequality compares favourably with similar inequalities in Cristianini et al.(2000), which moreover do not extend to the margin quantile case as this one.Let us also mention that it is easy to circumvent the fact that R is not observedwhen the test set X ( k +1) NN +1 is not observed.Indeed, we can consider the sample obtained by projecting X ( k +1) N on someball of fixed radius R max , putting t R max ( X i ) = min (cid:26) , R max k X i k (cid:27) X i . We can further consider an atomic prior distribution ν ∈ M ( R + ) bearing on R max ,to obtain a uniform result through a union bound. As a consequence of the previoustheorem, we have Corollary 4.2.9 . For any atomic prior ν ∈ M ( R + ) , for any partially exchange-able probability measure P ∈ M (Ω) , with P probability at least − ǫ , for any ( w, b ) ∈ Θ , any R max ∈ R + ,.2. Bounds for Support Vector Machines kN ( k +1) N X i = N +1 n (cid:2) g w,b ◦ t R max ( X i ) ≥ (cid:3) − = Y i o ≤ k + 1 k inf λ ∈ R + ,h ∈ N ∗ (cid:2) − exp( − λN ) (cid:3) − ( − exp " − λN N X i =1 (cid:2) g w,b ◦ t R max ( X i ) Y i ≤ R max γ h (cid:3) − log (cid:2) k + 1) N (cid:3)n h log(2) log (cid:16) e ( k +1) Nh (cid:17) + 1 o + log h h ( h +1) ǫν ( R max ) i N − kN N X i =1 (cid:2) g w,b ◦ t R max ( X i ) Y i ≤ R max γ h (cid:3) . Let us remark that t R max ( X i ) = X i , i = N + 1 , . . . , ( k + 1) N , as soon as weconsider only the values of R max not smaller than max i = N +1 ,..., ( k +1) N k X i k in thiscorollary. Thus we obtain a bound on the transductive generalization error of theunthresholded classification rule 2 (cid:2) g w,b ( X i ) ≥ (cid:3) − 1, as well as some incitation toreplace it with a thresholded rule when the value of R max minimizing the boundfalls below max i = N +1 ,..., ( k +1) N k X i k .54 Chapter 4. Support Vector Machines ppendix: Classification bythresholding In this appendix, we show how the bounds given in the first section of this mono-graph can be computed in practice on a simple example: the case when the clas-sification is performed by comparing a series of measurements to threshold values.Let us mention that our description covers the case when the same measurement iscompared to several thresholds, since it is enough to repeat a measurement in thelist of measurements describing a pattern to cover this case. Let us assume that the patterns we want to classify are described through h realvalued measurements normalized in the range (0 , X = (0 , h .Consider the threshold set T = (0 , h and the response set R = Y { , } h . For any t ∈ (0 , h and any a : { , } h → Y , let f ( t,a ) ( x ) = a n(cid:2) ( x j ≥ t j ) (cid:3) hj =1 o , x ∈ X , where x j is the j th coordinate of x ∈ X . Thus our parameter set here is Θ = T × R . Let us consider the Lebesgue measure L on T and the uniform probabilitydistribution U on R . Let our prior distribution be π = L ⊗ U . Let us define for anythreshold sequence t ∈ T ∆ t = n t ′ ∈ T : ( t ′ j , t j ) ∩ { X ji ; i = 1 , . . . , N } = ∅ , j = 1 , . . . , h o , where X ji is the j th coordinate of the sample pattern X i , and where the interval( t ′ j , t j ) of the real line is defined as the convex hull of the two point set { t ′ j , t j } ,whether t ′ j ≤ t j or not. We see that ∆ t is the set of thresholds giving the sameresponse as t on the training patterns. Let us consider for any t ∈ T the middle m (∆ t ) = R ∆ t t ′ L ( dt ′ ) L (∆ t )of ∆ t . The set ∆ t being a product of intervals, its middle is the point whose coordi-nates are the middle of these intervals. Let us introduce the finite set T composedof the middles of the cells ∆ t , which can be defined as T = { t ∈ T : t = m (∆ t ) } . It is easy to see that | T | ≤ ( N + 1) h and that | R | = | Y | h .15556 Appendix For any parameter ( t, a ) ∈ T × R = Θ, let us consider the posterior distributiondefined by its density dρ ( t,a ) dπ ( t ′ , a ′ ) = (cid:0) t ′ ∈ ∆ t (cid:1) (cid:0) a ′ = a (cid:1) π (cid:0) ∆ t × { a } (cid:1) . In fact we are considering a finite number of posterior distributions, since ρ ( t,a ) = ρ ( m (∆ t ) ,a ) , where m (∆ t ) ∈ T . Moreover, for any exchangeable sample distribution P ∈ M (cid:2) ( X × Y ) N +1 (cid:3) and any thresholds t ∈ T , P h ( X jN +1 , t j ) ∩ { X ji , i = 1 , . . . , N } = ∅ i ≤ N + 1 . Thus, for any ( t, a ) ∈ Θ, P n ρ ( t,a ) (cid:2) f . ( X N +1 ) (cid:3) = f ( t,a ) ( X N +1 ) o ≤ hN + 1 , showing that the classification produced by ρ ( t,a ) on new examples is typically non-random; this result is only indicative, since it is concerned with a non-random choiceof ( t, a ).Let us compute the various quantities needed to apply the results of the firstsection, focussing our attention on Theorem 2.1.3 (page 54).First note that ρ ( t,a ) ( r ) = r [( t, a )]. The entropy term is such that K ( ρ t,a , π ) = − log (cid:2) π (cid:0) ∆ t × { r } (cid:1)(cid:3) = − log (cid:2) L (∆ t ) (cid:3) + 2 h log (cid:0) | Y | (cid:1) . Let us notice accordingly thatmin ( t,a ) ∈ Θ K ( ρ ( t,a ) , π ) ≤ h log( N + 1) + 2 h log (cid:0) | Y | (cid:1) . Let us introduce the counters b ty ( c ) = 1 N N X i =1 n Y i = y and (cid:2) ( X ji ≥ t j ) (cid:3) hj =1 = c o ,t ∈ T, c ∈ { , } h , y ∈ Y ,b t ( c ) = X y ∈ Y b ty ( c ) = 1 N N X i =1 n(cid:2) ( X ji ≥ t j ) (cid:3) hj =1 = c o , t ∈ T, c ∈ { , } h . Since r [( t, a )] = X c ∈{ , } h (cid:2) b t ( c ) − b ta ( c ) ( c ) (cid:3) , the partition function of the Gibbs estimator can be computed as π (cid:2) exp( − λr ) (cid:3) = X t ∈ T L (∆ t ) X a ∈ R | Y | h exp (cid:20) − λ N X i =1 (cid:2) Y i = f ( t,a ) ( X i ) (cid:3)(cid:21) = X t ∈ T L (∆ t ) X a ∈ R | Y | h exp (cid:20) − λ X c ∈{ , } h (cid:2) b t ( c ) − b ta ( c ) ( c ) (cid:3)(cid:21) .2. Computation of inductive bounds X t ∈ T L (∆ t ) Y c ∈{ , } h " | Y | X y ∈ Y exp (cid:18) − λ (cid:2) b t ( c ) − b ty ( c ) (cid:3)(cid:19) . We see that the number of operations needed to compute π (cid:2) exp( − λr ) (cid:3) is propor-tional to | T | × h × | Y | ≤ ( N + 1) h h | Y | . An exact computation will therefore befeasible only for small values of N and h . For higher values, a Monte Carlo approx-imation of this sum will have to be performed instead.If we want to compute the bound provided by Theorem 2.1.3 (page 54) or byTheorem 2.2.2 (page 69), we need also to compute, for any fixed parameter θ ∈ Θ,quantities of the type π exp( − λr ) n exp (cid:2) ξm ′ ( · , θ ) (cid:3)o = π exp( − λr ) n exp (cid:2) ξρ θ ( m ′ ) (cid:3)o , λ, ξ ∈ R + . We need to introduce b ty ( θ, c ) = 1 N N X i =1 (cid:12)(cid:12)(cid:12) (cid:2) f θ ( X i ) = Y i (cid:3) − ( y = Y i ) (cid:12)(cid:12)(cid:12) (cid:8)(cid:2) ( X ji ≥ t j ) (cid:3) hj =1 = c (cid:9) . Similarly to what has been done previously, we obtain π (cid:8) exp (cid:2) − λr + ξm ′ ( · , θ ) (cid:3)(cid:9) = X t ∈ T L (∆ t ) Y c ∈{ , } h (cid:20) | Y | X y ∈ Y exp (cid:18) − λ (cid:2) b t ( c ) − b ty ( c ) (cid:3) + ξb ty ( θ, c ) (cid:19)(cid:21) . We can then compute π exp( − λr ) ( r ) = − ∂∂λ log (cid:8) π (cid:2) exp( − λr ) (cid:3)(cid:9) ,π exp( − λr ) n exp (cid:2) ξρ θ ( m ′ ) (cid:3)o = π (cid:8) exp (cid:2) − λr + ξm ′ ( · , θ ) (cid:3)(cid:9) π (cid:2) exp( − λr ) (cid:3) ,π exp( − λr ) (cid:2) m ′ ( · , θ ) (cid:3) = ∂∂ξ | ξ =0 log h π (cid:8) exp (cid:2) − λr + ξm ′ ( · , θ ) (cid:3)(cid:9)i . This is all we need to compute B ( ρ θ , β, γ ) (and also B ( π exp( − λr ) , β, γ )) in Theorem2.1.3 (page 54), using the approximationlog n π exp( − λ r ) h exp (cid:8) ξπ exp( − λ r ) ( m ′ ) (cid:9)io ≤ log n π exp( − λ r ) h exp (cid:8) ξm ′ ( · , θ ) (cid:9)io + ξπ exp( − λ r ) (cid:2) m ′ ( · , θ ) (cid:3) , ξ ≥ . Let us also explain how to apply the posterior distribution ρ ( t,a ) , in other wordsour randomized estimated classification rule, to a new pattern X N +1 : ρ ( t,a ) (cid:2) f · ( X N +1 ) = y (cid:3) = L (∆ t ) − Z ∆ t h a (cid:8)(cid:2) ( X jN +1 ≥ t ′ j ) (cid:3) hj =1 (cid:9) = y i L ( dt ′ )= L (∆ t ) − X c ∈{ , } h L (cid:16)n t ′ ∈ ∆ t : (cid:2) ( X jN +1 ≥ t ′ j ) (cid:3) hj =1 = c o(cid:17) (cid:2) a ( c ) = y (cid:3) . Let us define for short∆ t ( c ) = n t ′ ∈ ∆ t : (cid:2) ( X jN +1 ≥ t ′ j ) (cid:3) hj =1 = c o , c ∈ { , } h . Appendix With this notation ρ ( t,a ) (cid:2) f . ( X N +1 ) = y (cid:3) = L (cid:0) ∆ t (cid:1) − X c ∈{ , } h L (cid:2) ∆ t ( c ) (cid:3) (cid:2) a ( c ) = y (cid:3) . We can compute in the same way the probabilities for the label of the new patternunder the Gibbs posterior distribution: π exp( − λr ) (cid:2) f · ( X N +1 ) = y ′ (cid:3) = (X t ∈ T Y c ∈{ , } h (cid:20) | Y | X y ∈ Y exp (cid:18) − λ (cid:2) b t ( c ) − b ty ( c ) (cid:3)(cid:19)(cid:21) × X c ∈{ , } h L (cid:2) ∆ t ( c ) (cid:3) P y ∈ Y ( y = y ′ ) exp (cid:8) − λ (cid:2) b t ( c ) − b ty ( c ) (cid:3)(cid:9)P y ∈ Y exp (cid:8) − λ (cid:2) b t ( x ) − b ty ( c ) (cid:3)(cid:9) ) × (X t ∈ T L (∆ t ) Y c ∈{ , } h (cid:20) | Y | X y ∈ Y exp (cid:16) − λ (cid:2) b t ( c ) − b ty ( c ) (cid:3)(cid:17)(cid:21)) − . In the case when we observe the patterns of a shadow sample ( X i ) ( k +1) Ni = N +1 on top ofthe training sample ( X i , Y i ) Ni =1 , we can introduce the set of thresholds respondingas t on the extended sample ( X i ) ( k +1) Ni =1 ∆ t = n t ′ ∈ T : ( t ′ j , t j ) ∩ (cid:8) X ji ; i = 1 , . . . , ( k + 1) N } = ∅ , j = 1 , . . . , h o , consider the set T = (cid:8) t ∈ T : t = m (∆ t ) (cid:9) , of the middle points of the cells ∆ t , t ∈ T , and replace the Lebesgue measure L ∈ M (cid:2) (0 , h (cid:3) of the previous section with the uniform probability measure L on T . We can then consider π = L ⊗ U , where U is as previously the uniform probabilitymeasure on R . This gives obviously an exchangeable posterior distribution andtherefore qualifies π for transductive bounds. Let us notice that | T | ≤ (cid:2) ( k + 1) N +1 (cid:3) h , and therefore that π ( t, a ) ≥ (cid:2) ( k + 1) N + 1 (cid:3) − h | Y | − h , for any ( t, a ) ∈ T × R .For any ( t, a ) ∈ T × R we may similarly to the inductive case consider theposterior distribution ρ ( t,a ) defined by dρ ( t,a ) dπ ( t ′ , a ′ ) = ( t ′ ∈ ∆ t ) ( a ′ = a ) π (cid:0) ∆ t × { a } ) , but we may also consider δ ( m (∆ t ) ,a ) , which is such that r i { [ m (∆ t ) , a ] } = r i [( t, a )], i = 1 , 2, whereas only ρ ( t,a ) ( r ) = r [( t, a )], while ρ ( t,a ) ( r ) = 1 | T ∩ ∆ t | X t ′ ∈ T ∩ ∆ t r [( t ′ , a )] . We get .3. Transductive bounds K ( ρ ( t,a ) , π ) = − log (cid:2) L (∆ t ) (cid:3) + 2 h log (cid:0) | Y | (cid:1) ≤ log (cid:0) | T | (cid:1) + 2 h log( | Y | ) = K ( δ [ m (∆ t ) ,a ] , π ) ≤ h log (cid:2) ( k + 1) N + 1 (cid:3) + 2 h log( | Y | ) , whereas we had no such uniform bound in the inductive case. Similarly to theinductive case π (cid:2) exp( − λr ) (cid:3) = X t ∈ T L (∆ t ) Y c ∈{ , } h (cid:20) | Y | X y ∈ Y exp (cid:18) − λ (cid:2) b t ( c ) − b ty ( c ) (cid:3)(cid:19)(cid:21) . Moreover, for any θ ∈ Θ, π (cid:8) exp (cid:2) − λr + ξρ θ ( m ′ ) (cid:3)(cid:9) = π (cid:8) exp (cid:2) − λr + ξm ′ ( · , θ ) (cid:3)(cid:9) = X t ∈ T L (∆ t ) Y c ∈{ , } h (cid:20) | Y | X y ∈ Y exp (cid:18) − λ (cid:2) b t ( c ) − b ty ( c ) (cid:3) + ξb ( θ, c ) (cid:19)(cid:21) . The bound for the transductive counterpart to Theorems 2.1.3 (page 54) or 2.2.2(page 69), obtained as explained page 115, can be computed as in the inductivecase, from these two partition functions and the above entropy computation.Let us mention finally that, using the same notation as in the inductive case, π exp( − λr ) (cid:2) f · ( X N +1 ) = y ′ (cid:3) = (X t ∈ T Y c ∈{ , } h (cid:20) | Y | X y ∈ Y exp (cid:18) − λ (cid:2) b t ( c ) − b ty ( c ) (cid:3)(cid:19)(cid:21) × X c ∈{ , } h L (cid:2) ∆ t ( c ) (cid:3) P y ∈ Y ( y = y ′ ) exp (cid:8) − λ (cid:2) b t ( c ) − b ty ( c ) (cid:3)(cid:9)P y ∈ Y exp (cid:8) − λ (cid:2) b t ( x ) − b ty ( c ) (cid:3)(cid:9) ) × (X t ∈ T L (∆ t ) Y c ∈{ , } h (cid:20) | Y | X y ∈ Y exp (cid:16) − λ (cid:2) b t ( c ) − b ty ( c ) (cid:3)(cid:17)(cid:21)) − . To conclude this appendix on classification by thresholding, note that similar fac-torized computations are feasible in the important case of classification trees . Thiscan be achieved using some variant of the context tree weighting method discoveredby Willems et al. (1995) and successfully used in lossless compression theory. Theinterested reader can find a description of this algorithm applied to classificationtrees in Catoni (2004, page 62).60 Appendix ibliography Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D. (1997). Scalesensitive dimensions, uniform convergence and learnability. J. ACM Audibert, J.-Y. (2004a). Aggregated estimators and empirical complexity forleast square regression. Ann. Inst. H. Poincar´e Probab. Statist. Audibert, J.-Y. (2004b). PAC-Bayesian statistical learning theory. Ph.D. thesis,Univ. Paris 6. Available at http://cermics.enpc.fr/ ∼ audibert/ . Barron, A. (1987). Are Bayes rules consistent in information? In Open Problemsin Communication and Computation (T. M. Cover and B. Gopinath, eds.) 85–91.Springer, New York. MR0922073 Barron, A. and Yang, Y. (1999). Information-theoretic determination of mini-max rates of convergence. Ann. Statist. Barron, A., Birg´e, L. and Massart, P. (1999). Risk bounds for model selectionby penalization. Probab. Theory Related Fields Blanchard, G. (1999). The “progressive mixture” estimator for regression trees. Ann. Inst. H. Poincar´e Probab. Statist. Blanchard, G. (2001). Mixture and aggregation of estimators for pattern recog-nition. Application to decision trees. Ph.D. thesis, Univ. Paris 13. Available at http://ida.first.fraunhofer.de/ ∼ blanchard/ . Blanchard, G. (2004). Un algorithme acc´el´er´e d’´echantillonnage Bay´esien pourle mod`ele CART. Rev. Intell. Artificielle Birg´e, L. and Massart, P. (1997). From model selection to adaptive estimation.In Festschrift for Lucien Le Cam (D. Pollard, ed.) 55–87. Springer, New York.MR1462939 Birg´e, L. and Massart, P. (1998). Minimum contrast estimators on sieves. Bernoulli Birg´e, L. and Massart, P. (2001a). A generalized C p cri-terion for Gaussian model selection. Preprint. Available at ∼ massart/ . MR1848946 Birg´e, L. and Massart, P. (2001b). Gaussian model selection. J. Eur. Math.Soc. Blum, A. and Langford, J. (2003). PAC-MDL bounds. Computational Learn-ing Theory and Kernel Machines. 16th Annual Conference on ComputationalLearning Theory and 7th Kernel Workshop. COLT/Kernel 2003, Washington,DC, USA, August 24-27, 2003, Proceedings . Lecture Notes in Comput. Sci. Catoni, O. (2002). Data compression and adaptive histograms. In Foundations ofComputational Mathematics. Proceedings of the Smalefest 2000 (F. Cucker and16162 Bibliography J. M. Rojas eds.) 35–60. World Scientific. MR2021977 Catoni, O. (2003). Laplace transform estimates and deviation inequalities. Ann.Inst. H. Poincar´e Probab. Statist. Catoni, O. (2004). Statistical learning theory and stochastic optimization. Ecoled’ ´Et´e de Probabilit´es de Saint-Flour XXXI—2001 . Lecture Notes in Math. Cristianini, N. and Shawe Taylor, J. (2000). An Introduction to Support Vec-tor Machines and Other Kernel Based Learning Methods . Cambridge Univ. Press. Feder, M. and Merhav, N. (1996). Hierarchical universal coding. IEEE Trans.Inform. Theory Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statis-tical Learning . Springer, New York. MR1851606 Langford, J. and McAllester, D. (2004). Computable shell decompositionbounds. J. Machine Learning Research Langford, J. and Seeger, M. (2001a). Bounds for averaging classi-fiers. Technical report CMU-CS-01-102, Carnegie Mellon Univ. Available at ∼ jcl . Langford, J., Seeger, M. and Megiddo, N. (2001b). An improved predictiveaccuracy bound for averaging classifiers. International Conference on MachineLearning Littlestone, N. and Warmuth, M. (1986). Relating data compressionand learnability. Technical report, Univ. California, Santa Cruz. Available at ∼ manfred/pubs.html . McAllester, D. A. (1998). Some PAC-Bayesian theorems. In Proceedings of theEleventh Annual Conference on Computational Learning Theory (Madison, WI,1998) McAllester, D. A. (1999). PAC-Bayesian model averaging. In Proceedings ofthe Twelfth Annual Conference on Computational Learning Theory (Santa Cruz,CA, 1999) McDiarmid, C. (1998) Concentration. In Probabilistic Methods for AlgorithmicDiscrete Mathematics (M. Habib, C. McDiarmid and B. Reed, eds.) 195–248.Springer, New York. MR1678578 Mammen, E. and Tsybakov, A. (1999). Smooth discrimination analysis. Ann.Statist. Ryabko, B. Y. (1984). Twice-universal coding. Problems Inform. Transmission Seeger, M. (2002). PAC-Bayesian generalization error bounds for Gaussian pro-cess classification. J. Machine Learning Research Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C. and Anthony,M. (1998). Structural risk minimization over data-dependent hierarchies. IEEETrans. Inform. Theory Shawe-Taylor, J. and Cristianini, N. (2002). On the generalization of softmargin algorithms. IEEE Trans. Information Theory Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. Tsybakov, A. and Van de Geer, S. (2005). Square root penalty: Adaptation tothe margin in classification and in edge estimation. Ann. Statist. Van de Geer, S. (2000). Applications of Empirical Process Theory. CambridgeUniv. Press. MR1739079 Vapnik, V. N. (1998). Statistical Learning Theory . Wiley, New York. MR1641250 ibliography Vert, J.-P. (2000). Double mixture and universal inference. Preprint. Availableat http://cbio.ensmp.fr/ ∼ vert/publi/ . Vert, J.-P. (2001a). Adaptive context trees and text clustering. IEEE Trans.Inform. Theory Vert, J.-P. (2001b). Text categorization using adaptive context trees. Proceedingsof the CICLing-2001 Conference (A. Gelbukh, ed.) 423–436. Lecture Notes inComput. Sci. . Springer, New York. MR1888793 Willems, F. M. J., Shtarkov, Y. M. and Tjalkens, T. J. (1995). Thecontext-tree weighting method: Basic properties. IEEE Trans. Inform. Theory Willems, F. M. J., Shtarkov, Y. M. and Tjalkens, T. J. (1996). Con-text weighting for general finite-context sources. IEEE Trans. Inform. Theory Zhang, T. (2006a). From ǫ -entropy to KL-entropy: Analysis of minimum informa-tion complexity density estimation. Ann. Statist. Zhang, T. (2006b). Information-theoretic upper and lower bounds for statisticalestimation. IEEE Trans. Inform. Theory52