[PDF] Optimal Posteriors for Chi-squared Divergence based PAC-Bayesian Bounds and Comparison with KL-divergence based Optimal Posteriors and Cross-Validation Procedure

Abstract

We investigate optimal posteriors for recently introduced \cite{begin2016pac} chi-squared divergence based PAC-Bayesian bounds in terms of nature of their distribution, scalability of computations, and test set performance. For a finite classifier set, we deduce bounds for three distance functions: KL-divergence, linear and squared distances. Optimal posterior weights are proportional to deviations of empirical risks, usually with subset support. For uniform prior, it is sufficient to search among posteriors on classifier subsets ordered by these risks. We show the bound minimization for linear distance as a convex program and obtain a closed-form expression for its optimal posterior. Whereas that for squared distance is a quasi-convex program under a specific condition, and the one for KL-divergence is non-convex optimization (a difference of convex functions). To compute such optimal posteriors, we derive fast converging fixed point (FP) equations. We apply these approaches to a finite set of SVM regularization parameter values to yield stochastic SVMs with tight bounds. We perform a comprehensive performance comparison between our optimal posteriors and known KL-divergence based posteriors on a variety of UCI datasets with varying ranges and variances in risk values, etc. Chi-squared divergence based posteriors have weaker bounds and worse test errors, hinting at an underlying regularization by KL-divergence based posteriors. Our study highlights the impact of divergence function on the performance of PAC-Bayesian classifiers. We compare our stochastic classifiers with cross-validation based deterministic classifier. The latter has better test errors, but ours is more sample robust, has quantifiable generalization guarantees, and is computationally much faster.

Full PDF

OOptimal Posteriors for Chi-squared Divergence based

PAC-Bayesian Bounds and Comparison withKL-divergence based Optimal Posteriors and

Cross-Validation Procedure

Puja Sahu Nandyala Hemachandra

Abstract

We investigate optimal posteriors for recently introduced [3] chi-squared divergencebased PAC-Bayesian bounds in terms of nature of their distribution, scalability ofcomputations, and test set performance. For a ﬁnite classiﬁer set, we deduce boundsfor three distance functions: KL-divergence, linear and squared distances. Optimalposterior weights are proportional to deviations of empirical risks, usually with subsetsupport. For uniform prior, it is suﬃcient to search among posteriors on classiﬁer sub-sets ordered by these risks. We show the bound minimization for linear distance as aconvex program and obtain a closed-form expression for its optimal posterior. Whereasthat for squared distance is a quasi-convex program under a speciﬁc condition, and theone for KL-divergence is non-convex optimization (a diﬀerence of convex functions). Tocompute such optimal posteriors, we derive fast converging ﬁxed point (FP) equations.We apply these approaches to a ﬁnite set of SVM regularization parameter values toyield stochastic SVMs with tight bounds. We perform a comprehensive performancecomparison between our optimal posteriors and known KL-divergence based posteriorson a variety of UCI datasets with varying ranges and variances in risk values, etc. Chi-squared divergence based posteriors have weaker bounds and worse test errors, hintingat an underlying regularization by KL-divergence based posteriors. Our study high-lights the impact of divergence function on the performance of PAC-Bayesian classiﬁers.We compare our stochastic classiﬁers with cross-validation based deterministic classi-ﬁer. The latter has better test errors, but ours is more sample robust, has quantiﬁablegeneralization guarantees, and is computationally much faster.

Keywords:

Generalization guarantees, divergence measure, convex and non-convex con-strained optimization, ﬁxed point equations, sample robustness, SVM regularization parame-ter

In classiﬁcation algorithms, the choice of the parameter(s) inﬂuences the level of accuracythat the generated classiﬁer can achieve. For example, consider the Support Vector Machine1 a r X i v : . [ m a t h . S T ] A ug SVM) algorithm for classiﬁcation with the regularization parameter, λ > . This parameteris a user input which trades oﬀ between model complexity and training error. The optimalclassiﬁer that we get, depends heavily on the sample S that is used for training and the valueof the parameter, λ . We can control only this parameter value for obtaining a classiﬁer withlow (training) error, but not the given data. For a given training sample, we can choose thebest value of the parameter from a preﬁxed set of values, which yields a classiﬁer with thelowest error. However, this is a long drawn process. Plus, there is no guarantee that thechosen value will yield a classiﬁer having low(est) error on another sample from the samedistribution. This implies that the best parameter value is sample dependent and that thereis no unique value which is best for almost all the training samples (Please see [21] andAppendix A, pg. 17 in [22] for illustration on a UCI dataset). However, if we determinethe set of λ values with, say, lowest 30% error rates on each sample, we observe a recurringsubset of λ values across these samples (Please see [21] and Table 4, in Appendix B in [22]for illustration on a UCI dataset). Thus, we have an ensemble of λ values to pick from. Wecan combine multiple base classiﬁers resulting from diﬀerent parameter values, to build astrong stochastic classiﬁer using PAC-Bayesian framework. PAC-Bayesian Bounds and Optimal Posteriors

PAC-Bayesian approach assumesan arbitrary but ﬁxed prior distribution on the space of classiﬁers and outputs a posteriordistribution on this space, corresponding to a stochastic classiﬁer. This approach providesa probabilistic bound on the diﬀerence between the posterior averaged true and empiricalrisks of a stochastic classiﬁer as measured by a convex distance function. These bounds on unknown averaged true risk oﬀer a trade-oﬀ between averaged empirical risk and a termwhich encompasses model complexity of the stochastic classiﬁer. The bound is computedbased on a single sample but with a high probability guarantee over diﬀerent samples (fromthe same distribution). For a chosen distance function, we are interested in the ‘optimal PAC-Bayesian posterior’ deﬁned as the posterior distribution which minimizes the correspondingPAC-Bayesian bound . By design, these bounds and the resulting optimal posterior are robustto the choice of sample used for training, addressing the sample bias.

Relevant Work

A well known form of bounds estimating the unknown true risk of aclassiﬁer known as PAC-Bayesian bounds were proposed by [17, 18, 23] using the idea ofBayesian priors and posteriors on the classiﬁer space, and reﬁned further by [13, 10, 15,16]. Several authors improvised the bounds for the choice of the distance function theyconsidered for evaluating the classiﬁers. While [13] provided a bound for the KL divergenceas the distance function, φ , by tightening up the threshold with a factor of √ m instead of m , [6] generalized the framework of PAC-Bayesian bounds for a broader class of convex φ functions and relaxed the constraints on tail bounds of the empirical risk of the classiﬁersunder consideration. PAC-Bayesian theory has been used to devise margin bounds for linearclassiﬁers such as SVMs [14, 11]. [1] specialized the PAC-Bayesian bounds using sphericalGaussian distributions on the space of linear classiﬁers and extended the set up for usingdata-dependent priors [19]. More recently, [3] introduced PAC-Bayesian bounds based onRényi divergence between the posterior and the prior distributions. We use a speciﬁc caseof this Rényi divergence which corresponds to χ -divergence.All of the above consider a continuous (SVM) classiﬁer space ( n -dimensional Euclideanspace) and continuous prior as well as posterior distributions on it (spherical Gaussian dis-tributions) whereas we consider a ﬁnite set of classiﬁers such as those generated by a ﬁnite2et of regularization parameter values for the SVM. Our χ -divergence based PAC-Bayesianbounds are derived for this set up with a discrete prior distribution, and three diﬀerent dis-tance functions between posterior averaged empirical risk and posterior averaged true risk.The motivation for choosing a diﬀerent divergence function in the PAC-Bayesian frameworkis to achieve a better test set performance and a tighter risk bound. In order to do so,we ﬁrst need to investigate the nature of these χ -squared divergence based PAC-Bayesianbound minimization problems, identify the corresponding optimal PAC-Bayesian posteriorsand understand their nature. These posteriors might not be at par with the classical KL-divergence based PAC-Bayesian posteriors or the cross-validation based procedure, but thecomparison amongst them brings forth some insightful aspects of the PAC-Bayesian optimalposteriors. We list below the contributions of this paper. Contributions

We are interested in the optimal PAC-Bayesian posterior which mini-mizes the χ -divergence based PAC-Bayesian bound for a given distance function (Section2). We consider a ﬁnite classiﬁer set and three distance functions – linear distance, squareddistance (second degree polynomial) and KL-divergence (inﬁnite degree polynomial). • We deduce χ -divergence based PAC-Bayesian bounds for the above three distancefunctions and identify the optimal posteriors for them via respective bound minimiza-tion problems. • The linear distance based bound was considered in [3]; we identify the associated boundminimization as a convex program and obtain a closed form expression for the globaloptimal posterior (Section 4). • We also deduce PAC-Bayesian bounds for squared distance and KL-divergence, andshow that they are non-convex programs (Sections 5 and 6). We further show thatthe squared distance based bound is quasi-convex under certain conditions. The KL-divergence based bound minimization problem involves a diﬀerence of convex (DC)functions and hence is a DC program. Therefore we applied a DC approach known asConvex-Concave Procedure (CCP) [12] to ﬁnd its local minimum. In our computations,we observed that the CCP did not work for certain cases, especially when we havealmost linearly separable data. (Such cases are illustrated in Table 8 in Appendix D.) • For deriving optimal posteriors for such non-convex cases, we identify Fixed Point(FP) equations deduced from the partial KKT system with strict positivity constraints.These FP equations converge even when the solver or an alternate approach like CCPfails to identify a solution, and are much faster than the solver. (Some examples ofsuch cases are in Tables 6, 7 and 8 in the appendix.) • For any of the above 3 distance functions, for the uniform prior distribution, we simplifythe search for optimal posteriors on the simplex restricted to subsets of classiﬁersordered by empirical risk values (Section 3). • For computational illustration, we consider a comprehensive set of nine UCI datasets[5] with small to moderate number of examples and features, balanced and imbalancedclasses, and having diﬀerent ranges and variances in the empirical risk values. Using3uch datasets helps us compare and understand the performance of optimal posteri-ors due to diﬀerent distance functions for the χ -squared divergence based optimalposteriors and also compare with the known KL-divergence based optimal posteriors[21]. • We use these approaches on the set of SVMs generated by a ﬁnite set of regularizationparameter values (Section 7). This leads us to the notion of a stochastic SVM char-acterized by an optimal posterior on the regularization parameter set. Usually smallvalues of the regularization parameter values are preferred. Keeping this in mind, weused an arithmetic-geometric series of regularization parameter values, λ with a log-arithmic scale for λ ∈ (0 , . and a linear scale for λ ≥ . . We chose a mixtureof logarithmically and linearly spaced values of λ so that we cover many diﬀerent λ scorresponding to distinct SVM classiﬁers with low test errors. – Optimal posteriors for KL-divergence give extremely loose bounds and are compu-tationally expensive but have test error rates generally better than linear distanceones. The optimal bound value and test error rate of the squared distance basedoptimal posterior are remarkably lower than those of linear or KL-distance basedposteriors when base classiﬁers have high variation in empirical risk values. – This is accompanied by relatively high concentration on low empirical risk valuesand sparse nature of squared distance based posteriors. For almost separabledatasets, posteriors due to these three distance functions have comparable PAC-Bayesian bound values and test error rates.Table 1 outlines theoretical and computational aspects of optimal posteriors consideredin this paper. • To understand the role of divergence measure on PAC-Bayesian bounds, we conducteda comparative study of these χ -divergence based optimal PAC-Bayesian posteriorswith the posteriors derived for classical KL-divergence based PAC-Bayesian bounds[21] (Section 8). – We observe that though both the classes of posteriors have weights which aredecreasing with the increasing empirical risk values of classiﬁers, the rate at whichthey decrease is diﬀerent in the two classes – KL-divergence based posteriorsdecrease exponentially, while χ -divergence based posteriors decrease linearly withempirical risk values. – Another diﬀerence is in the size of support set for the two classes of posteriors– KL-divergence based posteriors take up the full support on the set of baseclassiﬁers, whereas those for χ -divergence usually depend only on a strict subsetas their support set. – The class of optimal posteriors for χ -divergence based PAC-Bayesian boundsis observed to have weaker bounds and higher test set errors than the class ofKL-divergence based PAC-Bayesian posteriors on a set of SVM classiﬁers. Suchbehaviour can be attributed to χ -divergence based posteriors overﬁtting the databy choosing a strict subset support of classiﬁers with least empirical risk values.4 We also compared the performance of the stochastic SVM classiﬁer governed by thesePAC-Bayesian posteriors with the deterministic SVM classiﬁer obtained via the popularcross-validation procedure for regularization parameter selection (Section 8) as thebaseline case. Though the cross-validation procedure gives a classiﬁer with betterperformance on a test set, the PAC-Bayesian posteriors yield a sample robust classiﬁerwith quantiﬁable guarantees on the unknown true risk. On the computation side, PAC-Bayesian procedure is more than 10 times faster than the cross-validation procedure.

Classical version of PAC-Bayesian theorem is derived using Donsker-Varadhan inequality a [3, 24] for change of measure which is based on KL-divergence between the two distributions.A new version of PAC-Bayesian results has been discovered by [3] which involves change ofmeasure guided by Rényi divergence between the two distributions: Theorem 1. [3] For any data distribution D over input space X × Y , the following boundholds for any prior P over the set of classiﬁers H , for any α > and any δ ∈ (0 , , wherethe probability is over random i.i.d. samples S m = { ( x i , y i ) | i = 1 , . . . , m } of size m drawnfrom D , for any convex function φ : [0 , × [0 , → R : P S m  φ (cid:16) E Q [ˆ l ] , E Q [ l ] (cid:17) ≤ (cid:20) E h ∼ P (cid:18) Q ( h ) P ( h ) (cid:19) α (cid:21) α (cid:34) I Rφ ( m, α (cid:48) ) δ (cid:35) α (cid:48)  ≥ − δ. (1) where α (cid:48) = αα − and I Rφ ( m, α (cid:48) ) := sup l ∈ [0 , (cid:104)(cid:80) mk =0 (cid:0) mk (cid:1) l k (1 − l ) m − k φ (cid:0) km , l (cid:1) α (cid:48) (cid:105) . Here, Q isan arbitrary posterior distribution on H , which may depend on the sample S and on theprior P . E Q [ˆ l ] := E h ∼ Q (cid:80) mi =1 1 m [ l ( h, x i , y i )] denotes the averaged empirical risk and E Q [ l ] := E h ∼ Q E ( x ,y ) ∼D [ l ] denotes averaged true risk of classiﬁers in H computed using a loss function, l ( h, x , y ) : H × X × Y → [ a, b ) (here, ≤ a < b ). We are interested in identifying the optimal posteriors for diﬀerent choices of distancefunctions for the case of α = 2 , which can be related to the chi-squared divergence measure, χ ( Q || P ) := E h ∼ P (cid:20)(cid:16) Q ( h ) P ( h ) (cid:17) − (cid:21) between distributions Q and P [3]. a The Donsker-Varadhan inequality can be stated as below:

Lemma 1 (KL divergence change of measure [3]) . For any set H , for any two distributions P and Q on H ,and for any measurable function φ : H → R , we have: E h ∼ Q φ ( h ) ≤ KL [ Q || P ] + ln (cid:0) E h ∼ P e φ ( h ) (cid:1) . An outline of theoretical aspects and computational results for optimal posteriors Q ∗ φ,χ = { q ∗ i,φ,χ } Hi =1 for minimization of PAC-Bayesian bound B φ,χ ( Q ) based on chi-squared di-vergence, χ ( Q || P ) = (cid:80) Hi =1 q i p i between a posterior Q and a prior P on the classiﬁer space H . Weconsider three diﬀerent distance functions, φ : KL-divergence kl (ˆ l, l ) = ˆ l ln ˆ ll + (1 − ˆ l ) ln (cid:16) − ˆ l − l (cid:17) , linear φ lin (ˆ l, l ) = l − ˆ l and squared distances φ sq (ˆ l, l ) = ( l − ˆ l ) for l, ˆ l ∈ (0 , . H denotes the classiﬁers setsize and H ∗ denotes the size of the support set of the optimal posterior Q ∗ φ,χ . ˆ l i denotes empiricalrisk value of a classiﬁer in H computed on a sample of size m . I R φ ( m ) is a sample size basedconstant for a distance function φ . It is a component of the bound function B φ,χ ( Q ) . Dist-ancefn φ Theoretical Aspects I R φ ( m ) Convexity Global min / Fixed Point (FP) (for uniform P ) φ lin mδ (due to[3]) Convex q ∗ i, lin ,χ ( H ∗ ) =  (cid:18) (cid:80) H ∗ i =1 ˆ liH (cid:48) − ˆ l i (cid:19)(cid:113) HH ∗ mδ − ˆ var H ∗ (ˆ l )  H (Global min) φ sq m − m δ shown non-convex;Quasi-convexunder a condition q F Pi, sq ,χ ( H ∗ ) = H ∗ + (cid:16)(cid:80) H ∗ i =1 ( q FPi, sq ,χ ( H ∗ )) (cid:17) (cid:113) (12 m − H m δ (cid:18) (cid:80) H ∗ i =1 ˆ l i H ∗ − ˆ l i (cid:19) kl computedbased on formgiven by [3] Non-convex;Diﬀerence ofconvex functions(DC) q F Pi, kl ,χ ( H ∗ ) satisﬁes: q i = p i (cid:16)(cid:80) H ∗ i =1 q i (cid:17) ×  (cid:16)(cid:80) H ∗ i =1 ˆ l i q i − ˆ l i (cid:17)(cid:114) H ( (cid:80) H ∗ i =1 q i ) I R kl ( m, δ ln (cid:18) (1 − r ) (cid:80) H ∗ i =1 ˆ l i q i r (1 − (cid:80) H ∗ i =1 ˆ l i q i ) (cid:19) Dist-ancefn φ Computations (for uniform P ) Solver (

Ipopt ) output Global min Fixed Point (FP) φ lin Identiﬁes global minima Identiﬁed analytically Not required φ sq Identiﬁes a unique (local) minimumeven with diﬀerent initializations closed formmay not exist Matches solver output kl Identiﬁes multiple local minimawith diﬀerent initializations;throws up error formoderate and large H closed formmay not exist Identiﬁes a uniquestationary point even withdiﬀerent initializations .1 Optimal posteriors via PAC-Bayesian bound minimization PAC-Bayesian theorem (1) gives a high probability upper bound on averaged true risk, E Q [ l ] assuming distance function φ ( E Q [ˆ l ] , · ) is invertible for given E Q [ˆ l ] : B φ,χ ( Q ) ≡ B φ,χ ( E Q [ˆ l ] , S m , δ, P )= f φ  E Q [ˆ l ] , φ − E Q [ˆ l ] (cid:115) ( χ ( Q || P ) + 1) I Rφ ( m, δ  , (2)where φ − E Q [ˆ l ] ( K ) = b implies φ ( E Q [ˆ l ] , b ) = K for some b ∈ (0 , and a given K > . Generally f φ ( · , · ) is the sum of its arguments except when φ is KL-distance function. That is, boundfunction B φ,χ ( Q ) is the sum of averaged empirical risk, E Q [ˆ l ] , and a model complexity termwhich depends on system parameters, S m , δ and P . We are interested in the determining anoptimal posterior distribution Q ∗ φ,χ which minimizes bound B φ,χ ( Q ) for a given distancefunction φ . To characterize the minimum of B φ,χ ( Q ) , we make use of the ﬁrst order KKT conditionswhich are necessary for a stationary point of a non-convex problem. These KKT conditionsrequire the objective function and the active constraints to be diﬀerentiable at the localminimum. We derive ﬁxed point (FP) equations for the optimal posterior using the partialKKT system. These FP equations use KKT system with strict positivity constraints due towhich complementary slackness conditions are automatically satisﬁed; hence called ‘ partial ’KKT system. The computations illustrate that these FP equations always converge to astationary point at a very fast rate, even for a large classiﬁer set when a non-convex solverfails to identify a solution. (Please see Table 6 and Table 7 in Appendix D for an illustrationof such cases.) Framework

We work with a ﬁnite set of classiﬁers: H = { h i } Hi =1 of size H . The prior, P = { p i } Hi =1 and posterior, Q = { q i } Hi =1 are discrete distributions on H , where p i , q i ≥ ∀ i = 1 , . . . , H with (cid:80) Hi =1 p i = 1 and (cid:80) Hi =1 q i = 1 . For diﬀerentiability required by KKTconditions, our objective function should have open domain, that is, the interior of the H -dimensional probability simplex: int (∆ H ) = { ( q , . . . , q H ) | q i > ∀ i = 1 , . . . , H ; (cid:80) Hi =1 q i =1 } . In computations, we consider q i ≥ (cid:15) ∀ i = 1 , . . . , H for (cid:15) > to ensure existence of aminimizer in int (∆ H ) . Our FP equations are derived using partial KKT system on int (∆ H ) . Q ∗ φ,χ , for uniform prior We consider the special case of uniform prior on entire H . We want to identify the optimalposterior Q ∗ φ,χ with the H -dimensional probability simplex, ∆ H , as the feasible region. We7how below that it is enough to restrict the search space to certain subsets of ∆ H . Thisreduces the computational complexity of the search from exponential scale to linear scale . Theorem 2.

Consider a uniform prior distribution on the set H of classiﬁers, and agiven set of posterior weights Q = { q j } H (cid:48) j =1 . We have three choices of distance function φ ∈ { φ lin , φ sq , kl } . Then among all subsets H (cid:48) ⊂ H of size H (cid:48) , the smallest bound value B φ,χ ( Q, H (cid:48) ) corresponding to the given posterior weights Q is achieved when H (cid:48) is the sub-set formed by the ﬁrst H (cid:48) elements of the ordered set of classiﬁers ranked by non-decreasingempirical risk values, ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H .Proof. We ﬁrst consider the case of linear and squared distance based bounds. Under thegiven set up, these bound functions are deﬁned as follows: B lin ,χ ( Q, H (cid:48) ) := (cid:88) i ∈H (cid:48) ˆ l i q i + (cid:115) (cid:0)(cid:80) i ∈H (cid:48) q i (cid:1) H mδ . (3) B sq ,χ ( Q, H (cid:48) ) := (cid:88) i ∈H (cid:48) ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116) H (cid:32)(cid:88) i ∈H (cid:48) q i (cid:33) (cid:18) m − m δ (cid:19) . (4)For a given set of posterior weights { q j } H (cid:48) j =1 , the second terms of B lin ,χ ( Q, H (cid:48) ) and B sq ,χ ( Q, H (cid:48) ) are invariant of the support set H (cid:48) as long as its cardinality is H (cid:48) . Thus thevalue of the bound depends on the common ﬁrst term which is a sum of positive quantities.For given weights { q j } H (cid:48) j =1 , the bounds (3) and (4) are the smallest when the sum (cid:80) i ∈H (cid:48) ˆ l i q i is minimized. This will happen when H (cid:48) consists of classiﬁers with smallest H (cid:48) values in theset { ˆ l i } Hi =1 . Furthermore, if the elements of H (cid:48) are ordered by non-decreasing empirical riskvalues, ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) , the posterior weights should be ordered non-increasingly. Hence,the claim of the theorem holds true.Now, for the KL-divergence as a distance function, the bound value, r , is the solution tofollowing two equations: kl (cid:32)(cid:88) i ∈H (cid:48) ˆ l i q i , r (cid:33) = (cid:115) H (cid:0)(cid:80) i ∈H (cid:48) q i (cid:1) I R kl ( m, δ (5) r ≥ (cid:88) i ∈H (cid:48) ˆ l i q i (6)The right hand side term of (5) is invariant of support H (cid:48) as long as it is of size H (cid:48) . Let ˆ L := (cid:80) i ∈H (cid:48) ˆ l i q i , then (5) is an implicit function of variables ˆ L and r . Using implicit functiontheorem, we have drd ˆ L = − ∂kl/∂ ˆ L∂kl/∂r = ln ˆ Lr − ln − ˆ L − r ˆ Lr − − ˆ L − r (7)Using (6) and strict monotonicity of natural logarithm function, we can claim that drd ˆ L > .That is, the bound r is a strictly increasing function of ˆ L := (cid:80) i ∈H (cid:48) ˆ l i q i under the given setup. To ﬁnd the least r for a given Q ( H (cid:48) ) = { q j } H (cid:48) j =1 , we need to ﬁnd the least (cid:80) i ∈H (cid:48) ˆ l i q i onall possible subsets H (cid:48) . This happens when H (cid:48) is the subset formed by the ﬁrst ordered H (cid:48) elements ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) . Hence proved. 8 orollary 1. As a consequence of the above Theorem 2, for determining the (global) op-timal posterior Q ∗ φ,χ , it is suﬃcient to compare the bound values corresponding to the bestposteriors on ordered subsets of H , ranked by non-decreasing ˆ l i values. These ordered subsetscan be uniquely identiﬁed by their size. An ordered subset of size 1 is { ˆ l } , of size 2 is { ˆ l , ˆ l } and so on. Thus there exists an isomorphism between the set { , . . . , H } (which denote thesubset size) and the family of ordered increasing subsets of H . Algorithm 1:

OptQ φ − χ For Uni-form Prior : Algorithm for ﬁndingoptimal posterior for the PAC-Bayesianbound based on χ -divergence whenprior is uniform distribution Input: m, δ, H, { ˆ l i } Hi =1 Output: Q ∗ φ,χ Deﬁne an array B ∗ φ,χ [ . . . ] of size H B ∗ φ,χ [1] ← f φ (cid:18) ˆ l , φ − l (cid:18)(cid:113) I Rφ ( m, p δ (cid:19)(cid:19) for H (cid:48) = 2 , . . . , H do ﬂag ← Identify Q ∗ φ,χ for H (cid:48) via (9) or(12) or (15) for i = 1 , . . . , H (cid:48) do if q ∗ i,φ,χ < then ﬂag ← break end end if ﬂag = 1 then break end Compute B ∗ φ,χ [ H (cid:48) ] using Q ∗ φ,χ in(2) end H ∗ ← arg min H (cid:48) B ∗ φ,χ [ H (cid:48) ] Identify Q ∗ φ,χ for H ∗ via (9) or (12) or(15) return Q ∗ φ,χ Correctness of Algorithm

OptQ φ − χ For Uniform Prior

We want to de-termine the globally optimal posterior Q ∗ φ,χ that has the minimum bound value B φ,χ ( Q ) over the H -dimensional probability simplex, ∆ H . Using the result of Theorem 2, we canconﬁne the search to a much smaller spaceof posteriors with support on a family of in-creasing ordered subsets of H . These or-dered subsets are deﬁned by their size. Forexample, an ordered subset of size H (cid:48) ∈ [ H ] comprises of the lowest H (cid:48) values in the set { ˆ l i } Hi =1 . This restricted space of posteriors,say ∆ ord ⊂ ∆ H , is a union of convex setsof posteriors with supports on ordered sub-sets deﬁned above. Due to increasing sub-set relation between consecutive supports,this union is also a convex set. The searchspace ∆ ord is a restriction of ∆ H , yet consistsof uncountably many posteriors. We reﬁnethe search further by localizing to optimalposteriors Q ∗ φ,χ ( H (cid:48) ) on each increasing or-dered subset and comparing their bound val-ues, B ∗ φ,χ ( H (cid:48) ) to ﬁnd the minimum. Thus,an exponential search on restricted posteriorspace is simpliﬁed to a ﬁnite linear search onthe support size. When using FP scheme, wealso need to verify that Q ∗ φ,χ ( H (cid:48) ) satisﬁespositivity constraints. We denote the sup-port size of the optimal posterior Q ∗ φ,χ by H ∗ ∈ [ H ] . Therefore, for determining Q ∗ φ,χ in ∆ ord , it is suﬃcient to search for H ∗ in theset { , . . . , H } .9 Optimal PAC-Bayesian Posterior using Linear Dis-tance Function

As a basic case, we can consider linear distance function, φ lin (ˆ l, l ) = l − ˆ l for ˆ l, l ∈ [0 , . Theoptimal posterior Q ∗ lin ,χ that bounds the unknown averaged true risk of a stochastic classiﬁer,is obtained via the following minimization problem for the bound B lin ,χ ( Q ) identiﬁed by [3]. min Q =( q ,...,q H ) ∈ ∆ H B lin ,χ ( Q ) := H (cid:88) i =1 ˆ l i q i + (cid:115) (cid:80) Hi =1 q i p i mδ (8) Theorem 3.

The bound function B lin ,χ ( Q ) (identiﬁed by [3]) is a strictly convex functionand hence, (8) is a convex program with a unique global minimum. Proof (in Section A.1 in the appendix) uses ﬁrst order convexity property. Q ∗ lin ,χ , for uniform prior We identify the optimal posterior and the optimal bound value for linear distance functionby exploiting convexity of B lin ,χ ( Q ) . Proofs for Theorem 4 and Theorem 5 stated below arein Sections A.2 and A.3 in the appendix. Theorem 4 (Optimal posterior on an ordered subset support) . When prior is uniform distri-bution on H , among all the posteriors with support as subset of H of size exactly H (cid:48) , the bestposterior denoted by Q ∗ lin ,χ ( H (cid:48) ) has the support on the ordered subset H (cid:48) ord = {{ ˆ l i } H (cid:48) i =1 | ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) } consisting of smallest H (cid:48) values in H . The optimal posterior weights aredetermined as follows: q ∗ i, lin ,χ ( H (cid:48) ) =  (cid:32) (cid:80) H (cid:48) i =1 ˆ liH (cid:48) − ˆ l i (cid:33)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )  H i = 1 , . . . , H (cid:48) i = H (cid:48) + 1 , . . . , H, (9) where ˆ var H (cid:48) (ˆ l ) = H (cid:48) H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) (cid:19) is the variance of the values in H (cid:48) ord . We require that H (cid:48) is such that HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) > so that Q ∗ lin ,χ ( H (cid:48) ) is deﬁned andfor feasibility, q ∗ i, lin ,χ ( H (cid:48) ) > for i = 1 , . . . , H (cid:48) . Using the closed form expression (9), we can identify the optimal posterior Q ∗ lin ,χ viaAlgorithm 1. Remark 1.

For given values of

H, H (cid:48) , m and ˆ var H (cid:48) (ˆ l ) , the upper bound on δ is related tosparseness of the optimal posterior Q ∗ lin ,χ . A higher δ diminishes the eﬀect of divergenceterm (cid:80) Hi =1 q i p i and allows sparse solutions. heorem 5. The bound value of the best posterior Q ∗ lin ,χ ( H (cid:48) ) on an ordered subset of size H (cid:48) , B ∗ lin ,χ ( H (cid:48) ) := B lin ,χ (cid:0) Q ∗ lin ,χ ( H (cid:48) ) (cid:1) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) , (10) is decreasing function of H (cid:48) ≤ H ∗ , the support size of globally optimal posterior Q ∗ lin ,χ . PAC-Bayesian bound for squared distance, φ sq (cid:16) ˆ l, l (cid:17) = (cid:16) ˆ l − l (cid:17) for ˆ l, l ∈ [0 , is identiﬁedbelow. We ﬁrst to need to identify I R sq ( m, for a given sample size m . Details of thederivation are in Appendix B. Lemma 2.

For a given sample size, m , I R sq ( m,

2) := (cid:80) mk =0 (cid:0) mk (cid:1) . m e m ( km − . ) = m − m . Theorem 6.

For a ﬁnite set of classiﬁers, H , PAC-Bayesian bound, B sq,χ ( Q ) , based onsquared distance function with χ -divergence measure is given by: B sq,χ ( Q ) := H (cid:88) i =1 ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q i p i (cid:33) (cid:18) m − m δ (cid:19) (11) Proof.

Using the PAC-Bayesian statement in (1) with φ sq (cid:16) ˆ l, l (cid:17) = (cid:16) ˆ l − l (cid:17) for ﬁnite H , wecan obtain B sq,χ ( Q ) in using (2) and I R sq ( m, identiﬁed via Lemma 2.We want to determine the optimal posterior Q ∗ sq,χ which minimizes B sq,χ ( Q ) over ∆ H .This bound function turns out to be non-convex in Q . Theorem 7.

The bound function, B sq,χ ( Q ) = (cid:80) Hi =1 ˆ l i q i + (cid:114)(cid:16)(cid:80) Hi =1 q i p i (cid:17) (cid:0) m − m δ (cid:1) is non-convex. We show this non-convexity even when P ∼ U nif ( H ) via counter examples violatingﬁrst order convexity property in Section B.2 in the appendix. Remark 2.

Computationally this bound minimization problem for (11) is observed to havea single solution. We used bordered Hessian test to verify that the solution obtained is a localminimum. This motivates the following. The bound B sq ,χ ( Q ) is shown to be quasi-convexunder a condition on system parameters. Proposition 1.

The bound function B sq ,χ ( Q ) is strictly quasi-convex if the following con-dition holds for any Q, Q (cid:48) for each α ∈ (0 , : (cid:18) (cid:114) m − m δ (cid:19)  (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i  < (1 − α )( E Q [ˆ l ] − E Q (cid:48) [ˆ l ]) and hence a local minimum to the bound minimization problem for the bound (11) is also aglobal minimum. Q F P sq ,χ , for uniformprior For uniform prior set up, we derive FPE for minimizing (11) on an ordered subset supportof size H (cid:48) . Theorem 8 (Optimal posterior on an ordered subset support) . When prior is uniform distri-bution on H , among all the posteriors with support as subset of size exactly H (cid:48) , the best pos-terior denoted by Q ∗ sq ,χ ( H (cid:48) ) has the support on the ordered subset H (cid:48) ord = {{ ˆ l i } H (cid:48) i =1 | ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) } consisting of smallest H (cid:48) values in H . The optimal posterior weights { q ∗ i, sq ,χ ( H (cid:48) ) } are determined as the solution to the following ﬁxed point equation in { q i ( H (cid:48) ) } H (cid:48) i =1 : q i ( H (cid:48) ) =  H (cid:48) + (cid:16)(cid:80) H (cid:48) i =1 ( q i ( H (cid:48) )) (cid:17) (cid:113) (12 m − H m δ (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) i = 1 , . . . , H (cid:48) i = H (cid:48) + 1 , . . . , H. (12) under the feasibility condition that q i ( H (cid:48) ) > for i = 1 , . . . , H (cid:48) . Proof of this theorem is in Section B.3 in the appendix. In our computations, FP iteratesin (12) do converge to a solution. Using them, we can identify the optimal posterior Q ∗ sq ,χ via Algorithm 1. kl ( · , · ) Chi-squared divergence based PAC-Bayesian bound using the distance function kl (ˆ l, l ) =ˆ l ln (cid:16) ˆ ll (cid:17) + (1 − ˆ l ) ln (cid:16) − ˆ l − l (cid:17) (for ˆ l, l ∈ [0 , ) is: B kl ,χ ( Q ) = sup r ∈ (0 , (cid:40) r : kl (cid:16) E Q [ˆ l ] , r (cid:17) ≤ (cid:114) ( χ ( Q || P ) + 1) I R kl ( m, δ (cid:41) (13)where I R kl ( m,

2) := m (cid:80) k =0 (cid:0) mk (cid:1) l k (1 − l ) m − k (cid:0) kl (cid:0) km , l (cid:1)(cid:1) should be computed ﬁrst. For m > ,computation is diﬃcult due to storage limitations in the range of ﬂoating point numbers.We notice that I R kl ( m, decreases with m and hence, we can use I R kl (1028 , as an upperapproximation for I R kl ( m, for m > . Please refer to Table 5 and Figure 3 in AppendixC for details. kl ( · , · ) is not a monotone function and so its inverse does not exist. Thus, B kl ,χ ( Q ) doesnot have an explicit form. However, we can employ a numerical root ﬁnding algorithm suchas that described in [20] (Algo. ( KLroots )) to obtain B kl ,χ ( Q ) for given system parametervalues. 12or a ﬁnite classiﬁer space H = { h i } Hi =1 with empirical risk values { ˆ l i } Hi =1 , the KL-distancebound minimization problem is: min ( q ,...,q H ) ∈ ∆ H r ∈ (0 , r (14a)s.t. (cid:32) H (cid:88) i =1 ˆ l i q i (cid:33) ln  H (cid:80) i =1 ˆ l i q i r  + (cid:32) − H (cid:88) i =1 ˆ l i q i (cid:33) ln  − H (cid:80) i =1 ˆ l i q i − r  = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) (cid:18) H (cid:80) i =1 q i p i (cid:19) I R kl ( m, δ (14b) r ≥ H (cid:88) i =1 ˆ l i q i (14c)Here, r is the right root of (14b) for a given E Q [ˆ l ] . The above is known to be a non-convexproblem with a diﬀerence of convex (DC) equality constraint (14b); and has multiple sta-tionary points. This fact is illustrated in our computations, where diﬀerent initializations ledto diﬀerent stationary points. Using bordered Hessian test, we veriﬁed that these stationarypoints computed on our datasets by the solver or the FP equation (15) given below are eitherlocal minima or saddle points.The constraint (14c) is a strict inequality which is relaxed toensure a solution in the closed domain. The iterative root ﬁnding algorithm adds to thecomputational complexity of the bound minimization algorithm.The objective function and constraints of the above bound minimization, (14) are eitherlinear or diﬀerence of convex (DC) functions, hence it falls into the category of a DC program.We can make use of the convex-concave procedure (CCP), which is a powerful heuristicmethod used to ﬁnd local solutions to DC programming problems [12]. This proceduremakes a linear approximation via supporting hyperplane to the second convex function inthe DC function of the optimization program. This helps us convert the DC constraint into aconvex constraint and hence the original DC program is reduced to a convex program whichcan be easily solved. The details of the CCP for solving (14) are in Appendix C.3 and therelated computations are in Table 8 in the appendix. Our general observation is that theﬁxed point scheme that we derive outperforms CCP. Q F P kl ,χ , for uniformprior We derive FPE for (14) for uniform prior when the support of Q is an ordered subset of size H (cid:48) . Proof is in Section C.2 of the appendix. This FPE is used in Algorithm 1 for determining Q F P kl ,χ . Theorem 9 (Optimal posterior on an ordered subset support) . When prior is uniformdistribution on H , among all the posteriors with support as subset of size exactly H (cid:48) , the bestposterior denoted by Q ∗ sq ,χ ( H (cid:48) ) has the support on the ordered subset H (cid:48) ord = {{ ˆ l i } H (cid:48) i =1 | ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) } consisting of smallest H (cid:48) values in H . A stationary point Q FPkl ,χ ( H (cid:48) ) for (14)13 an be obtained as the solution to the following ﬁxed point equation in { q i } H (cid:48) i =1 : q i = 1 Z kl ,χ (cid:32) H (cid:48) (cid:88) i =1 q i (cid:33)  (cid:16) ˆ l i − (cid:80) H (cid:48) i =1 ˆ l i q i (cid:17)(cid:113) H ( (cid:80) H (cid:48) i =1 q i ) I R kl ( m, δ (cid:34) ln (cid:32) (1 − r ) (cid:80) H (cid:48) i =1 ˆ l i q i r (1 − (cid:80) H (cid:48) i =1 ˆ l i q i ) (cid:33)(cid:35) (15) for i = 1 , . . . , H (cid:48) , where Z kl ,χ is a suitable normalization constant and r is the solution to (14b) and (14c) for a given Q = ( q , . . . , q H (cid:48) ) ∈ interior (∆ H (cid:48) ) . For computations, we included nine datasets from UCI repository [5] with small to moderatenumber of examples (306 examples to 5463 examples) and small to moderate number offeatures (3 features to 57 features). The details about the number of features, numberofexamples and class distribution of these datasets are listed in Table 2 [22]. These datasetsspan a variety ranging from almost linearly separable (Banknote, Mushroom and Wavedatasets) to moderately inseparable (Wdbc, Mammographic and Ionosphere datasets) toinseparable data (Spambase, Bupa and Haberman datasets). SVMs on these datasets havevarying ranges and degrees of variation in their empirical risk values.We consider a ﬁnite set of SVM regularization parameter values

Λ = { λ i } Hi =1 , say, between and an upper bound λ > , since small values of λ i ’s are preferable. The set Λ isan arithmetic-geometric progression (AGP) with a logarithmic scale for λ ∈ (0 , . and alinear scale for λ ≥ . . The logarithmic subset of Λ is a union of 3 geometric series withratios , and each with elements truncated between e − and 0.1. We use a lowerbound e − which is slightly away from 0 since the SVM classiﬁers become very close(almost indistinguishable) and give similar training and test errors for very small values of λ which are almost zero. For values beyond 0.1, we use a arithmetic progression with spacing of0.05. We use an upper bound λ = 5 on this arithmetic series, since on most datasets, SVMsgenerated by λ ≥ do not have good training and test error rates. (Graphical illustration ofranges and variation of training errors and test errors of the nine UCI datasets we consideredon the chosen Λ range is depicted in Section G in [22].)SVM QP (with RBF kernels) was implemented using ksvm function in kernlab package[9] in R (version 3.1.3 (2015-03-09)) . The Gaussian width parameter was estimated by kernlab using sigest function which estimates 0.1 and 0.9 quantiles of squared distancebetween the data points.Each of these datasets was partitioned such that 80% of the examples formed a compo-sition of training set and validation set (in equal proportion) used for constructing the set H = { h ( λ i ) | λ i ∈ Λ } Hi =1 of SVM classiﬁers and remaining 20% used for computing their testerror rates. The training set size ( m ), validation set size ( v ) and test set size ( t ) are in theratio m : v : t = 0 . . . . The role of the validation set is to compute the empirical risk ˆ l i of the SVM h ( λ i ) ∈ H which will be used for deriving the PAC-Bayesian bound. We follow b after one-hot encoding for categorical features c after removing the rows with missing values from the data d number of examples when class ‘0’ is removed ataset Number offeatures , n Number ofexamples Pos/Neg Trainingset size , m Validationset size , v Testset size , t Spambase

57 4601 2788/1813 1840 1840 921

Bupa

Mammographic

Wdbc

30 569 357/212 227 227 115

Banknote

Mushroom

22 ( b ) 5643 c Ionosphere

34 351 225/126 140 140 71

Waveform

40 3308 d Haberman

Table 2:

Details of various UCI datasets used for computational experiments (Table 5 in Section G of [22]). We list the number of features n , total number of exampleswith distribution into positive and negative classes for each dataset. We also give the numberof examples in training, validation and test sets, according to the random partition createdby 0.4:0.4:0.2 ratio of the total dataset size.the scheme provided in [3, 25] to generate the set H . Each classiﬁer h ( λ i ) ∈ H is trained on m training examples subsampled from this composite set and validated on the remaining v examples. Overlaps between training sets of diﬀerent classiﬁers are allowed. Same is truefor their validation sets. (For further details about the dataset categorization, please referto Section G.1 in [22].)The PAC-Bayesian bound minimization for ﬁnding the optimal posterior was imple-mented in AMPL Interface and solved using Ipopt software package (version 3.12 (2016-05-01)) [27], a library for large-scale nonlinear optimization ( http://projects.coin-or.org/Ipopt ).All the computations were done on a machine equipped with 4 Intel Xeon 2.13 GHz coresand 64 GB RAM.

We present a comparative study of χ -divergence based optimal posteriors considered in thispaper with optimal posteriors for KL-divergence based PAC-Bayesian bounds consideredby [21]. We also analyze the performance of these stochastic SVM classiﬁers governed byPAC-Bayesian posteriors with respect to the deterministic SVM classiﬁer identiﬁed via cross-validation procedure. Given a set of base classiﬁers and having computed their empirical risk values, we observethe following diﬀerences and similarities between the optimal PAC-Bayesian posteriors dueto the two divergence functions on this classiﬁer set.15able 3:

PAC-Bayesian bounds and averaged test error rates for Q ∗ φ,χ . We comparebound values B ∗ φ,χ and average test error rates T φ,χ of optimal posteriors Q ∗ φ,χ for 3 distancefunctions: KL-divergence kl , linear φ lin and squared distances φ sq . For large sample size ( m ≥ ), the constant I R kl ( m, cannot be computed due to ﬂoating point storage limitations onthe machine. So, we use an upper approximation: for m > , I R kl ( m, ≤ I R kl (1028 ,

2) =0 . since I R kl ( m, is decreasing with m . Please see Table 7 in Appendix D for details. (cid:63) refers to values obtained using ﬁxed point equation because the solver Ipopt does notconverge to a solution for reasons like local infeasibility, Restoration Phase Failed, etc. Pleasesee Appendix D for more such examples. Lowest 10% bound values and test error rates foreach dataset are denoted in bold face. For a given posterior, we measure sparsity by thenumber of classiﬁers needed by its cumulative distribution function (CDF) to achieve acertain signiﬁcance level. Concentration of a posterior is quantiﬁed in terms of its (cid:96) norm,which is equivalent to HHI score used for measuring market share of a ﬁrm in the industry[7, 28]. For almost separable datasets, the CDFs of the three posteriors are close to eachother. Q ∗ kl ,χ have almost full support and low concentration. These posteriors give extremelyloose bounds and are computationally expensive but have test error rates better than lineardistance ones, Q ∗ lin ,χ , on most datasets. Squared distance based posteriors, Q ∗ sq ,χ , are sparsewith relatively high concentration (relatively high (cid:96) norm) and have the tightest bounds andlowest test error rates. The contrast in test error rates is striking when the dataset yieldsclassiﬁers with high variation in empirical risk values. See Section D.2 in the appendix fordetails. Dataset PAC-Bayesian Bound , B ∗ φ,χ Average Test Error , T φ,χ B ∗ lin ,χ B ∗ sq ,χ B ∗ kl ,χ T lin ,χ T sq ,χ T kl ,χ Spambase 0.38054 (cid:63) (cid:63)

Bupa 0.82183 (cid:63) (cid:63)

Mammographic 0.50276 (cid:63) (cid:63)

Wdbc 0.41631 (cid:63) (cid:63)

Banknote 0.22283 (cid:63) (cid:63)

Mushroom 0.10785 (cid:63) (cid:63)

Ionosphere 0.64273 (cid:63) (cid:63)

Waveform 0.18565

Haberman 0.70477 (cid:63) (cid:63) i Nature of optimal posteriors : Both KL-divergence and χ -divergence based opti-mal posteriors that exhibit decreasing trend with respect to the empirical risk values.That is to say that higher the empirical risk of a classiﬁer, lower its optimal posteriorweight. However, the rate at which these posterior weights decrease is inﬂuenced by thechoice of divergence function in the PAC-Bayesian bound. In the case of KL-divergencebased PAC-Bayesian bounds, optimal posterior weights decrease exponentially with theempirical risk values. Whereas, the optimal posteriors that minimize χ -divergencebased PAC-Bayesian bounds have linearly decreasing weights with respect to the em-pirical risk values.When the prior is uniform distribution, the optimal posterior weights in both the casesare directly proportional to the empirical risk values (no role for prior weight). Con-sider the case of linear distance function with χ -divergence whose optimal posterior16eights are determined in Theorem 4 given by Equation (9). Similarly, when we havelinear distance function with KL-divergence, the optimal posterior weights are givenas (Equation (21) in [21]): q ∗ i, lin, KL = p i e − m ˆ l i (cid:80) Hi =1 p i e − m ˆ l i ∀ i = 1 , . . . , H (16)Using a uniform prior, p i = 1 H , in above, we get the following expression for the optimalposterior weights, which are directly proportional to the empirical risk values, ˆ l i s onexponential scale: q ∗ i, lin, KL = e − m ˆ l i (cid:80) Hi =1 e − m ˆ l i ∀ i = 1 , . . . , H (17)ii Size of support set : KL-divergence based optimal posteriors take into account allthe classiﬁers in the base set, though the posterior weight associated with a high riskclassiﬁer is inﬁnitesimally small. On the other hand, χ -divergence based optimalposteriors select only a strict subset of base classiﬁers comprising of the ones with lowempirical risk values.iii Test set performance : Optimal posteriors for the KL-divergence based bounds haverelatively lower test error rates than their χ -divergence based counterparts. Thishints at an underlying regularization phenomenon involving support set and tightnessof the bound. The χ -divergence based posteriors might be overﬁtting because theyconcentrate on a strict subset support of classiﬁers. In contrast, the KL-divergencebased posteriors have the whole classiﬁer set as their support and a better test setperformance. This phenomenon is supported by the fact that KL-divergence basedoptimal posteriors yield tighter bounds than the ones derived for the case for χ -divergence. We performed 5-fold cross-validation (CV) on the datasets by setting aside 20% of the dataas a test set for each dataset. The set of λ values is an arithmetic-geometric progression(AGP) with a logarithmic scale for λ ∈ (0 , . and a linear scale for λ ≥ . . This is the sameset of λ values which has been used for the proposed PAC-Bayesian technique. We reportthe test error of the “best" λ identiﬁed by the CV method and compare it with the PAC-Bayesian method using the (sq, χ ) pair (since this pair gives the lowest test error obtainedby using diﬀerent distance functions). These values are reported in Table 4 below. In termsof relative test error, the CV method is signiﬁcantly better than the proposed method onSpambase, Bupa, Ionosphere and Haberman datasets, while the proposed PAC-Bayesianmethod is signiﬁcantly better than CV method on Wdbc dataset. The diﬀerence in the twotest errors is small (less than 20%) on other datasets. Thus, CV method has better testerror performance than χ -divergence based PAC-Bayesian posterior. But the CV methodtakes remarkably longer time (between 2 to 10 hours) to identify the “best" λ than the timetaken by the PAC-Bayesian method to identify the optimal posterior the classiﬁer space17which generally takes about 20 to 300 seconds). Thus, computational complexity of the CVmethod is much higher than that of PAC-Bayesian method. Dataset λ ∗ CVTest Error sq- χ PAC-BTest Error ∆ Test Error RelativeTest Error sq- χ PAC-BBoundSpambase (cid:63)

Bupa (cid:63)

Mammographic

Wdbc • Banknote

Mushroom

Ionosphere (cid:63)

Waveform

Haberman (cid:63) λ ∈ (0 , . and a linear scale for λ ≥ . . ∆ test error is theamount by which test error of CV method is smaller or larger than that of the PAC-Bayesianmethod for sq- χ pair. Relative test error is the ratio of ∆ test error to the CV test error,signifying the relative diﬀerence between the test errors of the two methods. • denotes theinstances where the PAC-Bayesian method has a signiﬁcantly lower test error rate than theCV method, while (cid:63) denotes the instances where the CV method has a signiﬁcantly lowertest error rate.Apart from the computational beneﬁts , the proposed PAC-Bayesian method also has statistical advantages over the CV method, as noted below:i Sample robustness : CV method is not as sample robust as the PAC-Bayesian methodeven though it trains on multiple sub-samples (partitions) and reports the averagedtraining error as CV error for choosing the best λ . For example, a 5-fold CV thatwe performed uses multiple training samples with a diminished sample size of 20% ofthe dataset size. Whereas the PAC-Bayesian method uses a single and much largertraining sample (60% of dataset size in our scheme) to give an upper bound on thetrue risk.ii Point estimate versus interval estimate : CV method gives a point estimate ofthe true risk by averaging CV error over multiple folds, but there are no guaranteesassociated with it. On the other hand, the PAC-Bayesian method gives an intervalestimate of the form [0 , Bnd ] , where Bnd denotes the upper bound given by the PAC-Bayesian theorem and intrinsically has a high-probability guarantee associated.iii

Deterministic versus stochastic classiﬁer : CV method outputs a deterministicclassiﬁer in terms of the “best" λ value which has good test performance. The PAC-Bayesian technique is a committee method that outputs an optimal distribution onthe set of classiﬁers, which yields a stochastic classiﬁer. The classiﬁer determined by18he CV method may have better performance on a single test set, but the stochasticclassiﬁer obtained via PAC-Bayesian technique will have comparable performance whenused on multiple test set instances.iv An upper bound on true risk versus its point estimate : The PAC-Bayesianmethod gives a tight upper bound on the true risk of the stochastic classiﬁer for alldatasets. This upper bounds holds even for the “best" deterministic classiﬁer obtainedby CV method as can be seen in Table 4 above. Thus the high-probability PAC-Bayesian upper bound is a more useful quantity than the estimate given by CV testerror for the true risk which can be under-biased or over-biased depending on thetraining sample and the folds created from this sample for cross-validation.These advantages strengthen the usefulness of the PAC-Bayesian method for constituting astochastic classiﬁer and perhaps in tuning hyperparameters of other classiﬁcation algorithms.

We determine optimal posteriors for PAC-Bayesian bound minimization problem with boundsderived using χ -divergence function. The distance functions that we considered are: lineardistance, squared distance (second degree polynomial) and KL-divergence (inﬁnite degreepolynomial). We ﬁrst show that, in the uniform prior set up, minimizers of these PAC-Bayesian bounds can be obtained by a restricted search on subsets of the classiﬁer set or-dered by empirical risks. The bound minimization problem for linear distance case is shownto be a convex program and we also derive a closed form expression for its optimal pos-terior, while the other two distance functions result in non-convex programs. We furthershow that the squared distance results in a quasi-convex bound under certain conditions,and it is computationally observed to have single local minimum. We propose a convergentand computationally cheap ﬁxed point based approach to identify the optimal posteriors forthese bound minimization problems.Our computational exercise is comprehensive. The nine UCI datasets we have consid-ered take into account small to moderate number of examples and features, balanced andimbalanced classes, and having diﬀerent ranges and variance in the empirical risk values.Using this set of datasets helps us compare and understand the performance of optimalPAC-Bayesian posteriors due to diﬀerent distance functions for a given divergence function,and also across KL- and χ -divergence functions.Based on the computations on SVM classiﬁers, we observe that the squared distancebased posteriors perform the best among the three distance functions in terms of boundvalues as well as average test error rates. The optimal posteriors for linear and squareddistances have subset support, especially as the size of the classiﬁer set increases. On theother hand KL-distance based posteriors usually have full support but do not perform wellon the test set. This could be because they overﬁt the data while training. These chi-squareddivergence based optimal posteriors do not have a high measure of concentration, implyingless bias towards classiﬁers with low empirical risks.Comparing with KL-divergence based optimal PAC-Bayesian posteriors, we observe thatboth groups of PAC-Bayesian posteriors have weights decreasing with respect to the empirical19isk values of the classiﬁers. The diﬀerence lies in the rate of decrement – KL-divergencebased posterior weights decrease exponentially whereas χ -divergence bases ones show alinear decrease. Also, the former have a full support of the classiﬁer set, while the later usuallypick a strict subset of the base classiﬁers as their support. On the test set performance, KL-divergence based posteriors are better than those based on χ -divergence. This phenomenonhints at an underlying regularization by KL-divergence based posteriors; perhaps, an implicitregularization.We also provide a comparison of these PAC-Bayesian posteriors with the widely usedcross-validation procedure as the baseline case. While the CV has lower test error sets, thePAC-Bayesian method has the advantages of sample robustness and a much lower compu-tational cost over the cross-validation method. Also, it provides a reliable high probabilityupper bound on the true risk rather than a point estimate given by cross-validation method.The signiﬁcance of this work is in understanding the importance of choosing a divergencefunction for the PAC-Bayesian bound and its inﬂuence on the resulting optimal posteriorswhich are used to design stochastic classiﬁers. The challenges associated with this studyare – deducing the form of the bound for a given distance function, identifying the natureof the corresponding bound minimization problem, and obtaining a closed-form or a ﬁxedpoint equation of the optimal posterior which minimizes this bound. Another challengeis identifying the support set of the optimal posterior for an arbitrary prior distributionon the classiﬁer set. Therefore we considered the uniform prior which provided us with astructure to address the problem of identifying the support set in a linear fashion. On thecomputational side, we had to carefully choose the set of regularization parameter values tobe used for generating the base SVMs, since this would greatly inﬂuence the performance ofthe stochastic SVMs built on them. To achieve a right mix of base classiﬁers with good risks,we considered an arithmetic-geometric progression of the regularization parameter values inthe interval [0 , .As a part of the future work, we can extend our results to a non-uniform prior on theclassiﬁer space, where we do not have such a structure to the feasible region and may needto do a full simplex search. We would also like to understand the nature of our results onhigh dimensional datasets such as image segmentation [5], where each image is representedin matrix form rather than a vector or a record. References [1] Amiran Ambroladze, Emilio Parrado-Hernández, and John Shawe-Taylor. Tighter PAC-Bayes bounds. In

Neural Information Processing Systems , pages 9–16, 2006.[2] M.S. Bazaraa, H.D. Sherali, and C.M. Shetty.

Nonlinear Programming: Theory andAlgorithms . Wiley, 2013.[3] Luc Bégin, Pascal Germain, François Laviolette, and Jean-Francis Roy. PAC-Bayesianbounds based on the Rényi divergence. In

Artiﬁcial Intelligence and Statistics , pages435–444, 2016. 204] Stephen Boyd and Lieven Vandenberghe.

Convex Optimization . Cambridge UniversityPress, 2004.[5] Dua Dheeru and Eﬁ Karra Taniskidou. UCI machine learning repository, 2017.[6] Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand. PAC-Bayesian learning of linear classiﬁers. In

Proceedings of the 26th Annual InternationalConference on Machine Learning , pages 353–360, 2009.[7] Albert O. Hirschman.

National Power and the Structure of Foreign Trade . Universityof California Press, 1945.[8] Anatoli Juditsky. Lecture notes in convex optimization: Theory and algorithmes. https://ljk.imag.fr/membres/Anatoli.Iouditski/ , November 2015.[9] Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis. kernlab – An S4Package for Kernel Methods in R , volume 11. 2004.[10] John Langford. Tutorial on practical prediction theory for classiﬁcation.

Journal ofmachine learning research , 6(Mar):273–306, 2005.[11] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In

Neural InformationProcessing Systems (NIPS) , pages 439–446, 2002.[12] Thomas Lipp and Stephen Boyd. Variations and extension of the convex–concave pro-cedure.

Optimization and Engineering , 17(2):263–287, 2016.[13] Andreas Maurer. A note on the PAC Bayesian theorem.

CoRR , cs.LG/0411099, 2004.[14] David McAllester. PAC-Bayesian stochastic model selection.

Machine Learning ,51(1):5–21, 2003.[15] David McAllester. A PAC-Bayesian tutorial with a dropout bound. arXiv preprintarXiv:1307.2118 , 2013.[16] David McAllester and Takintayo Akinbiyi. PAC-Bayesian theory. In BernhardSchölkopf, Zhiyuan Luo, and Vladimir Vovk, editors,

Empirical Inference: Festschrift inHonor of Vladimir N. Vapnik , chapter 10, pages 95–103. Springer Science and BusinessMedia, 2013.[17] David A. McAllester. Some PAC-Bayesian theorems. In

Proceedings of the EleventhAnnual Conference on Computational Learning Theory , COLT’ 98, pages 230–234, NewYork, NY, USA, 1998. ACM.[18] David A. McAllester. PAC-Bayesian model averaging. In

Proceedings of the TwelfthAnnual Conference on Computational Learning Theory , COLT ’99, pages 164–170, NewYork, NY, USA, 1999. ACM. 2119] Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, and Shiliang Sun.PAC-Bayes bounds with data dependent priors.

Journal of Machine Learning Research ,13:3507–3531, 2012.[20] Puja Sahu and Nandyala Hemachandra. Some new PAC-Bayesian bounds and their usein selection of regularization parameter for linear SVMs. In

Conference on Data Scienceand Management of Data , pages 240–248, 2018. DOI: 10.1145/3152494.3152514.[21] Puja Sahu and Nandyala Hemachandra. Optimal PAC-Bayesian posteriors for stochasticclassiﬁers and their use for choice of SVM regularization parameter. In Wee Sun Lee andTaiji Suzuki, editors,

The 11th Asian Conference on Machine Learning , volume 101 of

Proceedings of Machine Learning Research , pages 268–283, Nagoya, Japan, 17–19 Nov2019. PMLR.[22] Puja Sahu and Nandyala Hemachandra. Optimal PAC-Bayesian posteriors forstochastic classiﬁers and their use for choice of SVM regularization parameter, 2019.https://arxiv.org/abs/1912.06803.[23] Matthias Seeger. The proof of McAllesterâĂŹs PAC-Bayesian theorem. In

NeuralInformation Processing Systems , 2002.[24] Yevgeny Seldin, François Laviolette, Nicolo Cesa-Bianchi, John Shawe-Taylor, and PeterAuer. PAC-Bayesian inequalities for martingales.

IEEE Trans. on Information Theory ,58(12):7086–7093, 2012.[25] Niklas Thiemann, Christian Igel, Olivier Wintenberger, and Yevgeny Seldin. A stronglyquasiconvex PAC-Bayesian bound. In

Algorithmic Learning Theory , pages 466–492,2017.[26] Tim van Erven and Peter Harremoës. Rényi divergence and Kullback-Leibler divergence.

IEEE Transactions on Information Theory , 60(7):3797–3820, 2014.[27] Andreas Wächter and Lorenz T. Biegler. On the implementation of an interior-pointﬁlter line-search algorithm for large-scale nonlinear programming.

Math. Program. ,106(1):25–57, 2006.[28] Wikipedia contributors. Herﬁndahl index — Wikipedia, the free encyclopedia, 2019.[Online; accessed 25-May-2019].

A Optimal PAC-Bayesian Posterior using Linear Dis-tance Function

As a basic case, we can consider linear distance function, φ lin (ˆ l, l ) = l − ˆ l for ˆ l, l ∈ [0 , . ThePAC-Bayesian bound in this case takes the following simpliﬁed form [3]: P S (cid:40) E Q [ l ] ≤ E Q [ˆ l ] + (cid:114) χ ( Q || P ) + 14 mδ (cid:41) ≥ − δ. (18)22hus, the upper bound on the true risk of a stochastic classiﬁer governed by a distribution Q , when using the linear distance function with χ -divergence between prior and posterior,is: B lin ,χ ( Q ) = H (cid:88) i =1 ˆ l i q i + (cid:115) (cid:80) Hi =1 q i p i mδ (19) A.1 The bound minimization problem

The corresponding bound optimization problem is: min Q = { q ,...,q H } H (cid:88) i =1 ˆ l i q i + (cid:115) (cid:80) Hi =1 q i p i mδ s. t. H (cid:88) i =1 q i = 1 q i ≥ ∀ i = 1 , . . . , H. (20)We are interested in the distribution Q ∗ lin ,χ which is optimal for the above bound minimiza-tion problem since that corresponds to the tightest PAC-Bayesian upper bound on the truerisk of an stochastic classiﬁer. Theorem 10.

The bound function B lin ,χ ( Q ) = (cid:80) Hi =1 ˆ l i q i + (cid:114) (cid:80) Hi =1 q ipi mδ is a strictly convexfunction and hence the optimization problem (20) is a convex program with a unique globalminimum.Proof. B lin ,χ ( Q ) is a diﬀerentiable function of Q = { q i } Hi =1 . Hence we can prove its convexity23f we can show that following ﬁrst order condition holds for any Q, Q (cid:48) : B lin ,χ ( Q (cid:48) ) ≥ B lin ,χ ( Q ) + (cid:104)∇ B lin ,χ ( Q ) , Q (cid:48) − Q (cid:105)⇒ H (cid:88) i =1 ˆ l i q (cid:48) i + (cid:115) (cid:80) Hi =1 q (cid:48) i p i mδ ≥ H (cid:88) i =1 ˆ l i q (cid:48) i − H (cid:88) i =1 ˆ l i q i + 1 √ mδ (cid:80) Hi =1 q i q (cid:48) i p i − (cid:80) Hi =1 q i p i (cid:113)(cid:80) Hi =1 q i p i + H (cid:88) i =1 ˆ l i q i + (cid:115) (cid:80) Hi =1 q i p i mδ ⇒ (cid:115) (cid:80) Hi =1 q (cid:48) i p i mδ ≥ √ mδ · (cid:80) Hi =1 q i q (cid:48) i p i − (cid:80) Hi =1 q i p i + (cid:80) Hi =1 q i p i (cid:113)(cid:80) Hi =1 q i p i ⇒ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q (cid:48) i p i ≥ (cid:80) Hi =1 q i q (cid:48) i p i (cid:113)(cid:80) Hi =1 q i p i ⇒ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q (cid:48) i p i  (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i  ≥ H (cid:88) i =1 q i q (cid:48) i p i (21)Using Cauchy-Schwarz inequality, the above holds for any pair of distributions Q, Q (cid:48) for agiven prior distribution P with equality if and only if Q ≡ Q (cid:48) . This implies that the boundfunction B lin ,χ ( Q ) is strictly convex. Therefore, the optimization problem (19) has a uniqueglobal minimum. A.2 The optimal posterior, Q ∗ lin ,χ via partial KKT system Theorem 11.

The global minimum of the bound minimization problem (20) can be obtainedvia: q ∗ i, lin ,χ = (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17)(cid:113) mδ − ˆ var P (ˆ l ) ∀ i = 1 , . . . , H (22) if the following two conditions are satisﬁed: q ∗ i, lin ,χ ≥ ∀ i = 1 , . . . , H (23) mδ > ˆ var P (ˆ l ) (24) where ˆ var P (ˆ l ) := (cid:80) Hi =1 p i (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17) is the variance of empirical risk values ˆ l i s underthe prior distribution P .Proof. The Lagrangian function corresponding to the optimization problem (20) is: L lin ,χ ( Q, µ ) := H (cid:88) i =1 ˆ l i q i + (cid:115) (cid:80) Hi =1 q i p i mδ − µ (cid:32) H (cid:88) i =1 q i − (cid:33) (25)24t optimality, posterior Q should set the derivatives of this Lagrangian function L lin ,χ ( Q, µ ) to zero. Setting the derivative of L lin ,χ with respect to q i ’s as zero, we get: ∂ L lin ,χ ( Q, µ ) ∂q i = 0 ⇒ ˆ l i + 1 √ mδ (cid:113)(cid:80) Hi =1 q i p i q i p i − µ = 0 ⇒ q i (cid:113)(cid:80) Hi =1 q i p i = ( µ − ˆ l i ) p i √ mδ (26)Based on the primal feasibility condition (cid:80) Hi =1 q i = 1 , we should have: ∂ L lin ,χ ( Q, µ ) ∂µ = 0 H (cid:88) i =1 q i = 1 ⇒ √ mδ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i (cid:34) H (cid:88) i =1 ( µ − ˆ l i ) p i (cid:35) = 1 ⇒ µ = H (cid:88) i =1 ˆ l i p i + 1 √ mδ (cid:113)(cid:80) Hi =1 q i p i (27)Thus, combining the results in (26) and (27), we get the following relation between q i sand p i s: q i = p i  √ mδ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i (cid:32) H (cid:88) i =1 ˆ l i p i − ˆ l i (cid:33) ∀ i = 1 to H Using the transformation z i = (cid:16) q i p i − (cid:17) , the above can be reduced to a linear system ofequations in z i s: z i = 4 mδ (cid:32) H (cid:88) i =1 ˆ l i p i − ˆ l i (cid:33) (cid:32) H (cid:88) i =1 z i p i + 1 (cid:33) ∀ i = 1 to H ⇔ H (cid:88) i =1 z i p i − z i mδ (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17) + 1 = 0 ∀ i = 1 to H. z ∗ i = (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17) mδ − ˆ var P (ˆ l ) i = 1 , . . . , H ⇒ ˜ q ∗ i = (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17)(cid:113) mδ − ˆ var P (ˆ l ) i = 1 , . . . , H which implies q ∗ i, lin ,χ = p i  (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17)(cid:113) mδ − ˆ var P (ˆ l )  i = 1 , . . . , H (28)For q ∗ i, lin ,χ to be a real number, we need: mδ − ˆ var P (ˆ l ) > . Remark 3.

We suspect that this upper bound on δ is related to the sparseness of the optimalposterior that minimizes the bound B lin ,χ ( Q ) = (cid:80) Hi =1 ˆ l i q i + (cid:114) (cid:80) Hi =1 q ipi mδ . A higher δ diminishesthe eﬀect of divergence χ [ Q || P ] = (cid:80) Hi =1 q i p i . Hence it allows sparse solutions (where somecomponents of posterior Q take value zero) which have higher divergence from the priorcompared to a non-sparse solution. A.3 Optimal posterior, Q ∗ lin ,χ , for uniform prior Theorem 12 (Optimal posterior on an ordered subset support) . When prior is uniformdistribution on H , among all the posteriors with support as subset of H of size exactly H (cid:48) ,the optimal/best posterior denoted by Q ∗ lin ,χ ( H (cid:48) ) has the support on the ordered subset H (cid:48) ord = {{ ˆ l i } H (cid:48) i =1 | ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) } consisting of smallest H (cid:48) values in H . The optimal posteriorweights are determined as follows: q ∗ i, lin ,χ ( H (cid:48) ) =  (cid:32) (cid:80) H (cid:48) i =1 ˆ liH (cid:48) − ˆ l i (cid:33)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )  H i = 1 , . . . , H (cid:48) i = H (cid:48) + 1 , . . . , H, (29) where ˆ var H (cid:48) (ˆ l ) = H (cid:48) H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) (cid:19) is the variance of the values in H (cid:48) ord . We assume that the subset size H (cid:48) is such that HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) > so that Q ∗ lin ,χ ( H (cid:48) ) is deﬁned and for feasibility, we require q ∗ i, lin ,χ ( H (cid:48) ) > for i = 1 , . . . , H (cid:48) .Proof. Under the uniform prior set, that is, when p i = H for all i = 1 , . . . , H , we obtain Q ∗ lin ,χ ( H (cid:48) ) using the partial KKT system (ignoring the positivity constraints) and the prooftechnique in Theorem 11 above. 26 heorem 13. The bound value for the linear distance function B ∗ lin ,χ ( H (cid:48) ) := (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) (30) of an optimal posterior Q ∗ lin ,χ ( H (cid:48) ) on an ordered subset of size H (cid:48) , is decreasing function of H (cid:48) for H (cid:48) ≤ H ∗ . Here H ∗ is the size of the ordered subset which forms the support of theglobally optimal posterior Q ∗ lin ,χ .Proof. Consider an ordered subset of size H (cid:48) ∈ { , . . . , H } such that HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) > sothat the posterior Q ∗ lin ,χ ( H (cid:48) ) given in (29) is deﬁned. Suppose further that all the elements ofthis posterior are positive, so that it is feasible and hence optimal posterior on the consideredordered subset of size H (cid:48) . The bound value at this optimal posterior can be computed as: B ∗ lin ,χ ( H (cid:48) ) : = B lin ,χ (cid:0) Q ∗ lin ,χ ( H (cid:48) ) (cid:1) = H (cid:48) (cid:88) i =1 ˆ l i q ∗ i, lin ,χ ( H (cid:48) ) + (cid:118)(cid:117)(cid:117)(cid:116) H mδ (cid:32) H (cid:48) (cid:88) i =1 (cid:16) q ∗ i, lin ,χ ( H (cid:48) ) (cid:17) (cid:33) , (31)where H (cid:48) (cid:88) i =1 ˆ l i q ∗ i, lin ,χ ( H (cid:48) ) = H (cid:48) (cid:88) i =1 ˆ l i  (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )  H (cid:48) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )= (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) (cid:19) − (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) (cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )= (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ var H (cid:48) (ˆ l ) (cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) . (32)27or evaluating the second term, we simplify the χ -divergence factor: H (cid:48) (cid:88) i =1 (cid:0) q ∗ i, lin ,χ ( H (cid:48) ) (cid:1) = H (cid:48) (cid:88) i =1  (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )  H (cid:48) ) = 1( H (cid:48) ) H (cid:48) (cid:88) i =1  (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) + 2 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )  = 1( H (cid:48) )  H (cid:48) + H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) + 2 H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )  = 1 H (cid:48)  H (cid:48) H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) + H (cid:48) (cid:16) (cid:8)(cid:8)(cid:8)(cid:8) (cid:80) H (cid:48) i =1 ˆ l i − (cid:8)(cid:8)(cid:8)(cid:8) (cid:80) H (cid:48) i =1 ˆ l i (cid:17)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )  = 1 H (cid:48) (cid:34) var H (cid:48) (ˆ l ) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) (cid:35) = 1 H (cid:48) (cid:32) HH (cid:48) mδHH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) (cid:33) (33)Substituting values of q ∗ i, lin ,χ ( H (cid:48) ) from Theorem 12, the bound value becomes: B ∗ lin ,χ ( H (cid:48) ) = H (cid:48) (cid:88) i =1 ˆ l i q ∗ i, lin ,χ ( H (cid:48) ) + (cid:118)(cid:117)(cid:117)(cid:116) H mδ (cid:32) H (cid:48) (cid:88) i =1 (cid:16) q ∗ i, lin ,χ ( H (cid:48) ) (cid:17) (cid:33) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ var H (cid:48) (ˆ l ) (cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) + (cid:118)(cid:117)(cid:117)(cid:116) HH (cid:48) mδ (cid:32) HH (cid:48) mδHH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) (cid:33) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) (cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )= (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) . (34)This bound value B ∗ lin ,χ ( H (cid:48) ) is the sum of an increasing term, (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) , and a decreasingterm, (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (cid:16) ˆ l (cid:17) . B ∗ lin ,χ ( H (cid:48) ) can be shown to be a decreasing function of H (cid:48) , asillustrated in Figure 1. 28igure 1: Nature of B ∗ lin ,χ ( H (cid:48) ) and its components as the subset size H (cid:48) varies29 lgorithm 2: OptQ lin- χ For Uniform Prior : Algorithm for ﬁnding optimalposterior for the PAC-Bayesian bound with linear distance function and χ -divergencewhen prior is uniform distribution Input: m, δ, H, { ˆ l i } Hi =1 Output: Q ∗ lin ,χ ﬂag ← for H (cid:48) = 2 , . . . , H do ˆ var H (cid:48) (ˆ l ) ← H (cid:48) H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) DeltaBnd ← (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (cid:16) ˆ l (cid:17) if DeltaBnd < then ﬂag ← break end for i = 1 , . . . , H (cid:48) do q ∗ i, lin ,χ ←  (cid:32) (cid:80) H (cid:48) i =1 ˆ liH (cid:48) − ˆ l i (cid:33) DeltaBnd  H if q ∗ i, lin ,χ < then ﬂag ← break end end if ﬂag = 1 then break end end if ﬂag = 1 then H ∗ ← H (cid:48) − for i = 1 , . . . , H ∗ do q ∗ i, lin ,χ ←  (cid:32) (cid:80) H (cid:48) i =1 ˆ liH (cid:48) − ˆ l i (cid:33) DeltaBnd  H end for i = H ∗ + 1 , . . . , H do q ∗ i, lin ,χ ← end else H ∗ ← H (cid:48) end return Q ∗ lin ,χ ← (cid:16) q ∗ , lin ,χ , . . . , q ∗ H, lin ,χ (cid:17) .3.1 Correctness of Algorithm OptQ lin- χ For Uniform Prior

We want to determine the globally optimal posterior Q ∗ lin ,χ that has the minimum boundvalue B lin ,χ ( Q ) over the H -dimensional probability simplex, ∆ H . Using the result of The-orem 1 in the main paper, we can conﬁne the search to a much smaller space of posteriorswith support on a family of increasing ordered subsets of H . These ordered subsets are de-ﬁned by their size. For example, an ordered subset of size H (cid:48) ∈ { , . . . , H } comprises of the lowest H (cid:48) values in the set { ˆ l i } Hi =1 . Thus, the restricted space of posteriors, say ∆ ord ⊂ ∆ H ,is a union of convex sets of posteriors with supports on the ordered subsets deﬁned above.Due to increasing subset relation between consecutive supports, this union itself is a convexset. Therefore, as a consequence of Theorem 10, the bound function B lin ,χ ( Q ) is convex onthe set of posteriors, ∆ ord as well, which contains the global minimum. The search space ∆ ord is a restriction of the simplex ∆ H , yet consists of uncountably many posteriors on theordered subsets. We reﬁne the search further by localizing to optimal posteriors on each ofthe increasing ordered subsets and comparing their bound values to ﬁnd the minimum. Asidentiﬁed by (30), these bound values, B ∗ lin ,χ are functions of the subset size H (cid:48) . B ∗ lin ,χ ( H (cid:48) ) is deﬁned only for those values of H (cid:48) where (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (cid:16) ˆ l (cid:17) > . We ignore those H (cid:48) values where this condition is not met (Line 5 in Algorithm 2). Further, we also need to ver-ify that for given H (cid:48) , the optimal posterior Q ∗ lin ,χ ( H (cid:48) ) in (29) satisﬁes positivity constraints,as done in Line 11 of the algorithm. Hence an exponential search on restricted posteriorspace is simpliﬁed to a ﬁnite linear search on the support size. We denote the support sizeof Q ∗ lin ,χ by H ∗ ∈ [ H ] . Therefore, for ﬁnding the optimal posterior Q ∗ lin ,χ in the restrictedposterior space ∆ ord , it is suﬃcient to search for H ∗ in the set { , . . . , H } corresponding tosupport sizes. Warm start for searching optimal support size, H ∗ We can reduce the sequentialsearch for the optimal support size, H ∗ on { , . . . , H } by using a warm start value for H (cid:48) .Then we can reach H ∗ by identifying a direction which leads to decrease in the bound value, B ∗ lin ,χ ( H (cid:48) ) as long as the corresponding posterior is deﬁned and non-negative. This boundvalue is the sum of an increasing term, (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) , and a decreasing term, (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (cid:16) ˆ l (cid:17) .We expect H ∗ to lie in the neighbourhood of the point of intersection of the these twocomponents of B ∗ lin ,χ ( H (cid:48) ) . The point of intersection can be obtained by equating the twoterms and solving for H (cid:48) : (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) = (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (cid:16) ˆ l (cid:17) ⇒ H (cid:48) (cid:88) i =1 ˆ l i = H mδ (35)The value of H (cid:48) which satisﬁes (35) can be used as a warm start point of the Algorithm 2. If H (cid:48) satisﬁes feasibility conditions for Q ∗ lin ,χ ( H (cid:48) ) , we compare its bound value, B ∗ lin ,χ ( H (cid:48) ) withits neighbours B ∗ lin ,χ ( H (cid:48) − and B ∗ lin ,χ ( H (cid:48) +1) to identify a descent direction until feasibility31s violated. Otherwise, if B ∗ lin ,χ ( H (cid:48) ) is infeasible or undeﬁned, we keep decrementing H (cid:48) tillwe reach a feasible Q ∗ lin ,χ ( H (cid:48) ) . B Optimal PAC-Bayesian Posterior using Squared Dis-tance Function

The distance function of our interest is the squared distance function: φ sq (cid:16) ˆ l, l (cid:17) = (cid:16) ˆ l − l (cid:17) for ˆ l, l ∈ [0 , . The PAC-Bayesian bound for squared distance function with chi-squareddivergence can be stated as: P S (cid:16) E Q [ˆ l ] − E Q [ l ] (cid:17) ≤ (cid:115) [ χ ( Q || P ) + 1] (cid:18) I R sq ( m, δ (cid:19) ≥ − δ. (36)The above statement gives the following probabilistic upper bound on the true risk of anstochastic classiﬁer governed by a distribution Q on H : B sq,χ ( Q ) := E Q [ˆ l ] + (cid:115) [ χ ( Q || P ) + 1] (cid:18) I R sq ( m, δ (cid:19) (37)We ﬁrst to need to identify the constant I R sq ( m, for a given sample size m . Lemma 3.

For a given sample size, m , l ∗ = 0 . is the maximizer of I R sq ( m, , l ) := (cid:80) mk =0 (cid:0) mk (cid:1) l k (1 − l ) m − k e m ( km − l ) for l ∈ [0 , .Proof. We have I R sq ( m,

2) = sup l ∈ [0 , (cid:34) m (cid:88) k =0 (cid:18) mk (cid:19) l k (1 − l ) m − k (cid:18) km − l (cid:19) (cid:35) , = sup l ∈ [0 , m (cid:34) m (cid:88) k =0 (cid:18) mk (cid:19) l k (1 − l ) m − k ( k − ml ) (cid:35) . The quantity to be maximized in the above expression is the fourth central moment of a32 .0 0.2 0.4 0.6 0.8 1.0 + − − − l I s q R ( m , l ) m = 50m = 100m = 200m = 500m = 1000m = 1020m = 1028 l * = . Figure 2: Plot of the function I R sq ( m, , l ) := sup l ∈ [0 , (cid:80) mk =0 (cid:0) mk (cid:1) l k (1 − l ) m − k (cid:0) km − l (cid:1) fordiﬀerent sample sizes, m . I R sq ( m, , l ) is symmetric about l = 0 . , its maximizer.Binomial distribution, and can be computed in terms of its ﬁrst and second central moments: I R sq ( m,

2) = sup l ∈ [0 , m (cid:8) m [ l (1 − l ) + l (1 − l )] + 3 m ( m − l (1 − l )] (cid:9) = 1 m sup l ∈ [0 , l (1 − l ) (cid:8) [(1 − l ) + l ] + 3( m − (cid:9) = 1 m sup l ∈ [0 , l (1 − l ) (cid:8) (1 − l + l )[(1 − l ) + l − l (1 − l )] + 3( m − (cid:9) = 1 m sup l ∈ [0 , l (1 − l ) (cid:2) l − l + 1 + 3( m − (cid:3) = 1 m sup l ∈ [0 , (cid:2) − l + 6 l − l + l + 3( m − l − l ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) I R sq ( m, ,l ) Thus, I R sq ( m,

2) = 1 m sup l ∈ [0 , I R sq ( m, , l ) . I R sq ( m, , l ) is a smooth, continuous function of l ∈ [0 , and the maximum can be obtainedvia derivative test. I R sq ( m, , l ) is a concave function of l ∈ [0 , . We observe that the uniquemaximum is attained at l ∗ = for all m ≥ (See Figure 2).Thus, I R sq ( m,

2) = 1 m I R sq ( m, ,

12 ) = 12 m − m P S (cid:40)(cid:16) E Q [ˆ l ] − E Q [ l ] (cid:17) ≤ (cid:115) [ χ ( Q || P ) + 1] (cid:18) m − m δ (cid:19)(cid:41) ≥ − δ. (38) Theorem 14.

For a ﬁnite set of classiﬁers, H , PAC-Bayesian upper bound on the averagedtrue risk based on squared distance function when chi-squared divergence is used as a measureof divergence between the prior and the posterior is given by: B sq,χ ( Q ) = H (cid:88) i =1 ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q i p i (cid:33) (cid:18) m − m δ (cid:19) (39) Proof.

Using the PAC-Bayesian statement in (38) for the case of a ﬁnite classiﬁer set, wecan obtain the above form of B sq,χ ( Q ) , where the expression m − m δ has been identiﬁed viaLemma 3. B.1 The bound minimization problem

We want to determine the optimal posterior Q ∗ sq,χ which minimizes the upper bound, B sq,χ ( Q ) . When classiﬁer space H is a ﬁnite set, say H = { h i } Hi =1 , this optimization problemcan be described as: min q ,...,q H H (cid:88) i =1 ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q i p i (cid:33) (cid:18) m − m δ (cid:19) s. t. H (cid:88) i =1 q i = 1 q i ≥ ∀ i = 1 , . . . , H. (40) B.2 Non-convexity of the bound function

If the bound function, B sq,χ ( Q ) turns out to be convex, then it has a unique minimizerwhich can be easily obtained using the KKT conditions. We investigate whether this boundfunction is convex in Q using the ﬁrst order conditions for convexity. Theorem 15.

The bound function, B sq,χ ( Q ) = (cid:80) Hi =1 ˆ l i q i + (cid:114)(cid:16)(cid:80) Hi =1 q i p i (cid:17) (cid:0) m − m δ (cid:1) is non-convex.Proof. We use the ﬁrst order condition to verify convexity of our bound function. We needto check if the following condition holds for any pair of distributions Q and Q (cid:48) on classsiﬁer34pace H : B sq ,χ ( Q (cid:48) ) ≥ B sq ,χ ( Q ) + (cid:104)∇ B sq ,χ ( Q ) , Q (cid:48) − Q (cid:105)⇒ H (cid:88) i =1 ˆ l i q (cid:48) i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q (cid:48) i p i (cid:33) (cid:18) m − m δ (cid:19) ≥ H (cid:88) i =1 ˆ l i q (cid:48) i − H (cid:88) i =1 ˆ l i q i + (cid:114) m − m δ · (cid:16)(cid:80) Hi =1 q i q (cid:48) i p i − (cid:80) Hi =1 q i p i (cid:17) (cid:16)(cid:80) Hi =1 q i p i (cid:17) / + H (cid:88) i =1 ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q i p i (cid:33) (cid:18) m − m δ (cid:19) ⇒ (cid:32) H (cid:88) i =1 q i p i (cid:33) / (cid:32) H (cid:88) i =1 q (cid:48) i p i (cid:33) / ≥ (cid:32) H (cid:88) i =1 q i q (cid:48) i p i + H (cid:88) i =1 q i p i (cid:33) (cid:30) (41)Notice that this inequality does not depend on ˆ l i values. We have counter examples whichviolate this convexity condition. Consider H = 10 and P to be uniform distribution on H .If Q is a degenerate distribution with q = 1 and Q (cid:48) = (0 . , . , . , . , . , . , . , . , . , . , then we have LHS = 6.087086 and RHS = 6.172964 for (41) violating the convexity property.Thus, we can claim that B sq ,χ is a non-convex function. Remark 4.

Computationally this bound minimization problem is observed to have single localminimum. The quasi-convexity of this bound function is holds under a condition identiﬁedin Propostion 2.

We are interested in checking whether B sq ,χ ( Q ) is strictly quasi-convex. If so, we canclaim that a local optimal solution will be a global optimal solution [2]. Deﬁnition 1. [2] Let f : E −→ R where E is a non-empty convex set in R n . A function f is strictly quasi convex if, for each x , x ∈ E with f ( x ) (cid:54) = f ( x ) , we have f [ α x + (1 − α ) x ] < max( f ( x ) , f ( x )) ∀ α ∈ (0 , . (42) Theorem 16. [2] Let f : E −→ R be strictly quasi-convex. Consider the problem of min-imizing f ( x ) subject to x ∈ E , where E is a non-empty convex set in R n . If ¯ x is a localoptimal solution, then ¯ x is also a global optimal solution. Proposition 2.

The bound function B sq ,χ ( Q ) is strictly quasi-convex if the following con-dition holds for any Q, Q (cid:48) for each α ∈ (0 , : (cid:18) (cid:114) m − m δ (cid:19)  (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i  < (1 − α )( E Q [ˆ l ] − E Q (cid:48) [ˆ l ]) and hence a local minimum to the bound minimization problem (40) is also a global minimum. roof. B sq ,χ ( Q ) is deﬁned on the simplex ∆ H which is a non-empty convex set in R H . Forquasiconvexity, we need to show that for each Q (cid:54) = Q (cid:48) ∈ ∆ H with B sq ,χ ( Q ) (cid:54) = B sq ,χ ( Q (cid:48) ) ,the following holds: B sq ,χ [ αQ + (1 − α ) Q (cid:48) ] < max( B sq ,χ ( Q ) , B sq ,χ ( Q (cid:48) )) ∀ α ∈ (0 , . That is equivalent to showing: E αQ +(1 − α ) Q (cid:48) [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < max  E Q [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i , E Q (cid:48) [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q (cid:48) i p i  We assume that B sq ,χ ( Q ) > B sq ,χ ( Q (cid:48) ) . This implies that we need to show that B sq ,χ ( αQ +(1 − α ) Q (cid:48) ) < B sq ,χ ( Q ) . We consider 4 cases as follows:Case I : E Q [ˆ l ] = E Q (cid:48) [ˆ l ] and H (cid:80) i =1 q i p i = H (cid:80) i =1 q (cid:48) i p i then we have E αQ +(1 − α ) Q (cid:48) [ˆ l ] = αE Q [ˆ l ] + (1 − α ) E Q (cid:48) [ˆ l ] = E Q [ˆ l ] = E Q (cid:48) [ˆ l ] . Thus to show that B sq ,χ ( αQ + (1 − α ) Q (cid:48) ) < B sq ,χ ( Q ) , we have to show the followingfor any Q, Q (cid:48) for each α ∈ (0 , : E αQ +(1 − α ) Q (cid:48) [ˆ l ] + (cid:114) m − m δ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < E Q [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i This is equivalent to showing that for any

Q, Q (cid:48) for each α ∈ (0 , , (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i (43)36onsider the LHS as given below: H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i = H (cid:88) i =1 ( α q i + (1 − α ) q (cid:48) i + 2 α (1 − α ) q i q (cid:48) i ) p i < H (cid:88) i =1 ( α q i + (1 − α ) q (cid:48) i + 2 α (1 − α ) q i ) p i , (cid:32) since H (cid:88) i =1 q i q (cid:48) i < H (cid:88) i =1 q i ∀ Q (cid:54) = Q (cid:48) (cid:33) = H (cid:88) i =1 q i − (1 − α ) q i + (1 − α ) q (cid:48) i ) p i = H (cid:88) i =1 q i p i + (1 − α ) (cid:34) H (cid:88) i =1 q (cid:48) i p i − H (cid:88) i =1 q i p i (cid:35) < H (cid:88) i =1 q i p i (44)Since √ x is an increasing function of x , we have the proof of strict quasi-convexity forthis case.Case II : H (cid:80) i =1 q i p i = (cid:80) Hi =1 q (cid:48) i p i and E Q [ˆ l ] > E Q (cid:48) [ˆ l ] then we have, E αQ +(1 − α ) Q (cid:48) [ˆ l ] = αE Q [ˆ l ] + (1 − α ) E Q (cid:48) [ˆ l ] < E Q [ˆ l ] By previous argument, H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < H (cid:88) i =1 q i p i + (1 − α ) (cid:34) H (cid:88) i =1 q (cid:48) i p i − H (cid:88) i =1 q i p i (cid:35) As the second term on the RHS of above inequality is zero by assumption, we get E αQ +(1 − α ) Q (cid:48) [ˆ l ] + (cid:114) m − m δ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < E Q [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i This implies that ⇔ B sq ,χ [ αQ + (1 − α ) Q (cid:48) ] < B sq ,χ ( Q ) = max( B sq ,χ ( Q ) , B sq ,χ ( Q (cid:48) )) Therefore, B sq ,χ ( Q ) is strictly quasi-convex in this case too.37ase III : E Q [ˆ l ] > E Q (cid:48) [ˆ l ] and (cid:80) Hi =1 q i p i > (cid:80) Hi =1 q (cid:48) i p i , then, as earlier we have, E αQ +(1 − α ) Q (cid:48) [ˆ l ] < E Q [ˆ l ] And, H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < H (cid:88) i =1 q i p i + (1 − α ) (cid:34) H (cid:88) i =1 q (cid:48) i p i − H (cid:88) i =1 q i p i (cid:35) < H (cid:88) i =1 q i p i The last inequality follows since we have assumed (cid:80) Hi =1 q (cid:48) i p i − (cid:80) Hi =1 q i p i < . Hence, itfollows that B sq ,χ [ αQ + (1 − α ) Q (cid:48) ] < B sq ,χ ( Q ) = max { B sq ,χ ( Q ) , B sq ,χ ( Q (cid:48) ) } . There-fore, B sq ,χ ( Q ) is strictly quasi-convex in this case.The condition for quasiconvexity holds easily in all the above three cases. The nextcase requires an added assumption.Case IV : E Q [ˆ l ] > E Q (cid:48) [ˆ l ] and (cid:80) Hi =1 q i p i < (cid:80) Hi =1 q (cid:48) i p i with B sq ,χ ( Q ) > B sq ,χ ( Q (cid:48) ) then we get thefollowing inequality, E Q [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i > E Q (cid:48) [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q (cid:48) i p i ⇐⇒ ( E Q [ˆ l ] − E Q (cid:48) [ˆ l ]) > (cid:114) m − m δ (cid:20) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q (cid:48) i p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i (cid:21) Hence, we have to show that, B sq ,χ ( αQ + (1 − α ) Q (cid:48) ) < B sq ,χ ( Q ) E αQ +(1 − α ) Q (cid:48) [ˆ l ] + (cid:114) m − m δ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < E Q [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i This is equivalent to proving that (cid:18) (cid:114) m − m δ (cid:19)  (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i  < E Q [ˆ l ] − ( αE Q [ˆ l ] + (1 − α ) E Q (cid:48) [ˆ l ]) That is, we need to show that for any

Q, Q (cid:48) for each α ∈ (0 , : (cid:18) (cid:114) m − m δ (cid:19)  (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i  < (1 − α )( E Q [ˆ l ] − E Q (cid:48) [ˆ l ]) The above holds due to the assumption in the theorem statement.38hus, under the given condition, B sq ,χ is strictly quasi-convex and admits a global minimumwhich can be identiﬁed based on KKT conditions. Remark 5.

The condition that for any

Q, Q (cid:48) for each α ∈ (0 , : (cid:18) (cid:114) m − m δ (cid:19)  (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i  < (1 − α )( E Q [ˆ l ] − E Q (cid:48) [ˆ l ]) is required to complete the proof of quasi-convexity of B sq ,χ ( Q ) for the case when E Q [ˆ l ] >E Q (cid:48) [ˆ l ] and (cid:80) Hi =1 q i p i < (cid:80) Hi =1 q (cid:48) i p i . We haven’t been able to verify that this condition will alwayshold for any pair ( Q, Q (cid:48) ) . Other cases are easy to prove. B.3 The posterior based on ﬁxed point scheme, Q ∗ sq ,χ The Lagrangian function corresponding to the optimization problem (40) is: L sq,χ ( Q, µ ) := ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q i p i (cid:33) (cid:18) m − m δ (cid:19) − µ (cid:32) H (cid:88) i =1 q i − (cid:33) (45)At optimality, posterior Q should set the derivatives of this Lagrangian function L sq,χ ( Q, µ ) to zero. Setting the derivative of L sq,χ with respect to q i ’s as zero, We get: ∂ L sq,χ ( Q, µ ) ∂ q i = 0 ∀ i = 1 , . . . , H ⇒ ˆ l i +  (cid:114) m − m δ × (cid:16)(cid:80) Hi =1 q i p i (cid:17) × q i p i  − µ = 0 ⇒ q i (cid:16)(cid:80) Hi =1 q i p i (cid:17) = 2 p i ( µ − ˆ l i ) (cid:113) m − m δ ∀ i = 1 , . . . , H (46)And now, setting the derivative of L sq,χ with respect to µ as zero, we get: ∂ L sq,χ ( Q, µ ) ∂ µ = 0 ⇒ H (cid:88) i =1 q i − ⇒ H (cid:88) i =1  (cid:16)(cid:80) Hi =1 q i p i (cid:17) (cid:113) m − m δ  ( µ − ˆ l i ) p i  = 1 (47) ⇒ µ = H (cid:88) i =1 ˆ l i p i + (cid:113) m − m δ (cid:16)(cid:80) Hi =1 q i p i (cid:17) (48)39e get the following ﬁxed point equation in q i ’s: q i =  (cid:16)(cid:80) Hi =1 q i p i (cid:17) (cid:113) m − m δ (cid:32) H (cid:88) i =1 ˆ l i p i − ˆ l i (cid:33) + 1  p i ∀ i = 1 , . . . , H (49) Theorem 17 (Optimal posterior on an ordered subset support) . When prior is uniformdistribution on H , among all the posteriors with support as subset of size exactly H (cid:48) , the bestposterior denoted by Q ∗ sq ,χ ( H (cid:48) ) has the support on the ordered subset H (cid:48) ord = {{ ˆ l i } H (cid:48) i =1 | ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) } consisting of smallest H (cid:48) values in H . The optimal posterior weights aredetermined as the solution to the following ﬁxed point equation: q F Pi, sq ,χ ( H (cid:48) ) =  H (cid:48) + (cid:16)(cid:80) H (cid:48) i =1 ( q FPi, sq ,χ ( H (cid:48) )) (cid:17) (cid:113) (12 m − H m δ (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) i = 1 , . . . , H (cid:48) i = H (cid:48) + 1 , . . . , H. (50) under the assumption that for a given H (cid:48) , (50) converges to a ﬁxed point solution and forfeasibility, we require q F Pi, sq ,χ ( H (cid:48) ) > for i = 1 , . . . , H (cid:48) . C Optimal PAC-Bayesian Posterior using KL-distance

The PAC-Bayesian bound using the distance function kl (ˆ l, l ) = ˆ l ln (cid:16) ˆ ll (cid:17) + (1 − ˆ l ) ln (cid:16) − ˆ l − l (cid:17) (forany ˆ l, l ∈ (0 , ) is obtained as: P S (cid:40) ∀ Q on H : kl (cid:16) E Q [ˆ l ] , E Q [ l ] (cid:17) ≤ (cid:114) ( χ ( Q || P ) + 1) I R kl ( m, δ (cid:41) ≥ − δ. (51)The upper bound on the averaged true risk E Q [ l ] corresponding to the above PAC-Bayesian theorem is obtained as: B kl ,χ ( Q ) = sup r ∈ (0 , (cid:40) r : kl (cid:16) E Q [ˆ l ] , r (cid:17) ≤ (cid:114) ( χ ( Q || P ) + 1) I R kl ( m, δ (cid:41) (52)An inverse kl ( · , · ) function does not exist since it is not a monotone function, and so thebound B kl ,χ ( Q ) does not have an explicit form. However, we can employ a numerical rootﬁnding algorithm such as that described in [20] (Algo. ( KLroots )) to obtain B kl ,χ ( Q ) fora given instance of system parameters.We ﬁrst need to compute the constant I R kl ( m,

2) := m (cid:80) k =0 (cid:0) mk (cid:1) l k (1 − l ) m − k (cid:0) kl (cid:0) km , l (cid:1)(cid:1) inorder to determine the bound value.For m > , computation is diﬃcult due to storage limitations in the range of ﬂoatingpoint numbers – gives I R kl ( m ) as NaN. We notice that I R kl ( m ) decreases with m and hence,we can use I R kl (1028) as an upper approximation for I R kl ( m ) for m > .40igure 3: Plot of the function I R kl ( m, l ) = m (cid:80) k =0 (cid:0) mk (cid:1) l k (1 − l ) m − k (cid:0) kl (cid:0) km , l (cid:1)(cid:1) as a function of thetrue risk l ∈ [0 , for diﬀerent values of the sample size, m represented by diﬀerent curvesin the above graph. We observe that the function I R kl ( m, l ) is bimodal and symmetric about l = 0 . . We are interested in the quantity I R kl ( m ) = sup l ∈ [0 , I R kl ( m, l ) as a function of m which we identify graphically (and mark it by a • on each curve).41able 5: Table of I R kl ( m ) = sup l ∈ [0 , m (cid:80) k =0 (cid:0) mk (cid:1) l k (1 − l ) m − k (cid:0) kl (cid:0) km , l (cid:1)(cid:1) values for diﬀerent samplesizes, m . We notice that I R kl ( m ) decreases as m increases. For a given m , l ∗ ( m ) denotes thevalue of l ∈ [0 , at which the supremum is attained. We observe that l ∗ ( m ) → as m growsbeyond 1000. Samplesize, m l ∗ ( m ) I R kl ( m )

50 0.98 0.0074799100 0.99 0.0037092200 0.995 0.0018470500 0.998 0.00073691000 0.999 0.00036821020 0.999 0.00036091028 0.999 0.0003580

C.1 The KL-distance bound minimization problem

For a ﬁnite classiﬁer space H = { h i } Hi =1 , this optimization problem can be described as: min q ,...,q H ,r r (53a)s.t. (cid:32) H (cid:88) i =1 ˆ l i q i (cid:33) ln  H (cid:80) i =1 ˆ l i q i r  + (cid:32) − H (cid:88) i =1 ˆ l i q i (cid:33) ln  − H (cid:80) i =1 ˆ l i q i − r  = (cid:118)(cid:117)(cid:117)(cid:116) (cid:16)(cid:80) Hi =1 q i p i (cid:17) I R kl ( m, δ (53b) r ≥ H (cid:88) i =1 ˆ l i q i (53c) H (cid:88) i =1 q i = 1 (53d) q i ≥ ∀ i = 1 , . . . , H (53e)Here, r is the right root of kl (cid:16) E Q [ˆ l ] , r (cid:17) = (cid:113) ( χ ( Q || P )+1) I R kl ( m, δ for a given E Q [ˆ l ] . The aboveis known to be a non-convex problem with a diﬀerence of convex (DC) equality constraint(53b). The constraint (53c) is a strict inequality which is relaxed for modelling purpose. C.2 The posterior based on ﬁxed point scheme, Q FPkl ,χ We derive FP equation for KL-distance based bound optimization problem below:

Theorem 18 (Optimal posterior on an ordered subset support) . Among all the posteriorswith support as subset of size exactly H (cid:48) , a stationary point Q FPkl ,χ ( H (cid:48) ) can be obtained as the olution to the following ﬁxed point equation: q i = 1 Z kl ,χ p i (cid:32) H (cid:48) (cid:88) i =1 q i p i (cid:33)  (cid:16) ˆ l i − (cid:80) H (cid:48) i =1 ˆ l i q i (cid:17)(cid:114) (cid:18)(cid:80) H (cid:48) i =1 q ipi (cid:19) I R kl ( m, δ (cid:34) ln (cid:32) (1 − r ) (cid:80) H (cid:48) i =1 ˆ l i q i r (1 − (cid:80) H (cid:48) i =1 ˆ l i q i ) (cid:33)(cid:35) (54) where Z kl ,χ is a suitable normalization constant and r is the solution to (53b) and (53c) fora given Q = ( q , . . . , q H ) .Proof. The Lagrangian function for (53) can be written as follows: L kl ,χ = r − β (cid:34)(cid:32) H (cid:48) (cid:88) i =1 ˆ l i q i (cid:33) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) + (cid:32) − H (cid:48) (cid:88) i =1 ˆ l i q i (cid:33) ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33) − (cid:118)(cid:117)(cid:117)(cid:116) (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) I R kl ( m, δ  − β (cid:32) r − H (cid:48) (cid:88) i =1 ˆ l i q i (cid:33) − µ (cid:32) H (cid:48) (cid:88) i =1 q i − (cid:33) − H (cid:48) (cid:88) i =1 µ i q i (55)Due to the strict inequality constraint (53c), complementary slackness conditions for a sta-tionary point imply that the Lagrange multiplier β should vanish at optimality.Diﬀerentiating L kl ,χ with respect to primal variables r and q i s, and also with respect todual variable µ , we get: ∂ L kl ,χ ∂r = 1 − β (cid:34) − (cid:32) (cid:80) Hi =1 ˆ l i q i r (cid:33) + (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) (56) ∂ L kl ,χ ∂q i = − β  ˆ l i ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) + (cid:1)(cid:1) ˆ l i − ˆ l i ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33) − (cid:1)(cid:1) ˆ l i − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · q i p i  − µ − µ i ∀ i = 1 , . . . , H (57) ∂ L kl ,χ ∂µ = H (cid:88) i =1 q i − (58)At an optimal solution, these derivatives should be set to zero. Let us ﬁrst consider the43erivative (56) and set it to zero. That is, ∂ L kl ,χ ∂r = 0 ⇒ − β (cid:34) − (cid:32) (cid:80) Hi =1 ˆ l i q i r (cid:33) + (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) = 0 ⇒ β  − H (cid:80) i =1 ˆ l i q i + (cid:8)(cid:8)(cid:8)(cid:8)(cid:8)(cid:8) r (cid:18) H (cid:80) i =1 q i ˆ l i (cid:19) + r − (cid:8)(cid:8)(cid:8)(cid:8)(cid:8)(cid:8) r (cid:18) H (cid:80) i =1 q i ˆ l i (cid:19) r (1 − r )  = 1 ⇒ β = r (1 − r ) r − H (cid:80) i =1 ˆ l i q i > (59)The denominator in above is strictly positive since r > H (cid:80) i =1 q i ˆ l i . The inequality constraintin (53) also implies that r ∈ (0 , , which means that the numerator term is also strictlypositive. Hence, we have β > which is a feasible value for the Lagrange parameter.Next consider the derivative (57) of the Lagrange L kl ,χ . We multiply it with q i and setit zero to get: q i ∂ L kl ,χ ∂q i = 0 ∀ i = 1 , . . . , H ⇒ − β  ˆ l i q i (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · q i p i  − µ q i − µ i q i = 0 (60)where µ i q i = 0 due to complementary slackness conditions, since µ i is the Lagrange multiplierfor the constraint q i ≥ in (53). Since we are interested in ﬁnding the best posterior on theordered subset of size H (cid:48) ≤ H , only ﬁrst H (cid:48) values of the distribution Q = ( q , . . . , q H ) willtake strictly positive values. Therefore summing (60) over i = 1 , . . . , H (cid:48) , we get: H (cid:48) (cid:88) i =1 q i ∂ L kl ,χ ∂q i = − β (cid:40) H (cid:48) (cid:88) i =1 ˆ l i q i (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · H (cid:48) (cid:88) i =1 q i p i  − µ (cid:32) H (cid:48) (cid:88) i =1 q i (cid:33) − H (cid:48) (cid:88) i =1 µ i q i = 0 (cid:80) H (cid:48) i =1 q i = 1 , we get: µ = − β  H (cid:48) (cid:88) i =1 ˆ l i q i (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · H (cid:48) (cid:88) i =1 q i p i  (61)Then using (57) and above (61), we get: (cid:8)(cid:8)(cid:8) − β  ˆ l i ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ˆ l i ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33) − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · q i p i  = (cid:8)(cid:8)(cid:8) − β  H (cid:48) (cid:88) i =1 ˆ l i q i (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · H (cid:48) (cid:88) i =1 q i p i  ⇒ (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · (cid:32) q i p i − H (cid:48) (cid:88) i =1 q i p i (cid:33) = (cid:32) ˆ l i − H (cid:48) (cid:88) i =1 ˆ l i q i (cid:33) (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) ⇒ q i = p i ×  H (cid:48) (cid:88) i =1 q i p i + (cid:16) ˆ l i − (cid:80) H (cid:48) i =1 ˆ l i q i (cid:17)(cid:115) I R kl ( m, δ (cid:18)(cid:80) H (cid:48) i =1 q ipi (cid:19) (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) ∀ i = 1 , . . . , H (cid:48) ⇒ q i = p i (cid:32) H (cid:48) (cid:88) i =1 q i p i (cid:33)  (cid:16) ˆ l i − (cid:80) H (cid:48) i =1 ˆ l i q i (cid:17)(cid:114) (cid:18)(cid:80) H (cid:48) i =1 q ipi (cid:19) I R kl ( m, δ (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) ∀ i = 1 , . . . , H (cid:48) (62)For feasibility, we need (cid:80) Hi =1 q i = 1 . Therefore by using a suitable normalization constant Z kl ,χ , we get the normalized equation in q i s: q i = 1 Z kl ,χ p i (cid:32) H (cid:48) (cid:88) i =1 q i p i (cid:33)  (cid:16) ˆ l i − (cid:80) Hi =1 ˆ l i q i (cid:17)(cid:114) (cid:18)(cid:80) H (cid:48) i =1 q ipi (cid:19) I R kl ( m, δ (cid:34) ln (cid:32) (1 − r ) (cid:80) H (cid:48) i =1 ˆ l i q i r (1 − (cid:80) H (cid:48) i =1 ˆ l i q i ) (cid:33)(cid:35) ∀ i = 1 , . . . , H (cid:48) (63)The above is the ﬁxed point equation (FPE) which identiﬁes a stationary point of the boundminimization problem (53). Corollary 2 (Optimal posterior on an ordered subset support) . When prior is uniformdistribution on H , among all the posteriors with support as subset of size exactly H (cid:48) , a tationary point Q FPkl ,χ ( H (cid:48) ) for (53) can be obtained as the solution to the following ﬁxedpoint equation: q i = 1 Z kl ,χ (cid:32) H (cid:48) (cid:88) i =1 q i (cid:33)  (cid:16) ˆ l i − (cid:80) H (cid:48) i =1 ˆ l i q i (cid:17)(cid:113) H ( (cid:80) H (cid:48) i =1 q i ) I R kl ( m, δ (cid:34) ln (cid:32) (1 − r ) (cid:80) H (cid:48) i =1 ˆ l i q i r (1 − (cid:80) H (cid:48) i =1 ˆ l i q i ) (cid:33)(cid:35) (64) for i = 1 , . . . , H where Z kl ,χ is a suitable normalization constant and r is the solution to (53b) and (53c) for a given Q = ( q , . . . , q H ) . Lemma 4.

When all the classiﬁers have same empirical risk (all ˆ l i s are same), the optimalposterior for the bound minimization problem (53) is Q ≡ P . KL-distance based bound minimization is non-convex with multiple stationary pointswhich makes it diﬃcult to identify the global minimum even by FP scheme. The iterativeroot ﬁnding algorithm adds to the computational complexity of the bound minimizationalgorithm.

C.3 Convex-concave procedure for ﬁnding a local solution for min-imization of B kl ,χ ( Q ) We have seen that our optimization problem (53) for ﬁnding the bound B kl ,χ ( Q ) consistsof a linear objective function and linear constraints, except for the constraint (53b), whichtakes the form: kl (cid:16) E Q [ˆ l ] , r (cid:17) = (cid:113) ( χ ( Q || P )+1) I R kl ( m, δ (65) ⇔ (cid:16)(cid:80) Hi =1 ˆ l i q i (cid:17) ln  H (cid:80) i =1 ˆ l i q i r  + (cid:18) − H (cid:80) i =1 ˆ l i q i (cid:19) ln  − H (cid:80) i =1 ˆ l i q i − r  = (cid:114) (cid:18)(cid:80) Hi =1 q ipi (cid:19) I R kl ( m, δ (66)We know that χ [ Q || P ] is jointly convex in both its arguments [26]. Based on the proof ofTheorem 10, we have that (cid:114)(cid:16)(cid:80) Hi =1 q i p i (cid:17) is a convex function of Q . And hence, for givensystem parameters m and δ , right hand side of the above constraint is a convex function of Q . The left hand side is a composition of two functions: E Q [ˆ l ] (a linear function) and kl ( p, q ) (a jointly convex function). The superposition of a convex function and an aﬃne mappingis convex, provided that it is ﬁnite at least at one point [8, 4]. Hence, it is established that kl (cid:16) E Q [ˆ l ] , r (cid:17) is convex in its arguments ( Q, r ) . This implies that the constraint (53b) is adiﬀerence of convex (DC) function and the associated optimization problem (53) is a DCprogram.Reformulating the original problem (53) in terms of all inequality constraints of the form46 ( x ) − g ( x ) ≤ , we have: min q ,...,q H ,r r (67a) (cid:32) H (cid:88) i =1 ˆ l i q i (cid:33) ln  H (cid:80) i =1 ˆ l i q i r  + (cid:32) − H (cid:88) i =1 ˆ l i q i (cid:33) ln  − H (cid:80) i =1 ˆ l i q i − r  − (cid:115) (cid:18)(cid:80) Hi =1 q ipi (cid:19) I R kl ( m, δ ≤ (67b) (cid:115) (cid:18)(cid:80) Hi =1 q ipi (cid:19) I R kl ( m, δ − (cid:32) H (cid:88) i =1 ˆ l i q i (cid:33) ln  H (cid:80) i =1 ˆ l i q i r  + (cid:32) − H (cid:88) i =1 ˆ l i q i (cid:33) ln  − H (cid:80) i =1 ˆ l i q i − r  ≤ (67c) H (cid:88) i =1 ˆ l i q i − r ≤ (67d) H (cid:88) i =1 q i = 1 (67e) − q i ≤ ∀ i = 1 , . . . , H (67f)To apply the convex-concave procedure (CCP), we determine the approximations to theDC functions (67b) and (67c), at a point ( Q , r ) which is feasible to (67), and equivalentlyto (53). Let (cid:99) kC (( Q, r ); ( Q , r )) denote the linear under-approximation to the function kC ( Q, r ) := (cid:114) (cid:18)(cid:80) Hi =1 q ipi (cid:19) I R kl ( m, δ in (67b) at ( Q , r ) . (cid:99) kC (( Q, r ); ( Q , r )) := kC ( Q , r ) + (cid:104)∇ kC ( Q , r ) , (cid:0) ( Q − Q ) , ( r − r ) (cid:1) (cid:105) = (cid:115) (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) I R kl ( m, δ + (cid:32) H (cid:88) i =1 ∂kC ∂q i (cid:12)(cid:12)(cid:12)(cid:12) q i = q i · ( q i − q i ) (cid:33) + 0 · ( r − r )= (cid:115) (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) I R kl ( m, δ + (cid:115) I R kl ( m, δ (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) (cid:32) H (cid:88) i =1 q i p i ( q i − q i ) (cid:33) = (cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24) (cid:115) (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) I R kl ( m, δ + (cid:115) I R kl ( m, δ (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) (cid:32) H (cid:88) i =1 q i q i p i (cid:33) − (cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24) (cid:115) (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) I R kl ( m, δ = (cid:115) I R kl ( m, δ (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) (cid:32) H (cid:88) i =1 q i q i p i (cid:33) Recall the linear under-approximation (cid:100) kK (( Q, r ); ( Q , r )) to the function kK ( Q, r ) := l (cid:16)(cid:80) Hi =1 ˆ l i q i , r (cid:17) at ( Q , r ) . (cid:100) kK (( Q, r ); ( Q , r )) := kK ( Q , r ) + (cid:104)∇ kK ( Q , r ) , (cid:0) Q − Q , r − r (cid:1) (cid:105) = ln (cid:32) − (cid:80) Hi =1 ˆ l i q i − r (cid:33) + H (cid:88) i =1 ˆ l i q i (cid:34) ln (cid:32) (cid:80) Hi =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) Hi =1 ˆ l i q i − r (cid:33)(cid:35) + (cid:34) r − (cid:80) Hi =1 ˆ l i q i r (1 − r ) (cid:35) ( r − r ) Using the above linear approximations (cid:99) kC (( Q, r ); ( Q , r )) and (cid:100) kK (( Q, r ); ( Q , r )) in(67b) and (67c), we can invoke the CCP procedure described in [12] to get a local minimizerto the KL-distance based bound minimization problem (53). D Computational Illustrations for SVMs

The datasets that we have considered for our computations, the scheme used to generateclassiﬁers and compute risk values are same as the ones considered in [22]. On this set of baseclassiﬁers, we compare the optimal PAC-Bayesians posteriors for the case of χ -divergence,obtained using the FP scheme and the solver for the diﬀerent φ functions considered. D.1 Illustration of various optimal posteriors, Q ∗ φ,χ We present some graphs to illustrate the nature of the optimal posteriors that we havecomputed under the framework mentioned above. Figure 4 depicts the role of the conﬁdencelevel δ in determining the optimal support size H ∗ for Q ∗ lin ,χ in case of uniform prior. Figure5 shows that the stationary point Q F P kl ,χ obtained by the ﬁxed point scheme has almost fullsupport, and that the ﬁxed point equation (15) always converges to a solution even whenthe solver throws up an error due to issues such as M ’ (Maximum Number of iterationsexceeded) or ‘ I ’ (Locally infeasible solution) or ‘ E ’ (Unknown Error) or ‘ R ’ (RestorationPhase Failed). Please see Table 6 for such examples. D.2 Sparsity and Concentration of Optimal Posteriors, Q ∗ φ,χ We determine the optimal posteriors Q ∗ φ,χ for diﬀerent distance functions, φ s and com-pare their bound values and test error rates. To understand the diﬀerences between thenature of these posteriors for diﬀerent choices of φ , we need to compare these vectors ofposterior weights. For a large H , as we have considered, it is diﬃcult to compare these high-dimensional probability weight vectors elementwise. We use diﬀerent measures for capturingthe information from these posteriors.To measure the sparsity of the posteriors, we ﬁrst compute their cumulative distributionfunctions (CDFs) denoted by F Q ∗ φ,χ ( · ) . We consider three diﬀerent signiﬁcance levels α ∈{ . , . , . } and identify the number of classiﬁers N φ,χ ( α ) out of H = 1990 , required by48 ataset H 50 200 500 1000 1990 (Validation setsize, v ) B F P kl ,χ B solver kl ,χ B F P kl ,χ B solver kl ,χ B F P kl ,χ B solver kl ,χ B F P kl ,χ B solver kl ,χ B F P kl ,χ B solver kl ,χ Spambase ( v = 1840) ( v = 138) ( v = 332) ( v = 227) ( v = 549) ( v = 2257) ( v = 140) ( v = 1323) ( v = 122) Table 6:

Bound values for kl- χ case : Comparing the bound values B F P kl ,χ and B solver kl ,χ for the posterior obtained via ﬁxed point equation (64) with the linear search algorithm 2and the optimal posterior for minimizing the PAC-Bayesian bound (53) for the KL-distancewith chi-squared divergence between the prior and posterior distributions. The ﬁxed pointequation always converges and identiﬁes the local minimum output by the Ipopt solver, evenwhen the solver fails to identify a solution for reasons like local infeasibility (I), RestorationPhase Failed (R), Maximum number of iterations exceeded (M), Unknown error (E), etc.49 ataset H 50 200 500 1000 1990 (Test setsize, t ) T F P kl ,χ T solver kl ,χ T F P kl ,χ T solver kl ,χ T F P kl ,χ T solver kl ,χ T F P kl ,χ T solver kl ,χ T F P kl ,χ T solver kl ,χ Spambase ( t = 921) ( t = 69) ( t = 166) ( t = 115) ( t = 275) ( t = 1129) . e −

06 4 . e − ( t = 71) ( t = 662) ( t = 62) Table 7:

Test error rates for kl − χ case : Comparing the test error rates T F P kl ,χ and T solver kl ,χ when using the posterior obtained via ﬁxed point equation (64) with the linear searchalgorithm 2 and the optimal posterior for minimizing the PAC-Bayesian bound (53) for theKL-distance with chi-squared divergence between the prior and posterior distributions. Theﬁxed point equation always converges and identiﬁes the local minimum output by the Ipopt solver, even when the solver fails to identify a solution for reasons like local infeasibility (I),Restoration Phase Failed (R), Maximum number of iterations exceeded (M), Unknown error(E), etc. 50igure 4:

Illustration of variation of subset support for Q ∗ lin ,χ as the PAC-Bayesianconﬁdence level δ changes . We consider Bupa dataset (345 samples, 6 features) with atraining sample of size m = 276 for determining our SVM classiﬁers using H = 1990 regular-ization parameter values from the set Λ = { . , . . . , } . For a uniform prior distribution,the optimal posterior for linear distance function, Q ∗ lin ,χ is computed via Ipopt solver on thefull simplex ( Q solver ) as well ﬁxed point equation (FPE) (29) on increasing ordered subsetsof H , denoted by Q F P . We observe that the FPE correctly identiﬁes the global minimum.In case of uniform prior, the posterior weights q ∗ i, lin ,χ are negatively proportional to the em-pirical risk ˆ l i values of the classiﬁers in the ordered support set. H ∗ denotes the support sizeof the optimal posterior and ¯ l := ˆ l H ∗ denotes the value of the empirical risk beyond whichthe posterior weights are zero. For given ﬁxed parameters, namely H and m , we considerthree values of conﬁdence level δ = 0 . , . , . . The optimal subset size H ∗ decreases as δ increases, allowing sparser posteriors. 51igure 5: Illustration of (almost) full support for Q F P kl ,χ . We consider Spambase dataset(4601 samples, 57 features) with a training sample of size m = 3680 for determining ourSVM classiﬁers using regularization parameter values from the set Λ = { . , . . . , } . Fora uniform prior distribution, the optimal posterior for linear distance function, Q ∗ kl ,χ iscomputed via Ipopt solver on the full simplex ( Q solver ) as well ﬁxed point equation (FPE)(15) on increasing ordered subsets of H , denoted by Q F P . We observe that the FPE alwaysconverges to a stationary point even when the solver throws up error and outputs infeasiblesolutions. In case of uniform prior, the posterior weights q F Pi, kl ,χ are negatively proportionalto the empirical risk ˆ l i values of the classiﬁers in the ordered support set. We notice that,the support size for Q F P , denoted by H F P = 1986 for all the three values of conﬁdence level δ = 0 . , . , . ; implying that Q F P for kl- χ case has almost full support.52 ataset PAC-Bayesian Bound Average Test Error B F P kl ,χ Range( B CCP kl ,χ ) Mean( B CCP kl ,χ ) T F P kl ,χ Range( T CCP kl ,χ ) Mean( T CCP kl ,χ )Spambase 0.43076 [0.46072, 0.53931] 0.48658 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 8: We compare the bound values and test error rates of the optimal posterior ob-tained via Fixed Point (FP) scheme and the posterior based on Convex-Concave Procedure(CCP) for minimizing the PAC-Bayesian bound B kl ,χ based on KL-distance function with χ -divergence measure. The CCP based posteriors are identiﬁed by the bound minimizationmodel described in Section C.3. The bound values and test error rates for FP scheme basedsolution are denoted by B F P kl ,χ and T F P kl ,χ . Similarly, the bound values and test error rates ofthe CCP based posterior are denoted by B CCP kl ,χ and T CCP kl ,χ . For computations, we considerSVM classiﬁers generated on nine datasets from UCI repository [5] using the scheme in Sec-tion 7 of the main paper (also considered in [22]) for H = 50 values in Λ = { . , . , . . . , } .We run the CCP procedure for 1000 diﬀerent initializations of posterior Q (as done in [12]).The range, mean and standard deviation of the bound values and average test error ratesof the CCP based posteriors obtained by these 1000 initializations are tabulated above. Wenotice that B F P kl ,χ is always better than B CCP kl ,χ and T F P kl ,χ is comparable with mean value of T CCP kl ,χ for diﬀerent datasets considered. This might be because FP scheme identiﬁes theglobal minimum for kl- χ based bound minimization problem, whereas CCP converges to alocal solution or a stationary point. ‘NA’ denotes the cases where the CCP cannot providelinear approximation to kl ( E Q [ˆ l ] , r ) because a subgradient cannot be determined when E Q [ˆ l ] takes the boundary value zero. Such cases usually occur for almost separable datasets –Banknote and Mushroom, where the quantity E Q [ˆ l ] = 0 for any distribution Q since all ˆ l i stake value zero for i = 1 , . . . , . F Q ∗ φ,χ ( · ) to achieve the given signiﬁcance level α . That is N φ,χ ( α ) := min { i | F Q ∗ φ,χ ( i ) ≥ α } , α ∈ { . , . , . } . (68)For a given level α , a low N φ,χ ( α ) indicates that the distribution Q ∗ φ,χ is sparse. Inour computations, we observe that for the three signiﬁcance levels α ∈ { . , . , . } , theoptimal posterior Q ∗ kl ,χ has large N kl ,χ ( α ) values, implying almost full support. Whereas Q ∗ sq ,χ is sparse as reﬂected by low N sq ,χ ( α ) values. (Please see Table 9 for the computedvalues.)We quantify the level of concentration of a posterior distribution on its support viaHerﬁndahl-Hirschman Index (HHI) [7, 28]. HHI is a prominent index in the economicsliterature, widely used for measuring the contribution of a sector in the economy or marketshare of a ﬁrm in the industry. It is an indicator of the amount of competition among theﬁrms. It is deﬁned as the square root of the sum of the squares of the contribution/market53hares of the ﬁrms. It turns out that for any probability vector like our posteriors, HHI isequivalent to the (cid:96) norm of the given probability vector. A high HHI score indicates highconcentration of probabilities and vice versa. HHI scores for Q ∗ φ,χ are given in Table 9. Weobserve that Q ∗ sq ,χ has relatively high HHI score, indicating higher concentration comparedto Q ∗ lin ,χ and Q ∗ kl ,χ . The diﬀerences in concentration levels are remarkable in case of datasetswith highly varying empirical risk values. 54 .000 0.001 0.002 0.003 0.004 0.005 . . . . . . Empirical risk, l^ C u m u l a t i v e po s t e r i o r w e i gh t s , F ( Q f-c ) Banknote Dataset F ( Q* lin - c ) F ( Q* sq - c ) F ( Q* kl - c ) H = 1990 d = . . . . . . Empirical risk, l^ C u m u l a t i v e po s t e r i o r w e i gh t s , F ( Q f-c ) Spambase Dataset F ( Q* lin - c ) F ( Q* sq - c ) F ( Q* kl - c ) H = 1990 d = . . . . . . Empirical risk, l^ C u m u l a t i v e po s t e r i o r w e i gh t s , F ( Q f-c ) Wdbc Dataset F ( Q* lin - c ) F ( Q* sq - c ) F ( Q* kl - c ) H = 1990 d = . . . . . . Empirical risk, l^ C u m u l a t i v e po s t e r i o r w e i gh t s , F ( Q f-c ) Bupa Dataset F ( Q* lin - c ) F ( Q* sq - c ) F ( Q* kl - c ) H = 1990 d = . . . . . . Empirical risk, l^ C u m u l a t i v e po s t e r i o r w e i gh t s , F ( Q f-c ) Mammographic Dataset F ( Q* lin - c ) F ( Q* sq - c ) F ( Q* kl - c ) H = 1990 d =