Optimal Posteriors for Chi-squared Divergence based PAC-Bayesian Bounds and Comparison with KL-divergence based Optimal Posteriors and Cross-Validation Procedure
OOptimal Posteriors for Chi-squared Divergence based
PAC-Bayesian Bounds and Comparison withKL-divergence based Optimal Posteriors and
Cross-Validation Procedure
Puja Sahu Nandyala Hemachandra
Abstract
We investigate optimal posteriors for recently introduced [3] chi-squared divergencebased PAC-Bayesian bounds in terms of nature of their distribution, scalability ofcomputations, and test set performance. For a finite classifier set, we deduce boundsfor three distance functions: KL-divergence, linear and squared distances. Optimalposterior weights are proportional to deviations of empirical risks, usually with subsetsupport. For uniform prior, it is sufficient to search among posteriors on classifier sub-sets ordered by these risks. We show the bound minimization for linear distance as aconvex program and obtain a closed-form expression for its optimal posterior. Whereasthat for squared distance is a quasi-convex program under a specific condition, and theone for KL-divergence is non-convex optimization (a difference of convex functions). Tocompute such optimal posteriors, we derive fast converging fixed point (FP) equations.We apply these approaches to a finite set of SVM regularization parameter values toyield stochastic SVMs with tight bounds. We perform a comprehensive performancecomparison between our optimal posteriors and known KL-divergence based posteriorson a variety of UCI datasets with varying ranges and variances in risk values, etc. Chi-squared divergence based posteriors have weaker bounds and worse test errors, hintingat an underlying regularization by KL-divergence based posteriors. Our study high-lights the impact of divergence function on the performance of PAC-Bayesian classifiers.We compare our stochastic classifiers with cross-validation based deterministic classi-fier. The latter has better test errors, but ours is more sample robust, has quantifiablegeneralization guarantees, and is computationally much faster.
Keywords:
Generalization guarantees, divergence measure, convex and non-convex con-strained optimization, fixed point equations, sample robustness, SVM regularization parame-ter
In classification algorithms, the choice of the parameter(s) influences the level of accuracythat the generated classifier can achieve. For example, consider the Support Vector Machine1 a r X i v : . [ m a t h . S T ] A ug SVM) algorithm for classification with the regularization parameter, λ > . This parameteris a user input which trades off between model complexity and training error. The optimalclassifier that we get, depends heavily on the sample S that is used for training and the valueof the parameter, λ . We can control only this parameter value for obtaining a classifier withlow (training) error, but not the given data. For a given training sample, we can choose thebest value of the parameter from a prefixed set of values, which yields a classifier with thelowest error. However, this is a long drawn process. Plus, there is no guarantee that thechosen value will yield a classifier having low(est) error on another sample from the samedistribution. This implies that the best parameter value is sample dependent and that thereis no unique value which is best for almost all the training samples (Please see [21] andAppendix A, pg. 17 in [22] for illustration on a UCI dataset). However, if we determinethe set of λ values with, say, lowest 30% error rates on each sample, we observe a recurringsubset of λ values across these samples (Please see [21] and Table 4, in Appendix B in [22]for illustration on a UCI dataset). Thus, we have an ensemble of λ values to pick from. Wecan combine multiple base classifiers resulting from different parameter values, to build astrong stochastic classifier using PAC-Bayesian framework. PAC-Bayesian Bounds and Optimal Posteriors
PAC-Bayesian approach assumesan arbitrary but fixed prior distribution on the space of classifiers and outputs a posteriordistribution on this space, corresponding to a stochastic classifier. This approach providesa probabilistic bound on the difference between the posterior averaged true and empiricalrisks of a stochastic classifier as measured by a convex distance function. These bounds on unknown averaged true risk offer a trade-off between averaged empirical risk and a termwhich encompasses model complexity of the stochastic classifier. The bound is computedbased on a single sample but with a high probability guarantee over different samples (fromthe same distribution). For a chosen distance function, we are interested in the ‘optimal PAC-Bayesian posterior’ defined as the posterior distribution which minimizes the correspondingPAC-Bayesian bound . By design, these bounds and the resulting optimal posterior are robustto the choice of sample used for training, addressing the sample bias.
Relevant Work
A well known form of bounds estimating the unknown true risk of aclassifier known as PAC-Bayesian bounds were proposed by [17, 18, 23] using the idea ofBayesian priors and posteriors on the classifier space, and refined further by [13, 10, 15,16]. Several authors improvised the bounds for the choice of the distance function theyconsidered for evaluating the classifiers. While [13] provided a bound for the KL divergenceas the distance function, φ , by tightening up the threshold with a factor of √ m instead of m , [6] generalized the framework of PAC-Bayesian bounds for a broader class of convex φ functions and relaxed the constraints on tail bounds of the empirical risk of the classifiersunder consideration. PAC-Bayesian theory has been used to devise margin bounds for linearclassifiers such as SVMs [14, 11]. [1] specialized the PAC-Bayesian bounds using sphericalGaussian distributions on the space of linear classifiers and extended the set up for usingdata-dependent priors [19]. More recently, [3] introduced PAC-Bayesian bounds based onRényi divergence between the posterior and the prior distributions. We use a specific caseof this Rényi divergence which corresponds to χ -divergence.All of the above consider a continuous (SVM) classifier space ( n -dimensional Euclideanspace) and continuous prior as well as posterior distributions on it (spherical Gaussian dis-tributions) whereas we consider a finite set of classifiers such as those generated by a finite2et of regularization parameter values for the SVM. Our χ -divergence based PAC-Bayesianbounds are derived for this set up with a discrete prior distribution, and three different dis-tance functions between posterior averaged empirical risk and posterior averaged true risk.The motivation for choosing a different divergence function in the PAC-Bayesian frameworkis to achieve a better test set performance and a tighter risk bound. In order to do so,we first need to investigate the nature of these χ -squared divergence based PAC-Bayesianbound minimization problems, identify the corresponding optimal PAC-Bayesian posteriorsand understand their nature. These posteriors might not be at par with the classical KL-divergence based PAC-Bayesian posteriors or the cross-validation based procedure, but thecomparison amongst them brings forth some insightful aspects of the PAC-Bayesian optimalposteriors. We list below the contributions of this paper. Contributions
We are interested in the optimal PAC-Bayesian posterior which mini-mizes the χ -divergence based PAC-Bayesian bound for a given distance function (Section2). We consider a finite classifier set and three distance functions – linear distance, squareddistance (second degree polynomial) and KL-divergence (infinite degree polynomial). • We deduce χ -divergence based PAC-Bayesian bounds for the above three distancefunctions and identify the optimal posteriors for them via respective bound minimiza-tion problems. • The linear distance based bound was considered in [3]; we identify the associated boundminimization as a convex program and obtain a closed form expression for the globaloptimal posterior (Section 4). • We also deduce PAC-Bayesian bounds for squared distance and KL-divergence, andshow that they are non-convex programs (Sections 5 and 6). We further show thatthe squared distance based bound is quasi-convex under certain conditions. The KL-divergence based bound minimization problem involves a difference of convex (DC)functions and hence is a DC program. Therefore we applied a DC approach known asConvex-Concave Procedure (CCP) [12] to find its local minimum. In our computations,we observed that the CCP did not work for certain cases, especially when we havealmost linearly separable data. (Such cases are illustrated in Table 8 in Appendix D.) • For deriving optimal posteriors for such non-convex cases, we identify Fixed Point(FP) equations deduced from the partial KKT system with strict positivity constraints.These FP equations converge even when the solver or an alternate approach like CCPfails to identify a solution, and are much faster than the solver. (Some examples ofsuch cases are in Tables 6, 7 and 8 in the appendix.) • For any of the above 3 distance functions, for the uniform prior distribution, we simplifythe search for optimal posteriors on the simplex restricted to subsets of classifiersordered by empirical risk values (Section 3). • For computational illustration, we consider a comprehensive set of nine UCI datasets[5] with small to moderate number of examples and features, balanced and imbalancedclasses, and having different ranges and variances in the empirical risk values. Using3uch datasets helps us compare and understand the performance of optimal posteri-ors due to different distance functions for the χ -squared divergence based optimalposteriors and also compare with the known KL-divergence based optimal posteriors[21]. • We use these approaches on the set of SVMs generated by a finite set of regularizationparameter values (Section 7). This leads us to the notion of a stochastic SVM char-acterized by an optimal posterior on the regularization parameter set. Usually smallvalues of the regularization parameter values are preferred. Keeping this in mind, weused an arithmetic-geometric series of regularization parameter values, λ with a log-arithmic scale for λ ∈ (0 , . and a linear scale for λ ≥ . . We chose a mixtureof logarithmically and linearly spaced values of λ so that we cover many different λ scorresponding to distinct SVM classifiers with low test errors. – Optimal posteriors for KL-divergence give extremely loose bounds and are compu-tationally expensive but have test error rates generally better than linear distanceones. The optimal bound value and test error rate of the squared distance basedoptimal posterior are remarkably lower than those of linear or KL-distance basedposteriors when base classifiers have high variation in empirical risk values. – This is accompanied by relatively high concentration on low empirical risk valuesand sparse nature of squared distance based posteriors. For almost separabledatasets, posteriors due to these three distance functions have comparable PAC-Bayesian bound values and test error rates.Table 1 outlines theoretical and computational aspects of optimal posteriors consideredin this paper. • To understand the role of divergence measure on PAC-Bayesian bounds, we conducteda comparative study of these χ -divergence based optimal PAC-Bayesian posteriorswith the posteriors derived for classical KL-divergence based PAC-Bayesian bounds[21] (Section 8). – We observe that though both the classes of posteriors have weights which aredecreasing with the increasing empirical risk values of classifiers, the rate at whichthey decrease is different in the two classes – KL-divergence based posteriorsdecrease exponentially, while χ -divergence based posteriors decrease linearly withempirical risk values. – Another difference is in the size of support set for the two classes of posteriors– KL-divergence based posteriors take up the full support on the set of baseclassifiers, whereas those for χ -divergence usually depend only on a strict subsetas their support set. – The class of optimal posteriors for χ -divergence based PAC-Bayesian boundsis observed to have weaker bounds and higher test set errors than the class ofKL-divergence based PAC-Bayesian posteriors on a set of SVM classifiers. Suchbehaviour can be attributed to χ -divergence based posteriors overfitting the databy choosing a strict subset support of classifiers with least empirical risk values.4 We also compared the performance of the stochastic SVM classifier governed by thesePAC-Bayesian posteriors with the deterministic SVM classifier obtained via the popularcross-validation procedure for regularization parameter selection (Section 8) as thebaseline case. Though the cross-validation procedure gives a classifier with betterperformance on a test set, the PAC-Bayesian posteriors yield a sample robust classifierwith quantifiable guarantees on the unknown true risk. On the computation side, PAC-Bayesian procedure is more than 10 times faster than the cross-validation procedure.
Classical version of PAC-Bayesian theorem is derived using Donsker-Varadhan inequality a [3, 24] for change of measure which is based on KL-divergence between the two distributions.A new version of PAC-Bayesian results has been discovered by [3] which involves change ofmeasure guided by Rényi divergence between the two distributions: Theorem 1. [3] For any data distribution D over input space X × Y , the following boundholds for any prior P over the set of classifiers H , for any α > and any δ ∈ (0 , , wherethe probability is over random i.i.d. samples S m = { ( x i , y i ) | i = 1 , . . . , m } of size m drawnfrom D , for any convex function φ : [0 , × [0 , → R : P S m φ (cid:16) E Q [ˆ l ] , E Q [ l ] (cid:17) ≤ (cid:20) E h ∼ P (cid:18) Q ( h ) P ( h ) (cid:19) α (cid:21) α (cid:34) I Rφ ( m, α (cid:48) ) δ (cid:35) α (cid:48) ≥ − δ. (1) where α (cid:48) = αα − and I Rφ ( m, α (cid:48) ) := sup l ∈ [0 , (cid:104)(cid:80) mk =0 (cid:0) mk (cid:1) l k (1 − l ) m − k φ (cid:0) km , l (cid:1) α (cid:48) (cid:105) . Here, Q isan arbitrary posterior distribution on H , which may depend on the sample S and on theprior P . E Q [ˆ l ] := E h ∼ Q (cid:80) mi =1 1 m [ l ( h, x i , y i )] denotes the averaged empirical risk and E Q [ l ] := E h ∼ Q E ( x ,y ) ∼D [ l ] denotes averaged true risk of classifiers in H computed using a loss function, l ( h, x , y ) : H × X × Y → [ a, b ) (here, ≤ a < b ). We are interested in identifying the optimal posteriors for different choices of distancefunctions for the case of α = 2 , which can be related to the chi-squared divergence measure, χ ( Q || P ) := E h ∼ P (cid:20)(cid:16) Q ( h ) P ( h ) (cid:17) − (cid:21) between distributions Q and P [3]. a The Donsker-Varadhan inequality can be stated as below:
Lemma 1 (KL divergence change of measure [3]) . For any set H , for any two distributions P and Q on H ,and for any measurable function φ : H → R , we have: E h ∼ Q φ ( h ) ≤ KL [ Q || P ] + ln (cid:0) E h ∼ P e φ ( h ) (cid:1) . An outline of theoretical aspects and computational results for optimal posteriors Q ∗ φ,χ = { q ∗ i,φ,χ } Hi =1 for minimization of PAC-Bayesian bound B φ,χ ( Q ) based on chi-squared di-vergence, χ ( Q || P ) = (cid:80) Hi =1 q i p i between a posterior Q and a prior P on the classifier space H . Weconsider three different distance functions, φ : KL-divergence kl (ˆ l, l ) = ˆ l ln ˆ ll + (1 − ˆ l ) ln (cid:16) − ˆ l − l (cid:17) , linear φ lin (ˆ l, l ) = l − ˆ l and squared distances φ sq (ˆ l, l ) = ( l − ˆ l ) for l, ˆ l ∈ (0 , . H denotes the classifiers setsize and H ∗ denotes the size of the support set of the optimal posterior Q ∗ φ,χ . ˆ l i denotes empiricalrisk value of a classifier in H computed on a sample of size m . I R φ ( m ) is a sample size basedconstant for a distance function φ . It is a component of the bound function B φ,χ ( Q ) . Dist-ancefn φ Theoretical Aspects I R φ ( m ) Convexity Global min / Fixed Point (FP) (for uniform P ) φ lin mδ (due to[3]) Convex q ∗ i, lin ,χ ( H ∗ ) = (cid:18) (cid:80) H ∗ i =1 ˆ liH (cid:48) − ˆ l i (cid:19)(cid:113) HH ∗ mδ − ˆ var H ∗ (ˆ l ) H (Global min) φ sq m − m δ shown non-convex;Quasi-convexunder a condition q F Pi, sq ,χ ( H ∗ ) = H ∗ + (cid:16)(cid:80) H ∗ i =1 ( q FPi, sq ,χ ( H ∗ )) (cid:17) (cid:113) (12 m − H m δ (cid:18) (cid:80) H ∗ i =1 ˆ l i H ∗ − ˆ l i (cid:19) kl computedbased on formgiven by [3] Non-convex;Difference ofconvex functions(DC) q F Pi, kl ,χ ( H ∗ ) satisfies: q i = p i (cid:16)(cid:80) H ∗ i =1 q i (cid:17) × (cid:16)(cid:80) H ∗ i =1 ˆ l i q i − ˆ l i (cid:17)(cid:114) H ( (cid:80) H ∗ i =1 q i ) I R kl ( m, δ ln (cid:18) (1 − r ) (cid:80) H ∗ i =1 ˆ l i q i r (1 − (cid:80) H ∗ i =1 ˆ l i q i ) (cid:19) Dist-ancefn φ Computations (for uniform P ) Solver (
Ipopt ) output Global min Fixed Point (FP) φ lin Identifies global minima Identified analytically Not required φ sq Identifies a unique (local) minimumeven with different initializations closed formmay not exist Matches solver output kl Identifies multiple local minimawith different initializations;throws up error formoderate and large H closed formmay not exist Identifies a uniquestationary point even withdifferent initializations .1 Optimal posteriors via PAC-Bayesian bound minimization PAC-Bayesian theorem (1) gives a high probability upper bound on averaged true risk, E Q [ l ] assuming distance function φ ( E Q [ˆ l ] , · ) is invertible for given E Q [ˆ l ] : B φ,χ ( Q ) ≡ B φ,χ ( E Q [ˆ l ] , S m , δ, P )= f φ E Q [ˆ l ] , φ − E Q [ˆ l ] (cid:115) ( χ ( Q || P ) + 1) I Rφ ( m, δ , (2)where φ − E Q [ˆ l ] ( K ) = b implies φ ( E Q [ˆ l ] , b ) = K for some b ∈ (0 , and a given K > . Generally f φ ( · , · ) is the sum of its arguments except when φ is KL-distance function. That is, boundfunction B φ,χ ( Q ) is the sum of averaged empirical risk, E Q [ˆ l ] , and a model complexity termwhich depends on system parameters, S m , δ and P . We are interested in the determining anoptimal posterior distribution Q ∗ φ,χ which minimizes bound B φ,χ ( Q ) for a given distancefunction φ . To characterize the minimum of B φ,χ ( Q ) , we make use of the first order KKT conditionswhich are necessary for a stationary point of a non-convex problem. These KKT conditionsrequire the objective function and the active constraints to be differentiable at the localminimum. We derive fixed point (FP) equations for the optimal posterior using the partialKKT system. These FP equations use KKT system with strict positivity constraints due towhich complementary slackness conditions are automatically satisfied; hence called ‘ partial ’KKT system. The computations illustrate that these FP equations always converge to astationary point at a very fast rate, even for a large classifier set when a non-convex solverfails to identify a solution. (Please see Table 6 and Table 7 in Appendix D for an illustrationof such cases.) Framework
We work with a finite set of classifiers: H = { h i } Hi =1 of size H . The prior, P = { p i } Hi =1 and posterior, Q = { q i } Hi =1 are discrete distributions on H , where p i , q i ≥ ∀ i = 1 , . . . , H with (cid:80) Hi =1 p i = 1 and (cid:80) Hi =1 q i = 1 . For differentiability required by KKTconditions, our objective function should have open domain, that is, the interior of the H -dimensional probability simplex: int (∆ H ) = { ( q , . . . , q H ) | q i > ∀ i = 1 , . . . , H ; (cid:80) Hi =1 q i =1 } . In computations, we consider q i ≥ (cid:15) ∀ i = 1 , . . . , H for (cid:15) > to ensure existence of aminimizer in int (∆ H ) . Our FP equations are derived using partial KKT system on int (∆ H ) . Q ∗ φ,χ , for uniform prior We consider the special case of uniform prior on entire H . We want to identify the optimalposterior Q ∗ φ,χ with the H -dimensional probability simplex, ∆ H , as the feasible region. We7how below that it is enough to restrict the search space to certain subsets of ∆ H . Thisreduces the computational complexity of the search from exponential scale to linear scale . Theorem 2.
Consider a uniform prior distribution on the set H of classifiers, and agiven set of posterior weights Q = { q j } H (cid:48) j =1 . We have three choices of distance function φ ∈ { φ lin , φ sq , kl } . Then among all subsets H (cid:48) ⊂ H of size H (cid:48) , the smallest bound value B φ,χ ( Q, H (cid:48) ) corresponding to the given posterior weights Q is achieved when H (cid:48) is the sub-set formed by the first H (cid:48) elements of the ordered set of classifiers ranked by non-decreasingempirical risk values, ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H .Proof. We first consider the case of linear and squared distance based bounds. Under thegiven set up, these bound functions are defined as follows: B lin ,χ ( Q, H (cid:48) ) := (cid:88) i ∈H (cid:48) ˆ l i q i + (cid:115) (cid:0)(cid:80) i ∈H (cid:48) q i (cid:1) H mδ . (3) B sq ,χ ( Q, H (cid:48) ) := (cid:88) i ∈H (cid:48) ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116) H (cid:32)(cid:88) i ∈H (cid:48) q i (cid:33) (cid:18) m − m δ (cid:19) . (4)For a given set of posterior weights { q j } H (cid:48) j =1 , the second terms of B lin ,χ ( Q, H (cid:48) ) and B sq ,χ ( Q, H (cid:48) ) are invariant of the support set H (cid:48) as long as its cardinality is H (cid:48) . Thus thevalue of the bound depends on the common first term which is a sum of positive quantities.For given weights { q j } H (cid:48) j =1 , the bounds (3) and (4) are the smallest when the sum (cid:80) i ∈H (cid:48) ˆ l i q i is minimized. This will happen when H (cid:48) consists of classifiers with smallest H (cid:48) values in theset { ˆ l i } Hi =1 . Furthermore, if the elements of H (cid:48) are ordered by non-decreasing empirical riskvalues, ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) , the posterior weights should be ordered non-increasingly. Hence,the claim of the theorem holds true.Now, for the KL-divergence as a distance function, the bound value, r , is the solution tofollowing two equations: kl (cid:32)(cid:88) i ∈H (cid:48) ˆ l i q i , r (cid:33) = (cid:115) H (cid:0)(cid:80) i ∈H (cid:48) q i (cid:1) I R kl ( m, δ (5) r ≥ (cid:88) i ∈H (cid:48) ˆ l i q i (6)The right hand side term of (5) is invariant of support H (cid:48) as long as it is of size H (cid:48) . Let ˆ L := (cid:80) i ∈H (cid:48) ˆ l i q i , then (5) is an implicit function of variables ˆ L and r . Using implicit functiontheorem, we have drd ˆ L = − ∂kl/∂ ˆ L∂kl/∂r = ln ˆ Lr − ln − ˆ L − r ˆ Lr − − ˆ L − r (7)Using (6) and strict monotonicity of natural logarithm function, we can claim that drd ˆ L > .That is, the bound r is a strictly increasing function of ˆ L := (cid:80) i ∈H (cid:48) ˆ l i q i under the given setup. To find the least r for a given Q ( H (cid:48) ) = { q j } H (cid:48) j =1 , we need to find the least (cid:80) i ∈H (cid:48) ˆ l i q i onall possible subsets H (cid:48) . This happens when H (cid:48) is the subset formed by the first ordered H (cid:48) elements ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) . Hence proved. 8 orollary 1. As a consequence of the above Theorem 2, for determining the (global) op-timal posterior Q ∗ φ,χ , it is sufficient to compare the bound values corresponding to the bestposteriors on ordered subsets of H , ranked by non-decreasing ˆ l i values. These ordered subsetscan be uniquely identified by their size. An ordered subset of size 1 is { ˆ l } , of size 2 is { ˆ l , ˆ l } and so on. Thus there exists an isomorphism between the set { , . . . , H } (which denote thesubset size) and the family of ordered increasing subsets of H . Algorithm 1:
OptQ φ − χ For Uni-form Prior : Algorithm for findingoptimal posterior for the PAC-Bayesianbound based on χ -divergence whenprior is uniform distribution Input: m, δ, H, { ˆ l i } Hi =1 Output: Q ∗ φ,χ Define an array B ∗ φ,χ [ . . . ] of size H B ∗ φ,χ [1] ← f φ (cid:18) ˆ l , φ − l (cid:18)(cid:113) I Rφ ( m, p δ (cid:19)(cid:19) for H (cid:48) = 2 , . . . , H do flag ← Identify Q ∗ φ,χ for H (cid:48) via (9) or(12) or (15) for i = 1 , . . . , H (cid:48) do if q ∗ i,φ,χ < then flag ← break end end if flag = 1 then break end Compute B ∗ φ,χ [ H (cid:48) ] using Q ∗ φ,χ in(2) end H ∗ ← arg min H (cid:48) B ∗ φ,χ [ H (cid:48) ] Identify Q ∗ φ,χ for H ∗ via (9) or (12) or(15) return Q ∗ φ,χ Correctness of Algorithm
OptQ φ − χ For Uniform Prior
We want to de-termine the globally optimal posterior Q ∗ φ,χ that has the minimum bound value B φ,χ ( Q ) over the H -dimensional probability simplex, ∆ H . Using the result of Theorem 2, we canconfine the search to a much smaller spaceof posteriors with support on a family of in-creasing ordered subsets of H . These or-dered subsets are defined by their size. Forexample, an ordered subset of size H (cid:48) ∈ [ H ] comprises of the lowest H (cid:48) values in the set { ˆ l i } Hi =1 . This restricted space of posteriors,say ∆ ord ⊂ ∆ H , is a union of convex setsof posteriors with supports on ordered sub-sets defined above. Due to increasing sub-set relation between consecutive supports,this union is also a convex set. The searchspace ∆ ord is a restriction of ∆ H , yet consistsof uncountably many posteriors. We refinethe search further by localizing to optimalposteriors Q ∗ φ,χ ( H (cid:48) ) on each increasing or-dered subset and comparing their bound val-ues, B ∗ φ,χ ( H (cid:48) ) to find the minimum. Thus,an exponential search on restricted posteriorspace is simplified to a finite linear search onthe support size. When using FP scheme, wealso need to verify that Q ∗ φ,χ ( H (cid:48) ) satisfiespositivity constraints. We denote the sup-port size of the optimal posterior Q ∗ φ,χ by H ∗ ∈ [ H ] . Therefore, for determining Q ∗ φ,χ in ∆ ord , it is sufficient to search for H ∗ in theset { , . . . , H } .9 Optimal PAC-Bayesian Posterior using Linear Dis-tance Function
As a basic case, we can consider linear distance function, φ lin (ˆ l, l ) = l − ˆ l for ˆ l, l ∈ [0 , . Theoptimal posterior Q ∗ lin ,χ that bounds the unknown averaged true risk of a stochastic classifier,is obtained via the following minimization problem for the bound B lin ,χ ( Q ) identified by [3]. min Q =( q ,...,q H ) ∈ ∆ H B lin ,χ ( Q ) := H (cid:88) i =1 ˆ l i q i + (cid:115) (cid:80) Hi =1 q i p i mδ (8) Theorem 3.
The bound function B lin ,χ ( Q ) (identified by [3]) is a strictly convex functionand hence, (8) is a convex program with a unique global minimum. Proof (in Section A.1 in the appendix) uses first order convexity property. Q ∗ lin ,χ , for uniform prior We identify the optimal posterior and the optimal bound value for linear distance functionby exploiting convexity of B lin ,χ ( Q ) . Proofs for Theorem 4 and Theorem 5 stated below arein Sections A.2 and A.3 in the appendix. Theorem 4 (Optimal posterior on an ordered subset support) . When prior is uniform distri-bution on H , among all the posteriors with support as subset of H of size exactly H (cid:48) , the bestposterior denoted by Q ∗ lin ,χ ( H (cid:48) ) has the support on the ordered subset H (cid:48) ord = {{ ˆ l i } H (cid:48) i =1 | ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) } consisting of smallest H (cid:48) values in H . The optimal posterior weights aredetermined as follows: q ∗ i, lin ,χ ( H (cid:48) ) = (cid:32) (cid:80) H (cid:48) i =1 ˆ liH (cid:48) − ˆ l i (cid:33)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) H i = 1 , . . . , H (cid:48) i = H (cid:48) + 1 , . . . , H, (9) where ˆ var H (cid:48) (ˆ l ) = H (cid:48) H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) (cid:19) is the variance of the values in H (cid:48) ord . We require that H (cid:48) is such that HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) > so that Q ∗ lin ,χ ( H (cid:48) ) is defined andfor feasibility, q ∗ i, lin ,χ ( H (cid:48) ) > for i = 1 , . . . , H (cid:48) . Using the closed form expression (9), we can identify the optimal posterior Q ∗ lin ,χ viaAlgorithm 1. Remark 1.
For given values of
H, H (cid:48) , m and ˆ var H (cid:48) (ˆ l ) , the upper bound on δ is related tosparseness of the optimal posterior Q ∗ lin ,χ . A higher δ diminishes the effect of divergenceterm (cid:80) Hi =1 q i p i and allows sparse solutions. heorem 5. The bound value of the best posterior Q ∗ lin ,χ ( H (cid:48) ) on an ordered subset of size H (cid:48) , B ∗ lin ,χ ( H (cid:48) ) := B lin ,χ (cid:0) Q ∗ lin ,χ ( H (cid:48) ) (cid:1) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) , (10) is decreasing function of H (cid:48) ≤ H ∗ , the support size of globally optimal posterior Q ∗ lin ,χ . PAC-Bayesian bound for squared distance, φ sq (cid:16) ˆ l, l (cid:17) = (cid:16) ˆ l − l (cid:17) for ˆ l, l ∈ [0 , is identifiedbelow. We first to need to identify I R sq ( m, for a given sample size m . Details of thederivation are in Appendix B. Lemma 2.
For a given sample size, m , I R sq ( m,
2) := (cid:80) mk =0 (cid:0) mk (cid:1) . m e m ( km − . ) = m − m . Theorem 6.
For a finite set of classifiers, H , PAC-Bayesian bound, B sq,χ ( Q ) , based onsquared distance function with χ -divergence measure is given by: B sq,χ ( Q ) := H (cid:88) i =1 ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q i p i (cid:33) (cid:18) m − m δ (cid:19) (11) Proof.
Using the PAC-Bayesian statement in (1) with φ sq (cid:16) ˆ l, l (cid:17) = (cid:16) ˆ l − l (cid:17) for finite H , wecan obtain B sq,χ ( Q ) in using (2) and I R sq ( m, identified via Lemma 2.We want to determine the optimal posterior Q ∗ sq,χ which minimizes B sq,χ ( Q ) over ∆ H .This bound function turns out to be non-convex in Q . Theorem 7.
The bound function, B sq,χ ( Q ) = (cid:80) Hi =1 ˆ l i q i + (cid:114)(cid:16)(cid:80) Hi =1 q i p i (cid:17) (cid:0) m − m δ (cid:1) is non-convex. We show this non-convexity even when P ∼ U nif ( H ) via counter examples violatingfirst order convexity property in Section B.2 in the appendix. Remark 2.
Computationally this bound minimization problem for (11) is observed to havea single solution. We used bordered Hessian test to verify that the solution obtained is a localminimum. This motivates the following. The bound B sq ,χ ( Q ) is shown to be quasi-convexunder a condition on system parameters. Proposition 1.
The bound function B sq ,χ ( Q ) is strictly quasi-convex if the following con-dition holds for any Q, Q (cid:48) for each α ∈ (0 , : (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i < (1 − α )( E Q [ˆ l ] − E Q (cid:48) [ˆ l ]) and hence a local minimum to the bound minimization problem for the bound (11) is also aglobal minimum. Q F P sq ,χ , for uniformprior For uniform prior set up, we derive FPE for minimizing (11) on an ordered subset supportof size H (cid:48) . Theorem 8 (Optimal posterior on an ordered subset support) . When prior is uniform distri-bution on H , among all the posteriors with support as subset of size exactly H (cid:48) , the best pos-terior denoted by Q ∗ sq ,χ ( H (cid:48) ) has the support on the ordered subset H (cid:48) ord = {{ ˆ l i } H (cid:48) i =1 | ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) } consisting of smallest H (cid:48) values in H . The optimal posterior weights { q ∗ i, sq ,χ ( H (cid:48) ) } are determined as the solution to the following fixed point equation in { q i ( H (cid:48) ) } H (cid:48) i =1 : q i ( H (cid:48) ) = H (cid:48) + (cid:16)(cid:80) H (cid:48) i =1 ( q i ( H (cid:48) )) (cid:17) (cid:113) (12 m − H m δ (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) i = 1 , . . . , H (cid:48) i = H (cid:48) + 1 , . . . , H. (12) under the feasibility condition that q i ( H (cid:48) ) > for i = 1 , . . . , H (cid:48) . Proof of this theorem is in Section B.3 in the appendix. In our computations, FP iteratesin (12) do converge to a solution. Using them, we can identify the optimal posterior Q ∗ sq ,χ via Algorithm 1. kl ( · , · ) Chi-squared divergence based PAC-Bayesian bound using the distance function kl (ˆ l, l ) =ˆ l ln (cid:16) ˆ ll (cid:17) + (1 − ˆ l ) ln (cid:16) − ˆ l − l (cid:17) (for ˆ l, l ∈ [0 , ) is: B kl ,χ ( Q ) = sup r ∈ (0 , (cid:40) r : kl (cid:16) E Q [ˆ l ] , r (cid:17) ≤ (cid:114) ( χ ( Q || P ) + 1) I R kl ( m, δ (cid:41) (13)where I R kl ( m,
2) := m (cid:80) k =0 (cid:0) mk (cid:1) l k (1 − l ) m − k (cid:0) kl (cid:0) km , l (cid:1)(cid:1) should be computed first. For m > ,computation is difficult due to storage limitations in the range of floating point numbers.We notice that I R kl ( m, decreases with m and hence, we can use I R kl (1028 , as an upperapproximation for I R kl ( m, for m > . Please refer to Table 5 and Figure 3 in AppendixC for details. kl ( · , · ) is not a monotone function and so its inverse does not exist. Thus, B kl ,χ ( Q ) doesnot have an explicit form. However, we can employ a numerical root finding algorithm suchas that described in [20] (Algo. ( KLroots )) to obtain B kl ,χ ( Q ) for given system parametervalues. 12or a finite classifier space H = { h i } Hi =1 with empirical risk values { ˆ l i } Hi =1 , the KL-distancebound minimization problem is: min ( q ,...,q H ) ∈ ∆ H r ∈ (0 , r (14a)s.t. (cid:32) H (cid:88) i =1 ˆ l i q i (cid:33) ln H (cid:80) i =1 ˆ l i q i r + (cid:32) − H (cid:88) i =1 ˆ l i q i (cid:33) ln − H (cid:80) i =1 ˆ l i q i − r = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) (cid:18) H (cid:80) i =1 q i p i (cid:19) I R kl ( m, δ (14b) r ≥ H (cid:88) i =1 ˆ l i q i (14c)Here, r is the right root of (14b) for a given E Q [ˆ l ] . The above is known to be a non-convexproblem with a difference of convex (DC) equality constraint (14b); and has multiple sta-tionary points. This fact is illustrated in our computations, where different initializations ledto different stationary points. Using bordered Hessian test, we verified that these stationarypoints computed on our datasets by the solver or the FP equation (15) given below are eitherlocal minima or saddle points.The constraint (14c) is a strict inequality which is relaxed toensure a solution in the closed domain. The iterative root finding algorithm adds to thecomputational complexity of the bound minimization algorithm.The objective function and constraints of the above bound minimization, (14) are eitherlinear or difference of convex (DC) functions, hence it falls into the category of a DC program.We can make use of the convex-concave procedure (CCP), which is a powerful heuristicmethod used to find local solutions to DC programming problems [12]. This proceduremakes a linear approximation via supporting hyperplane to the second convex function inthe DC function of the optimization program. This helps us convert the DC constraint into aconvex constraint and hence the original DC program is reduced to a convex program whichcan be easily solved. The details of the CCP for solving (14) are in Appendix C.3 and therelated computations are in Table 8 in the appendix. Our general observation is that thefixed point scheme that we derive outperforms CCP. Q F P kl ,χ , for uniformprior We derive FPE for (14) for uniform prior when the support of Q is an ordered subset of size H (cid:48) . Proof is in Section C.2 of the appendix. This FPE is used in Algorithm 1 for determining Q F P kl ,χ . Theorem 9 (Optimal posterior on an ordered subset support) . When prior is uniformdistribution on H , among all the posteriors with support as subset of size exactly H (cid:48) , the bestposterior denoted by Q ∗ sq ,χ ( H (cid:48) ) has the support on the ordered subset H (cid:48) ord = {{ ˆ l i } H (cid:48) i =1 | ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) } consisting of smallest H (cid:48) values in H . A stationary point Q FPkl ,χ ( H (cid:48) ) for (14)13 an be obtained as the solution to the following fixed point equation in { q i } H (cid:48) i =1 : q i = 1 Z kl ,χ (cid:32) H (cid:48) (cid:88) i =1 q i (cid:33) (cid:16) ˆ l i − (cid:80) H (cid:48) i =1 ˆ l i q i (cid:17)(cid:113) H ( (cid:80) H (cid:48) i =1 q i ) I R kl ( m, δ (cid:34) ln (cid:32) (1 − r ) (cid:80) H (cid:48) i =1 ˆ l i q i r (1 − (cid:80) H (cid:48) i =1 ˆ l i q i ) (cid:33)(cid:35) (15) for i = 1 , . . . , H (cid:48) , where Z kl ,χ is a suitable normalization constant and r is the solution to (14b) and (14c) for a given Q = ( q , . . . , q H (cid:48) ) ∈ interior (∆ H (cid:48) ) . For computations, we included nine datasets from UCI repository [5] with small to moderatenumber of examples (306 examples to 5463 examples) and small to moderate number offeatures (3 features to 57 features). The details about the number of features, numberofexamples and class distribution of these datasets are listed in Table 2 [22]. These datasetsspan a variety ranging from almost linearly separable (Banknote, Mushroom and Wavedatasets) to moderately inseparable (Wdbc, Mammographic and Ionosphere datasets) toinseparable data (Spambase, Bupa and Haberman datasets). SVMs on these datasets havevarying ranges and degrees of variation in their empirical risk values.We consider a finite set of SVM regularization parameter values
Λ = { λ i } Hi =1 , say, between and an upper bound λ > , since small values of λ i ’s are preferable. The set Λ isan arithmetic-geometric progression (AGP) with a logarithmic scale for λ ∈ (0 , . and alinear scale for λ ≥ . . The logarithmic subset of Λ is a union of 3 geometric series withratios , and each with elements truncated between e − and 0.1. We use a lowerbound e − which is slightly away from 0 since the SVM classifiers become very close(almost indistinguishable) and give similar training and test errors for very small values of λ which are almost zero. For values beyond 0.1, we use a arithmetic progression with spacing of0.05. We use an upper bound λ = 5 on this arithmetic series, since on most datasets, SVMsgenerated by λ ≥ do not have good training and test error rates. (Graphical illustration ofranges and variation of training errors and test errors of the nine UCI datasets we consideredon the chosen Λ range is depicted in Section G in [22].)SVM QP (with RBF kernels) was implemented using ksvm function in kernlab package[9] in R (version 3.1.3 (2015-03-09)) . The Gaussian width parameter was estimated by kernlab using sigest function which estimates 0.1 and 0.9 quantiles of squared distancebetween the data points.Each of these datasets was partitioned such that 80% of the examples formed a compo-sition of training set and validation set (in equal proportion) used for constructing the set H = { h ( λ i ) | λ i ∈ Λ } Hi =1 of SVM classifiers and remaining 20% used for computing their testerror rates. The training set size ( m ), validation set size ( v ) and test set size ( t ) are in theratio m : v : t = 0 . . . . The role of the validation set is to compute the empirical risk ˆ l i of the SVM h ( λ i ) ∈ H which will be used for deriving the PAC-Bayesian bound. We follow b after one-hot encoding for categorical features c after removing the rows with missing values from the data d number of examples when class ‘0’ is removed ataset Number offeatures , n Number ofexamples Pos/Neg Trainingset size , m Validationset size , v Testset size , t Spambase
57 4601 2788/1813 1840 1840 921
Bupa
Mammographic
Wdbc
30 569 357/212 227 227 115
Banknote
Mushroom
22 ( b ) 5643 c Ionosphere
34 351 225/126 140 140 71
Waveform
40 3308 d Haberman
Table 2:
Details of various UCI datasets used for computational experiments (Table 5 in Section G of [22]). We list the number of features n , total number of exampleswith distribution into positive and negative classes for each dataset. We also give the numberof examples in training, validation and test sets, according to the random partition createdby 0.4:0.4:0.2 ratio of the total dataset size.the scheme provided in [3, 25] to generate the set H . Each classifier h ( λ i ) ∈ H is trained on m training examples subsampled from this composite set and validated on the remaining v examples. Overlaps between training sets of different classifiers are allowed. Same is truefor their validation sets. (For further details about the dataset categorization, please referto Section G.1 in [22].)The PAC-Bayesian bound minimization for finding the optimal posterior was imple-mented in AMPL Interface and solved using Ipopt software package (version 3.12 (2016-05-01)) [27], a library for large-scale nonlinear optimization ( http://projects.coin-or.org/Ipopt ).All the computations were done on a machine equipped with 4 Intel Xeon 2.13 GHz coresand 64 GB RAM.
We present a comparative study of χ -divergence based optimal posteriors considered in thispaper with optimal posteriors for KL-divergence based PAC-Bayesian bounds consideredby [21]. We also analyze the performance of these stochastic SVM classifiers governed byPAC-Bayesian posteriors with respect to the deterministic SVM classifier identified via cross-validation procedure. Given a set of base classifiers and having computed their empirical risk values, we observethe following differences and similarities between the optimal PAC-Bayesian posteriors dueto the two divergence functions on this classifier set.15able 3:
PAC-Bayesian bounds and averaged test error rates for Q ∗ φ,χ . We comparebound values B ∗ φ,χ and average test error rates T φ,χ of optimal posteriors Q ∗ φ,χ for 3 distancefunctions: KL-divergence kl , linear φ lin and squared distances φ sq . For large sample size ( m ≥ ), the constant I R kl ( m, cannot be computed due to floating point storage limitations onthe machine. So, we use an upper approximation: for m > , I R kl ( m, ≤ I R kl (1028 ,
2) =0 . since I R kl ( m, is decreasing with m . Please see Table 7 in Appendix D for details. (cid:63) refers to values obtained using fixed point equation because the solver Ipopt does notconverge to a solution for reasons like local infeasibility, Restoration Phase Failed, etc. Pleasesee Appendix D for more such examples. Lowest 10% bound values and test error rates foreach dataset are denoted in bold face. For a given posterior, we measure sparsity by thenumber of classifiers needed by its cumulative distribution function (CDF) to achieve acertain significance level. Concentration of a posterior is quantified in terms of its (cid:96) norm,which is equivalent to HHI score used for measuring market share of a firm in the industry[7, 28]. For almost separable datasets, the CDFs of the three posteriors are close to eachother. Q ∗ kl ,χ have almost full support and low concentration. These posteriors give extremelyloose bounds and are computationally expensive but have test error rates better than lineardistance ones, Q ∗ lin ,χ , on most datasets. Squared distance based posteriors, Q ∗ sq ,χ , are sparsewith relatively high concentration (relatively high (cid:96) norm) and have the tightest bounds andlowest test error rates. The contrast in test error rates is striking when the dataset yieldsclassifiers with high variation in empirical risk values. See Section D.2 in the appendix fordetails. Dataset PAC-Bayesian Bound , B ∗ φ,χ Average Test Error , T φ,χ B ∗ lin ,χ B ∗ sq ,χ B ∗ kl ,χ T lin ,χ T sq ,χ T kl ,χ Spambase 0.38054 (cid:63) (cid:63)
Bupa 0.82183 (cid:63) (cid:63)
Mammographic 0.50276 (cid:63) (cid:63)
Wdbc 0.41631 (cid:63) (cid:63)
Banknote 0.22283 (cid:63) (cid:63)
Mushroom 0.10785 (cid:63) (cid:63)
Ionosphere 0.64273 (cid:63) (cid:63)
Waveform 0.18565
Haberman 0.70477 (cid:63) (cid:63) i Nature of optimal posteriors : Both KL-divergence and χ -divergence based opti-mal posteriors that exhibit decreasing trend with respect to the empirical risk values.That is to say that higher the empirical risk of a classifier, lower its optimal posteriorweight. However, the rate at which these posterior weights decrease is influenced by thechoice of divergence function in the PAC-Bayesian bound. In the case of KL-divergencebased PAC-Bayesian bounds, optimal posterior weights decrease exponentially with theempirical risk values. Whereas, the optimal posteriors that minimize χ -divergencebased PAC-Bayesian bounds have linearly decreasing weights with respect to the em-pirical risk values.When the prior is uniform distribution, the optimal posterior weights in both the casesare directly proportional to the empirical risk values (no role for prior weight). Con-sider the case of linear distance function with χ -divergence whose optimal posterior16eights are determined in Theorem 4 given by Equation (9). Similarly, when we havelinear distance function with KL-divergence, the optimal posterior weights are givenas (Equation (21) in [21]): q ∗ i, lin, KL = p i e − m ˆ l i (cid:80) Hi =1 p i e − m ˆ l i ∀ i = 1 , . . . , H (16)Using a uniform prior, p i = 1 H , in above, we get the following expression for the optimalposterior weights, which are directly proportional to the empirical risk values, ˆ l i s onexponential scale: q ∗ i, lin, KL = e − m ˆ l i (cid:80) Hi =1 e − m ˆ l i ∀ i = 1 , . . . , H (17)ii Size of support set : KL-divergence based optimal posteriors take into account allthe classifiers in the base set, though the posterior weight associated with a high riskclassifier is infinitesimally small. On the other hand, χ -divergence based optimalposteriors select only a strict subset of base classifiers comprising of the ones with lowempirical risk values.iii Test set performance : Optimal posteriors for the KL-divergence based bounds haverelatively lower test error rates than their χ -divergence based counterparts. Thishints at an underlying regularization phenomenon involving support set and tightnessof the bound. The χ -divergence based posteriors might be overfitting because theyconcentrate on a strict subset support of classifiers. In contrast, the KL-divergencebased posteriors have the whole classifier set as their support and a better test setperformance. This phenomenon is supported by the fact that KL-divergence basedoptimal posteriors yield tighter bounds than the ones derived for the case for χ -divergence. We performed 5-fold cross-validation (CV) on the datasets by setting aside 20% of the dataas a test set for each dataset. The set of λ values is an arithmetic-geometric progression(AGP) with a logarithmic scale for λ ∈ (0 , . and a linear scale for λ ≥ . . This is the sameset of λ values which has been used for the proposed PAC-Bayesian technique. We reportthe test error of the “best" λ identified by the CV method and compare it with the PAC-Bayesian method using the (sq, χ ) pair (since this pair gives the lowest test error obtainedby using different distance functions). These values are reported in Table 4 below. In termsof relative test error, the CV method is significantly better than the proposed method onSpambase, Bupa, Ionosphere and Haberman datasets, while the proposed PAC-Bayesianmethod is significantly better than CV method on Wdbc dataset. The difference in the twotest errors is small (less than 20%) on other datasets. Thus, CV method has better testerror performance than χ -divergence based PAC-Bayesian posterior. But the CV methodtakes remarkably longer time (between 2 to 10 hours) to identify the “best" λ than the timetaken by the PAC-Bayesian method to identify the optimal posterior the classifier space17which generally takes about 20 to 300 seconds). Thus, computational complexity of the CVmethod is much higher than that of PAC-Bayesian method. Dataset λ ∗ CVTest Error sq- χ PAC-BTest Error ∆ Test Error RelativeTest Error sq- χ PAC-BBoundSpambase (cid:63)
Bupa (cid:63)
Mammographic
Wdbc • Banknote
Mushroom
Ionosphere (cid:63)
Waveform
Haberman (cid:63) λ ∈ (0 , . and a linear scale for λ ≥ . . ∆ test error is theamount by which test error of CV method is smaller or larger than that of the PAC-Bayesianmethod for sq- χ pair. Relative test error is the ratio of ∆ test error to the CV test error,signifying the relative difference between the test errors of the two methods. • denotes theinstances where the PAC-Bayesian method has a significantly lower test error rate than theCV method, while (cid:63) denotes the instances where the CV method has a significantly lowertest error rate.Apart from the computational benefits , the proposed PAC-Bayesian method also has statistical advantages over the CV method, as noted below:i Sample robustness : CV method is not as sample robust as the PAC-Bayesian methodeven though it trains on multiple sub-samples (partitions) and reports the averagedtraining error as CV error for choosing the best λ . For example, a 5-fold CV thatwe performed uses multiple training samples with a diminished sample size of 20% ofthe dataset size. Whereas the PAC-Bayesian method uses a single and much largertraining sample (60% of dataset size in our scheme) to give an upper bound on thetrue risk.ii Point estimate versus interval estimate : CV method gives a point estimate ofthe true risk by averaging CV error over multiple folds, but there are no guaranteesassociated with it. On the other hand, the PAC-Bayesian method gives an intervalestimate of the form [0 , Bnd ] , where Bnd denotes the upper bound given by the PAC-Bayesian theorem and intrinsically has a high-probability guarantee associated.iii
Deterministic versus stochastic classifier : CV method outputs a deterministicclassifier in terms of the “best" λ value which has good test performance. The PAC-Bayesian technique is a committee method that outputs an optimal distribution onthe set of classifiers, which yields a stochastic classifier. The classifier determined by18he CV method may have better performance on a single test set, but the stochasticclassifier obtained via PAC-Bayesian technique will have comparable performance whenused on multiple test set instances.iv An upper bound on true risk versus its point estimate : The PAC-Bayesianmethod gives a tight upper bound on the true risk of the stochastic classifier for alldatasets. This upper bounds holds even for the “best" deterministic classifier obtainedby CV method as can be seen in Table 4 above. Thus the high-probability PAC-Bayesian upper bound is a more useful quantity than the estimate given by CV testerror for the true risk which can be under-biased or over-biased depending on thetraining sample and the folds created from this sample for cross-validation.These advantages strengthen the usefulness of the PAC-Bayesian method for constituting astochastic classifier and perhaps in tuning hyperparameters of other classification algorithms.
We determine optimal posteriors for PAC-Bayesian bound minimization problem with boundsderived using χ -divergence function. The distance functions that we considered are: lineardistance, squared distance (second degree polynomial) and KL-divergence (infinite degreepolynomial). We first show that, in the uniform prior set up, minimizers of these PAC-Bayesian bounds can be obtained by a restricted search on subsets of the classifier set or-dered by empirical risks. The bound minimization problem for linear distance case is shownto be a convex program and we also derive a closed form expression for its optimal pos-terior, while the other two distance functions result in non-convex programs. We furthershow that the squared distance results in a quasi-convex bound under certain conditions,and it is computationally observed to have single local minimum. We propose a convergentand computationally cheap fixed point based approach to identify the optimal posteriors forthese bound minimization problems.Our computational exercise is comprehensive. The nine UCI datasets we have consid-ered take into account small to moderate number of examples and features, balanced andimbalanced classes, and having different ranges and variance in the empirical risk values.Using this set of datasets helps us compare and understand the performance of optimalPAC-Bayesian posteriors due to different distance functions for a given divergence function,and also across KL- and χ -divergence functions.Based on the computations on SVM classifiers, we observe that the squared distancebased posteriors perform the best among the three distance functions in terms of boundvalues as well as average test error rates. The optimal posteriors for linear and squareddistances have subset support, especially as the size of the classifier set increases. On theother hand KL-distance based posteriors usually have full support but do not perform wellon the test set. This could be because they overfit the data while training. These chi-squareddivergence based optimal posteriors do not have a high measure of concentration, implyingless bias towards classifiers with low empirical risks.Comparing with KL-divergence based optimal PAC-Bayesian posteriors, we observe thatboth groups of PAC-Bayesian posteriors have weights decreasing with respect to the empirical19isk values of the classifiers. The difference lies in the rate of decrement – KL-divergencebased posterior weights decrease exponentially whereas χ -divergence bases ones show alinear decrease. Also, the former have a full support of the classifier set, while the later usuallypick a strict subset of the base classifiers as their support. On the test set performance, KL-divergence based posteriors are better than those based on χ -divergence. This phenomenonhints at an underlying regularization by KL-divergence based posteriors; perhaps, an implicitregularization.We also provide a comparison of these PAC-Bayesian posteriors with the widely usedcross-validation procedure as the baseline case. While the CV has lower test error sets, thePAC-Bayesian method has the advantages of sample robustness and a much lower compu-tational cost over the cross-validation method. Also, it provides a reliable high probabilityupper bound on the true risk rather than a point estimate given by cross-validation method.The significance of this work is in understanding the importance of choosing a divergencefunction for the PAC-Bayesian bound and its influence on the resulting optimal posteriorswhich are used to design stochastic classifiers. The challenges associated with this studyare – deducing the form of the bound for a given distance function, identifying the natureof the corresponding bound minimization problem, and obtaining a closed-form or a fixedpoint equation of the optimal posterior which minimizes this bound. Another challengeis identifying the support set of the optimal posterior for an arbitrary prior distributionon the classifier set. Therefore we considered the uniform prior which provided us with astructure to address the problem of identifying the support set in a linear fashion. On thecomputational side, we had to carefully choose the set of regularization parameter values tobe used for generating the base SVMs, since this would greatly influence the performance ofthe stochastic SVMs built on them. To achieve a right mix of base classifiers with good risks,we considered an arithmetic-geometric progression of the regularization parameter values inthe interval [0 , .As a part of the future work, we can extend our results to a non-uniform prior on theclassifier space, where we do not have such a structure to the feasible region and may needto do a full simplex search. We would also like to understand the nature of our results onhigh dimensional datasets such as image segmentation [5], where each image is representedin matrix form rather than a vector or a record. References [1] Amiran Ambroladze, Emilio Parrado-Hernández, and John Shawe-Taylor. Tighter PAC-Bayes bounds. In
Neural Information Processing Systems , pages 9–16, 2006.[2] M.S. Bazaraa, H.D. Sherali, and C.M. Shetty.
Nonlinear Programming: Theory andAlgorithms . Wiley, 2013.[3] Luc Bégin, Pascal Germain, François Laviolette, and Jean-Francis Roy. PAC-Bayesianbounds based on the Rényi divergence. In
Artificial Intelligence and Statistics , pages435–444, 2016. 204] Stephen Boyd and Lieven Vandenberghe.
Convex Optimization . Cambridge UniversityPress, 2004.[5] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.[6] Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand. PAC-Bayesian learning of linear classifiers. In
Proceedings of the 26th Annual InternationalConference on Machine Learning , pages 353–360, 2009.[7] Albert O. Hirschman.
National Power and the Structure of Foreign Trade . Universityof California Press, 1945.[8] Anatoli Juditsky. Lecture notes in convex optimization: Theory and algorithmes. https://ljk.imag.fr/membres/Anatoli.Iouditski/ , November 2015.[9] Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis. kernlab – An S4Package for Kernel Methods in R , volume 11. 2004.[10] John Langford. Tutorial on practical prediction theory for classification.
Journal ofmachine learning research , 6(Mar):273–306, 2005.[11] John Langford and John Shawe-Taylor. PAC-Bayes & margins. In
Neural InformationProcessing Systems (NIPS) , pages 439–446, 2002.[12] Thomas Lipp and Stephen Boyd. Variations and extension of the convex–concave pro-cedure.
Optimization and Engineering , 17(2):263–287, 2016.[13] Andreas Maurer. A note on the PAC Bayesian theorem.
CoRR , cs.LG/0411099, 2004.[14] David McAllester. PAC-Bayesian stochastic model selection.
Machine Learning ,51(1):5–21, 2003.[15] David McAllester. A PAC-Bayesian tutorial with a dropout bound. arXiv preprintarXiv:1307.2118 , 2013.[16] David McAllester and Takintayo Akinbiyi. PAC-Bayesian theory. In BernhardSchölkopf, Zhiyuan Luo, and Vladimir Vovk, editors,
Empirical Inference: Festschrift inHonor of Vladimir N. Vapnik , chapter 10, pages 95–103. Springer Science and BusinessMedia, 2013.[17] David A. McAllester. Some PAC-Bayesian theorems. In
Proceedings of the EleventhAnnual Conference on Computational Learning Theory , COLT’ 98, pages 230–234, NewYork, NY, USA, 1998. ACM.[18] David A. McAllester. PAC-Bayesian model averaging. In
Proceedings of the TwelfthAnnual Conference on Computational Learning Theory , COLT ’99, pages 164–170, NewYork, NY, USA, 1999. ACM. 2119] Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, and Shiliang Sun.PAC-Bayes bounds with data dependent priors.
Journal of Machine Learning Research ,13:3507–3531, 2012.[20] Puja Sahu and Nandyala Hemachandra. Some new PAC-Bayesian bounds and their usein selection of regularization parameter for linear SVMs. In
Conference on Data Scienceand Management of Data , pages 240–248, 2018. DOI: 10.1145/3152494.3152514.[21] Puja Sahu and Nandyala Hemachandra. Optimal PAC-Bayesian posteriors for stochasticclassifiers and their use for choice of SVM regularization parameter. In Wee Sun Lee andTaiji Suzuki, editors,
The 11th Asian Conference on Machine Learning , volume 101 of
Proceedings of Machine Learning Research , pages 268–283, Nagoya, Japan, 17–19 Nov2019. PMLR.[22] Puja Sahu and Nandyala Hemachandra. Optimal PAC-Bayesian posteriors forstochastic classifiers and their use for choice of SVM regularization parameter, 2019.https://arxiv.org/abs/1912.06803.[23] Matthias Seeger. The proof of McAllesterâĂŹs PAC-Bayesian theorem. In
NeuralInformation Processing Systems , 2002.[24] Yevgeny Seldin, François Laviolette, Nicolo Cesa-Bianchi, John Shawe-Taylor, and PeterAuer. PAC-Bayesian inequalities for martingales.
IEEE Trans. on Information Theory ,58(12):7086–7093, 2012.[25] Niklas Thiemann, Christian Igel, Olivier Wintenberger, and Yevgeny Seldin. A stronglyquasiconvex PAC-Bayesian bound. In
Algorithmic Learning Theory , pages 466–492,2017.[26] Tim van Erven and Peter Harremoës. Rényi divergence and Kullback-Leibler divergence.
IEEE Transactions on Information Theory , 60(7):3797–3820, 2014.[27] Andreas Wächter and Lorenz T. Biegler. On the implementation of an interior-pointfilter line-search algorithm for large-scale nonlinear programming.
Math. Program. ,106(1):25–57, 2006.[28] Wikipedia contributors. Herfindahl index — Wikipedia, the free encyclopedia, 2019.[Online; accessed 25-May-2019].
A Optimal PAC-Bayesian Posterior using Linear Dis-tance Function
As a basic case, we can consider linear distance function, φ lin (ˆ l, l ) = l − ˆ l for ˆ l, l ∈ [0 , . ThePAC-Bayesian bound in this case takes the following simplified form [3]: P S (cid:40) E Q [ l ] ≤ E Q [ˆ l ] + (cid:114) χ ( Q || P ) + 14 mδ (cid:41) ≥ − δ. (18)22hus, the upper bound on the true risk of a stochastic classifier governed by a distribution Q , when using the linear distance function with χ -divergence between prior and posterior,is: B lin ,χ ( Q ) = H (cid:88) i =1 ˆ l i q i + (cid:115) (cid:80) Hi =1 q i p i mδ (19) A.1 The bound minimization problem
The corresponding bound optimization problem is: min Q = { q ,...,q H } H (cid:88) i =1 ˆ l i q i + (cid:115) (cid:80) Hi =1 q i p i mδ s. t. H (cid:88) i =1 q i = 1 q i ≥ ∀ i = 1 , . . . , H. (20)We are interested in the distribution Q ∗ lin ,χ which is optimal for the above bound minimiza-tion problem since that corresponds to the tightest PAC-Bayesian upper bound on the truerisk of an stochastic classifier. Theorem 10.
The bound function B lin ,χ ( Q ) = (cid:80) Hi =1 ˆ l i q i + (cid:114) (cid:80) Hi =1 q ipi mδ is a strictly convexfunction and hence the optimization problem (20) is a convex program with a unique globalminimum.Proof. B lin ,χ ( Q ) is a differentiable function of Q = { q i } Hi =1 . Hence we can prove its convexity23f we can show that following first order condition holds for any Q, Q (cid:48) : B lin ,χ ( Q (cid:48) ) ≥ B lin ,χ ( Q ) + (cid:104)∇ B lin ,χ ( Q ) , Q (cid:48) − Q (cid:105)⇒ H (cid:88) i =1 ˆ l i q (cid:48) i + (cid:115) (cid:80) Hi =1 q (cid:48) i p i mδ ≥ H (cid:88) i =1 ˆ l i q (cid:48) i − H (cid:88) i =1 ˆ l i q i + 1 √ mδ (cid:80) Hi =1 q i q (cid:48) i p i − (cid:80) Hi =1 q i p i (cid:113)(cid:80) Hi =1 q i p i + H (cid:88) i =1 ˆ l i q i + (cid:115) (cid:80) Hi =1 q i p i mδ ⇒ (cid:115) (cid:80) Hi =1 q (cid:48) i p i mδ ≥ √ mδ · (cid:80) Hi =1 q i q (cid:48) i p i − (cid:80) Hi =1 q i p i + (cid:80) Hi =1 q i p i (cid:113)(cid:80) Hi =1 q i p i ⇒ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q (cid:48) i p i ≥ (cid:80) Hi =1 q i q (cid:48) i p i (cid:113)(cid:80) Hi =1 q i p i ⇒ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q (cid:48) i p i (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i ≥ H (cid:88) i =1 q i q (cid:48) i p i (21)Using Cauchy-Schwarz inequality, the above holds for any pair of distributions Q, Q (cid:48) for agiven prior distribution P with equality if and only if Q ≡ Q (cid:48) . This implies that the boundfunction B lin ,χ ( Q ) is strictly convex. Therefore, the optimization problem (19) has a uniqueglobal minimum. A.2 The optimal posterior, Q ∗ lin ,χ via partial KKT system Theorem 11.
The global minimum of the bound minimization problem (20) can be obtainedvia: q ∗ i, lin ,χ = (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17)(cid:113) mδ − ˆ var P (ˆ l ) ∀ i = 1 , . . . , H (22) if the following two conditions are satisfied: q ∗ i, lin ,χ ≥ ∀ i = 1 , . . . , H (23) mδ > ˆ var P (ˆ l ) (24) where ˆ var P (ˆ l ) := (cid:80) Hi =1 p i (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17) is the variance of empirical risk values ˆ l i s underthe prior distribution P .Proof. The Lagrangian function corresponding to the optimization problem (20) is: L lin ,χ ( Q, µ ) := H (cid:88) i =1 ˆ l i q i + (cid:115) (cid:80) Hi =1 q i p i mδ − µ (cid:32) H (cid:88) i =1 q i − (cid:33) (25)24t optimality, posterior Q should set the derivatives of this Lagrangian function L lin ,χ ( Q, µ ) to zero. Setting the derivative of L lin ,χ with respect to q i ’s as zero, we get: ∂ L lin ,χ ( Q, µ ) ∂q i = 0 ⇒ ˆ l i + 1 √ mδ (cid:113)(cid:80) Hi =1 q i p i q i p i − µ = 0 ⇒ q i (cid:113)(cid:80) Hi =1 q i p i = ( µ − ˆ l i ) p i √ mδ (26)Based on the primal feasibility condition (cid:80) Hi =1 q i = 1 , we should have: ∂ L lin ,χ ( Q, µ ) ∂µ = 0 H (cid:88) i =1 q i = 1 ⇒ √ mδ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i (cid:34) H (cid:88) i =1 ( µ − ˆ l i ) p i (cid:35) = 1 ⇒ µ = H (cid:88) i =1 ˆ l i p i + 1 √ mδ (cid:113)(cid:80) Hi =1 q i p i (27)Thus, combining the results in (26) and (27), we get the following relation between q i sand p i s: q i = p i √ mδ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i (cid:32) H (cid:88) i =1 ˆ l i p i − ˆ l i (cid:33) ∀ i = 1 to H Using the transformation z i = (cid:16) q i p i − (cid:17) , the above can be reduced to a linear system ofequations in z i s: z i = 4 mδ (cid:32) H (cid:88) i =1 ˆ l i p i − ˆ l i (cid:33) (cid:32) H (cid:88) i =1 z i p i + 1 (cid:33) ∀ i = 1 to H ⇔ H (cid:88) i =1 z i p i − z i mδ (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17) + 1 = 0 ∀ i = 1 to H. z ∗ i = (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17) mδ − ˆ var P (ˆ l ) i = 1 , . . . , H ⇒ ˜ q ∗ i = (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17)(cid:113) mδ − ˆ var P (ˆ l ) i = 1 , . . . , H which implies q ∗ i, lin ,χ = p i (cid:16)(cid:80) Hi =1 ˆ l i p i − ˆ l i (cid:17)(cid:113) mδ − ˆ var P (ˆ l ) i = 1 , . . . , H (28)For q ∗ i, lin ,χ to be a real number, we need: mδ − ˆ var P (ˆ l ) > . Remark 3.
We suspect that this upper bound on δ is related to the sparseness of the optimalposterior that minimizes the bound B lin ,χ ( Q ) = (cid:80) Hi =1 ˆ l i q i + (cid:114) (cid:80) Hi =1 q ipi mδ . A higher δ diminishesthe effect of divergence χ [ Q || P ] = (cid:80) Hi =1 q i p i . Hence it allows sparse solutions (where somecomponents of posterior Q take value zero) which have higher divergence from the priorcompared to a non-sparse solution. A.3 Optimal posterior, Q ∗ lin ,χ , for uniform prior Theorem 12 (Optimal posterior on an ordered subset support) . When prior is uniformdistribution on H , among all the posteriors with support as subset of H of size exactly H (cid:48) ,the optimal/best posterior denoted by Q ∗ lin ,χ ( H (cid:48) ) has the support on the ordered subset H (cid:48) ord = {{ ˆ l i } H (cid:48) i =1 | ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) } consisting of smallest H (cid:48) values in H . The optimal posteriorweights are determined as follows: q ∗ i, lin ,χ ( H (cid:48) ) = (cid:32) (cid:80) H (cid:48) i =1 ˆ liH (cid:48) − ˆ l i (cid:33)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) H i = 1 , . . . , H (cid:48) i = H (cid:48) + 1 , . . . , H, (29) where ˆ var H (cid:48) (ˆ l ) = H (cid:48) H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) (cid:19) is the variance of the values in H (cid:48) ord . We assume that the subset size H (cid:48) is such that HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) > so that Q ∗ lin ,χ ( H (cid:48) ) is defined and for feasibility, we require q ∗ i, lin ,χ ( H (cid:48) ) > for i = 1 , . . . , H (cid:48) .Proof. Under the uniform prior set, that is, when p i = H for all i = 1 , . . . , H , we obtain Q ∗ lin ,χ ( H (cid:48) ) using the partial KKT system (ignoring the positivity constraints) and the prooftechnique in Theorem 11 above. 26 heorem 13. The bound value for the linear distance function B ∗ lin ,χ ( H (cid:48) ) := (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) (30) of an optimal posterior Q ∗ lin ,χ ( H (cid:48) ) on an ordered subset of size H (cid:48) , is decreasing function of H (cid:48) for H (cid:48) ≤ H ∗ . Here H ∗ is the size of the ordered subset which forms the support of theglobally optimal posterior Q ∗ lin ,χ .Proof. Consider an ordered subset of size H (cid:48) ∈ { , . . . , H } such that HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) > sothat the posterior Q ∗ lin ,χ ( H (cid:48) ) given in (29) is defined. Suppose further that all the elements ofthis posterior are positive, so that it is feasible and hence optimal posterior on the consideredordered subset of size H (cid:48) . The bound value at this optimal posterior can be computed as: B ∗ lin ,χ ( H (cid:48) ) : = B lin ,χ (cid:0) Q ∗ lin ,χ ( H (cid:48) ) (cid:1) = H (cid:48) (cid:88) i =1 ˆ l i q ∗ i, lin ,χ ( H (cid:48) ) + (cid:118)(cid:117)(cid:117)(cid:116) H mδ (cid:32) H (cid:48) (cid:88) i =1 (cid:16) q ∗ i, lin ,χ ( H (cid:48) ) (cid:17) (cid:33) , (31)where H (cid:48) (cid:88) i =1 ˆ l i q ∗ i, lin ,χ ( H (cid:48) ) = H (cid:48) (cid:88) i =1 ˆ l i (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) H (cid:48) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )= (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) (cid:19) − (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) (cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )= (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ var H (cid:48) (ˆ l ) (cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) . (32)27or evaluating the second term, we simplify the χ -divergence factor: H (cid:48) (cid:88) i =1 (cid:0) q ∗ i, lin ,χ ( H (cid:48) ) (cid:1) = H (cid:48) (cid:88) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) H (cid:48) ) = 1( H (cid:48) ) H (cid:48) (cid:88) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) + 2 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) = 1( H (cid:48) ) H (cid:48) + H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) + 2 H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) = 1 H (cid:48) H (cid:48) H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) + H (cid:48) (cid:16) (cid:8)(cid:8)(cid:8)(cid:8) (cid:80) H (cid:48) i =1 ˆ l i − (cid:8)(cid:8)(cid:8)(cid:8) (cid:80) H (cid:48) i =1 ˆ l i (cid:17)(cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) = 1 H (cid:48) (cid:34) var H (cid:48) (ˆ l ) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) (cid:35) = 1 H (cid:48) (cid:32) HH (cid:48) mδHH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) (cid:33) (33)Substituting values of q ∗ i, lin ,χ ( H (cid:48) ) from Theorem 12, the bound value becomes: B ∗ lin ,χ ( H (cid:48) ) = H (cid:48) (cid:88) i =1 ˆ l i q ∗ i, lin ,χ ( H (cid:48) ) + (cid:118)(cid:117)(cid:117)(cid:116) H mδ (cid:32) H (cid:48) (cid:88) i =1 (cid:16) q ∗ i, lin ,χ ( H (cid:48) ) (cid:17) (cid:33) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ var H (cid:48) (ˆ l ) (cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) + (cid:118)(cid:117)(cid:117)(cid:116) HH (cid:48) mδ (cid:32) HH (cid:48) mδHH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) (cid:33) = (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) (cid:113) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l )= (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) + (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (ˆ l ) . (34)This bound value B ∗ lin ,χ ( H (cid:48) ) is the sum of an increasing term, (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) , and a decreasingterm, (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (cid:16) ˆ l (cid:17) . B ∗ lin ,χ ( H (cid:48) ) can be shown to be a decreasing function of H (cid:48) , asillustrated in Figure 1. 28igure 1: Nature of B ∗ lin ,χ ( H (cid:48) ) and its components as the subset size H (cid:48) varies29 lgorithm 2: OptQ lin- χ For Uniform Prior : Algorithm for finding optimalposterior for the PAC-Bayesian bound with linear distance function and χ -divergencewhen prior is uniform distribution Input: m, δ, H, { ˆ l i } Hi =1 Output: Q ∗ lin ,χ flag ← for H (cid:48) = 2 , . . . , H do ˆ var H (cid:48) (ˆ l ) ← H (cid:48) H (cid:48) (cid:80) i =1 (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) DeltaBnd ← (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (cid:16) ˆ l (cid:17) if DeltaBnd < then flag ← break end for i = 1 , . . . , H (cid:48) do q ∗ i, lin ,χ ← (cid:32) (cid:80) H (cid:48) i =1 ˆ liH (cid:48) − ˆ l i (cid:33) DeltaBnd H if q ∗ i, lin ,χ < then flag ← break end end if flag = 1 then break end end if flag = 1 then H ∗ ← H (cid:48) − for i = 1 , . . . , H ∗ do q ∗ i, lin ,χ ← (cid:32) (cid:80) H (cid:48) i =1 ˆ liH (cid:48) − ˆ l i (cid:33) DeltaBnd H end for i = H ∗ + 1 , . . . , H do q ∗ i, lin ,χ ← end else H ∗ ← H (cid:48) end return Q ∗ lin ,χ ← (cid:16) q ∗ , lin ,χ , . . . , q ∗ H, lin ,χ (cid:17) .3.1 Correctness of Algorithm OptQ lin- χ For Uniform Prior
We want to determine the globally optimal posterior Q ∗ lin ,χ that has the minimum boundvalue B lin ,χ ( Q ) over the H -dimensional probability simplex, ∆ H . Using the result of The-orem 1 in the main paper, we can confine the search to a much smaller space of posteriorswith support on a family of increasing ordered subsets of H . These ordered subsets are de-fined by their size. For example, an ordered subset of size H (cid:48) ∈ { , . . . , H } comprises of the lowest H (cid:48) values in the set { ˆ l i } Hi =1 . Thus, the restricted space of posteriors, say ∆ ord ⊂ ∆ H ,is a union of convex sets of posteriors with supports on the ordered subsets defined above.Due to increasing subset relation between consecutive supports, this union itself is a convexset. Therefore, as a consequence of Theorem 10, the bound function B lin ,χ ( Q ) is convex onthe set of posteriors, ∆ ord as well, which contains the global minimum. The search space ∆ ord is a restriction of the simplex ∆ H , yet consists of uncountably many posteriors on theordered subsets. We refine the search further by localizing to optimal posteriors on each ofthe increasing ordered subsets and comparing their bound values to find the minimum. Asidentified by (30), these bound values, B ∗ lin ,χ are functions of the subset size H (cid:48) . B ∗ lin ,χ ( H (cid:48) ) is defined only for those values of H (cid:48) where (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (cid:16) ˆ l (cid:17) > . We ignore those H (cid:48) values where this condition is not met (Line 5 in Algorithm 2). Further, we also need to ver-ify that for given H (cid:48) , the optimal posterior Q ∗ lin ,χ ( H (cid:48) ) in (29) satisfies positivity constraints,as done in Line 11 of the algorithm. Hence an exponential search on restricted posteriorspace is simplified to a finite linear search on the support size. We denote the support sizeof Q ∗ lin ,χ by H ∗ ∈ [ H ] . Therefore, for finding the optimal posterior Q ∗ lin ,χ in the restrictedposterior space ∆ ord , it is sufficient to search for H ∗ in the set { , . . . , H } corresponding tosupport sizes. Warm start for searching optimal support size, H ∗ We can reduce the sequentialsearch for the optimal support size, H ∗ on { , . . . , H } by using a warm start value for H (cid:48) .Then we can reach H ∗ by identifying a direction which leads to decrease in the bound value, B ∗ lin ,χ ( H (cid:48) ) as long as the corresponding posterior is defined and non-negative. This boundvalue is the sum of an increasing term, (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) , and a decreasing term, (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (cid:16) ˆ l (cid:17) .We expect H ∗ to lie in the neighbourhood of the point of intersection of the these twocomponents of B ∗ lin ,χ ( H (cid:48) ) . The point of intersection can be obtained by equating the twoterms and solving for H (cid:48) : (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) = (cid:114) HH (cid:48) mδ − ˆ var H (cid:48) (cid:16) ˆ l (cid:17) ⇒ H (cid:48) (cid:88) i =1 ˆ l i = H mδ (35)The value of H (cid:48) which satisfies (35) can be used as a warm start point of the Algorithm 2. If H (cid:48) satisfies feasibility conditions for Q ∗ lin ,χ ( H (cid:48) ) , we compare its bound value, B ∗ lin ,χ ( H (cid:48) ) withits neighbours B ∗ lin ,χ ( H (cid:48) − and B ∗ lin ,χ ( H (cid:48) +1) to identify a descent direction until feasibility31s violated. Otherwise, if B ∗ lin ,χ ( H (cid:48) ) is infeasible or undefined, we keep decrementing H (cid:48) tillwe reach a feasible Q ∗ lin ,χ ( H (cid:48) ) . B Optimal PAC-Bayesian Posterior using Squared Dis-tance Function
The distance function of our interest is the squared distance function: φ sq (cid:16) ˆ l, l (cid:17) = (cid:16) ˆ l − l (cid:17) for ˆ l, l ∈ [0 , . The PAC-Bayesian bound for squared distance function with chi-squareddivergence can be stated as: P S (cid:16) E Q [ˆ l ] − E Q [ l ] (cid:17) ≤ (cid:115) [ χ ( Q || P ) + 1] (cid:18) I R sq ( m, δ (cid:19) ≥ − δ. (36)The above statement gives the following probabilistic upper bound on the true risk of anstochastic classifier governed by a distribution Q on H : B sq,χ ( Q ) := E Q [ˆ l ] + (cid:115) [ χ ( Q || P ) + 1] (cid:18) I R sq ( m, δ (cid:19) (37)We first to need to identify the constant I R sq ( m, for a given sample size m . Lemma 3.
For a given sample size, m , l ∗ = 0 . is the maximizer of I R sq ( m, , l ) := (cid:80) mk =0 (cid:0) mk (cid:1) l k (1 − l ) m − k e m ( km − l ) for l ∈ [0 , .Proof. We have I R sq ( m,
2) = sup l ∈ [0 , (cid:34) m (cid:88) k =0 (cid:18) mk (cid:19) l k (1 − l ) m − k (cid:18) km − l (cid:19) (cid:35) , = sup l ∈ [0 , m (cid:34) m (cid:88) k =0 (cid:18) mk (cid:19) l k (1 − l ) m − k ( k − ml ) (cid:35) . The quantity to be maximized in the above expression is the fourth central moment of a32 .0 0.2 0.4 0.6 0.8 1.0 + − − − l I s q R ( m , l ) m = 50m = 100m = 200m = 500m = 1000m = 1020m = 1028 l * = . Figure 2: Plot of the function I R sq ( m, , l ) := sup l ∈ [0 , (cid:80) mk =0 (cid:0) mk (cid:1) l k (1 − l ) m − k (cid:0) km − l (cid:1) fordifferent sample sizes, m . I R sq ( m, , l ) is symmetric about l = 0 . , its maximizer.Binomial distribution, and can be computed in terms of its first and second central moments: I R sq ( m,
2) = sup l ∈ [0 , m (cid:8) m [ l (1 − l ) + l (1 − l )] + 3 m ( m − l (1 − l )] (cid:9) = 1 m sup l ∈ [0 , l (1 − l ) (cid:8) [(1 − l ) + l ] + 3( m − (cid:9) = 1 m sup l ∈ [0 , l (1 − l ) (cid:8) (1 − l + l )[(1 − l ) + l − l (1 − l )] + 3( m − (cid:9) = 1 m sup l ∈ [0 , l (1 − l ) (cid:2) l − l + 1 + 3( m − (cid:3) = 1 m sup l ∈ [0 , (cid:2) − l + 6 l − l + l + 3( m − l − l ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) I R sq ( m, ,l ) Thus, I R sq ( m,
2) = 1 m sup l ∈ [0 , I R sq ( m, , l ) . I R sq ( m, , l ) is a smooth, continuous function of l ∈ [0 , and the maximum can be obtainedvia derivative test. I R sq ( m, , l ) is a concave function of l ∈ [0 , . We observe that the uniquemaximum is attained at l ∗ = for all m ≥ (See Figure 2).Thus, I R sq ( m,
2) = 1 m I R sq ( m, ,
12 ) = 12 m − m P S (cid:40)(cid:16) E Q [ˆ l ] − E Q [ l ] (cid:17) ≤ (cid:115) [ χ ( Q || P ) + 1] (cid:18) m − m δ (cid:19)(cid:41) ≥ − δ. (38) Theorem 14.
For a finite set of classifiers, H , PAC-Bayesian upper bound on the averagedtrue risk based on squared distance function when chi-squared divergence is used as a measureof divergence between the prior and the posterior is given by: B sq,χ ( Q ) = H (cid:88) i =1 ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q i p i (cid:33) (cid:18) m − m δ (cid:19) (39) Proof.
Using the PAC-Bayesian statement in (38) for the case of a finite classifier set, wecan obtain the above form of B sq,χ ( Q ) , where the expression m − m δ has been identified viaLemma 3. B.1 The bound minimization problem
We want to determine the optimal posterior Q ∗ sq,χ which minimizes the upper bound, B sq,χ ( Q ) . When classifier space H is a finite set, say H = { h i } Hi =1 , this optimization problemcan be described as: min q ,...,q H H (cid:88) i =1 ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q i p i (cid:33) (cid:18) m − m δ (cid:19) s. t. H (cid:88) i =1 q i = 1 q i ≥ ∀ i = 1 , . . . , H. (40) B.2 Non-convexity of the bound function
If the bound function, B sq,χ ( Q ) turns out to be convex, then it has a unique minimizerwhich can be easily obtained using the KKT conditions. We investigate whether this boundfunction is convex in Q using the first order conditions for convexity. Theorem 15.
The bound function, B sq,χ ( Q ) = (cid:80) Hi =1 ˆ l i q i + (cid:114)(cid:16)(cid:80) Hi =1 q i p i (cid:17) (cid:0) m − m δ (cid:1) is non-convex.Proof. We use the first order condition to verify convexity of our bound function. We needto check if the following condition holds for any pair of distributions Q and Q (cid:48) on classsifier34pace H : B sq ,χ ( Q (cid:48) ) ≥ B sq ,χ ( Q ) + (cid:104)∇ B sq ,χ ( Q ) , Q (cid:48) − Q (cid:105)⇒ H (cid:88) i =1 ˆ l i q (cid:48) i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q (cid:48) i p i (cid:33) (cid:18) m − m δ (cid:19) ≥ H (cid:88) i =1 ˆ l i q (cid:48) i − H (cid:88) i =1 ˆ l i q i + (cid:114) m − m δ · (cid:16)(cid:80) Hi =1 q i q (cid:48) i p i − (cid:80) Hi =1 q i p i (cid:17) (cid:16)(cid:80) Hi =1 q i p i (cid:17) / + H (cid:88) i =1 ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q i p i (cid:33) (cid:18) m − m δ (cid:19) ⇒ (cid:32) H (cid:88) i =1 q i p i (cid:33) / (cid:32) H (cid:88) i =1 q (cid:48) i p i (cid:33) / ≥ (cid:32) H (cid:88) i =1 q i q (cid:48) i p i + H (cid:88) i =1 q i p i (cid:33) (cid:30) (41)Notice that this inequality does not depend on ˆ l i values. We have counter examples whichviolate this convexity condition. Consider H = 10 and P to be uniform distribution on H .If Q is a degenerate distribution with q = 1 and Q (cid:48) = (0 . , . , . , . , . , . , . , . , . , . , then we have LHS = 6.087086 and RHS = 6.172964 for (41) violating the convexity property.Thus, we can claim that B sq ,χ is a non-convex function. Remark 4.
Computationally this bound minimization problem is observed to have single localminimum. The quasi-convexity of this bound function is holds under a condition identifiedin Propostion 2.
We are interested in checking whether B sq ,χ ( Q ) is strictly quasi-convex. If so, we canclaim that a local optimal solution will be a global optimal solution [2]. Definition 1. [2] Let f : E −→ R where E is a non-empty convex set in R n . A function f is strictly quasi convex if, for each x , x ∈ E with f ( x ) (cid:54) = f ( x ) , we have f [ α x + (1 − α ) x ] < max( f ( x ) , f ( x )) ∀ α ∈ (0 , . (42) Theorem 16. [2] Let f : E −→ R be strictly quasi-convex. Consider the problem of min-imizing f ( x ) subject to x ∈ E , where E is a non-empty convex set in R n . If ¯ x is a localoptimal solution, then ¯ x is also a global optimal solution. Proposition 2.
The bound function B sq ,χ ( Q ) is strictly quasi-convex if the following con-dition holds for any Q, Q (cid:48) for each α ∈ (0 , : (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i < (1 − α )( E Q [ˆ l ] − E Q (cid:48) [ˆ l ]) and hence a local minimum to the bound minimization problem (40) is also a global minimum. roof. B sq ,χ ( Q ) is defined on the simplex ∆ H which is a non-empty convex set in R H . Forquasiconvexity, we need to show that for each Q (cid:54) = Q (cid:48) ∈ ∆ H with B sq ,χ ( Q ) (cid:54) = B sq ,χ ( Q (cid:48) ) ,the following holds: B sq ,χ [ αQ + (1 − α ) Q (cid:48) ] < max( B sq ,χ ( Q ) , B sq ,χ ( Q (cid:48) )) ∀ α ∈ (0 , . That is equivalent to showing: E αQ +(1 − α ) Q (cid:48) [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < max E Q [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i , E Q (cid:48) [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q (cid:48) i p i We assume that B sq ,χ ( Q ) > B sq ,χ ( Q (cid:48) ) . This implies that we need to show that B sq ,χ ( αQ +(1 − α ) Q (cid:48) ) < B sq ,χ ( Q ) . We consider 4 cases as follows:Case I : E Q [ˆ l ] = E Q (cid:48) [ˆ l ] and H (cid:80) i =1 q i p i = H (cid:80) i =1 q (cid:48) i p i then we have E αQ +(1 − α ) Q (cid:48) [ˆ l ] = αE Q [ˆ l ] + (1 − α ) E Q (cid:48) [ˆ l ] = E Q [ˆ l ] = E Q (cid:48) [ˆ l ] . Thus to show that B sq ,χ ( αQ + (1 − α ) Q (cid:48) ) < B sq ,χ ( Q ) , we have to show the followingfor any Q, Q (cid:48) for each α ∈ (0 , : E αQ +(1 − α ) Q (cid:48) [ˆ l ] + (cid:114) m − m δ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < E Q [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i This is equivalent to showing that for any
Q, Q (cid:48) for each α ∈ (0 , , (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i (43)36onsider the LHS as given below: H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i = H (cid:88) i =1 ( α q i + (1 − α ) q (cid:48) i + 2 α (1 − α ) q i q (cid:48) i ) p i < H (cid:88) i =1 ( α q i + (1 − α ) q (cid:48) i + 2 α (1 − α ) q i ) p i , (cid:32) since H (cid:88) i =1 q i q (cid:48) i < H (cid:88) i =1 q i ∀ Q (cid:54) = Q (cid:48) (cid:33) = H (cid:88) i =1 q i − (1 − α ) q i + (1 − α ) q (cid:48) i ) p i = H (cid:88) i =1 q i p i + (1 − α ) (cid:34) H (cid:88) i =1 q (cid:48) i p i − H (cid:88) i =1 q i p i (cid:35) < H (cid:88) i =1 q i p i (44)Since √ x is an increasing function of x , we have the proof of strict quasi-convexity forthis case.Case II : H (cid:80) i =1 q i p i = (cid:80) Hi =1 q (cid:48) i p i and E Q [ˆ l ] > E Q (cid:48) [ˆ l ] then we have, E αQ +(1 − α ) Q (cid:48) [ˆ l ] = αE Q [ˆ l ] + (1 − α ) E Q (cid:48) [ˆ l ] < E Q [ˆ l ] By previous argument, H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < H (cid:88) i =1 q i p i + (1 − α ) (cid:34) H (cid:88) i =1 q (cid:48) i p i − H (cid:88) i =1 q i p i (cid:35) As the second term on the RHS of above inequality is zero by assumption, we get E αQ +(1 − α ) Q (cid:48) [ˆ l ] + (cid:114) m − m δ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < E Q [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i This implies that ⇔ B sq ,χ [ αQ + (1 − α ) Q (cid:48) ] < B sq ,χ ( Q ) = max( B sq ,χ ( Q ) , B sq ,χ ( Q (cid:48) )) Therefore, B sq ,χ ( Q ) is strictly quasi-convex in this case too.37ase III : E Q [ˆ l ] > E Q (cid:48) [ˆ l ] and (cid:80) Hi =1 q i p i > (cid:80) Hi =1 q (cid:48) i p i , then, as earlier we have, E αQ +(1 − α ) Q (cid:48) [ˆ l ] < E Q [ˆ l ] And, H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < H (cid:88) i =1 q i p i + (1 − α ) (cid:34) H (cid:88) i =1 q (cid:48) i p i − H (cid:88) i =1 q i p i (cid:35) < H (cid:88) i =1 q i p i The last inequality follows since we have assumed (cid:80) Hi =1 q (cid:48) i p i − (cid:80) Hi =1 q i p i < . Hence, itfollows that B sq ,χ [ αQ + (1 − α ) Q (cid:48) ] < B sq ,χ ( Q ) = max { B sq ,χ ( Q ) , B sq ,χ ( Q (cid:48) ) } . There-fore, B sq ,χ ( Q ) is strictly quasi-convex in this case.The condition for quasiconvexity holds easily in all the above three cases. The nextcase requires an added assumption.Case IV : E Q [ˆ l ] > E Q (cid:48) [ˆ l ] and (cid:80) Hi =1 q i p i < (cid:80) Hi =1 q (cid:48) i p i with B sq ,χ ( Q ) > B sq ,χ ( Q (cid:48) ) then we get thefollowing inequality, E Q [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i > E Q (cid:48) [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q (cid:48) i p i ⇐⇒ ( E Q [ˆ l ] − E Q (cid:48) [ˆ l ]) > (cid:114) m − m δ (cid:20) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q (cid:48) i p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i (cid:21) Hence, we have to show that, B sq ,χ ( αQ + (1 − α ) Q (cid:48) ) < B sq ,χ ( Q ) E αQ +(1 − α ) Q (cid:48) [ˆ l ] + (cid:114) m − m δ (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i < E Q [ˆ l ] + (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i This is equivalent to proving that (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i < E Q [ˆ l ] − ( αE Q [ˆ l ] + (1 − α ) E Q (cid:48) [ˆ l ]) That is, we need to show that for any
Q, Q (cid:48) for each α ∈ (0 , : (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i < (1 − α )( E Q [ˆ l ] − E Q (cid:48) [ˆ l ]) The above holds due to the assumption in the theorem statement.38hus, under the given condition, B sq ,χ is strictly quasi-convex and admits a global minimumwhich can be identified based on KKT conditions. Remark 5.
The condition that for any
Q, Q (cid:48) for each α ∈ (0 , : (cid:18) (cid:114) m − m δ (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 ( αq i + (1 − α ) q (cid:48) i ) p i − (cid:118)(cid:117)(cid:117)(cid:116) H (cid:88) i =1 q i p i < (1 − α )( E Q [ˆ l ] − E Q (cid:48) [ˆ l ]) is required to complete the proof of quasi-convexity of B sq ,χ ( Q ) for the case when E Q [ˆ l ] >E Q (cid:48) [ˆ l ] and (cid:80) Hi =1 q i p i < (cid:80) Hi =1 q (cid:48) i p i . We haven’t been able to verify that this condition will alwayshold for any pair ( Q, Q (cid:48) ) . Other cases are easy to prove. B.3 The posterior based on fixed point scheme, Q ∗ sq ,χ The Lagrangian function corresponding to the optimization problem (40) is: L sq,χ ( Q, µ ) := ˆ l i q i + (cid:118)(cid:117)(cid:117)(cid:116)(cid:32) H (cid:88) i =1 q i p i (cid:33) (cid:18) m − m δ (cid:19) − µ (cid:32) H (cid:88) i =1 q i − (cid:33) (45)At optimality, posterior Q should set the derivatives of this Lagrangian function L sq,χ ( Q, µ ) to zero. Setting the derivative of L sq,χ with respect to q i ’s as zero, We get: ∂ L sq,χ ( Q, µ ) ∂ q i = 0 ∀ i = 1 , . . . , H ⇒ ˆ l i + (cid:114) m − m δ × (cid:16)(cid:80) Hi =1 q i p i (cid:17) × q i p i − µ = 0 ⇒ q i (cid:16)(cid:80) Hi =1 q i p i (cid:17) = 2 p i ( µ − ˆ l i ) (cid:113) m − m δ ∀ i = 1 , . . . , H (46)And now, setting the derivative of L sq,χ with respect to µ as zero, we get: ∂ L sq,χ ( Q, µ ) ∂ µ = 0 ⇒ H (cid:88) i =1 q i − ⇒ H (cid:88) i =1 (cid:16)(cid:80) Hi =1 q i p i (cid:17) (cid:113) m − m δ ( µ − ˆ l i ) p i = 1 (47) ⇒ µ = H (cid:88) i =1 ˆ l i p i + (cid:113) m − m δ (cid:16)(cid:80) Hi =1 q i p i (cid:17) (48)39e get the following fixed point equation in q i ’s: q i = (cid:16)(cid:80) Hi =1 q i p i (cid:17) (cid:113) m − m δ (cid:32) H (cid:88) i =1 ˆ l i p i − ˆ l i (cid:33) + 1 p i ∀ i = 1 , . . . , H (49) Theorem 17 (Optimal posterior on an ordered subset support) . When prior is uniformdistribution on H , among all the posteriors with support as subset of size exactly H (cid:48) , the bestposterior denoted by Q ∗ sq ,χ ( H (cid:48) ) has the support on the ordered subset H (cid:48) ord = {{ ˆ l i } H (cid:48) i =1 | ˆ l ≤ ˆ l ≤ . . . ≤ ˆ l H (cid:48) } consisting of smallest H (cid:48) values in H . The optimal posterior weights aredetermined as the solution to the following fixed point equation: q F Pi, sq ,χ ( H (cid:48) ) = H (cid:48) + (cid:16)(cid:80) H (cid:48) i =1 ( q FPi, sq ,χ ( H (cid:48) )) (cid:17) (cid:113) (12 m − H m δ (cid:18) (cid:80) H (cid:48) i =1 ˆ l i H (cid:48) − ˆ l i (cid:19) i = 1 , . . . , H (cid:48) i = H (cid:48) + 1 , . . . , H. (50) under the assumption that for a given H (cid:48) , (50) converges to a fixed point solution and forfeasibility, we require q F Pi, sq ,χ ( H (cid:48) ) > for i = 1 , . . . , H (cid:48) . C Optimal PAC-Bayesian Posterior using KL-distance
The PAC-Bayesian bound using the distance function kl (ˆ l, l ) = ˆ l ln (cid:16) ˆ ll (cid:17) + (1 − ˆ l ) ln (cid:16) − ˆ l − l (cid:17) (forany ˆ l, l ∈ (0 , ) is obtained as: P S (cid:40) ∀ Q on H : kl (cid:16) E Q [ˆ l ] , E Q [ l ] (cid:17) ≤ (cid:114) ( χ ( Q || P ) + 1) I R kl ( m, δ (cid:41) ≥ − δ. (51)The upper bound on the averaged true risk E Q [ l ] corresponding to the above PAC-Bayesian theorem is obtained as: B kl ,χ ( Q ) = sup r ∈ (0 , (cid:40) r : kl (cid:16) E Q [ˆ l ] , r (cid:17) ≤ (cid:114) ( χ ( Q || P ) + 1) I R kl ( m, δ (cid:41) (52)An inverse kl ( · , · ) function does not exist since it is not a monotone function, and so thebound B kl ,χ ( Q ) does not have an explicit form. However, we can employ a numerical rootfinding algorithm such as that described in [20] (Algo. ( KLroots )) to obtain B kl ,χ ( Q ) fora given instance of system parameters.We first need to compute the constant I R kl ( m,
2) := m (cid:80) k =0 (cid:0) mk (cid:1) l k (1 − l ) m − k (cid:0) kl (cid:0) km , l (cid:1)(cid:1) inorder to determine the bound value.For m > , computation is difficult due to storage limitations in the range of floatingpoint numbers – gives I R kl ( m ) as NaN. We notice that I R kl ( m ) decreases with m and hence,we can use I R kl (1028) as an upper approximation for I R kl ( m ) for m > .40igure 3: Plot of the function I R kl ( m, l ) = m (cid:80) k =0 (cid:0) mk (cid:1) l k (1 − l ) m − k (cid:0) kl (cid:0) km , l (cid:1)(cid:1) as a function of thetrue risk l ∈ [0 , for different values of the sample size, m represented by different curvesin the above graph. We observe that the function I R kl ( m, l ) is bimodal and symmetric about l = 0 . . We are interested in the quantity I R kl ( m ) = sup l ∈ [0 , I R kl ( m, l ) as a function of m which we identify graphically (and mark it by a • on each curve).41able 5: Table of I R kl ( m ) = sup l ∈ [0 , m (cid:80) k =0 (cid:0) mk (cid:1) l k (1 − l ) m − k (cid:0) kl (cid:0) km , l (cid:1)(cid:1) values for different samplesizes, m . We notice that I R kl ( m ) decreases as m increases. For a given m , l ∗ ( m ) denotes thevalue of l ∈ [0 , at which the supremum is attained. We observe that l ∗ ( m ) → as m growsbeyond 1000. Samplesize, m l ∗ ( m ) I R kl ( m )
50 0.98 0.0074799100 0.99 0.0037092200 0.995 0.0018470500 0.998 0.00073691000 0.999 0.00036821020 0.999 0.00036091028 0.999 0.0003580
C.1 The KL-distance bound minimization problem
For a finite classifier space H = { h i } Hi =1 , this optimization problem can be described as: min q ,...,q H ,r r (53a)s.t. (cid:32) H (cid:88) i =1 ˆ l i q i (cid:33) ln H (cid:80) i =1 ˆ l i q i r + (cid:32) − H (cid:88) i =1 ˆ l i q i (cid:33) ln − H (cid:80) i =1 ˆ l i q i − r = (cid:118)(cid:117)(cid:117)(cid:116) (cid:16)(cid:80) Hi =1 q i p i (cid:17) I R kl ( m, δ (53b) r ≥ H (cid:88) i =1 ˆ l i q i (53c) H (cid:88) i =1 q i = 1 (53d) q i ≥ ∀ i = 1 , . . . , H (53e)Here, r is the right root of kl (cid:16) E Q [ˆ l ] , r (cid:17) = (cid:113) ( χ ( Q || P )+1) I R kl ( m, δ for a given E Q [ˆ l ] . The aboveis known to be a non-convex problem with a difference of convex (DC) equality constraint(53b). The constraint (53c) is a strict inequality which is relaxed for modelling purpose. C.2 The posterior based on fixed point scheme, Q FPkl ,χ We derive FP equation for KL-distance based bound optimization problem below:
Theorem 18 (Optimal posterior on an ordered subset support) . Among all the posteriorswith support as subset of size exactly H (cid:48) , a stationary point Q FPkl ,χ ( H (cid:48) ) can be obtained as the olution to the following fixed point equation: q i = 1 Z kl ,χ p i (cid:32) H (cid:48) (cid:88) i =1 q i p i (cid:33) (cid:16) ˆ l i − (cid:80) H (cid:48) i =1 ˆ l i q i (cid:17)(cid:114) (cid:18)(cid:80) H (cid:48) i =1 q ipi (cid:19) I R kl ( m, δ (cid:34) ln (cid:32) (1 − r ) (cid:80) H (cid:48) i =1 ˆ l i q i r (1 − (cid:80) H (cid:48) i =1 ˆ l i q i ) (cid:33)(cid:35) (54) where Z kl ,χ is a suitable normalization constant and r is the solution to (53b) and (53c) fora given Q = ( q , . . . , q H ) .Proof. The Lagrangian function for (53) can be written as follows: L kl ,χ = r − β (cid:34)(cid:32) H (cid:48) (cid:88) i =1 ˆ l i q i (cid:33) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) + (cid:32) − H (cid:48) (cid:88) i =1 ˆ l i q i (cid:33) ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33) − (cid:118)(cid:117)(cid:117)(cid:116) (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) I R kl ( m, δ − β (cid:32) r − H (cid:48) (cid:88) i =1 ˆ l i q i (cid:33) − µ (cid:32) H (cid:48) (cid:88) i =1 q i − (cid:33) − H (cid:48) (cid:88) i =1 µ i q i (55)Due to the strict inequality constraint (53c), complementary slackness conditions for a sta-tionary point imply that the Lagrange multiplier β should vanish at optimality.Differentiating L kl ,χ with respect to primal variables r and q i s, and also with respect todual variable µ , we get: ∂ L kl ,χ ∂r = 1 − β (cid:34) − (cid:32) (cid:80) Hi =1 ˆ l i q i r (cid:33) + (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) (56) ∂ L kl ,χ ∂q i = − β ˆ l i ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) + (cid:1)(cid:1) ˆ l i − ˆ l i ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33) − (cid:1)(cid:1) ˆ l i − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · q i p i − µ − µ i ∀ i = 1 , . . . , H (57) ∂ L kl ,χ ∂µ = H (cid:88) i =1 q i − (58)At an optimal solution, these derivatives should be set to zero. Let us first consider the43erivative (56) and set it to zero. That is, ∂ L kl ,χ ∂r = 0 ⇒ − β (cid:34) − (cid:32) (cid:80) Hi =1 ˆ l i q i r (cid:33) + (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) = 0 ⇒ β − H (cid:80) i =1 ˆ l i q i + (cid:8)(cid:8)(cid:8)(cid:8)(cid:8)(cid:8) r (cid:18) H (cid:80) i =1 q i ˆ l i (cid:19) + r − (cid:8)(cid:8)(cid:8)(cid:8)(cid:8)(cid:8) r (cid:18) H (cid:80) i =1 q i ˆ l i (cid:19) r (1 − r ) = 1 ⇒ β = r (1 − r ) r − H (cid:80) i =1 ˆ l i q i > (59)The denominator in above is strictly positive since r > H (cid:80) i =1 q i ˆ l i . The inequality constraintin (53) also implies that r ∈ (0 , , which means that the numerator term is also strictlypositive. Hence, we have β > which is a feasible value for the Lagrange parameter.Next consider the derivative (57) of the Lagrange L kl ,χ . We multiply it with q i and setit zero to get: q i ∂ L kl ,χ ∂q i = 0 ∀ i = 1 , . . . , H ⇒ − β ˆ l i q i (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · q i p i − µ q i − µ i q i = 0 (60)where µ i q i = 0 due to complementary slackness conditions, since µ i is the Lagrange multiplierfor the constraint q i ≥ in (53). Since we are interested in finding the best posterior on theordered subset of size H (cid:48) ≤ H , only first H (cid:48) values of the distribution Q = ( q , . . . , q H ) willtake strictly positive values. Therefore summing (60) over i = 1 , . . . , H (cid:48) , we get: H (cid:48) (cid:88) i =1 q i ∂ L kl ,χ ∂q i = − β (cid:40) H (cid:48) (cid:88) i =1 ˆ l i q i (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · H (cid:48) (cid:88) i =1 q i p i − µ (cid:32) H (cid:48) (cid:88) i =1 q i (cid:33) − H (cid:48) (cid:88) i =1 µ i q i = 0 (cid:80) H (cid:48) i =1 q i = 1 , we get: µ = − β H (cid:48) (cid:88) i =1 ˆ l i q i (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · H (cid:48) (cid:88) i =1 q i p i (61)Then using (57) and above (61), we get: (cid:8)(cid:8)(cid:8) − β ˆ l i ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ˆ l i ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33) − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · q i p i = (cid:8)(cid:8)(cid:8) − β H (cid:48) (cid:88) i =1 ˆ l i q i (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) − (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · H (cid:48) (cid:88) i =1 q i p i ⇒ (cid:118)(cid:117)(cid:117)(cid:116) I R kl ( m, δ (cid:16)(cid:80) H (cid:48) i =1 q i p i (cid:17) · (cid:32) q i p i − H (cid:48) (cid:88) i =1 q i p i (cid:33) = (cid:32) ˆ l i − H (cid:48) (cid:88) i =1 ˆ l i q i (cid:33) (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) ⇒ q i = p i × H (cid:48) (cid:88) i =1 q i p i + (cid:16) ˆ l i − (cid:80) H (cid:48) i =1 ˆ l i q i (cid:17)(cid:115) I R kl ( m, δ (cid:18)(cid:80) H (cid:48) i =1 q ipi (cid:19) (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) ∀ i = 1 , . . . , H (cid:48) ⇒ q i = p i (cid:32) H (cid:48) (cid:88) i =1 q i p i (cid:33) (cid:16) ˆ l i − (cid:80) H (cid:48) i =1 ˆ l i q i (cid:17)(cid:114) (cid:18)(cid:80) H (cid:48) i =1 q ipi (cid:19) I R kl ( m, δ (cid:34) ln (cid:32) (cid:80) H (cid:48) i =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) H (cid:48) i =1 ˆ l i q i − r (cid:33)(cid:35) ∀ i = 1 , . . . , H (cid:48) (62)For feasibility, we need (cid:80) Hi =1 q i = 1 . Therefore by using a suitable normalization constant Z kl ,χ , we get the normalized equation in q i s: q i = 1 Z kl ,χ p i (cid:32) H (cid:48) (cid:88) i =1 q i p i (cid:33) (cid:16) ˆ l i − (cid:80) Hi =1 ˆ l i q i (cid:17)(cid:114) (cid:18)(cid:80) H (cid:48) i =1 q ipi (cid:19) I R kl ( m, δ (cid:34) ln (cid:32) (1 − r ) (cid:80) H (cid:48) i =1 ˆ l i q i r (1 − (cid:80) H (cid:48) i =1 ˆ l i q i ) (cid:33)(cid:35) ∀ i = 1 , . . . , H (cid:48) (63)The above is the fixed point equation (FPE) which identifies a stationary point of the boundminimization problem (53). Corollary 2 (Optimal posterior on an ordered subset support) . When prior is uniformdistribution on H , among all the posteriors with support as subset of size exactly H (cid:48) , a tationary point Q FPkl ,χ ( H (cid:48) ) for (53) can be obtained as the solution to the following fixedpoint equation: q i = 1 Z kl ,χ (cid:32) H (cid:48) (cid:88) i =1 q i (cid:33) (cid:16) ˆ l i − (cid:80) H (cid:48) i =1 ˆ l i q i (cid:17)(cid:113) H ( (cid:80) H (cid:48) i =1 q i ) I R kl ( m, δ (cid:34) ln (cid:32) (1 − r ) (cid:80) H (cid:48) i =1 ˆ l i q i r (1 − (cid:80) H (cid:48) i =1 ˆ l i q i ) (cid:33)(cid:35) (64) for i = 1 , . . . , H where Z kl ,χ is a suitable normalization constant and r is the solution to (53b) and (53c) for a given Q = ( q , . . . , q H ) . Lemma 4.
When all the classifiers have same empirical risk (all ˆ l i s are same), the optimalposterior for the bound minimization problem (53) is Q ≡ P . KL-distance based bound minimization is non-convex with multiple stationary pointswhich makes it difficult to identify the global minimum even by FP scheme. The iterativeroot finding algorithm adds to the computational complexity of the bound minimizationalgorithm.
C.3 Convex-concave procedure for finding a local solution for min-imization of B kl ,χ ( Q ) We have seen that our optimization problem (53) for finding the bound B kl ,χ ( Q ) consistsof a linear objective function and linear constraints, except for the constraint (53b), whichtakes the form: kl (cid:16) E Q [ˆ l ] , r (cid:17) = (cid:113) ( χ ( Q || P )+1) I R kl ( m, δ (65) ⇔ (cid:16)(cid:80) Hi =1 ˆ l i q i (cid:17) ln H (cid:80) i =1 ˆ l i q i r + (cid:18) − H (cid:80) i =1 ˆ l i q i (cid:19) ln − H (cid:80) i =1 ˆ l i q i − r = (cid:114) (cid:18)(cid:80) Hi =1 q ipi (cid:19) I R kl ( m, δ (66)We know that χ [ Q || P ] is jointly convex in both its arguments [26]. Based on the proof ofTheorem 10, we have that (cid:114)(cid:16)(cid:80) Hi =1 q i p i (cid:17) is a convex function of Q . And hence, for givensystem parameters m and δ , right hand side of the above constraint is a convex function of Q . The left hand side is a composition of two functions: E Q [ˆ l ] (a linear function) and kl ( p, q ) (a jointly convex function). The superposition of a convex function and an affine mappingis convex, provided that it is finite at least at one point [8, 4]. Hence, it is established that kl (cid:16) E Q [ˆ l ] , r (cid:17) is convex in its arguments ( Q, r ) . This implies that the constraint (53b) is adifference of convex (DC) function and the associated optimization problem (53) is a DCprogram.Reformulating the original problem (53) in terms of all inequality constraints of the form46 ( x ) − g ( x ) ≤ , we have: min q ,...,q H ,r r (67a) (cid:32) H (cid:88) i =1 ˆ l i q i (cid:33) ln H (cid:80) i =1 ˆ l i q i r + (cid:32) − H (cid:88) i =1 ˆ l i q i (cid:33) ln − H (cid:80) i =1 ˆ l i q i − r − (cid:115) (cid:18)(cid:80) Hi =1 q ipi (cid:19) I R kl ( m, δ ≤ (67b) (cid:115) (cid:18)(cid:80) Hi =1 q ipi (cid:19) I R kl ( m, δ − (cid:32) H (cid:88) i =1 ˆ l i q i (cid:33) ln H (cid:80) i =1 ˆ l i q i r + (cid:32) − H (cid:88) i =1 ˆ l i q i (cid:33) ln − H (cid:80) i =1 ˆ l i q i − r ≤ (67c) H (cid:88) i =1 ˆ l i q i − r ≤ (67d) H (cid:88) i =1 q i = 1 (67e) − q i ≤ ∀ i = 1 , . . . , H (67f)To apply the convex-concave procedure (CCP), we determine the approximations to theDC functions (67b) and (67c), at a point ( Q , r ) which is feasible to (67), and equivalentlyto (53). Let (cid:99) kC (( Q, r ); ( Q , r )) denote the linear under-approximation to the function kC ( Q, r ) := (cid:114) (cid:18)(cid:80) Hi =1 q ipi (cid:19) I R kl ( m, δ in (67b) at ( Q , r ) . (cid:99) kC (( Q, r ); ( Q , r )) := kC ( Q , r ) + (cid:104)∇ kC ( Q , r ) , (cid:0) ( Q − Q ) , ( r − r ) (cid:1) (cid:105) = (cid:115) (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) I R kl ( m, δ + (cid:32) H (cid:88) i =1 ∂kC ∂q i (cid:12)(cid:12)(cid:12)(cid:12) q i = q i · ( q i − q i ) (cid:33) + 0 · ( r − r )= (cid:115) (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) I R kl ( m, δ + (cid:115) I R kl ( m, δ (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) (cid:32) H (cid:88) i =1 q i p i ( q i − q i ) (cid:33) = (cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24) (cid:115) (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) I R kl ( m, δ + (cid:115) I R kl ( m, δ (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) (cid:32) H (cid:88) i =1 q i q i p i (cid:33) − (cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24) (cid:115) (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) I R kl ( m, δ = (cid:115) I R kl ( m, δ (cid:18)(cid:80) Hi =1 ( q i )2 pi (cid:19) (cid:32) H (cid:88) i =1 q i q i p i (cid:33) Recall the linear under-approximation (cid:100) kK (( Q, r ); ( Q , r )) to the function kK ( Q, r ) := l (cid:16)(cid:80) Hi =1 ˆ l i q i , r (cid:17) at ( Q , r ) . (cid:100) kK (( Q, r ); ( Q , r )) := kK ( Q , r ) + (cid:104)∇ kK ( Q , r ) , (cid:0) Q − Q , r − r (cid:1) (cid:105) = ln (cid:32) − (cid:80) Hi =1 ˆ l i q i − r (cid:33) + H (cid:88) i =1 ˆ l i q i (cid:34) ln (cid:32) (cid:80) Hi =1 ˆ l i q i r (cid:33) − ln (cid:32) − (cid:80) Hi =1 ˆ l i q i − r (cid:33)(cid:35) + (cid:34) r − (cid:80) Hi =1 ˆ l i q i r (1 − r ) (cid:35) ( r − r ) Using the above linear approximations (cid:99) kC (( Q, r ); ( Q , r )) and (cid:100) kK (( Q, r ); ( Q , r )) in(67b) and (67c), we can invoke the CCP procedure described in [12] to get a local minimizerto the KL-distance based bound minimization problem (53). D Computational Illustrations for SVMs
The datasets that we have considered for our computations, the scheme used to generateclassifiers and compute risk values are same as the ones considered in [22]. On this set of baseclassifiers, we compare the optimal PAC-Bayesians posteriors for the case of χ -divergence,obtained using the FP scheme and the solver for the different φ functions considered. D.1 Illustration of various optimal posteriors, Q ∗ φ,χ We present some graphs to illustrate the nature of the optimal posteriors that we havecomputed under the framework mentioned above. Figure 4 depicts the role of the confidencelevel δ in determining the optimal support size H ∗ for Q ∗ lin ,χ in case of uniform prior. Figure5 shows that the stationary point Q F P kl ,χ obtained by the fixed point scheme has almost fullsupport, and that the fixed point equation (15) always converges to a solution even whenthe solver throws up an error due to issues such as M ’ (Maximum Number of iterationsexceeded) or ‘ I ’ (Locally infeasible solution) or ‘ E ’ (Unknown Error) or ‘ R ’ (RestorationPhase Failed). Please see Table 6 for such examples. D.2 Sparsity and Concentration of Optimal Posteriors, Q ∗ φ,χ We determine the optimal posteriors Q ∗ φ,χ for different distance functions, φ s and com-pare their bound values and test error rates. To understand the differences between thenature of these posteriors for different choices of φ , we need to compare these vectors ofposterior weights. For a large H , as we have considered, it is difficult to compare these high-dimensional probability weight vectors elementwise. We use different measures for capturingthe information from these posteriors.To measure the sparsity of the posteriors, we first compute their cumulative distributionfunctions (CDFs) denoted by F Q ∗ φ,χ ( · ) . We consider three different significance levels α ∈{ . , . , . } and identify the number of classifiers N φ,χ ( α ) out of H = 1990 , required by48 ataset H 50 200 500 1000 1990 (Validation setsize, v ) B F P kl ,χ B solver kl ,χ B F P kl ,χ B solver kl ,χ B F P kl ,χ B solver kl ,χ B F P kl ,χ B solver kl ,χ B F P kl ,χ B solver kl ,χ Spambase ( v = 1840) ( v = 138) ( v = 332) ( v = 227) ( v = 549) ( v = 2257) ( v = 140) ( v = 1323) ( v = 122) Table 6:
Bound values for kl- χ case : Comparing the bound values B F P kl ,χ and B solver kl ,χ for the posterior obtained via fixed point equation (64) with the linear search algorithm 2and the optimal posterior for minimizing the PAC-Bayesian bound (53) for the KL-distancewith chi-squared divergence between the prior and posterior distributions. The fixed pointequation always converges and identifies the local minimum output by the Ipopt solver, evenwhen the solver fails to identify a solution for reasons like local infeasibility (I), RestorationPhase Failed (R), Maximum number of iterations exceeded (M), Unknown error (E), etc.49 ataset H 50 200 500 1000 1990 (Test setsize, t ) T F P kl ,χ T solver kl ,χ T F P kl ,χ T solver kl ,χ T F P kl ,χ T solver kl ,χ T F P kl ,χ T solver kl ,χ T F P kl ,χ T solver kl ,χ Spambase ( t = 921) ( t = 69) ( t = 166) ( t = 115) ( t = 275) ( t = 1129) . e −
06 4 . e − ( t = 71) ( t = 662) ( t = 62) Table 7:
Test error rates for kl − χ case : Comparing the test error rates T F P kl ,χ and T solver kl ,χ when using the posterior obtained via fixed point equation (64) with the linear searchalgorithm 2 and the optimal posterior for minimizing the PAC-Bayesian bound (53) for theKL-distance with chi-squared divergence between the prior and posterior distributions. Thefixed point equation always converges and identifies the local minimum output by the Ipopt solver, even when the solver fails to identify a solution for reasons like local infeasibility (I),Restoration Phase Failed (R), Maximum number of iterations exceeded (M), Unknown error(E), etc. 50igure 4:
Illustration of variation of subset support for Q ∗ lin ,χ as the PAC-Bayesianconfidence level δ changes . We consider Bupa dataset (345 samples, 6 features) with atraining sample of size m = 276 for determining our SVM classifiers using H = 1990 regular-ization parameter values from the set Λ = { . , . . . , } . For a uniform prior distribution,the optimal posterior for linear distance function, Q ∗ lin ,χ is computed via Ipopt solver on thefull simplex ( Q solver ) as well fixed point equation (FPE) (29) on increasing ordered subsetsof H , denoted by Q F P . We observe that the FPE correctly identifies the global minimum.In case of uniform prior, the posterior weights q ∗ i, lin ,χ are negatively proportional to the em-pirical risk ˆ l i values of the classifiers in the ordered support set. H ∗ denotes the support sizeof the optimal posterior and ¯ l := ˆ l H ∗ denotes the value of the empirical risk beyond whichthe posterior weights are zero. For given fixed parameters, namely H and m , we considerthree values of confidence level δ = 0 . , . , . . The optimal subset size H ∗ decreases as δ increases, allowing sparser posteriors. 51igure 5: Illustration of (almost) full support for Q F P kl ,χ . We consider Spambase dataset(4601 samples, 57 features) with a training sample of size m = 3680 for determining ourSVM classifiers using regularization parameter values from the set Λ = { . , . . . , } . Fora uniform prior distribution, the optimal posterior for linear distance function, Q ∗ kl ,χ iscomputed via Ipopt solver on the full simplex ( Q solver ) as well fixed point equation (FPE)(15) on increasing ordered subsets of H , denoted by Q F P . We observe that the FPE alwaysconverges to a stationary point even when the solver throws up error and outputs infeasiblesolutions. In case of uniform prior, the posterior weights q F Pi, kl ,χ are negatively proportionalto the empirical risk ˆ l i values of the classifiers in the ordered support set. We notice that,the support size for Q F P , denoted by H F P = 1986 for all the three values of confidence level δ = 0 . , . , . ; implying that Q F P for kl- χ case has almost full support.52 ataset PAC-Bayesian Bound Average Test Error B F P kl ,χ Range( B CCP kl ,χ ) Mean( B CCP kl ,χ ) T F P kl ,χ Range( T CCP kl ,χ ) Mean( T CCP kl ,χ )Spambase 0.43076 [0.46072, 0.53931] 0.48658 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 8: We compare the bound values and test error rates of the optimal posterior ob-tained via Fixed Point (FP) scheme and the posterior based on Convex-Concave Procedure(CCP) for minimizing the PAC-Bayesian bound B kl ,χ based on KL-distance function with χ -divergence measure. The CCP based posteriors are identified by the bound minimizationmodel described in Section C.3. The bound values and test error rates for FP scheme basedsolution are denoted by B F P kl ,χ and T F P kl ,χ . Similarly, the bound values and test error rates ofthe CCP based posterior are denoted by B CCP kl ,χ and T CCP kl ,χ . For computations, we considerSVM classifiers generated on nine datasets from UCI repository [5] using the scheme in Sec-tion 7 of the main paper (also considered in [22]) for H = 50 values in Λ = { . , . , . . . , } .We run the CCP procedure for 1000 different initializations of posterior Q (as done in [12]).The range, mean and standard deviation of the bound values and average test error ratesof the CCP based posteriors obtained by these 1000 initializations are tabulated above. Wenotice that B F P kl ,χ is always better than B CCP kl ,χ and T F P kl ,χ is comparable with mean value of T CCP kl ,χ for different datasets considered. This might be because FP scheme identifies theglobal minimum for kl- χ based bound minimization problem, whereas CCP converges to alocal solution or a stationary point. ‘NA’ denotes the cases where the CCP cannot providelinear approximation to kl ( E Q [ˆ l ] , r ) because a subgradient cannot be determined when E Q [ˆ l ] takes the boundary value zero. Such cases usually occur for almost separable datasets –Banknote and Mushroom, where the quantity E Q [ˆ l ] = 0 for any distribution Q since all ˆ l i stake value zero for i = 1 , . . . , . F Q ∗ φ,χ ( · ) to achieve the given significance level α . That is N φ,χ ( α ) := min { i | F Q ∗ φ,χ ( i ) ≥ α } , α ∈ { . , . , . } . (68)For a given level α , a low N φ,χ ( α ) indicates that the distribution Q ∗ φ,χ is sparse. Inour computations, we observe that for the three significance levels α ∈ { . , . , . } , theoptimal posterior Q ∗ kl ,χ has large N kl ,χ ( α ) values, implying almost full support. Whereas Q ∗ sq ,χ is sparse as reflected by low N sq ,χ ( α ) values. (Please see Table 9 for the computedvalues.)We quantify the level of concentration of a posterior distribution on its support viaHerfindahl-Hirschman Index (HHI) [7, 28]. HHI is a prominent index in the economicsliterature, widely used for measuring the contribution of a sector in the economy or marketshare of a firm in the industry. It is an indicator of the amount of competition among thefirms. It is defined as the square root of the sum of the squares of the contribution/market53hares of the firms. It turns out that for any probability vector like our posteriors, HHI isequivalent to the (cid:96) norm of the given probability vector. A high HHI score indicates highconcentration of probabilities and vice versa. HHI scores for Q ∗ φ,χ are given in Table 9. Weobserve that Q ∗ sq ,χ has relatively high HHI score, indicating higher concentration comparedto Q ∗ lin ,χ and Q ∗ kl ,χ . The differences in concentration levels are remarkable in case of datasetswith highly varying empirical risk values. 54 .000 0.001 0.002 0.003 0.004 0.005 . . . . . . Empirical risk, l^ C u m u l a t i v e po s t e r i o r w e i gh t s , F ( Q f-c ) Banknote Dataset F ( Q* lin - c ) F ( Q* sq - c ) F ( Q* kl - c ) H = 1990 d = . . . . . . Empirical risk, l^ C u m u l a t i v e po s t e r i o r w e i gh t s , F ( Q f-c ) Spambase Dataset F ( Q* lin - c ) F ( Q* sq - c ) F ( Q* kl - c ) H = 1990 d = . . . . . . Empirical risk, l^ C u m u l a t i v e po s t e r i o r w e i gh t s , F ( Q f-c ) Wdbc Dataset F ( Q* lin - c ) F ( Q* sq - c ) F ( Q* kl - c ) H = 1990 d = . . . . . . Empirical risk, l^ C u m u l a t i v e po s t e r i o r w e i gh t s , F ( Q f-c ) Bupa Dataset F ( Q* lin - c ) F ( Q* sq - c ) F ( Q* kl - c ) H = 1990 d = . . . . . . Empirical risk, l^ C u m u l a t i v e po s t e r i o r w e i gh t s , F ( Q f-c ) Mammographic Dataset F ( Q* lin - c ) F ( Q* sq - c ) F ( Q* kl - c ) H = 1990 d =