Sequential Maximum Margin Classifiers for Partially Labeled Data
SSEQUENTIAL MAXIMUM MARGIN CLASSIFIERS FOR PARTIALLY LABELED DATA
Elizabeth Hou, Alfred O. Hero
University of MichiganDept. of Electrical Engineering and Computer Science1301 Beal Avenue, Ann Arbor, MI 48109-2122
ABSTRACT
In many real-world applications, data is not collected as one batch,but sequentially over time, and often it is not possible or desirable towait until the data is completely gathered before analyzing it. Thus,we propose a framework to sequentially update a maximum marginclassifier by taking advantage of the Maximum Entropy Discrimina-tion principle. Our maximum margin classifier allows for a kernelrepresentation to represent large numbers of features and can alsobe regularized with respect to a smooth sub-manifold, allowing it toincorporate unlabeled observations. We compare the performance ofour classifier to its non-sequential equivalents in both simulated andreal datasets.
Index Terms — semi-supervised classification, support vectormachines, maximum entropy, maximum margin classifiers
1. INTRODUCTION
As the popularity of big data increases and more data is being gath-ered, the importance of sequential models that are able to continu-ously update with new data has increased. These models are par-ticularly crucial in high throughput real-time applications such asspeech or streaming text classification. To this end, we propose asequential framework to update the probabilistic maximum marginclassifier built from the Maximum Entropy Discrimination (MED)principle of [1].The proposed sequential MED framework can be cast as recur-sive Bayesian estimation where the likelihood function is a log-linearmodel formed from a series of constraints and weighted by Lagrangemultipliers. In the Gaussian case it shares similarities with the prob-lem of Gaussian process classification, which has been previouslystudied [2, 3, 4, 5, 6, 7], but to the best of our knowledge, a methodto recursively update the Gaussian process classifier has not beendeveloped. In the single time point case, sequential MED can bespecialized to the support vector machine [4] and Laplacian supportvector machine [8] as previously discussed in [1] and [9].We are interested in situations where we receive a stream of data X (1) , X (2) , . . . over time t where each X ( t ) is a matrix of dimension p × n , with p denoting the number of feature variables and n denot-ing the number of i.i.d. samples, where n = n ( t ) may vary withtime. In the fully labeled scenario, the data has corresponding labels y i = [1 , − ∀ i and t ; however in the partially labeled scenario, ateach time point t , only l ( t ) < n ( t ) of the samples have labels. Wedefine the observed data at any time point t as D ( t ) = { X ( t ) , y ( t ) } This work was partially supported by the Consortium for VerificationTechnology under Department of Energy National Nuclear Security Ad-ministration award number de-na0002534 and partially by the University ofMichigan ECE Departmental Fellowship. and all observed data up to time τ as {D ( t ) } τt =1 . Such scenarioswould arise in a variety of domains such as a satellite that only trans-mits its data daily or a government agency that only releases its dataquarterly with their corresponding reports. The rest of the paper isorganized as follows: Section 2 and Section 3 will discuss how tosequentially update the corresponding MED models for supervisedand semi-supervised classification. Section 4 validates the methodby simulation and we present an application to a dataset of spokenletters of the English alphabet.
2. SEQUENTIAL MED
Constrained relative entropy minimization is used to estimate theclosest distribution to a given prior distribution subject to a set ofmoment constraints. The authors of [10] show that, if the prior distri-bution is from the exponential family, then the density that optimizesthe constrained relative entropy problem is also a member of the ex-ponential family. Similar to Bayesian conjugate priors, there existrelative entropy conjugate priors that facilitate evaluation of the clos-est distribution. These produce optimal constrained relative entropydensities, which can be thought of as posteriors, from the same para-metric family as the prior. Maximum entropy discrimination (MED)[1] also admits conjugate priors as it a special case of constrainedrelative entropy minimization where one of the constraints is over aparametric family of discriminant functions L ( X | Θ ) . In this paper, we are interested in maximum margin binary classi-fiers. In this case the discriminant function L ( X | θ , b ) = f ( X ) θ + b is linear for some feature transformation f ( · ) , feature weights vector θ , and bias term b . Slack variables γ i are used to create a marginin the constraints E ( y i ( f ( X i ) θ + b ) − γ i ) , the expected hinge losswith slack variables. The MED objective function is min P ( Θ , γ |D ) KL ( P ( Θ , γ |D|| P ( Θ , γ )) subject to (1) (cid:90) (cid:90) P ( Θ , γ |D ) ( y i ( f ( X i ) θ + b ) − γ i ) d Θ d γ ≥ ∀ i = 1 , . . . , n whose solution P ( Θ , γ |D ) is the constrained minimum relativeentropy posterior. The associated MED decision rule ˆ y i (cid:48) = sgn ( (cid:82)(cid:82) P ( Θ |D )( f ( x i (cid:48) ) θ + b ) d Θ ) is a weighted combination ofdiscriminant functions. The minimum relative entropy posterior hasthe formP ( Θ , γ |D ) = P ( Θ , γ ) Z ( α ) exp (cid:40) n (cid:88) i =1 α i ( y i ( f ( X ) θ + b ) − γ i ) (cid:41) a r X i v : . [ s t a t . M L ] M a r here α = [ α , ..., α n ] T ≥ are Lagrange multipliers that min-imize the partition function Z ( α ) . It is common to set the initialprior distribution to the separable form:P ( Θ , γ ) = P ( θ ) P ( b ) (cid:81) ni =1 P ( γ i ) . If in addition, we specifythat P ( γ i ) = Ce − C (1 − γ i ) I ( γ i ≤ , P ( θ ) is N ( , I ) , and P ( b ) is a zero mean Bayesian non-informative (diffuse) prior, denoted N (0 , ∞ ) , then the Lagrange multipliers can be obtained as the solu-tion ˆ α to the constrained optimization max α − α T Y f ( X ) f ( X ) T Y α + n (cid:88) i =1 α i + log(1 − α i /C ) subject to n (cid:88) i =1 y i α i = 0 and α , . . . , α n ≥ where Y = diag ( y ) . This objective function has a log barrier term log(1 − α i /C ) instead of the inequality constraints α i ≤ C com-monly found in the dual form of the SVM. Except in some ill-definedcases where the maximum lies near the boundary of the feasible set,the ˆ α i will be identical to the optimal support vectors that maximizethe SVM objective. The authors in [1, 9] show that the maximum aposteriori (MAP) estimator for θ of the MED posterior is related tothe Lagrange multipliers by ˆ θ = f ( X ) T ˆ α , so the MED posteriormode is equivalent to a maximum margin classifier. Under the separable prior assumptions above, the MED posteriorP ( Θ , γ |D ) will take the factored form P ( θ |D ) P ( b |D ) P ( γ ) . Dueto the fact that the slack parameters γ i do not depend on the data D , the density P ( γ ) does not affect the MED decision rule givenafter (1). Hence only P ( θ |D ) and P ( b |D are important. This re-maining part of the MED posterior has the form: P ( θ |D ) P ( b |D ) = N ( f ( X ) T Y α , I ) N (0 , ∞ ) , which is a conjugate distribution. Dueto this conjugacy the posterior distribution optimizing the objectivein (1) can be propagated forward in time in a recursive manner. Theupdating procedure is given in the following theorem and corollaries. Theorem 1.
Let the MED prior at t = 1 be θ ∼ N ( , I ) , b ∼ N (0 , ∞ ) , and P ( γ i ) = C (1) e − C (1) (1 − γ i ) I ( γ i ≤ . Then givendata D ( τ ) at time point τ , the relative entropy conjugate priors are P (cid:0) θ |{D ( t ) } τ − t =1 (cid:1) = N (cid:32) τ − (cid:88) t =1 f ( X ( t ) ) T Y ( t ) ˆ α ( t ) , I (cid:33) P (cid:0) b |{D ( t ) } τ − t =1 (cid:1) = N (0 , ∞ ) P ( γ ) = n ( τ ) (cid:89) i =1 C ( τ ) exp (cid:8) − C ( τ ) (1 − γ i ) (cid:9) I ( γ i ≤ and the MED posterior P ( Θ | {D} τt =1 ) can represented as P (cid:0) θ | {D} τt =1 (cid:1) = N (cid:16) µ + f ( X ( τ ) ) T Y ( τ ) ˆ α ( τ ) , I (cid:17) where µ = (cid:80) τ − t =1 f ( X ( t ) ) T Y ( t ) ˆ α ( t ) is the prior mean and P ( b | {D} τt =1 ) is the same as the Bayes non-informative prior. Introducing the kernel function k ( x , x (cid:48) ) = (cid:104) f ( x ) , f ( x (cid:48) ) (cid:105) andthe parameter transformation ω = f ( X ) θ , the posterior at time τ > can be represented in terms of this kernel. Corollary 1.1.
The equivalent prior at t = 1 for the transformedparameter is ω ∼ N ( , K (1) ) where K (1) = f ( X (1) ) f ( X (1) ) T .Furthermore, the posterior at time τ is of Gaussian form P ( ω |{D ( t ) } τt =1 ) = N ( µ ( τ ) , K ( τ ) ) where the mean parameter sat-isfies the recursions µ ( τ ) = µ ( τ − + K ( τ ) Y ( τ ) ˆ α ( τ ) . Since P ( θ |{D ( t ) } τt =1 ) is Gaussian, the MAP estimator is sim-ply the mean parameter µ ( τ ) given in the Corollary 1.1. Thus thedecision rule reduces to ˆ y i (cid:48) = sgn ( f ( x i (cid:48) ) ˆ θ + ˆ b ) where the MAPestimator ˆ θ is a function of the previously estimated Lagrange mul-tipliers ˆ α (1) , . . . , ˆ α ( τ − and the maximizing values ˆ α ( τ ) and ˆ b forthe current time point τ . Corollary 1.2.
Given all previous ˆ α (1) , . . . , ˆ α ( τ − , the current op-timal Lagrange multipliers ˆ α ( τ ) are the solution to max α ( τ ) − α T ( τ ) Y ( τ ) K ( τ ) Y ( τ ) α ( τ ) + n ( τ ) (cid:88) i =1 log(1 − α ( τ ) i /C ( τ ) )+ α T ( τ ) (cid:32) − Y ( τ ) τ − (cid:88) t =1 k ( X ( τ ) , X ( t ) ) Y ( t ) ˆ α ( t ) (cid:33) subject to y T ( τ ) α ( τ ) = 0 and α ( τ ) i ≥ for all i = 1 , . . . , n ( τ ) and, holding the Lagrange multipliers fixed, the optimal bias ˆ b =arg min b (cid:88) s ∈{ i | ˆ α ( τ ) i (cid:54) =0 } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:32) y ( τ ) s − τ (cid:88) t =1 k ( X ( τ ) s , X ( t ) ) Y ( t ) ˆ α ( t ) (cid:33) − b (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ensures that the expectation constraints in the objective hold. The above dual formulation for the Lagrange multipliers α ( τ ) has some interesting implications. Since the Lagrange multipliersfrom the previous time points are fixed at time step τ , the factor − Y ( τ ) (cid:80) τ − t =1 k ( X ( τ ) , X ( t ) ) Y ( t ) ˆ α ( t ) are constants and can bethought of as (unnormalized) weights for α ( τ ) , the Lagrange multi-pliers from the current time point. Thus the corresponding Lagrangemultipliers for samples that are easily predicted using only the priorinformation will have lower weight than the Lagrange multipliersfor samples that are difficult or incorrect.
3. MANIFOLD REGULARIZATION
Next we consider the case wheres some of the labels are missing.Without loss of generality we will assume the first l points are la-beled and the latter n − l points are unlabeled.We will adopt the semi-supervised MED classification frame-work of [9], called Laplacian MED (LapMED). LapMED introducesan additional “geometric” constraint (cid:90) (cid:90) P ( θ , λ ) (cid:18)(cid:90) x ∈M θ T f ( x )∆ M f ( x ) θ d P x − λ (cid:19) d θ dλ ≤ (2)to (1) where M = supp ( P X ) ⊂ R n is a compact submanifold, ∆ M is the Laplace-Beltrami operator on M , and λ controls the complex-ity of the decision boundary in the intrinsic geometry of P X . Thisconstraint was motivated by the semi-supervised framework of [8]to encourage the function f ( x ) to be smooth over the support setof the feature distribution P X , inducing a geometric interpolation ofunlabeled points. Since the marginal distribution is unknown, from[11] f ( X ) T L f ( X ) → (cid:90) x ∈M f ( x )∆ M f ( x ) d P x , as n → ∞ here L is the normalized graph Laplacian formed with a heat ker-nel. The LapMED posterior can be approximated asP ( θ , b, γ , λ |D ) = P ( θ , b, γ , λ ) Z ( α , β ) exp (cid:40) l (cid:88) i =1 α i ( y i ( f ( X ) θ + b ) − γ i ) + β (cid:16) λ − θ T f ( X ) T L f ( X ) θ (cid:17) (cid:41) where β ≥ is a Lagrange multiplier for the smoothness constraint. The distribution P ( Θ , γ , λ |D ) that minimizes the objective with theadditional constraint (2) can similarly be factorized and, like the dis-tribution of slack parameters considered in Section 2, the distributionof the smoothness parameter λ is also independent of the data D .Likewise, the distribution of the decision rule coefficients P ( Θ |D ) are conjugate distributions with their priors. Thus the updating pro-cedure for the LapMED problem is similar to the updating procedurein Section 2. Theorem 2. At t = 0 , the MED priors for θ (or ω ), b , and γ i arethe same as in Theorem 1, and the prior for λ is a Bayesian zeromean point prior, denoted Exp. ( ∞ ) . Then given data D ( τ ) at timepoint τ , the MED conjugate prior and posterior are still Exp. ( ∞ ) for λ , the same as in Theorem 1 for b and γ i , and Gaussian of form N (cid:0) µ ( τ ) , Σ ( τ ) (cid:1) for θ (or ω ). Define a l × n expansion matrix as J =[ I 0 ] . Then the mean and covariance parameters for the distributionof θ are µ ( τ ) = G − τ ) τ (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) , Σ ( τ ) = G − τ ) , where G ( τ ) = G ( τ − + 2 β ( τ ) f ( X ( τ ) ) T L ( τ ) f ( X ( τ ) ) is a recur-sive graph of vertex disjoint subgraphs, and for the distribution of ω are µ ( τ ) = τ (cid:88) t =1 k ( τ ) (cid:0) X ( τ ) , X ( t ) (cid:1) J T Y ( t ) ˆ α ( t ) , Σ ( τ ) = k ( τ ) (cid:0) X ( τ ) , X ( τ ) (cid:1) where k ( τ ) ( x , x (cid:48) ) = (cid:104) f ( x ) , G − τ ) f ( x (cid:48) ) (cid:105) is a kernel function thatcan be recursively defined as k ( τ ) ( x , x (cid:48) ) = k ( τ − ( x , x (cid:48) ) − k ( τ − ( x , X ( τ ) ) (cid:18)(cid:0) β ( τ ) L ( τ ) (cid:1) − + k ( τ − (cid:0) X ( τ ) , X ( τ ) (cid:1)(cid:19) − k ( τ − ( X ( τ ) , x (cid:48) ) . (3)Theorem 2 gives the posterior distribution for semi-supervisedclassification whose form is comparable to the form given in Corol-lary 1.1 for the supervised case. Indeed the forms are identical ex-cept for the presence of the precision matrix term G ( τ ) in the semi-supervised case. As the sparsity of G ( τ ) is associated with the graphLaplacian, the kernel function of the semi-supervised case is a reg-ularized version of the kernel function that appears in Corallary 1.1.If we let β ( t ) be a fixed parameter, then ˆ α ( t ) and ˆ b optimize an ob-jective of the same form as in Corollary 1.2, but with kernel function k ( τ ) ( x , x (cid:48) ) . If β ( t ) is chosen to be 0, the sequential LapMED simplyignores the unlabeled data of time point t , and if all β ( i ) ’s are , thenthe unlabeled data is always ignored and the updating procedure isexactly the same as in the supervised scenario. These parameters are functions of the γ A and γ I , which are identical to the penalty param-eters in the Laplacian SVM [8], associated with the reproducing ker-nel Hilbert space and data distribution respectively: C ( t ) = l ( t ) γ A and β ( t ) = γ I γ A n t ) . Because the kernel function in (3) is a function of the previous kernelfunctions, calculating a map to its associated Hilbert space H ( τ ) canbe computationally expensive. Thus in this subsection, we derive anapproximation to the map to (cid:104) x , x (cid:48) (cid:105) H ( τ ) , which is computationallyeasier than direct recursive calculation.Recall that we approximate the constraint in (2), at any timepoint t , empirically with the graph Laplacian L ( t ) formed usingthe data from that time point X ( t ) . However, the non-empiricalconstraint using the Laplace-Beltrami operator over the unknownmarginal distribution P x , is actually the same at every time point.Thus as n ( τ − → ∞ , the prior graph G ( τ − converges to B (cid:90) x ∈M f ( x )∆ M f ( x ) d P x ≈ B ∞ (cid:88) i =1 δ i ξ i υ i ( z ) υ i ( z ) (4)where B = 2 (cid:80) τt =1 β ( t ) , δ i are the eigenvalues of the Laplace-Beltrami operator, and υ i ( z ) and ξ i are the infinite sequence of rightsingular functions and singular values of f ( x ) = (cid:82) k ( x, z ) f ( z ) dz .The approximate decomposition arises since the left singular func-tions of f are the eigenfunctions of the Laplace-Beltrami operator[12] and [8]. Thus instead of empirically approximating the Lapla-cian as a sum of subgraphs G ( τ − = I + (cid:80) τ − t =1 β ( t ) f ( X ( t ) ) T L ( t ) f ( X ( t ) ) , we can insteadimplement approximations to the eigen/singular values and singularfunctions in (4).Assuming that the sample size n is large enough, the averageeigenvalues of the τ − graph Laplacians would be a good estima-tor for the eigenvalues of the Laplace-Beltrami operator. Addition-ally the rows of the matix V T from the singular value decompositionof X will contain the basis for its row space. Thus because the rightsingular functions form an orthonormal basis for the coimage of f , ifthe mapping approximately preserves the basis, the mapped averagesingular vectors f ( ¯ V i ) would be good estimators for the right singu-lar functions υ i ( z ) and correspondingly so for the singular values.The posterior kernel function k ( τ ) ( x , x (cid:48) ) using an approxima-tion to the decomposition in (4) will no longer be a recursive functionof prior kernel functions k ( τ − ( x , x (cid:48) ) that have the same form, likein (3). Instead for τ > , it uses a prior kernel function ˜ k ( τ − ( x , x (cid:48) ) = k ( x , x (cid:48) ) − k ( x , ¯ V ( τ − ) (cid:18) diag (¯ s τ − ¯ d ( τ − ) − B + k ( ¯ V ( τ − , ¯ V ( τ − ) (cid:19) − k ( ¯ V ( τ − , x (cid:48) ) . where k ( x , x (cid:48) ) = (cid:104) f ( x ) , f ( x (cid:48) ) (cid:105) is the non-regularized kernel func-tion. So at time τ , the singular vectors of X ( τ − are used to updatethe average singular vectors, in the above function, through ¯ V ( τ − = ¯ V ( τ − + V ( τ − − ¯ V ( τ − τ − and similarly so for the average corresponding singular values ¯ s ( τ − and the average eigenvalues of the graph Laplacians ¯ d ( τ − . . EXPERIMENTS In this section, we compare the proposed sequential maximum mar-gin classifiers to popular supervised and semi-supervised maximummargin classifiers (SVM [4] and LapSVM [8]) where the model istrained using just the current time points data and where the modelhas been re-trained on all previous data. The former type of model isa lower bound on performance since it ignores all previous data andthe latter type of model is an upper bound since it is re-trained onall previous data at every time point. Note the MED and SVM mod-els only differ by a weak log-barrier term in the objective functionmaking their performance identical, and similarly so for LapMEDand LapSVM. Thus their performance curves will referred to as FullSVM/MED and Full LapSVM/LapMED.
In both of the following simulations, the models receive roughly 100samples ( n ( t ) = [97 , ) at every time point, the parameters areempirically chosen with a validation set, and then the models aretested on an independent data set of 1000 test points. The test accu-racy TP + TN is the average accuracy over 100 trials of simulation.In the first simulation, we generate data from 200 categoricaldistributions where 100 of the variables are sparse so they have highprobability of being 0, another 50 of the variables have lower prob-ability of being 0, and the final 50 variables are used to distinguishbetween the two classes. We use the term frequency - inverse docu-ment frequency (TF-IDF) kernel of [13], which is used in documentprocessing and topic models. Figure 1 shows that the accuracy ofthe sequential model (SeqMED) improves as the model is updatedwith more training data and has much better results even after onemodel update versus the independent model (SVM) that ignores pre-vious training data. Of course the sequential model does not im-prove as rapidly as the model that is re-trained on all the data (FullSVM/MED), but this is the price paid for lower computational com-plexity. For example, at t = 30 , SeqMED updates and fits 100coefficients for the new data whereas Full SVM/MED fits 3,000 co-efficients for all the data. Fig. 1 . Accuracy of prediction for categorical fully labeled simulateddata. The proposed sequential MED (SeqMED) classifier performsalmost as well as the full batch implementation of the SVM/MED(Full SVM/MED).In the second simulation, we generate data from the interior ofa 3-dimensional sphere where one class is roughly at the center ofthe sphere and the other class is on the shell, but only 10% of thesamples are labeled. We use a rbf kernel with width 1 for the kernelfunction and a heat kernel with width 0.01 and a 20 nearest neigh-bors graph for the graph Laplacian. Figure 2 shows improvement in performance of the sequential model similar to in Figure 1. We usethe approximate kernel function of Subsection 3.2 to perform eachupdate, establishing that the approximation is adequate.
Fig. 2 . Accuracy of prediction for continuous simulated data with10% labeled.
We compare the proposed algorithms on the Isolet speech databasefrom the UCI machine learning repository [14] following the experi-mental framework used in [8]. To train the models, we take the entiretraining set of 120 speakers (isolet1 - isolet4) and break them into 24groups (time points) of 5 speakers where only the first speaker is la-beled. At each time point, the models train on 260 samples ( t = 21 and only have 259) where 52 of the samples are labeled. Theparameters are set in the same way as in [8] and the test set is sim-ilarly composed of the 1,559 samples from isolet5. Figure 3 showsthat, after two time points, the sequential model always performsbetter than the model that ignores previous data, and comes close toperforming as well as the fully re-trained model as time progresses. Fig. 3 . Accuracy of prediction on isolet5 for models trained on par-tially labeled speech isolets 1-4. The proposed semi-supervised se-quential Laplacian MED classifier (SeqLapMED) comes close to thefull Laplacian SVM [8] as time progresses.
5. CONCLUSIONS
We have proposed recursive versions of supervised and semi-supervised maximum margin classifiers in the minimum entropydiscrimination (MED) classification framework. The proposed se-quential maximum margin classifiers perform nearly as well as amuch more computationally expensive fully re-trained maximummargin classifiers and significantly better than a classifier that ig-nores previous data. . APPENDIX
Proof of Theorem 1.
Let µ ( τ − = (cid:80) τ − t =1 f ( X ( t ) ) T Y ( t ) ˆ α ( t ) where µ (0) = . At time τ , let the priors be θ ∼ N ( µ ( τ − , I ) , b ∼ N (0 , σ ) where σ → ∞ , and γ i ∼ C ( τ ) e − C ( τ ) (1 − γ i ) I ( γ i ≤ . Then the posterior P ( θ , b, γ | {D} τt =1 )= P ( θ ) P ( b ) P ( γ ) Z ( ˆ α ( τ ) ) exp (cid:40) n ( τ ) (cid:88) i =1 ˆ α ( τ ) i (cid:0) y ( τ ) i ( f ( X ( τ ) ) θ + b ) − γ i (cid:1)(cid:41) = P ( θ ) Z θ ( ˆ α ( τ ) ) exp (cid:40) n ( τ ) (cid:88) i =1 ˆ α ( τ ) i y ( τ ) i f ( X ( τ ) ) θ (cid:41) P ( b ) Z b ( ˆ α ( τ ) ) exp (cid:40) b n ( τ ) (cid:88) i =1 y ( τ ) i ˆ α ( τ ) i (cid:41)(cid:81) n ( τ ) i =1 P ( γ i ) Z γ i ( ˆ α ( τ ) ) e − (cid:80) n ( τ ) i =1 ˆ α ( τ ) i γ i = P ( θ | X (1) , y (1) , . . . , X ( τ ) , y ( τ ) ) P ( b | y (1) , . . . , y ( τ ) ) P ( γ ) . So the posterior of the weights P ( θ | X (1) , y (1) , . . . , X ( τ ) , y ( τ ) )= exp (cid:8) − . θ − µ ( τ − ) T ( θ − µ ( τ − ) + ˆ α T ( τ ) Y ( τ ) f ( X ( τ ) ) θ (cid:9) (2 π ) p/ Z ( ˆ α ( τ ) )= exp (cid:110) − . (cid:16) θ T θ − µ T ( τ − θ − α T ( τ ) Y ( τ ) f ( X ( τ ) ) θ (cid:17)(cid:111) (2 π ) p/ (cid:82) exp (cid:110) − . (cid:16) θ T θ − µ T ( τ − θ − α T ( τ ) Y ( τ ) f ( X ( τ ) ) θ (cid:17)(cid:111) (2 π ) p/ d θ = exp (cid:26) − . (cid:16) θ − ( µ ( τ − + f ( X ( τ ) ) T Y ( τ ) ˆ α ( τ ) ) (cid:17) T (cid:16) θ − ( µ ( τ − + f ( X ( τ ) ) T Y ( τ ) ˆ α ( τ ) ) (cid:17)(cid:111) (cid:14) (2 π ) p/ ∼ N ( µ ( τ − + f ( X ( τ ) ) T Y ( τ ) ˆ α ( τ ) , I ) , the posterior of the bias term P ( b | y (1) , . . . , y ( τ ) )= (2 πσ ) − / exp (cid:8) − . b − σ b y T ( τ ) ˆ α ( τ ) ) /σ (cid:9)(cid:82) (2 πσ ) − / exp (cid:110) − . b − σ b y T ( τ ) ˆ α ( τ ) ) /σ (cid:111) db = e − . b − σ y T ( τ ) ˆ α ( τ ) ) /σ √ πσ ∼ N ( σ y T ( τ ) ˆ α ( τ ) , σ ) ⇒ if σ → ∞ , then N ( σ y T ( τ ) ˆ α ( τ ) , σ ) → N (0 , ∞ ) as long as the optimal Lagrange multipliers satisfy y T ( τ ) ˆ α ( τ ) = 0 , and the posterior of the margin parameters P ( γ ) do not depend onthe data. Proof of Corollary 1.1.
At time τ , let ω = f ( X ( τ ) ) θ have prior N ( µ ( τ − , K ( τ ) ) where µ ( τ − = (cid:80) τ − t =1 k ( X ( τ ) , X ( t ) ) Y ( t ) ˆ α ( t ) .Then the posterior P ( ω | X (1) , y (1) , . . . , X ( τ ) , y ( τ ) )= P ( ω ) Z ω ( ˆ α ( τ ) ) exp (cid:40) n ( τ ) (cid:88) i =1 ˆ α ( τ ) i y ( τ ) i ω (cid:41) = exp (cid:110) − . ω − µ ( τ − ) T K − τ ) ( ω − µ ( τ − ) + ˆ α T ( τ ) Y ( τ ) ω (cid:111) | π K ( τ ) | / Z ω ( ˆ α ( τ ) )= e − . ( ω − ( µ ( τ − + K ( τ ) Y ( τ ) ˆ α ( τ ) ) ) T K − τ ) ( ω − ( µ ( τ − + K ( τ ) Y ( τ ) ˆ α ( τ ) ) ) | π K ( τ ) | / ∼ N ( µ ( τ − + K ( τ ) Y ( τ ) ˆ α ( τ ) , K ( τ ) ) . Proof of Corollary 1.2.
The optimal Lagrange multipliers at t = τ are the solution to arg max α ( τ ) − log (cid:0) Z ( α ( τ ) ) (cid:1) = arg max α ( τ ) − log (cid:0) Z θ ( α ( τ ) ) (cid:1) − log (cid:0) Z b ( α ( τ ) ) (cid:1) − log (cid:0) Z γ ( α ( τ ) ) (cid:1) or = arg max α ( τ ) − log (cid:0) Z ω ( α ( τ ) ) (cid:1) − log (cid:0) Z b ( α ( τ ) ) (cid:1) − log (cid:0) Z γ ( α ( τ ) ) (cid:1) where − log (cid:0) Z θ ( α ( τ ) ) (cid:1) = − log (cid:32)(cid:90) e α T ( τ ) Y ( τ ) f ( X ( τ ) ) θ − . θ − µ ( τ − ) T ( θ − µ ( τ − ) (2 π ) p/ d θ (cid:33) = − α T ( τ ) Y ( τ ) f ( X ( τ ) ) µ ( τ − − . α T ( τ ) Y ( τ ) f ( X ( τ ) ) f ( X ( τ ) ) T Y ( τ ) α ( τ ) , − log (cid:0) Z ω ( α ( τ ) ) (cid:1) = − log (cid:32)(cid:90) e α T ( τ ) Y ( τ ) ω − . ω − µ ( τ − ) T K − τ ) ( ω − µ ( τ − ) | π K ( τ ) | / d ω (cid:33) = − α T ( τ ) Y ( τ ) µ ( τ − − . α T ( τ ) Y ( τ ) K ( τ ) Y ( τ ) α ( τ ) , − log (cid:0) Z b ( α ( τ ) ) (cid:1) = − log (cid:32)(cid:90) e − . b − σ y T ( τ ) α ( τ ) ) /σ √ πσ db (cid:33) − log (cid:16) e . σ ( y T ( τ ) α ( τ ) ) (cid:17) = − . σ ( y T ( τ ) α ( τ ) ) ⇒ if σ → ∞ , then y T ( τ ) α ( τ ) = 0 and − log (cid:0) Z γ ( α ( τ ) ) (cid:1) = − (cid:80) n ( τ ) i =1 log (cid:0) Z γ i ( α ( τ ) ) (cid:1) = − n ( τ ) (cid:88) i =1 log (cid:18)(cid:90) −∞ C ( τ ) e − C ( τ ) (1 − γ i ) e − α ( τ ) i γ i dγ i (cid:19) = − n ( τ ) (cid:88) i =1 log (cid:32) C ( τ ) C ( τ ) − α ( τ ) i e − C ( τ ) + γ i ( C ( τ ) − α ( τ ) i ) (cid:12)(cid:12)(cid:12)(cid:12) −∞ (cid:33) = − n ( τ ) (cid:88) i =1 log (cid:18) C ( τ ) e − α ( τ ) i C ( τ ) − α ( τ ) i (cid:19) = n ( τ ) (cid:88) i =1 α ( τ ) i + log (cid:18) − α ( τ ) i C ( τ ) (cid:19) . Proof of Theorem 2.
At time τ , let the priors for b and γ i be thesame as in Theorem 1, λ ∼ Exp. ( ν ) where ν → ∞ , and θ ( or ω ) ∼ N (cid:0) µ ( τ − , Σ ( τ − (cid:1) . Then the posterior P ( θ , b, γ , λ | {D} τt =1 ) andpartition function Z θ ( α ( τ ) , β ( τ ) ) factorize similarly asP ( θ | X (1) , y (1) , . . . , X ( τ ) , y ( τ ) ) P ( b | y (1) , . . . , y ( τ ) ) P ( γ ) P ( λ ) and Z θ ( α ( τ ) , β ( τ ) ) Z b ( α ( τ ) ) Z λ ( β ( τ ) ) l ( τ ) (cid:89) i =1 Z γ i ( α ( τ ) ) . The bias and margin terms are independent of β ( τ ) , so their pos-terior and partition functions are the same as in Theorem 1. Theposterior of the smoothness parameter λ does not depend on the dataand − log (cid:0) Z λ ( β ( τ ) ) (cid:1) = − log (cid:18)(cid:90) ∞ νe − νλ e β ( τ ) λ dλ (cid:19) = − log (cid:18) νν − β ( τ ) (cid:19) ⇒ if ν → ∞ , then log(1 − β ( τ ) /ν ) = 0 . et the parameters for the prior distribution of θ be µ ( τ − = G − τ − τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) , Σ ( τ − = G − τ − , where G ( τ − = G ( τ − +2 β ( τ − f ( X ( τ − ) T L ( τ − f ( X ( τ − ) ,and G (0) = I , then the posterior P ( θ | X (1) , y (1) , . . . , X ( τ ) , y ( τ ) )= exp (cid:110) − . θ − µ ( τ − ) T Σ − τ − ( θ − µ ( τ − ) (cid:111) det(2 π Σ ( τ − ) / exp (cid:8) ˆ α T ( τ ) Y ( τ ) J f ( X ( τ ) ) θ − β ( τ ) θ T f ( X ( τ ) ) T L ( τ ) f ( X ( τ ) ) θ (cid:9) Z θ ( ˆ α ( τ ) , β ( τ ) )= exp (cid:40) − . (cid:32) θ T G ( τ − θ − θ T τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) + (cid:32) τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) (cid:33) T G − τ − (cid:32) τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) (cid:33) − α T ( τ ) Y ( τ ) J f ( X ( τ ) ) θ + 2 β ( τ ) θ T f ( X ( τ ) ) T L ( τ ) f ( X ( τ ) ) θ (cid:33)(cid:41)(cid:30) (cid:16) det(2 π G − τ − ) / Z θ ( ˆ α ( τ ) , β ( τ ) ) (cid:17) = exp (cid:40) − . (cid:32) θ T G ( τ ) θ − θ T τ (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) + (cid:32) τ (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) (cid:33) T G − τ ) (cid:32) τ (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) (cid:33)(cid:33) − (cid:32) τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) (cid:33) T G − τ − (cid:32) τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) (cid:33) + (cid:32) τ (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) (cid:33) T G − τ ) (cid:32) τ (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) (cid:33)(cid:33)(cid:41)(cid:30) (cid:16) det(2 π G − τ − ) / Z θ ( ˆ α ( τ ) , β ( τ ) ) (cid:17) = exp − . (cid:32) θ − G − τ ) τ (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) (cid:33) T G ( τ ) (cid:32) θ − G − τ ) τ (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) (cid:33)(cid:41) (cid:30) det(2 π G − τ ) ) / ∼ N (cid:32) G − τ ) τ (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) ˆ α ( t ) , ( G ( τ ) ) − (cid:33) and − log (cid:0) Z θ ( α ( τ ) ) , β ( τ ) (cid:1) = − log (cid:16) det(2 π G − τ ) ) / / det(2 π G − τ − ) / (cid:17) − . (cid:32)(cid:32) τ (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) α ( t ) (cid:33) T G − τ ) (cid:32) τ (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) α ( t ) (cid:33) + (cid:32) τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) α ( t ) (cid:33) T G − τ − (cid:32) τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) α ( t ) (cid:33)(cid:33) = 0 . (cid:32) log (cid:16) det(2 π G − τ − ) (cid:17) + (cid:32) τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) α ( t ) (cid:33) T (cid:18) G ( τ − (cid:16) β ( τ ) f ( X ( τ ) ) T L ( τ ) f ( X ( τ ) ) (cid:17) − G ( τ − + G ( τ − (cid:19) − (cid:32) τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) α ( t ) (cid:33) − log (cid:16) det(2 π G − τ ) ) (cid:17)(cid:33) − . α T ( τ ) Y ( τ ) J f ( X ( τ ) ) G − τ ) f ( X ( τ ) ) T J T Y ( τ ) α ( τ ) − α T ( τ ) Y ( τ ) J f ( X ( τ ) ) G − τ ) τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) α ( t ) = 0 . (cid:16) Const. β ( τ ) − α T ( τ ) Y ( τ ) J f ( X ( τ ) ) G − τ ) f ( X ( τ ) ) T J T Y ( τ ) α ( τ ) (cid:17) − α T ( τ ) Y ( τ ) J f ( X ( τ ) ) G − τ ) τ − (cid:88) t =1 f ( X ( t ) ) T J T Y ( t ) α ( t ) where Const. β ( τ ) can be dropped from the objective when β ( t ) arefixed parameters.Or let the parameters for the prior distribution of ω = f ( X ( τ ) ) θ be µ ( τ − = τ − (cid:88) t =1 k ( τ − ( X ( τ ) , X ( t ) ) J T Y ( t ) ˆ α ( t ) Σ ( τ − = k ( τ − ( X ( τ ) , X ( τ ) ) where k (0) ( x , x (cid:48) ) = (cid:104) f ( x ) , f ( x (cid:48) ) (cid:105) k ( τ − ( x , x (cid:48) ) = k ( τ − ( x , x (cid:48) ) − k ( τ − ( x , X ( τ − ) (cid:16)(cid:0) β ( τ − L ( τ − (cid:1) − + k ( τ − (cid:0) X ( τ − , X ( τ − (cid:1)(cid:17) − k ( τ − ( X ( τ − , x (cid:48) ) . The posterior P ( ω | X (1) , y (1) , . . . , X ( τ ) , y ( τ ) )= e − . ω − µ ( τ − ) T Σ − τ − ( ω − µ ( τ − ) det(2 π Σ ( τ − ) / e ˆ α T ( τ ) Y ( τ ) Jω − β ( τ ) ω T L ( τ ) ω Z ω ( ˆ α ( τ ) , β ( τ ) )= exp (cid:26) − . (cid:18) ω T (cid:0) k ( τ − ( X ( τ ) , X ( τ ) ) − + 2 β ( τ ) L ( τ ) (cid:1) ω − ω T k ( τ − ( X ( τ ) , X ( τ ) ) − (cid:32) τ − (cid:88) t =1 k ( τ − ( X ( τ ) , X ( t ) ) J T Y ( t ) ˆ α ( t ) + k ( τ − ( X ( τ ) , X ( τ ) ) J T Y ( τ ) ˆ α ( τ ) (cid:33) + (cid:32) τ − (cid:88) t =1 k ( τ − ( X ( τ ) , X ( t ) ) J T Y ( t ) ˆ α ( t ) (cid:33) T k ( τ − ( X ( τ ) , X ( τ ) ) − (cid:32) τ − (cid:88) t =1 k ( τ − ( X ( τ ) , X ( t ) ) J T Y ( t ) ˆ α ( t ) (cid:33) (cid:19)(cid:27)(cid:30) (cid:16) det (cid:0) πk ( τ − ( X ( τ ) , X ( τ ) ) (cid:1) / Z ω ( ˆ α ( τ ) , β ( τ ) ) (cid:17) = exp (cid:40) − . (cid:32) ω − τ (cid:88) t =1 k ( τ ) (cid:0) X ( τ ) , X ( t ) (cid:1) J T Y ( t ) ˆ α ( t ) (cid:33) T k ( τ ) ( X ( τ ) , X ( τ ) ) − (cid:32) ω − τ (cid:88) t =1 k ( τ ) (cid:0) X ( τ ) , X ( t ) (cid:1) J T Y ( t ) ˆ α ( t ) (cid:33)(cid:41)(cid:30) det (cid:0) πk ( τ ) ( X ( τ ) , X ( τ ) ) (cid:1) / ∼ N (cid:32) τ (cid:88) t =1 k ( τ ) ( X ( τ ) , X ( t ) ) J T Y ( t ) ˆ α ( t ) , k ( τ ) ( X ( τ ) , X ( τ ) ) (cid:33) nd − log (cid:0) Z ω ( α ( τ ) ) , β ( τ ) (cid:1) = − . (cid:32) log (cid:0) det (cid:0) k ( τ ) ( X ( τ ) , X ( τ ) ) k ( τ − ( X ( τ ) , X ( τ ) ) − (cid:1)(cid:1) − (cid:32) τ (cid:88) t =1 k ( τ ) ( X ( τ ) , X ( t ) ) J T Y ( t ) α ( t ) (cid:33) T k ( τ ) ( X ( τ ) , X ( τ ) ) − (cid:32) τ (cid:88) t =1 k ( τ ) ( X ( τ ) , X ( t ) ) J T Y ( t ) α ( t ) (cid:33) + (cid:32) τ − (cid:88) t =1 k ( τ − ( X ( τ ) , X ( t ) ) J T Y ( t ) α ( t ) (cid:33) T k ( τ − ( X ( τ ) , X ( τ ) ) − (cid:32) τ − (cid:88) t =1 k ( τ − ( X ( τ ) , X ( t ) ) J T Y ( t ) α ( t ) (cid:33) (cid:33) = 0 . (cid:16) Const. β ( τ ) − α T ( τ ) Y ( τ ) J k ( τ ) ( X ( τ ) , X ( τ ) ) J T Y ( τ ) α ( τ ) (cid:17) − α T ( τ ) Y ( τ ) J τ − (cid:88) t =1 k ( τ ) ( X ( τ ) , X ( t ) ) J T Y ( t ) α ( t ) . B. REFERENCES [1] Tommi Jaakkola, Marina Meila, and Tony Jebara, “Maximumentropy discrimination,” in
Advances in Neural InformationProcessing Systems 12 , S.A. Solla, T.K. Leen, and K. M¨uller,Eds. 2000, pp. 470–476, MIT Press.[2] Grace Wahba, “Support vector machines, reproducing kernelhilbert spaces and the randomized gacv,”
Advances in KernelMethods-Support Vector Learning , vol. 6, pp. 69–87, 1999.[3] Tommi S Jaakkola and David Haussler, “Probabilistic kernelregression models.,” in
AISTATS , 1999.[4] Alex J Smola, Bernhard Sch¨olkopf, and Klaus-Robert M¨uller,“The connection between regularization operators and supportvector kernels,”
Neural networks , vol. 11, no. 4, pp. 637–649,1998.[5] Manfred Opper and Ole Winther, “Gaussian process classifica-tion and svm: Mean field results and leave-one-out estimator,”
Advances in Large Margin Classifiers , 1999.[6] Peter Sollich, “Bayesian methods for support vector machines:Evidence and predictive class probabilities,”
Machine Learn-ing , vol. 46, no. 1, pp. 21–52, 2002.[7] Carl Edward Rasmussen and Christopher K. I. Williams,
Gaus-sian Processes for Machine Learning (Adaptive Computationand Machine Learning) , The MIT Press, 2005.[8] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani, “Mani-fold regularization: A geometric framework for learning fromlabeled and unlabeled examples,”
J. Mach. Learn. Res. , vol. 7,pp. 2399–2434, Dec. 2006.[9] Elizabeth Hou, Kumar Sricharan, and Alfred O Hero, “La-tent laplacian maximum entropy discrimination for detectionof high-utility anomalies,”
IEEE Transactions on InformationForensics and Security , vol. 13, no. 6, pp. 1446–1459, June2018. [10] Oluwasanmi Koyejo and Joydeep Ghosh, “A representation ap-proach for relative entropy minimization with expectation con-straints,” in
ICML WDDL workshop , 2013.[11] Alexander Grigoryan, “Heat kernels on weighted manifoldsand applications,”
Cont. Math , vol. 398, pp. 93–191, 2006.[12] R. R. Lederman and V. Rokhlin, “On the analytical and nu-merical properties of the truncated laplace transform i.,”
SIAMJournal on Numerical Analysis , vol. 53, no. 3, pp. 1214–1235,2015.[13] Charles Elkan, “Deriving tf-idf as a fisher kernel,” in