[PDF] Stochastic Discriminative EM

Abstract

Stochastic discriminative EM (sdEM) is an online-EM-type algorithm for discriminative training of probabilistic generative models belonging to the exponential family. In this work, we introduce and justify this algorithm as a stochastic natural gradient descent method, i.e. a method which accounts for the information geometry in the parameter space of the statistical model. We show how this learning algorithm can be used to train probabilistic generative models by minimizing different discriminative loss functions, such as the negative conditional log-likelihood and the Hinge loss. The resulting models trained by sdEM are always generative (i.e. they define a joint probability distribution) and, in consequence, allows to deal with missing data and latent variables in a principled way either when being learned or when making predictions. The performance of this method is illustrated by several text classification problems for which a multinomial naive Bayes and a latent Dirichlet allocation based classifier are learned using different discriminative loss functions.

Full PDF

aa r X i v : . [ c s . L G ] O c t Stochastic Discriminative EM

Andr´es R. Masegosa †‡†

Dept. of Computer and Information Science ‡ Dept. of Computer Science and A. I.Norwegian University of Science and Technology University of GranadaTrondheim, Norway Granada, Spain

Abstract

Stochastic discriminative EM (sdEM) is anonline-EM-type algorithm for discriminativetraining of probabilistic generative models be-longing to the exponential family. In this work,we introduce and justify this algorithm as astochastic natural gradient descent method, i.e.a method which accounts for the information ge-ometry in the parameter space of the statisticalmodel. We show how this learning algorithmcan be used to train probabilistic generative mod-els by minimizing different discriminative lossfunctions, such as the negative conditional log-likelihood and the Hinge loss. The resultingmodels trained by sdEM are always generative(i.e. they deﬁne a joint probability distribution)and, in consequence, allows to deal with miss-ing data and latent variables in a principled wayeither when being learned or when making pre-dictions. The performance of this method is il-lustrated by several text classiﬁcation problemsfor which a multinomial naive Bayes and a latentDirichlet allocation based classiﬁer are learnedusing different discriminative loss functions.

Online learning methods based on stochastic approxima-tion theory [21] have been a promising research directionto tackle the learning problems of the so-called Big Dataera [1, 10, 12]. Stochastic gradient descent (SGD) is prob-ably the best known example of this kind of techniques,used to solve a wide range of learning problems [9]. Thisalgorithm and other versions [29] are usually employed totrain discriminative models such as logistic regression orSVM [10].There also are some successful examples of the use of SGDfor discriminative training of probabilistic generative mod-els, as is the case of deep belief networks [19]. However, this learning algorithm cannot be used directly for the dis-criminative training of general generative models. One ofthe main reasons is that statistical estimation or risk min-imization problems of generative models involve the so-lution of an optimization problem with a large number of normalization constraints [26], i.e. those which guaranteethat the optimized parameter set deﬁnes a valid probabilis-tic model. Although successful solutions to this problemhave been proposed [16, 22, 26, 32], they are based on ad-hoc methods which cannot be easily extended to other sta-tistical models, and hardly scale to large data sets.Stochastic approximation theory [21] has also been usedfor maximum likelihood estimation (MLE) of probabilisticgenerative models with latent variables, as is the case of theonline EM algorithm [13, 30]. This method provides efﬁ-cient MLE estimation for a broad class of statistical mod-els (i.e. exponential family models) by sequentially updat-ing the so-called expectation parameters . The advantageof this approach is that the resulting iterative optimizationalgorithm is fairly simple and amenable, as it does not in-volve any normalization constraints.In this paper we show that the derivation of Sato’s onlineEM [30] can be extended for the discriminative learningof generative models by introducing a novel interpreta-tion of this algorithm as a natural gradient algorithm [3].The resulting algorithm, called stochastic discriminativeEM (sdEM), is an online-EM-type algorithm that can traingenerative probabilistic models belonging to the exponen-tial family using a wide range of discriminative loss func-tions, such as the negative conditional log-likelihood or theHinge loss. In opposite to other discriminative learning ap-proaches [26], models trained by sdEM can deal with miss-ing data and latent variables in a principled way either whenbeing learned or when making predictions, because at anymoment they always deﬁne a joint probability distribution.sdEM could be used for learning using large scale data setsdue to its stochastic approximation nature and, as we willshow, because it allows to compute the natural gradient ofthe loss function with no extra cost [3]. Moreover, if al-lowed by the generative model and the discriminative lossunction, the presented algorithm could potentially be usedinterchangeably for classiﬁcation or regression or any otherprediction task. But in this initial work, sdEM is only ex-perimentally evaluated in classiﬁcation problems.The rest of this paper is organized as follows. Section 2provides the preliminaries for the description of the sdEMalgorithm, which is detailed in Section 3. A brief exper-imental evaluation is given in Section 4, while Section 5contains the main conclusions of this work.

We consider generative statistical models for predictiontasks, where Y denotes the random variable (or the vector-value random variable) to be predicted, X denotes the pre-dictive variables, and y ⋆ denotes a prediction, which ismade according to y ⋆ = arg max y p ( y, x | θ ) . Assumption 1.

The generative data model belongs tothe exponential family with a natural (or canonical)parametrization p ( y, x | θ ) ∝ exp ( h s ( y, x ) , θ i − A l ( θ )) where θ is the so-called natural parameter which belongsto the so-called natural parameter space Θ ∈ ℜ K , s ( y, x ) is the vector of sufﬁcient statistics belonging to a convexset S ⊆ ℜ K , h· , ·i denotes the dot product and A l is the logpartition function. Assumption 2.

We are given a conjugate prior distribution p ( θ | α ) of the generative data model p ( θ | α ) ∝ exp ( h s ( θ ) , α i − A g ( α )) where the sufﬁcient statistics are s ( θ ) = ( θ, − A l ( θ )) andthe hyperparameter α has two components (¯ α, ν ) . ν is apositive scalar and ¯ α is a vector also belonging to S [6]. The so-called expectation parameter µ ∈ S can also beused to parameterize probability distributions of the expo-nential family. It is a dual set of the model parameter θ [2].This expectation parameter µ is deﬁned as the expectedvector of sufﬁcient statistics with respect to θ : µ , E [ s ( y, x ) | θ ] = R s ( y, x ) p ( y, x | θ ) dydx = ∂A l ( θ ) /∂θ (1)The transformation between θ and µ is one-to-one: µ is adual set of the model parameter θ [2]. Therefore, Equa-tion (1) can be inverted as: θ = θ ( µ ) . That is to say, foreach θ ∈ Θ we always have an associated µ ∈ S and bothparameterize the same probability distribution. For obtaining the natural parameter θ associated to an ex-pectation parameter µ , we need to make use of the negativeof the entropy, H ( µ ) , R p ( y, x | θ ( µ )) ln p ( y, x | θ ( µ )) dydx = sup θ ∈ Θ h µ, θ i − A l ( θ ) (2)Using the above function, the natural parameter θ can beexplicitly expressed as θ = θ ( µ ) = ∂H ( µ ) /∂µ (3)Equations (1), (2), (3) deﬁne the Legendre-Fenchel trans-form.Another key requirement of our approach is that it shouldbe possible to compute the transformation from µ to θ inclosed form: Assumption 3.

The transformation from the expectationparameter µ to the natural parameter θ , which can be ex-pressed as θ ( µ ) = arg max θ ∈ Θ h µ, θ i − A l ( θ ) (4) is available in closed form. The above equation is also known as the maximum likeli-hood function , because θ ( n P ni =1 s ( y i , x i )) gives the max-imum likelihood estimation θ ⋆ for a data set with n obser-vations { ( y , x ) , . . . , ( y n , x n ) } .For later convenience, we show the following relations be-tween the Fisher Information matrices I ( θ ) and I ( µ ) forthe probability distributions p ( y, x | θ ) and p ( y, x | θ ( µ )) , re-spectively [25]: I ( θ ) = ∂ A l ( θ ) ∂θ∂θ = ∂µ∂θ = I ( µ ) − (5) I ( µ ) = ∂ H ( µ ) ∂µ∂µ = ∂θ∂µ = I ( θ ) − (6) Let W = { w ∈ ℜ K } be a parameter space on which thefunction L ( w ) is deﬁned. When W is a Euclidean spacewith an orthonormal coordinate system, the negative gradi-ent points in the direction of steepest descent. That is, thenegative gradient − ∂L ( w ) /∂w points in the same directionas the solution to: arg min dw L ( w + dw ) subject to || dw || = ǫ (7)for sufﬁciently small ǫ , where || dw || is the squared lengthof a small increment vector dw connecting w and w + dw . This justiﬁes the use of the classical gradient descentmethod for ﬁnding the minimum of L ( w ) by taking steps(of size ρ ) in the direction of the negative gradient: w t +1 = w t − ρ ∂L ( w t ) ∂w (8)owever, when W is a Riemannian space [4], there are noorthonormal linear coordinates, and the squared length ofvector dw is deﬁned by the following equation, || dw || = X ij g ij ( w ) dw i dw j (9)where the K × K matrix G = ( g ij ) is called the Rieman-nian metric tensor, and it generally depends on w . G re-duces to the identity matrix in the case of the Euclideanspace [4].In a Riemannian space, the steepest descent direction is notanymore the traditional gradient. That is, − ∂L ( w ) /∂w isnot the solution of Equation (7) when the squared lengthof the distance of dw is deﬁned by Equation (9). Amari[3] shows that this solution can be computed by pre-multiplying the traditional gradient by the inverse of theRiemannian metric G − , Theorem 1.

The steepest descent direction or the naturalgradient of L ( w ) in a Riemannian space is given by − ˜ ∂L ( w )˜ ∂w = − G − ( w ) ∂L ( w ) ∂w (10)where ˜ ∂L ( w ) / ˜ ∂w denotes the natural gradient .As argued in [3], in statistical estimation problems weshould used gradient descent methods which account forthe natural gradient of the parameter space, as the parame-ter space of a statistical model (belonging to the exponen-tial family or not) is a Riemannian space with the Fisherinformation matrix of the statistical model I ( w ) as the ten-sor metric [2], and this is the only invariant metric that mustbe given to the statistical model [2]. Sato’s online EM algorithm [30] is used for maximum like-lihood estimation of missing data-type statistical models.The model deﬁnes a probability distribution over two ran-dom or vector-valued variables X and Z , and is assumedto belong to the exponential family: p ( z, x | θ ) ∝ exp ( h s ( z, x ) , θ i − A l ( θ )) where ( z, x ) denotes a so-called complete data event. Thekey aspect is that we can only observe x , since z is an unob-servable event. In consequence, the loss function ℓ ( x, θ ) is deﬁned by marginalizing z : ℓ ( x, θ ) = − ln R p ( z, x ) dz .The online setting assumes the observation of a non-ﬁnitedata sequence { ( x t ) } t ≥ independently drawn accordingto the unknown data distribution π . The objective functionthat EM seeks to minimize is given by the following expec-tation: L ( θ ) = E [ ℓ ( x, θ ) | π ] . We derive this algorithm in terms of minimization of a lossfunction to highlight its connection with sdEM.

Sato [30] derived the stochastic updating equation of onlineEM by relying on the free energy formulation, or lowerbound maximization, of the EM algorithm [24] and on adiscounting averaging method. Using our own notation,this updating equation is expressed as follows, µ t +1 = (1 − ρ t ) µ t + ρ t E z [ s ( z, x t | θ ( µ t )]= µ t + ρ t ( E z [ s ( z, x t | θ ( µ t )] − µ t )= µ t + ρ t ∂ℓ ( x t , θ ( µ t )) ∂θ (11)where E z [ s ( z, x t | θ ( µ t )] denotes the expected sufﬁcientstatistics, E z [ s ( z, x t | θ ( µ t )] = R s ( z, x t ) p ( z | x t , θ ( µ t )) dz .He proved the convergence of the above iteration methodby casting it as a second order stochastic gradient descentusing the following equality, ∂ℓ ( x, θ ) ∂θ = ∂µ∂θ ∂ℓ ( x, θ ( µ )) ∂µ = I ( µ ) − ∂ℓ ( x, θ ( µ )) ∂µ (12)This equality is obtained by ﬁrstly applying the chain rule,followed by the equality shown in Equation (5). It showsthat online EM is equivalent to a stochastic gradient descentwith I ( µ t ) − as coefﬁcient matrices [9].Sato noted that that the third term of the equality in Equa-tion (12) resembles a natural gradient (see Theorem 1), buthe did not explore the connection. But the key insights ofthe above derivation, which were not noted by Sato, is thatEquation (12) is also valid for other loss functions differentfrom the marginal log-likelihood; and that the convergenceof Equation (11) does not depend on the formulation of theEM as a “lower bound maximization” method [24]. We consider the following supervised learning setup. Letus assume that we are given a data set D with n ob-servations { ( y , x ) , . . . , ( y n , x n ) } . We are also given a discriminative loss function ℓ ( y i , x i , θ ) . For example, itcould be the negative conditional log-likelihood (NCLL) ℓ ( y i , x i , θ ) = − ln p ( y i , x i | θ ) + ln R p ( y, x i | θ ) dy = − ln p ( y i | x i , θ ) . Our learning problem consists in minimiz-ing the following objective function: L ( θ ) = n X i =1 ℓ ( y i , x i , θ ) − ln p ( θ | α )= E [ ℓ ( y, x, θ ) | π ] − n ln p ( θ | α ) (13)where π is now the empirical distribution of D and E [ ℓ ( y, x, θ ) | π ] the empirical risk. Although the above The loss function is assumed to satisfy the mild conditionsgiven in [9]. E.g., it can be a non-smooth function, such as theHinge Loss. oss function is not standard in the machine learning lit-erature, we note that when ℓ is the negative log-likelihood(NLL), we get the classic maximum a posterior estimation .This objective function can be seen as an extension of thisframework.sdEM is presented as a generalization of Sato’s online EMalgorithm for ﬁnding the minimum of an objective functionin the form of Equation (13) (i.e. the solution to our learn-ing problem). The stochastic updating equation of sdEMcan be expressed as follows, µ t +1 = µ t − ρ t I ( µ t ) − ∂ ¯ ℓ ( y t , x t , θ ( µ t )) ∂µ (14)where ( y t , x t ) denotes the t -th sample, randomly generatedfrom π , and the function ¯ ℓ has the following expression: ¯ ℓ ( y t , x t , θ ( µ t )) = ℓ (( y t , x t , θ ( µ t )) + 1 /n ln p ( θ ( µ t )) . Wenote that this loss function satisﬁes the following equality,which is the base for a stochastic approximation method[21], E (cid:2) ¯ ℓ ( y t , x t , θ ( µ )) | π (cid:3) = L ( θ ( µ )) .Similarly to Amari’s natural gradient algorithm [3], themain problem of sdEM formulated as in Equation (14) isthe computation of the inverse of the Fisher informationmatrix at each step, which becomes even prohibitive forlarge models. The following result shows that this can becircumvented when we deal with distributions of the expo-nential family: Theorem 2.

In the exponential family, the natural gradientof a loss function with respect to the expectation parame-ters equals the gradient of the loss function with respect tothe natural parameters, I ( µ ) − ∂ ¯ ℓ ( y, x, θ ( µ )) ∂µ = ∂ ¯ ℓ ( y, x, θ ) ∂θ Sketch of the proof.

We ﬁrstly need to prove that I ( µ ) isa valid Riemannian tensor metric and, hence, the expecta-tion parameter space has a Riemanian structure deﬁned bythe metric I ( µ ) and the deﬁnition of the natural gradientmakes sense. This can be proved by the invariant propertyof the Fisher information metric to one-to-one reparame-terizations or, equivalently, transformations in the systemof coordinates [2, 4]. I ( µ ) is a Riemannian metric becauseit is the Fisher information matrix of the reparameterizedmodel p ( y, x | θ ( µ )) , and the reparameterization is one-to-one, as commented in Section 2.2.The equality stated in the theorem follows directly fromSato’s derivation of the online EM algorithm (Equation(12)). This derivation shows that we can avoid the com-putation of I ( µ ) − by using the natural parameters insteadof the expectation parameters and the function θ ( µ ) .Theorem 1 simpliﬁes the sdEM’s updating equation to, µ t +1 = µ t − ρ t ∂ ¯ ℓ ( y t , x t , θ ( µ t )) ∂θ (15) sdEM can be interpreted as a stochastic gradient descentalgorithm iterating over the expectation parameters andguided by the natural gradient in this Riemannian space. Algorithm 1

Stochastic Discriminative EM (sdEM)

Require: D is randomly shufﬂed. µ = ¯ α ; (initialize according to the prior) θ = θ ( µ ) ; t = 0 ; repeat for i = 1 , . . . , n do E-Step : µ t +1 = µ t − λt ) ∂ ¯ ℓ ( y i ,x i ,θ t ) ∂θ ; Check-Step: µ t +1 = Check ( µ t +1 , S ) ; M-Step : θ t +1 = θ ( µ t +1 ); t = t + 1 ; end for until convergence return θ ( µ t ) ;An alternative proof to Theorem 2 based on more recentresults on information geometry has been recently given in[27]. The results of that work indicate that sdEM could alsobe interpreted as a mirror descent algorithm with a Breg-man divergence as a proximitiy measure. It is beyond thescope of the paper to explore this relevant connection. In this section we do not attempt to give a formal proofof the convergence of sdEM, since very careful technicalarguments would be needed for this purpose [9]. We simplygo through the main elements that deﬁne the convergenceof sdEM as an stochastic approximation method [21].According to Equation (14), sdEM can be seen as astochastic gradient descent method with the inverse of theFisher information matrix I ( µ ) − as a coefﬁcient matrix[9]. As we are dealing with exponential families, these ma-trices are always positive-deﬁnite. Moreover, if the gradi-ent ∂ ¯ ℓ ( y, x, θ ) /∂θ can be computed exactly (in Section 3.4we discuss what happens when this is not possible), fromTheorem 2, we have that it is an unbiased estimator of thenatural gradient of the L ( θ ( µ )) deﬁned in Equation 13, E (cid:20) ∂ ¯ ℓ ( y, x, θ ) ∂θ | π (cid:21) = I ( µ ) − ∂L ( θ ( µ )) ∂µ (16)However, one key difference in terms of convergence be-tween online EM and sdEM can be seen in Equation (11): µ t +1 is a convex combination between µ t and the expectedsufﬁcient statistics. Then, µ t +1 ∈ S during all the itera-tions. As will be clear in the next section, we do not havethis same guarantee in sdEM, but we can take advantageof the log prior term of Equation (13) to avoid this prob-lem. This term plays a dual role as both “regularization”able 1: sdEM updating equations for fully observed data (Section 3.3) . Loss sdEM equation

NLL µ t +1 = (1 − ρ t (1 + νn )) µ t + ρ t (cid:0) s ( y t , x t ) + n ¯ α (cid:1) NCLL µ t +1 = (1 − ρ t νn ) µ t + ρ t (cid:0) s ( y t , x t ) − E y [ s ( y, x t ) | θ ( µ t )] + n ¯ α (cid:1) Hinge µ t +1 = (1 − ρ t νn ) µ t + ρ t ( n ¯ α if ln p ( y t ,x t | θ ) p (¯ y t ,x t | θ ) > s ( y t , x t ) − s (¯ y t , x t ) + n ¯ α otherwise where ¯ y t = arg max y = y t p ( y, x t | θ ) term and log-barrier function [31] i.e. a continuous func-tion whose value increases to inﬁnity as the parameter ap-proaches the boundary of the feasible region or the sup-port of p ( θ ( µ ) | α ) . Then, if the step sizes ρ t are smallenough (as happens near convergence), sdEM will alwaysstays in the feasible region S , due to the effect of the logprior term. The only problem is that, in the initial iterations,the step sizes ρ t are large, so one iteration can jump out ofthe boundary of S . The method to avoid that depends onthe particular model, but for the models examined in thiswork it seems to be a simple check in every iteration. Forexample, as we will see in the experimental section whenimplementing a multinomial Naive Bayes, we will check atevery iteration that each sufﬁcient statistic or “word count”is always positive. If a “word count” is negative at somepoint, we will set it to a very small value. As mentionedabove, this does not hurt the convergence of sdEM becausein the limit this problem disappears due the effect of thelog-prior term.The last ingredient required to assess the convergence ofa stochastic gradient descent method is to verify that thesequence of step sizes satisﬁes: P ρ t = ∞ , P ρ t < ∞ .So, if the sequence ( µ t ) t ≥ converges, it will probably con-verge to the global minimum ( µ ⋆ , θ ⋆ = θ ( µ ⋆ )) if L ( θ ) isconvex, or to a local minimum if L ( θ ) is not convex [9].Finally, we give an algorithmic description of sdEM in Al-gorithm 1. Following [11], we consider steps sizes of theform ρ t = (1 + λt ) − , where λ is a positive scalar . Asmentioned above, the “Check-Step” is introduced to guar-antee that µ t is always in S . Like the online EM algo-rithm [30, 13], Algorithm 1 resembles the classic expecta-tion maximization algorithm [15] since, as we will see inthe next section, the gradient is computed using expected The prior p would need to be suitably chosen. Our experiments suggest that trying λ ∈ { , . , . , . , . . . } sufﬁces for obtaining a quick convergence. sufﬁcient statistics . Assumption 3 guarantees that the max-imization step can be performed efﬁciently. This step dif-ferentiates sdEM from classic stochastic gradient descentmethods, where such a computation does not exist. As we have seen so far, the derivation of sdEM is com-plete except for the deﬁnition of the loss function. We willdiscuss now how two well known discriminative loss func-tions can be used with this algorithm.

Negative Conditional Log-likelihood (NCLL)

As mentioned above, this loss function is deﬁned as fol-lows: ℓ CL ( y t , x t , θ ) = − ln p ( y t , x t | θ ) + ln Z p ( y, x t | θ ) dy And its gradient is computed as ∂ℓ CL ( y t , x t , θ ) ∂θ = − s ( y t , x t ) + E y [ s ( y, x t ) | θ ] where the sufﬁcient statistic s ( y t , x t ) comes from thegradient of the ln p ( y t , x t | θ ) term in the NCLL loss,and the expected sufﬁcient statistic E y [ s ( y, x t ) | θ ] = R s ( y, x t ) p ( y | x t , θ ) dy , comes from the gradient of the ln R p ( y, x t | θ ) dy term in the NCLL loss. As mentionedabove, the computation of the gradient is similar to the ex-pectation step of the classic EM algorithm.The iteration equation of sdEM for the NCLL loss is de-tailed in Table 1. We note that in the case of multi-class pre-diction problems the integrals of the updating equation arereplaced by sums over the different classes of the class vari-able Y . We also show the updating equation for the nega-tive log-likelihood (NLL) loss for comparison purposes.able 2: sdEM updating equations for partially observed data (Section 3.4) Loss sdEM equation

NLL µ t +1 = (1 − ρ t (1 + νn )) µ t + ρ t (cid:0) E z [ s ( y t , z, x t ) | θ ( µ t )] + n ¯ α (cid:1) NCLL µ t +1 = (1 − ρ t νn ) µ t + ρ t (cid:0) E z [ s ( y t , z, x t ) | θ ( µ t )] − E yz [ s ( y, z, x t ) | θ ( µ t )] + n ¯ α (cid:1) Hinge µ t +1 = (1 − ρ t νn ) µ t + ρ t  n ¯ α if ln R p ( y t ,z,x t | θ ) dz R p (¯ y t ,z,x t | θ ) dz > E z [ s ( y t , z, x t ) | θ ( µ t )] − E z [ s (¯ y t , z, x t ) | θ ( µ t )] + n ¯ α otherwise where ¯ y t = arg max y = y t R p ( y, z, x t | θ ) dz The Hinge loss

Unlike the previous loss which is valid for continuous anddiscrete (and vector-valued) predictions, this loss is onlyvalid for binary or multi-class classiﬁcation problems.Margin-based loss functions have been extensively usedand studied by the machine learning community for binaryand multi-class classiﬁcation problems [5]. However, inour view, the application of margin-based losses (differentfrom the negative conditional log-likelihood) for discrimi-native training of probabilistic generative models is scarceand based on ad-hoc learning methods which, in general,are quite sophisticated [26]. In this section, we discuss howsdEM can be used to minimize the empirical risk of one ofthe most used margin-based losses, the Hinge loss, in bi-nary and multi-class classiﬁcation problems. But, ﬁrstly,we discuss how Hinge loss can be deﬁned for probabilisticgenerative models.We build on LeCun et al.’s ideas [23] about energy-basedlearning for prediction problems. LeCun et al. [23] deﬁnethe Hinge loss for energy-based models as follows, max(0 , − ( E (¯ y t , x t , w ) − E ( y t , x t , w )) where E ( · ) is the energy function parameterized by a pa-rameter vector w , E ( y t , x t , w ) is the energy associatedto the correct answer y t and E (¯ y t , x t , w ) is the energyassociated to the most offending incorrect answer, ¯ y t = arg min y = y t E ( y, x t , w ) . Predictions y ⋆ are made using y ⋆ = arg min y E ( y, x t , w ⋆ ) when the parameter w ⋆ thatminimizes the empirical risk is found.In our learning settings we consider the minus logarithmof the joint probability, − ln p ( y t , x t | θ ) , as an energy func-tion. In consequence, we deﬁne the hinge loss as follows ℓ hinge ( y t , x t , θ ) = max(0 , − ln p ( y t , x t | θ ) p (¯ y t , x t | θ ) ) (17) where ¯ y t denotes here too the most offending incorrect an-swer, ¯ y t = arg max y = y t p ( y, x t | θ ) .The gradient of this loss function can be simply computedas follows ∂ℓ hinge ( y t , x t , θ ) ∂θ =  if ln p ( y t ,x t | θ ) p (¯ y t ,x t | θ ) > − s ( y t , x t ) + s (¯ y t , x t ) otherwise and the iteration equation for minimizing the empirical riskof the Hinge loss is also given in Table 1. The generalization of sdEM to partially observable datais straightforward. We denote by Z the vector of non-observable variables. sdEM will handle statistical mod-els which deﬁne a probability distribution over ( y, z, x ) which belongs to the exponential family (Assumption 1).Assumption 2 and 3 remain unaltered.The tuple ( y, z, x ) will denote the complete event or com-plete data, while the tuple ( y, x ) is the observed event or theobserved data. So we assume that our given data set D with n observations is expressed as { ( y , x ) , . . . , ( y n , x n ) } . SosdEM’s Equation (14) and (15) are the same, with the onlydifference that the natural gradient is now deﬁned using theinverse of the Fisher information matrix for the statisticalmodel p ( y, z, x | θ ( µ )) . The same happens for Theorem 2.The NCLL loss and the Hinge loss are equally de-ﬁned as in Section 3.3, with the only difference thatthe computation of p ( y t , x t | θ ) and p ( x t | θ ) requiresmarginalization over z , p ( y t , x t | θ ) = R p ( y t , z, x t | θ ) dz , p ( x t | θ ) = R p ( y, z, x t | θ ) dydz . The updating equa-tions for sdEM under partially observed data for theNCLL and Hinge loss are detailed in Table 2. Newexpected sufﬁcient statistics need to be computed,igure 1: Toy example (Section 4.1). The result using the NLL loss (i.e. MLE estimation) is plotted with dashed lineswhich represent the densities p ( y = k ) N ( x, µ ( k ) , σ ( k ) ) for both classes (i.e. when the red line is higher than the blueline we predict the red class and vice versa). The estimated prediction accuracy of the MLE model is 78.6%. Solid linesrepresent the same estimation but using the NCLL and the Hinge loss. Their estimated prediction accuracies are 90.4%and 90.6%, respectively. D en s i t y −10 −5 0 5 10 . . . . . NLL y=1NLL y=−1NCLL y=1NCLL y=−1 D en s i t y −10 −5 0 5 10 . . . . . NLL y=1NLL y=−1Hinge y=1Hinge y=−1 (a) NCLL Loss (b) Hinge Loss E z [ s ( y t , z, x t ) | θ ] = R s ( y t , z, x t ) p ( z | y t , x t , θ ) dz and E yz [ s ( y, z, x t ) | θ ] = R s ( y, z, x t ) p ( y, z | x t , θ ) dydz . Aspreviously, we also show the updating equation for the neg-ative log-likelihood (NLL) loss for comparison purposes. For many interesting models [8], the computation of the ex-pected sufﬁcient statistics in the iteration equations shownin Table 1 and 2 cannot be computed in closed form.This is not a problem as far as we can deﬁne unbiasedestimators for these expected sufﬁcient statistics, sincethe equality of Equation (16) still holds. As it will beshown in the next section, we use sdEM to discrimina-tively train latent Dirichlet allocation (LDA) models [8].Similarly to [28], for this purpose we employ collapsedGibbs sampling to compute the expected sufﬁcient statis-tics, E z [ s ( y t , z, x t ) | θ ] , as it guarantees that at convergencesamples are i.i.d. according to p ( z | y t , x t , θ ) . We begin the experimental analysis of sdEM by learninga very simple Gaussian naive Bayes model composed bya binary class variable Y and a single continuous predic-tor X . Hence, the conditional density of the predictorgiven the class variable is assumed to be normally dis-tributed. The interesting part of this toy example is thatthe training data is generated by a different model: π ( y = −

1) = 0 . , π ( x | y = − ∼ N (0 , and π ( x | y = 1) ∼ . · N ( − , .

1) + 0 . · N (5 , . . Figure 1 shows thehistogram of the 30,000 samples generated from the π dis-tribution. The result is a mixture of 3 Gaussians, one in thecenter with a high variance associated to y = − and twonarrows Gaussians on both sides associated to y = 1 .sdEM can be used by considering 6 (non-minimal) sufﬁ-cient statistics: N ( − and N (1) as “counts” associated toboth classes, respectively; S ( − and S (1) as the “sum” ofthe x values associated to classes y = − and y = 1 , re-spectively; and V ( − and V (1) as the “sum of squares”of the x values for each class. We also have ﬁve param-eters which are computed from the sufﬁcient statistics asfollows: Two for the prior of class p ( y = −

1) = p ( − = N ( − / ( N ( − + N (1) ) and p (1) = N (1) / ( N ( − + N (1) ) ;and four for the two Gaussians which deﬁne the condi-tional of X given Y , µ ( − = S ( − /N ( − , σ ( − = p V ( − /N ( − − ( S ( − /N ( − ) , and equally for µ (1) and σ (1) .The sdEM’s updating equations for the NCLL loss can bewritten as follows N ( k ) t +1 = N ( k ) t + ρ t ( I [ y t = k ] − p t ( k | x t )) + ρ t nS ( k ) t +1 = (1 − ρ t n ) S ( k ) t + ρ t x t ( I [ y t = k ] − p t ( k | x t )) V ( k ) t +1 = (1 − ρ t n ) V ( k ) t + ρ t x t ( I [ y t = k ] − p t ( k | x t )) + ρ t n where k indexes both classes, k ∈ {− , } , I [ · ] denotesthe indicator function, p t ( k | x t ) is an abbreviation of p ( y = k | x t , θ t ) , and θ t is the parameter vector computed from thesufﬁcient statistics at the t -th iteration.igure 2: Convergence trade-off of the Hinge loss ver-sus the NCLL loss and the perplexity for a multinomialnaive Bayes model trained minimizing the Hinge loss us-ing sdEM. Circle-lines, triangle-lines and cross-lines corre-spond to the results with 20NewsGroup, Cade and Reuters-R52 datasets, respectively. Epoch . . . . . . . NC LL − H i nge Epoch N o r m a li z ed P e r p l e x i t y NCLL Hinge Perplexity1 3 5 7 9 11 13 15 17 19

Figure 3: Convergence of the classiﬁcation accuracy fora multinomial naive Bayes model trained minimizing theNCLL loss (NCLL-MNB) and the Hinge loss (Hinge-MNB) using sdEM. Red circle-lines, red triangle-lines andred cross-lines correspond to the results of NCLL-MNBwith 20NewsGroup, Cade and Reuters-R52 datasets, re-spectively. Same for Hinge-MNB. The three blue and thethree red solid lines detail the accuracy of logistic regres-sion and SVM, respectively. The three dashed black linesdetail the accuracy of plain MNB with a Laplace prior.

Epoch A cc u r a cy r5220ng Cade NCLL−MNB Hinge−MNB

Similarly, the sdEM’s updating equations for the Hinge losscan be written as follows, N ( k ) t +1 = N ( k ) t + ky t ρ t I [ln p t ( y t | x t ) p t ( ¯ y t | x t ) <

1] + ρ t nS ( k ) t +1 = (1 − ρ t n ) S ( k ) t + ky t ρ t x t I [ln p t ( y t | x t ) p t ( ¯ y t | x t ) < V ( k ) t +1 = (1 − ρ t n ) V ( k ) t + ky t ρ t x t I [ln p t ( y t | x t ) p t ( ¯ y t | x t ) <

1] + ρ t n where the product ky t is introduced in the updating equa-tions to deﬁne the sign of the sum, and the indicator func-tion I [ · ] deﬁnes when the hinge loss is null.In the above set of equations we have considered as a con-jugate prior for the Gaussians a three parameter Normal-Gamma prior, ν = 1 and ¯ α = 0 for S ( k ) and ¯ α = 1 for V ( k ) [6, page 268], and a Beta prior with ν = 0 and ¯ α = 1 for N ( k ) . We note that these priors assign zero probabil-ity to “extreme” parameters p ( k ) = 0 (i.e. N ( k ) = 0 ) and σ ( k ) = 0 (i.e. V ( k ) /N ( k ) − ( S ( k ) /N ( k ) ) = 0 ).Finally, the“Check-step” (see Algorithm 1) performed be-fore computing θ t +1 , and which guarantees that all sufﬁ-cient statistics are correct, is implemented as follows: N ( k ) t +1 = max( N ( k ) t +1 , ρ t n ) V ( k ) t +1 = max( V ( k ) t +1 , ( S ( k ) t +1 ) N ( k ) t +1 + ρ t n ) I.e., when the N ( k ) “counts” are negative or too small orwhen the V ( k ) values lead to negative or null deviations σ ( k ) ≤ , they are ﬁxed with the help of the prior term.The result of this experiment is given in Figure 1 andclearly shows the different trade-offs of both loss functionscompared to maximum likelihood estimation . It is interest-ing to see how a generative model which does not matchthe underlying distribution is able to achieve a pretty highprediction accuracy when trained with a discrimintaive lossfunction (using the sdEM algorithm). Next, we brieﬂy show how sdEM can be used to discrimi-natively train some generative models used for text classiﬁ-cation, such as multinomial naive Bayes and a similar clas-siﬁer based on latent Dirichlet allocation models [8]. Sup-plementary material with full details of these experimentsand the Java code used in this evaluation can be downloadat: http://sourceforge.net/projects/sdem/

Multinomial Naive Bayes (MNB)

MNB assumes that words in documents with the same classor label are distributed according to an independent multi-nomial distribution. sdEM can be easily applied to train thisigure 4: Convergence of the classiﬁcation accuracy ofLDA classiﬁcation models trained by sdEM using differentloss functions (NLL, NCLL and Hinge) over 10 differentrandom initializations. The two dashed lines and the singlesolid line detail the maximum, minimum and mean accu-racy of sLDA, respectively, over 10 random initializations.

NLL−LDA NCLL−LDA Hinge−LDA sLDA web−kbreuters−R8

Epoch A cc u r a cy model. The sufﬁcient statistics are the “prior class counts”and the “word counts” for each class. The updating equa-tions and the check step are the same as those of N ( k ) t inthe previous toy example. Parameters of the MNB are com-puted simply through normalization operations. Two dif-ferent conjugate Dirichlet distributions were considered: A“Laplace prior” where ¯ α i = 1 ; and a ”Log prior” where ¯ α i = “logarithm of the number of words in the corpus”.We only report analysis for “Laplace prior” in the case ofNCLL loss and for “Log prior” in the case of Hinge loss.Other combinations show similar results, although NCLLwas more sensitive to the chosen prior.We evaluate the application of sdEM to MNB withthree well-known multi-class text classiﬁcation prob-lems: 20Newsgroup (20 classes), Cade (12 classes) andReuters21578-R52 (52 classes). Data sets are stemmed.Full details about the data sets and the train/test data setssplit used in this evaluation can be found in [14].Figure 2 shows the convergence behavior of sdEM with λ = Latent Dirichlet Allocation (LDA)

We brieﬂy show the results of sdEM when discriminativelytraining LDA models. We deﬁne a classiﬁcation modelequal to MNB, but where the documents of the same classare now modeled using an independent LDA model. Weimplement this model by using, apart from the “prior classcounts”, the standard sufﬁcient statistics of the LDA model,i.e. “words per hidden topic counts”, associated to eachclass label. Similarly to [28], we used an online CollapsedGibbs sampling method to obtain, at convergence, unbiasedestimates of the expected sufﬁcient statistics (see Table 2).This evaluation was carried out using the standard train/testsplit of the Reuters21578-R8 (8 classes) and web-kb (4classes) data sets [14], under the same preprocessing thanin the MNB’s experiments. Figure 4 shows the resultsof this comparison using 2-topics LDA models trainedwith the NCLL loss (NCLL-LDA), the Hinge loss (Hinge-LDA), and also the NLL loss (NLL-LDA) following theupdating equations of Table 2. We compared these resultswith those returned by supervised-LDA (sLDA) [7] usingthe same prior, but this time with 50 topics because lesstopics produced worse results. We see again how a sim-ple generative model trained with sdEM outperforms muchmore sophisticated models.

We introduce a new learning algorithm for discriminativetraining of generative models. This method is based ona novel view of the online EM algorithm as a stochastic natural gradient descent algorithm for minimizing generaldiscriminative loss functions. It allows the training of awide set of generative models with or without latent vari-ables, because the resulting models are always generative.Moreover, sdEM is comparatively simpler and easier to im-plement (and debug) than other ad-hoc approaches.

Acknowledgments

This work has been partially funded from the Euro-pean Union’s Seventh Framework Programme for research,technological development and demonstration under grantagreement no 619209 (AMIDST project). eferences [1] Sungjin Ahn, Anoop Korattikara Balan, and MaxWelling. Bayesian posterior sampling via stochasticgradient Fisher scoring. In

ICML , 2012.[2] Shun-ichi Amari.

Differential-geometrical methodsin statistics . Springer, 1985.[3] Shun-Ichi Amari. Natural gradient works efﬁcientlyin learning.

Neural Comput. , 10(2):251–276, 1998.[4] Shun-ichi Amari and Hiroshi Nagaoka.

Methods ofinformation geometry , volume 191. American Math-ematical Soc., 2007.[5] Peter L Bartlett, Michael I Jordan, and Jon DMcAuliffe. Convexity, classiﬁcation, and risk bounds.

Journal of the American Statistical Association ,101(473):138–156, 2006.[6] Jos´e M Bernardo and Adrian FM Smith.

Bayesiantheory , volume 405. John Wiley & Sons, 2009.[7] David M Blei and Jon D McAuliffe. Supervised topicmodels. In

NIPS , volume 7, pages 121–128, 2007.[8] David M Blei, Andrew Y Ng, and Michael I Jor-dan. Latent Dirichlet allocation.

Journal of MachineLearning Research , 3:993–1022, 2003.[9] L. Bottou. Online learning and stochastic approxi-mations.

On-line learning in neural networks , 17(9),1998.[10] L. Bottou. Large-scale machine learning withstochastic gradient descent. In

Proceedings ofCOMPSTAT’2010 , pages 177–186. Springer, 2010.[11] L. Bottou. Stochastic gradient descent tricks. In

Neu-ral Networks: Tricks of the Trade , pages 421–436.Springer, 2012.[12] L. Bottou and O. Bousquet. The tradeoffs of largescale learning. In

NIPS , volume 4, page 2, 2007.[13] Olivier C. and E. Moulines. On-line expectation–maximization algorithm for latent data models.

Jour-nal of the Royal Statistical Society: Series B (Statisti-cal Methodology) , 71(3):593–613, 2009.[14] Ana Cardoso-Cachopo. Improving Methods forSingle-label Text Categorization. PdD Thesis, 2007.[15] Arthur P Dempster, Nan M Laird, Donald B Rubin,et al. Maximum likelihood from incomplete data viathe EM algorithm.

Journal of the Royal statisticalSociety , 39(1):1–38, 1977.[16] Greiner et al. Structural extension to logistic regres-sion: Discriminative parameter learning of belief netclassiﬁers.

Mach. Learning , 59(3):297–322, 2005. [17] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, andC.J. Lin. Liblinear: A library for large linear classiﬁ-cation.

JMLR , 9:1871–1874, 2008.[18] Mark Hall, Eibe Frank, Geoffrey Holmes, Bern-hard Pfahringer, Peter Reutemann, and Ian H Witten.The weka data mining software: an update.

ACMSIGKDD explorations newsletter , 11(1):10–18, 2009.[19] Geoffrey E Hinton, Simon Osindero, and Yee-WhyeTeh. A fast learning algorithm for deep belief nets.

Neural computation , 18(7):1527–1554, 2006.[20] Thorsten Joachims.

Text categorization with supportvector machines: Learning with many relevant fea-tures . Springer, 1998.[21] Harold Joseph Kushner and G George Yin.

Stochasticapproximation algorithms and applications . SpringerNew York, 1997.[22] S. Lacoste-Julien, F. Sha, and M.I. Jordan. DiscLDA:Discriminative learning for dimensionality reductionand classiﬁcation. In

NIPS , volume 83, page 85, 2008.[23] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ran-zato, and F Huang. A tutorial on energy-based learn-ing.

Predicting structured data , 2006.[24] Radford M Neal and Geoffrey E Hinton. A view ofthe EM algorithm that justiﬁes incremental, sparse,and other variants. In

Learning in graphical models ,pages 355–368. Springer, 1998.[25] Frank Nielsen and Vincent Garcia. Statistical expo-nential families: A digest with ﬂash cards. arXivpreprint arXiv:0911.4863 , 2009.[26] F. Pernkopf, M. Wohlmayr, and S. Tschiatschek.Maximum margin Bayesian network classiﬁers.

IEEETrans. PAMI , 34(3):521–532, 2012.[27] Garvesh Raskutti and Sayan Mukherjee. The infor-mation geometry of mirror descent. arXiv preprintarXiv:1310.7780 , 2013.[28] D. Rohde and O. Cappe. Online maximum-likelihoodestimation for latent factor models. In

Statistical Sig-nal Processing Workshop (SSP), 2011 IEEE , pages565–568, June 2011.[29] David Ruppert. Efﬁcient estimations from a slowlyconvergent Robbins-Monro process. Technical report,Cornell University Operations Research and Indus-trial Engineering, 1988.[30] Masa-aki Sato. Convergence of on-line EM algo-rithm. In

Proc. of the Int. Conf. on Neural InformationProcessing , volume 1, pages 476–481, 2000.31] SJ Wright and J Nocedal.

Numerical optimization ,volume 2. Springer New York, 1999.[32] Jun Zhu, Amr Ahmed, and Eric P Xing. MedLDA:maximum margin supervised topic models for regres-sion and classiﬁcation. In

ICML , pages 1257–1264.ACM, 2009.

Appendices

This supplementary material aims to extend, detail andcomplement the experimental evaluation of sdEM given inthe main paper. The structure of this document is as fol-lows. Section A details the experimental evaluation of themultinomial naive Bayes classiﬁer and introduces new ex-periments comparing with the stochastic gradient descentalgorithm [10]. The experimental evaluation of sdEM ap-plied to latent Dirichlet allocation models is detailed andextended in Section B. Section C points to the softwarerepository where all the software code used in this experi-mental evaluation can be downloaded to reproduce all theseresults.

A Multinomial Naive Bayes for textclassiﬁcation

Description of the algorithm

As commented in the main paper, a multinomial NaiveBayes (MNB) classiﬁer assumes that the words of the doc-uments with the same class labels are distributed accord-ing to an independent multinomial probability distribution.In this section we evaluate the use of sdEM to discrimina-tively train MNB models using the NCLL and the Hingeloss functions. In the ﬁrst case, such a model would berelated to a logistic regression model; while in the secondcase we will obtain a model directly related to a linear sup-port vector machine classiﬁer [20].The general updating equations for this problem can befound in the main paper in Table 2. But a detailed pseudo-code description is now given in Algorithm 2 for the NCLLloss and in Algorithm 4 for the Hinge loss. In both cases,the sufﬁcient statistics are the ”prior class counts” stored inthe matrix C and the ”word counts per class” stored in thematrix N . Matrix M is introduced to allow efﬁcient com-putations of the posterior probability of the class variablegiven a document d , p ( Y | d, N, M, C, γ ) . How this poste-rior is computed is detailed in Algorithm 3. In that way, thecomputational complexity of processing a label-documentpair is linear in the number of words of the document. Fi-nally, the function N ormalize ( · , · ) produces a multino-mial probability by normalizing the vector of counts. Thesecond argument contains the prior correction considered in this normalization, i.e. the value which is added to eachsingle component of the count vector to avoid null proba-bilities, similar to what is done in Algorithm 3.In both algorithms, we consider a Dirichlet distributionprior for the multinomial distributions. As detailed in theheader of these algorithms, two different priors are consid-ered: prior P , with Dirichlet’s metaparameters α k = 1 ;and prior P with α k = ln | W | , where | W | denotes the to-tal number of different words in the corpus. In both cases,the prior assigns null probability to parameters lying in the”border” of the parameter space (i.e. when a null probabil-ity is assigned to some word).As commented in the ”toy example” of the main paperin Section 4.1, the parametrization that we chose for thisDirichlet prior makes that the ν parameter, arising in theexponential family form of this prior (see Assumption 2 ofthe main paper), be equal to null, ν = 0 . This can be seenwhen expressing the Dirichlet distribution in the followingexponential form: Dir ( θ , . . . , θ k ; α , . . . , α K ) = Q k Γ( α k )Γ( P k α k ) θ α − . . . θ α K − K = exp X k ( α k −

1) ln θ k + X k ln Γ( α k ) − ln Γ( X k α k ) ! The second and third terms inside the exponent in the aboveequation correspond to the log partition function A g ( α ) ofthe prior. The ﬁrst term correspond to the dot product be-tween the sufﬁcient statistics (ln θ , . . . , ln θ K ) and the nat-ural parameters ( α − , . . . , α k − . As can be seen, the ν parameter can be obviated in this deﬁnition, i.e. ν = 0 .Table 3: Data sets statistics. | W | denotes the number ofdifferent words in the data set, | Y | denotes the number oflabels of the class variable, D train and D test the num-ber of documents in the training and the test set, respec-tively. These data sets can be downloaded, for example,from http://sourceforge.net/projects/sgmweka/. Name | W | | Y | D train D eval ACL-IMDB 89527 2 47000 3000Amazon12 86914 2 267875 100556Cade 157483 12 27322 13661Reuters-R8 14575 8 2785 1396Reuters-R52 16145 52 5485 2189WebKb 7287 4 6532 256820 News-group 54580 20 11293 7528

Experimental Evaluation

As detailed in the main paper, we evaluate the appli-cation of sdEM to MNB with three well-known multi-igure 5: Convergence behavior of sdEM applied to a multinomial naive Bayes model (NCLL-MNB) with different priors.Circle-lines, triangle-lines and cross-lines correspond to the results with 20NewsGroup, Cade and Reuters-R52 datasets,respectively. In the third ﬁgure, the three solid lines detail the accuracy of logistic regression for these three data sets.The tree dashed lines detail the accuracy of plain MNB with P1 (MNB results with P2 are omitted because they are muchworse).

Epoch . . . . . . . NC LL − H i nge Epoch N o r m a li z ed P e r p l e x i t y NCLL Hinge Perplexity1 3 5 7 9 11 13 15 17 19 Epoch . . . . . . . NC LL − H i nge Epoch N o r m a li z ed P e r p l e x i t y NCLL Hinge Perplexity1 3 5 7 9 11 13 15 17 19 Epoch A cc u r a cy Prior 1 Prior 2 (a) Converg. of loss functs. with P1 (b) Converg. of loss functs. with P2 (c) Converg. of Accuracyclass text classiﬁcation problems: 20Newsgroup, Cade andReuters21578-R52. These data sets are stemmed. Full de-tails about the data sets and the train/test data sets split usedin this evaluation can be found in [14]. Although Table 3shows some of the main statistics of these data sets.Figure 5 (a), Figure 5 (b), Figure 6 (a) and Figure 6 (b)show the convergence behavior of sdEM with λ = when training the MNB by minimizing the NCLL loss(NCLL-MNB) and by minimizing the Hinge loss (Hinge-MNB), respectively. In both cases, we plot the evolution ofthe NCLL loss, the Hinge loss and the normalized perplex-ity (i.e. the perplexity measure [8] divided by the numberof training documents) of the trained model at each epoch.We can see that there is a trade-off between the differentlosses. For example, Hinge-MNB decreases the Hinge loss(as expected) but tends to increase the NCLL loss, while itonly decreases perplexity at the very beginning. This lasttrend is much stronger when considering the P1 prior. Asimilar behavior can be observed for NCLL-MNB, with themain difference that the NCLL loss is an upper bound ofthe Hinge loss, and then when NCLL-MNB minimizes theNCLL loss it also minimizes the Hinge loss. Here it canbe also observed that the perplexity remains quite stablespecially for P1.Figure 5 (c) and Figure 6 (c) displays the evolution of theclassiﬁcation accuracy for the above models. We com-pare it to the standard MNB with a “Laplace prior” and Other values yield similar results and offer stable conver-gence, although at lower peace. A “Log prior” was also evaluated but reported much worseresults. with L2-regularized Logistic Regression and primal L2-regularized SVM implemented in the Liblinear toolkit v.18[17]. For the case of the NCLL loss, the models seem to bemore dependent of the chosen prior, specially for the Cadedataset. In any case, we can see that sdEM is able to trainsimple MNB models with a performance very close to thatprovided by highly optimized algorithms.A new set of experiments is included in this analysis com-paring the MNB models learnt with sdEM with the classicstochastic gradient descent (SGD) algorithm. This evalua-tion is made using the Amazon12 and ACL-IMDB data sets(whose main details can be found in Table 3). We choosethese data sets because they are binary classiﬁcation prob-lems, which are very well deﬁned problems for logistic re-gression and linear SVM models. How SGD is used to trainthis model can be seen in [11].In this evaluation we simply plot the evolution of the clas-siﬁcation accuracy of the SGD algorithm when training alinear classiﬁer using the NCLL loss (NCLL-SGD) andthe Hinge loss (Hinge-SGD) with a L2 regularization fordifferent learning rates or decreasing steps ρ t . SGD isimplemented as detailed in [11], where the weight of theregularized term is ﬁxed to 1e-4. As recommended in[11], learning rates ρ t for SGD are computed as follows: ρ t = λ λ · . · t . We also look at the evolution of theclassiﬁcation accuracy of NCLL-MNB and Hinge-MNBwith different priors in these two data sets and using dif-ferent learning rates. In both cases, the plotted learningrates ρ t are selected by using different λ values of the form λ ∈ { , . , . , . , . , . , . . . } . These re-sults are shown in Figures 7 and 8. In each case, we con-igure 6: Convergence behavior of sdEM applied to a multinomial naive Bayes model (Hinge-MNB) with different priors.Circle-lines, triangle-lines and cross-lines correspond to the results with 20NewsGroup, Cade and Reuters-R52 datasets,respectively. In the third ﬁgure, the three solid lines detail the accuracy of SVM for these three data sets. The tree dashedlines detail the accuracy of plain MNB with P1 (MNB results with P2 are omitted because they are much worse). Epoch . . . . . . . NC LL − H i nge Epoch N o r m a li z ed P e r p l e x i t y NCLL Hinge Perplexity1 3 5 7 9 11 13 15 17 19 Epoch . . . . . . . NC LL − H i nge Epoch N o r m a li z ed P e r p l e x i t y NCLL Hinge Perplexity1 3 5 7 9 11 13 15 17 19 Epoch A cc u r a cy Prior 1 Prior 2 (a) Converg. of loss functs. with P1 (b) Converg. of loss functs. with P2 (c) Converg. of Accuracysider the 5 consecutive λ values with the quickest conver-gence speed. B Latent Dirichlet Allocation (LDA) for textclassiﬁcation

Description of the algorithm

As commented in the main paper, we depart from a classiﬁ-cation model similar to MNB, but where the documents ofthe same class are now modeled using an independent LDAmodel instead of a multinomial distribution. The generativeprocess of each label-document pair in the corpus would beas follows [8]:1. Choose class label y ∼ p ( y | θ Y ) , a multinomial prob-ability.2. Choose N ∼ P oisson ( ξ y ) , the length of the docu-ment follows a Poisson distribution.3. Choose φ y ∼ Dir ( α y ) , a Dirichlet distribution withdimension | Z | (the meta-parameters are set to / | Z | in the experimental evaluation).4. For each of the N words w n :(a) Choose a topic z n ∼ M ultinomial ( φ y ) with di-mension | Z | .(b) Choose a word w n ∼ p ( w n | z n , β y ) , a multino-mial probability conditioned on the topic z n .In our case the unknown parameters are the β y for eachclass label, which deﬁnes the multinomial distribution ofthe step 4 (b) and the parameter θ Y which deﬁnes the prior. We denote by d to a document as a bag of words d = { w , . . . , w N } and we denote by z d to a particular hid-den topic assignment vector for the words in d . Then thesufﬁcient statistics for this model would be a three dimen-sional matrix indexed by k ∈ { , ..., | Y |} , z ∈ { , . . . , | Z |} and w ∈ { , . . . , | W |} , where | W | denotes again the totalnumber of different words in the corpus. The ( k, z, w ) -thcomponent of this sufﬁcient statistics matrix is computedas follows: s k,z,w ( y, z d , d ) = I [ y = k ] X n I [ z n = z ] I [ w n = w ] As previously commented in the main paper, these sufﬁ-cient statistics would correspond to the ”words per hid-den topic counts”. By adding the ”prior class counts”, wewould complete all the sufﬁcient statistics that deﬁne thisclassiﬁcation model.As also commented in the main paper, similarly to [28], weused an online Collapsed Gibbs sampling method to obtain,at convergence, unbiased estimates of the expected sufﬁ-cient statistics (see Section 3.5 in the main paper). Thiscollapsed Gibbs sampling method makes used of the ana-lytical marginalization of the parameter φ y and samples inturn each of the indicator variables z , . . . , z N . The prob-ability of an indicator variable z n conditioned on all thewords of the document and all the other indicators variablescan be computed as follows: p ( z n | y, { z n ′ } n ′ = n , d ) ∝ β y,z n ,w n · ( S ( − w n ) z n + α ) (18)igure 7: Convergence of the classiﬁcation accuracy for NCLL-SGD and NCLL-MNB for different priors and differentlearning rates in the Amazon12 and ACL-IMDB data set. Solid lines detail the accuracy of the aforementioned Liblinear’slogistic regression and dashed lines detail the accuracy of the plain MNB with the corresponding prior. The numbers of thebottom right legends correspond to different λ values (see Section 3.2 of the main paper), which deﬁnes how the sdEM’slearning rates ρ t decreases over time. Amazon12 Epoch A cc u r a cy Epoch A cc u r a cy Epoch A cc u r a cy ACL-IMDB

Epoch A cc u r a cy Epoch A cc u r a cy Epoch A cc u r a cy (a) NCLL-SGD (b) NCLL-MNB with prior P1 (c) NCLL-MNB with prior P2igure 8: Convergence of the classiﬁcation accuracy for Hinge-SGD and Hinge-MNB and different learning rates fordifferent priors in the Amazon12 and ACL-IMDB data set. Solid lines detail the accuracy of the aforementioned Liblinear’sSVM classiﬁer and dashed lines detail the accuracy of the plain MNB with the corresponding prior. The numbers of thebottom right legends correspond to different λ values (see Section 3.2 of the main paper), which deﬁnes how the sdEM’slearning rates ρ t decreases over time. Amazon12 Epoch A cc u r a cy Epoch A cc u r a cy Epoch A cc u r a cy ACL-IMDB

Epoch A cc u r a cy Epoch A cc u r a cy Epoch A cc u r a cy (a) Hinge-SGD (b) Hinge-MNB with prior P1 (c) Hinge-MNB with prior P2here S ( − w n ) z = P n ′ = n I [ z n ′ = z ] and β y,z n ,w n is thecomponent of the β parameter vector which deﬁnes theprobability that the n -th word in document d is equal to w n given that hidden topic is z n and the class label of thedocument is y , p ( w n | z n , β y ) .The above equation deﬁnes a Markov chain that when it isrun generates unbiased samples from its stationary distribu-tion, p ( z n | d, y ) (after discarding the ﬁrst burn-in samples).So, we could then compute the expected sufﬁcient statisticsrequired to apply the sdEM algorithm over these models.Let us note that under our online settings the β parameterof Equation (18) is ﬁxed to the values β t − estimated inthe previous step and the this online collapsed Gibbs sam-pler only requires that the simulation is conditioned to thelatent variables of the current observed document (i.e. itdoes not involve the hidden topics of the other documentsin the corpus as happens with its batch counterpart).In Algorithm 5 and Algorithm 7, we give a pseudocode de-scription of the sdEM algorithm when applied to the thisLDA classiﬁcation model when using the NCLL and theHinge loss functions, respectively. As can be seen, thisalgorithms does not directly relate to the standard LDA im-plementation, because we employ the same simpliﬁcationused in the implementation of the sLDA algorithm [7] formulti-class prediction. This simpliﬁcation assumes that allthe occurrences of the same word in a document share thesame hidden topic. The ﬁrst effect of this assumption isthat the number of hidden variables is reduced and the algo-rithm is much quicker. Whether this simplifying assump-tion has a positive or negative effect in the classiﬁcationperformance of the models is not evaluated here.Let us also see in the pseudo-code of these two algorithms,that Hinge-LDA will tend to be computationally more efﬁ-cient than NCLL-LDA, because Hinge-LDA does not up-date any parameter when it classiﬁes a document with amargin higher than 1. However, NCLL-LDA always up-dates all the parameters. When we deal with a high numberof classes, this may imply a great difference in the compu-tational performance. But this is something which is notevaluated in this ﬁrst experimental study.We also use a heuristic method to initialize the hidden top-ics variables z n of the incoming document which consistsin sampling the hidden topics according to Equation 18,where S ( − w n ) z n is computed on-the-ﬂy i.e. for the ﬁrst wordis a vector of zeros and, then, it is updated according tothe sampled topics. It is similar to running collapsed Gibbssampling for one iteration.We emphasis again that these algorithms are based on theupdating equations given in the Table 2 of the main paper. ∼ chongw/slda/ It is proposed in http://shuyo.wordpress.com/2011/06/27/collapsed-gibbs-sampling-estimation-for-latent-dirichlet-allocation-3/.

Experimental Evaluation

As previously commented in the paper, this evaluationwas carried out using the standard train/test split of theReuters21578-R8 and Web-KB data sets [14], under thesame preprocessing than in the MNB’s experiments. In Ta-ble 3 some statistics about these data sets are given.We used sdEM to train 2-topics LDA classiﬁcation by min-imizing the NCLL loss (NCLL-LDA), by minimizing theHinge loss (Hinge-LDA), and also by minimizing the neg-ative log-likelihood loss (NLL-LDA), following the updat-ing equations of Table 2 in the main paper. We remind thatat Figure 3 in the main paper, we show the results of thecomparison of the classiﬁcation accuracy of these modelswith the results obtained by supervised-LDA (sLDA) [7]using the same prior, but using 50 topics because with lesstopics it produced worse results.We plot here at Figure 9, the convergence behavior at thetraining phase of the above models. The aim is to highlighthow there is again a similar trade-off between the differ-ent losses when we train this model by minimizing a dis-criminative loss function such as NCLL or Hinge loss w.r.t.when we train this same model by minimizing a ”genera-tive loss” such as the negative log-likelihood (NLL).Looking at these ﬁgures we can see like neither NCLL-LDA nor Hinge-LDA decrease the perplexity loss in op-posite to NLL-LDA. We can also see that NLL-LDA doesdecrease either the NCLL or the Hinge loss but not so suc-cessfully as NCLL-LDA or Hinge-LDA.

C sdEM Java Code

All the code used to build all the experiments pre-sented in this supplemental material or in the main pa-per can be downloaded from the following code repos-itory ”https://sourceforge.net/projects/sdem/” (in ”Files”tab). This code is written in Java and mostly builds onWeka [18] data structures.igure 9: Convergence behavior of sdEM when applied to the LDA classiﬁcation model. Left side ﬁgures consider theNCLL-LDA model, i.e. when sdEM minimizes the NCLL loss function. Then, the series NCLL-disc and Perplexity-discdisplay the evolution of these two losses for the NCLL-LDA model. Right side ﬁgures consider the Hinge-LDA model,i.e. when sdEM minimizes the Hinge loss function. Then, the series Hinge-disc and Perplexity-disc display the evolutionof these two losses for the Hinge-LDA model. Series NCLL-gen, Hinge-gen and Perpelexity-gen show the evolution ofthe NCLL, Hinge and perplexity losses, respectively, for the NLL-LDA model, i.e. when sdEM minimizes the negativelog-likelihood (NLL) loss function.

Reuters-R8 data set . . . . . . . Epoch NC LL − P e r p l e x i t y . . . . . . . Epoch NC LL − P e r p l e x i t y . . . . . . . Epoch NC LL − P e r p l e x i t y . . . . . . . Epoch NC LL − P e r p l e x i t y NCLL−discPerplexity−disc NCLL−genPerplexity−gen . . . . . . . Epoch H i nge − P e r p l e x i t y . . . . . . . Epoch H i nge − P e r p l e x i t y . . . . . . . Epoch H i nge − P e r p l e x i t y . . . . . . . Epoch H i nge − P e r p l e x i t y Hinge−discPerplexity−disc Hinge−genPerplexity−gen

Web-KB data set . . . . . . . Epoch NC LL − P e r p l e x i t y . . . . . . . Epoch NC LL − P e r p l e x i t y . . . . . . . Epoch NC LL − P e r p l e x i t y . . . . . . . Epoch NC LL − P e r p l e x i t y NCLL−discPerplexity−disc NCLL−genPerplexity−gen . . . . . . . Epoch H i nge − P e r p l e x i t y . . . . . . . Epoch H i nge − P e r p l e x i t y . . . . . . . Epoch H i nge − P e r p l e x i t y . . . . . . . Epoch H i nge − P e r p l e x i t y Hinge−discPerplexity−disc Hinge−genPerplexity−gen (a) NCLL-LDA (b) Hinge-LDA lgorithm 2 sdEM for Multinomial Naive Bayes with the NCLL loss. | d | denotes the number of different words in thecurrent document d and | w | d to the number of times word w appears in document d . | W | denotes the total number ofdifferent words in the corpus. Require: D is randomly shufﬂed. Require: α value as prior count for each word. Two values are considered α = 1 and α = ln | W | . ∀ k, w N [ k ][ w ] = α ; C [ k ] = 1 . ; M [ k ] = α ∗ | W | ; t = 0 ; γ = 0 ; repeat for each label-document pair ( y, d ) do t = t + 1 ; ρ = λ · t γ = γ + α · ρn for each distinct word w in the document d do N [ y ][ w ] = N [ y ][ w ] + ρ · | w | d · (1 − p ( Y = y | d, N, M, C, γ )) ; M [ y ] = M [ y ] + ρ · | w | d · (1 − p ( Y = y | d, N, M, C, γ )) ; for k = 1 , ..., | Y | : k = y do oldV al = N [ k ][ w ] ; N [ k ][ w ] = N [ k ][ w ] − ρ · | w | d · ps ( Y = k | d, N, M, C, γ ) ; N [ k ][ w ] = max( N [ k ][ w ] , ; M [ k ] = M [ k ] + ( N [ k ][ w ] − oldV al ) ; end for end for C [ y ] = C [ y ] + ρ · (1 − p ( Y = y | d, N, M, C, γ )) ; for k = 1 , ..., | Y | : k = y do C [ k ] = C [ k ] − ρ · p ( Y = k | d, N, M, C, γ ) ; C [ k ] = max( C [ k ] , ; end for end for until convergence ¯ N = N ormalize ( N, γ ); ¯ C = N ormalize ( C, γ ); return ¯ N and ¯ C ; Algorithm 3

Compute predictions P ( Y = k | d, N, M, C, γ ) with Multinomial Naive Bayes. | d | denotes the number ofdifferent words in the current document d and | w | d denotes the number of times word w appears in document d . Thefunction ”Logs2Probs” simply exponentiate the log values and then normalize. Require: N , M , C , γ with non-negative values. ∀ k LogDC [ k ] = 0 . ; for k = 1 , ..., | Y | do LogDC [ k ] = ln( C [ k ] + γ ) ; sumW = 0 ; for each distinct word w in the document d do LogDC [ k ] = LogDC [ k ] + | w | d · ln( N [ y ][ w ] + γ ) ; sumW = sumW + | w | d ; end for LogDC [ k ] = LogDC [ k ] − sumW · ln( M [ k ] + γ · | d | ) ; end for return Logs2Probs(LogDC); lgorithm 4 sdEM for Multinomial Naive Bayes with the Hinge loss. | d | denotes the number of different words in thecurrent document d and | w | d denotes the number of times word w appears in document d . | W | denotes the total number ofdifferent words in the corpus. Require: D is randomly shufﬂed. Require: α value as prior count for each word. Two values are considered α = 1 and α = ln | W | . ∀ k, w N [ k ][ w ] = α ; C [ k ] = 1 . ; M [ k ] = α ∗ | d | ; t = 0 ; γ = 0 ; repeat for each label-document pair ( y, d ) do t = t + 1 ; ρ = λ · t γ = γ + α · ρn ¯ y = arg max y ′ = y p ( Y = y ′ | x ) ; if (ln p ( Y = y | x ) − ln p ( Y = ¯ y | x )) > then Go for the next document; end if for each distinct word w in the document d do N [ y ][ w ] = N [ y ][ w ] + ρ · | w | d · (1 − p ( Y = y | d, N, M, C, γ )) ; M [ y ] = M [ y ] + ρ · | w | d · (1 − p ( Y = y | d, N, M, C, γ )) ; oldV al = N [¯ y ][ w ] ; N [¯ y ][ w ] = N [¯ y ][ w ] − ρ · | w | d · p ( Y = ¯ y | d, N, M, C, γ ) ; N [¯ y ][ w ] = max( N [¯ y ][ w ] , ; M [ k ] = M [ k ] + ( N [¯ y ][ w ] − oldV al ) ; end for C [ y ] = C [ y ] + ρ · (1 − p ( Y = y | d, N, M, C, γ )) ; C [¯ y ] = C [¯ y ] − ρ · p ( Y = ¯ y | d, N, M, C, γ ) ; C [¯ y ] = max( C [¯ y ] , ; end for until convergence ¯ N = N ormalize ( N, γ ); ¯ C = N ormalize ( C, γ ); return ¯ N and ¯ C ; lgorithm 5 sdEM for the LDA based classiﬁer using the NCLL loss. | d | denotes the number of different words in thecurrent document d , | w | d denotes the number of times word w appears in document d and | Z | denotes the number of hiddentopics in the LDA model. Require: D is randomly shufﬂed. Require: η deﬁnes the prior for the ”word per topic counts”. In the experiments, it is ﬁxed to η = 0 . . ∀ k, z, w N [ k ][ z ][ w ] = η | Z | ; C [ k ] = 1 . ; M [ k ][ z ] = | W | η | Z | ; t = 0 ; γ = 0 ; repeat for each label-document pair ( y, d ) do t = t + 1 ; ρ = λ · t γ = γ + η | Z | · ρn OnlineLDA( d , N [ y ] , M [ y ] , ρ , γ , ̟ = (1 − p ( Y = y | d, N, M, C, γ )) ); for k = 1 , ..., | Y | : k = y do OnlineLDA( d , N [ k ] , M [ k ] , ρ , γ , ̟ = − p ( Y = k | d, N, M, C, γ ) ); end for C [ y ] = C [ y ] + ρ · (1 − p ( Y = y | d, N, M, C, γ )) ; for k = 1 , ..., | Y | : k = y do C [ k ] = C [ k ] − ρ · p ( Y = k | d, N, M, C, γ ) ; C [ k ] = max( C [ k ] , ; end for end for until convergence ¯ N = N ormalize ( N, γ ); ¯ C = N ormalize ( C, γ ); return ¯ N and ¯ C ; Algorithm 6

OnlineLDA( d , N , M , ρ , γ , ̟ ). The vector s would correspond to the expected sufﬁcient statistics for d computed by online collpased Gibbs sampling. Require: d , N , M , ρ , γ , ̟ properly computed. s = OnlineCollapsedGibbsSampling(d,N,M, γ ); for each distinct word w in the document d do for z=1,..., | Z | do oldV al = N [ z ][ w ] ; N [ z ][ w ] = N [ z ][ w ] + ρ · ̟ · s [ z ][ w ] ; N [ z ][ w ] = max( N [ z ][ w ] , ; M [ z ] = M [ z ] + ( N [ z ][ w ] − oldV al ) ; end for end forlgorithm 7 sdEM for the LDA based classiﬁer using the Hinge loss. | d | denotes the number of different words in thecurrent document d , | w | d denotes the number of times word w appears in document d and | Z | denotes the number of hiddentopics in the LDA model. Require: D is randomly shufﬂed. Require: η deﬁnes the prior for the ”word per topic counts”. In the experiments, it is ﬁxed to η = 0 . . ∀ k, z, w N [ k ][ z ][ w ] = η | Z | ; C [ k ] = 1 . ; M [ k ][ z ] = | W | η | Z | ; t = 0 ; γ = 0 ; repeat for each label-document pair ( y, d ) do t = t + 1 ; ρ = λ · t γ = γ + η | Z | · ρn ¯ y = arg max y ′ = y p ( Y = y ′ | x ) ; if (ln p ( Y = y | x ) − ln p ( Y = ¯ y | x )) > then Go for the next document; end if