Gaussian Process Pseudo-Likelihood Models for Sequence Labeling
GGaussian Process Pseudo-Likelihood Models forSequence Labeling
P. K. Srijith , P. Balamurugan , and Shirish Shevade Department of Computer Science, University of Sheffield, United Kingdom [email protected] SIERRA Project Team, INRIA-ENS, Paris, France [email protected] Computer Science and Automation, Indian Institute of Science, Bangalore [email protected]
Abstract.
Several machine learning problems arising in natural language pro-cessing can be modeled as a sequence labeling problem. Gaussian processes(GPs) provide a Bayesian approach to learning such problems in a kernel basedframework. We develop Gaussian process models based on pseudo-likelihood tosolve sequence labeling problems. The pseudo-likelihood model enables one tocapture multiple dependencies among the output components of the sequencewithout becoming computationally intractable. We use an efficient variationalGaussian approximation method to perform inference in the proposed model. Wealso provide an iterative algorithm which can effectively make use of the infor-mation from the neighboring labels to perform prediction. The ability to capturemultiple dependencies makes the proposed approach useful for a wide range ofsequence labeling problems. Numerical experiments on some sequence labelingproblems in natural language processing demonstrate the usefulness of the pro-posed approach.
Keywords:
Gaussian processes, sequence labeling, variational inference
Sequence labeling is the task of classifying a sequence of inputs into a sequence ofoutputs. It arises commonly in natural language processing (NLP) tasks such as part-of-speech tagging, chunking, named entity recognition etc. For instance, in part-of-speech(POS) tagging, the input is a sentence and the output is a sequence of POS tags. Theoutput consists of components whose labels depend on the labels of other componentsin the output. Sequence labeling takes into account these inter-dependencies amongvarious components of the output [17].In recent years, sequence labeling has received considerable attention from the ma-chine learning community and is often studied under the general framework of struc-tured prediction. Many algorithms have been proposed to tackle sequence labeling prob-lems. Hidden Markov model (HMM) [20], conditional random field (CRF) [13] andstructural support vector machine (SSVM) [25] are the popular algorithms for sequence a r X i v : . [ c s . L G ] S e p P. K. Srijith, P. Balamurugan, Shirish Shevade labeling. SSVM allows learning a SVM for predicting a structured output including se-quences. It is based on a large margin framework and is not probabilistic in nature.HMM is a probabilistic directed graphical model based on Markov assumption andhas been widely used for problems in speech and language processing. CRF is also aprobabilistic model based on Markov random field assumption. These parametric ap-proaches can provide an estimate of uncertainty in predictions due to their probabilisticnature. However, they do not follow a Bayesian approach as they make a pointwiseestimate of their parameters. This makes them less robust and heavily dependent oncross-validation for model selection. Bayesian CRF [19] overcomes this problem byproviding a Bayesian treatment to CRF. Approaches like SSVM and maximum marginMarkov network (M3N) make use of kernel functions which overcome the limitationsarising due to the parametric nature of models such as CRF. Kernel CRF [14] is pro-posed to overcome this limitation of the CRF, but it is also not a Bayesian approach.Gaussian processes (GPs) [21] have emerged as a better alternative to offer a non-parametric fully Bayesian approach to solve the sequence labeling problem. An initialwork which studied Gaussian process for sequence labeling is [1], where GPs were pro-posed as an alternative to overcome the limitations of CRF; however they used a max-imum a posteriori (MAP) approach instead of a fully Bayesian approach. This causedproblems of model selection and robustness issues. A more recent work GPstruct [7]provides a Bayesian approach to general structured prediction problem with GPs. It usesMarkov Chain Monte Carlo (MCMC) method to obtain the posterior distribution whichslows down the inference. Their approach is based on Markov random field assump-tion which could not capture long range dependencies among the labels. This difficultyis overcome in [8] which uses an approximate likelihood to reduce the computationalcomplexity arising due to the consideration of larger dependencies. In [8], the proposedmodel was used to solve grid structured problems in computer vision and was found tobe effective in these problems.In this work, we develop a Gaussian process approach based on pseudo-likelihoodto solve sequence labeling problems (which we call GPSL). The GPSL model helpsto capture multiple dependencies among the output components in a sequence withoutbecoming computationally intractable. We develop a variational inference method toobtain the posterior which is faster than MCMC based approaches and does not sufferfrom convergence problems. We also provide an efficient algorithm to perform predic-tion in the GPSL model which effectively takes into account the dependence on mul-tiple output components. We consider various GPSL models which consider differentnumber of dependencies. We study the usefulness of these models on various sequencelabeling problems arising in natural language processing (NLP). The GPSL modelswhich capture more dependencies are found to be useful for these sequence labelingproblems. They are also useful in sequence labeling data sets where the labels might bemissing for some output components, for example, when the labels are obtained usingcrowd-sourcing. The main contributions of the paper are as follows :1. A faster training algorithm based on variational inference.2. An efficient prediction algorithm which considers multiple dependencies.3. Application to sequence labeling problems in NLP. aussian Process Pseudo-Likelihood Models for Sequence Labeling 3
The rest of the paper is organized as follows. Gaussian processes are introducedin Section 2. Section 3 discusses the proposed approach, Gaussian process sequencelabeling (GPSL), in detail. We provide details of the variational inference and predictionalgorithm for the GPSL model in Section 4 and Section 5 respectively. In Section 6, westudy the performance of various GPSL models on sequence labeling problems anddraw several conclusions in Section 7.
Notations:
We consider a sequence labeling problem over sequences of input-output space pair ( X , Y ). The input sequence space X is assumed to be made up of L components X = X × X × . . . X L and the associated output sequence space has L components Y = Y × Y × . . . Y L . We assume a one-to-one mapping between theinput and output components. Each component of the output space is assumed to take adiscrete value from the set { , , . . . , J } . Each component in the input space is assumedto belong to a P dimensional space R P representing features for that input component.Consider a collection of N training input-output examples D = { ( x n , y n ) } Nn =1 , whereeach example ( x n , y n ) is such that x n ∈ X and y n ∈ Y . Thus, x n consists of L compo-nents ( x n , x n , . . . , x nL ) and y n consists of L components ( y n , y n , . . . , y nL ) . Thetraining data D contains N L input-output components.
A Gaussian process (GP) is a collection of random variables with the property that thejoint distribution of any finite subset of which is a Gaussian [21]. It generalizes Gaussiandistribution to infinitely many random variables and is used as a prior over a latentfunction. The GP is completely specified by a mean function and a covariance function.The covariance function is defined over latent function values of a pair of inputs andis evaluated using the Mercer kernel function over the pair of inputs. The covariancefunction expresses some general properties of functions such as their smoothness, andlength-scale. A commonly used covariance function is the squared exponential (SE) orthe Gaussian kernel cov (cid:0) f ( x mi ) , f ( x nl ) (cid:1) = K ( x mi , x nl ) = σ f exp( − κ || x mi − x nl || ) . (1)Here f ( x mi ) and f ( x nl ) are latent function values associated with the input compo-nents x mi and x nl respectively. θ = ( σ f , κ ) denotes the hyper parameters associatedwith the covariance function K .Multi-class classification approaches are useful when the output consists of a sin-gle component taking values from a finite discrete set { , , . . . , J } . Gaussian processmulti-class classification approaches [26,10,9] associate a latent function f j with everylabel j ∈ { , , . . . , J } . Let the vector of latent function values associated with a par-ticular label j over all the training examples be f j . The latent function f j is assignedan independent GP prior with zero mean and covariance function K j with hyper pa-rameters θ j . Thus, f j ∼ N (0 , K j ) , where K j is a matrix obtained by evaluating thecovariance function K j over all the pairs of training data input components. P. K. Srijith, P. Balamurugan, Shirish Shevade
In multi-class classification, the likelihood over a multi-class output y nl for an input x nl given the latent functions is defined as [21] p ( y nl | f ( x nl ) , f ( x nl ) , . . . , f J ( x nl )) = exp( f y nl ( x nl )) (cid:80) Jj =1 f j ( x nl ) . (2)The likelihood (2) is known as multinomial logistic or softmax function and is usedwidely for the GP multi-class classification problems [26,9]. It is important to notethat the likelihood function (2) used for the multi-class classification problems is notGaussian. Hence, the posterior over the latent functions cannot be obtained in a closedform. GP multi-class classification approaches work by approximating the posterior asa Gaussian using approximate inference techniques such as Laplace approximation [26]and variational inference [10,9]. The Gaussian approximated posterior is then used tomake predictions on the test data points. These approximations also yield an approxi-mate marginal likelihood or a lower bound on marginal likelihood which can be used toperform model selection [21].A sequence labeling problem can be treated as a multi-class classification prob-lem. One can use multi-class classification to obtain a label for each component of theoutput independently. But this fails to take into account the inter-dependence amongcomponents. If one considers the entire output as a distinct class, then there would bean exponential number of classes and the learning problem becomes intractable. Hence,the sequence labeling problem has to be studied separately from the multi-class classi-fication problems. Most of the previous approaches [13,7] to sequence labeling use likelihood based onMarkov random field assumption which captures only the interaction between neigh-boring output components. Non-neighboring components also play a significant role inproblems such as sequence labeling. In these models, capturing such interactions arecomputationally expensive due to large clique size. The proposed approach, Gaussianprocess sequence labeling (GPSL), can take into account interactions among variousoutput components without becoming computationally intractable by using a pseudo-likelihood (PL) model [4].The PL model defines the likelihood of an output y n given the input x n as p ( y n | x n ) ∝ (cid:81) Ll =1 p ( y nl | x nl , y n \ y nl ) . where, y n \ y nl represents all labels in y n except y nl . PLmodels have been successfully used to address many sequence labeling problems innatural language processing [24,23]. They can capture long range dependencies with-out becoming computationally intractable as the normalization is done for each outputcomponent separately. In models such as CRF, normalization is done over the entireoutput. This renders them incapable of capturing long range dependencies as the num-ber of summations in the normalization grows exponentially. The PL model is differentfrom a locally normalized model like maximum entropy Markov model (MEMM) aseach output component depends on several other output components. Therefore, they aussian Process Pseudo-Likelihood Models for Sequence Labeling 5(a) Dependence among input and outputcomponents. Dependence on various out-put components are modelled separately.(b) Dependence of local and dependent latent functions. Thelocal latent functions are defined over input-output pairs and de-pendent latent functions are defined between output components. Fig. 1.
Dependence of latent functions and input-output components in Gaussian process se-quence labeling model. do not suffer from the label bias problem [17] unlike MEMM. However, PL models cre-ate cyclic dependencies among the output components [11] and this makes predictionhard. We discuss an efficient approach to perform prediction in this case in Section 5.The label of an output component need not depend on the labels of all the otheroutput components. The dependencies among these output components are capturedthrough the set S . Consider the directed graph in Figure 1a for a sequence labelingproblem, where each output component is assumed to depend only on the neighbor-ing output components. Here, the dependency set S = { , } , where denotes thedependence of an output component on the previous output component and denotesits dependence on the next output component. One can also consider a model wherean output component depends on the previous two output components and the next twooutput components. Let R denote the number of dependency relations in a set S (that is, R is the cardinality of S ) and we assume it to be the same for all the output componentsfor the sake of clarity in presentation. Taking into account those dependencies, we can P. K. Srijith, P. Balamurugan, Shirish Shevade redefine the likelihood as p ( y n | x n ) ∝ L (cid:89) l =1 p ( y nl | x nl , y Snl ) . (3)Here, y Snl denotes the set of labels { y dnl } Rd =1 of the output components referred by thedependency set S and y dnl denotes the label of the d th dependent output component. In(3), instead of conditioning on the rest of the labels, we condition y nl only on the labelsdefined by the dependency set S .Now, the likelihood p ( y nl | x nl , y Snl ) can be defined using a set of latent functions.We use different latent functions to model different dependencies. The dependency ofthe label y nl on x nl is defined as a local dependency and is modeled as in GP multi-classclassification. We associate a latent function with each label in the set { , , . . . J } . Thelatent function associated with a label j , denoted as f Uj , is called a local latent func-tion. It is defined over all the training input components x nl for every n and l and thelatent function values associated with a particular label j over N L training examplesare denoted by f Uj . The local latent functions associated with a particular input com-ponent x nl are denoted as f Unl = { f U nl , . . . , f UJnl } . We also associate a latent function f Sd with each dependency relation d ∈ S and call them dependent latent functions.These latent functions are defined over all the values of a pair of labels (ˆ y nl , y nl ) where ˆ y nl ∈ { , , . . . J } and y nl ∈ { , , . . . J } . The latent function values associated with aparticular dependency d over J label pair values are denoted by f Sd . The dependenceof various latent functions on the input and output components for the directed graph inFigure 1a is depicted in Figure 1b. Given these latent functions we define the likelihood p ( y nl | x nl , y Snl ) to be a member of an exponential family: p ( y nl | x nl , y Snl , { f Uj } Jj =1 , { f Sd } Rd =1 ) =exp( f Uy nl ( x nl ) + (cid:80) Rd =1 f Sd ( y dnl , y nl )) (cid:80) Jy nl =1 exp( f Uy nl ( x nl ) + (cid:80) Rd =1 f Sd ( y dnl , y nl )) . (4)This differs from the softmax likelihood (2) used in multi-class classification in that itcaptures the dependencies among output components. Given the latent functions andthe input X = { x n } Nn =1 , the likelihood of the output Y = { y n } Nn =1 is p ( Y | X , { f Uj } Jj =1 , { f Sd } Rd =1 ) = N (cid:89) n =1 L (cid:89) l =1 p ( y nl | x nl , y n { D nl } , { f Uj } Jj =1 , { f Sd } Rd =1 ) (5)We impose independent GP priors over the latent functions { f Uj } Jj =1 , { f Sd } Rd =1 .The latent function f Uj is given a zero mean GP prior with covariance function K Uj parameterized by θ j . Thus, f Uj is a Gaussian with mean and covariance K Uj ofsize N L × N L , that is p ( f Uj ) = N ( f Uj ; , K Uj ) . K Uj consists of covariance func-tion evaluations over all the pairs of training data input components {{ x nl } Ll =1 } Nn =1 .The latent function f Sd is given zero mean GP prior with an identity covariance whichis defined to be when inputs are the same and otherwise. Thus f Sd is a Gaus-sian with mean and covariance I of size J , that is p ( f Sd ) = N ( f Sd ; , I J ) . Let aussian Process Pseudo-Likelihood Models for Sequence Labeling 7 f U = ( f U1 , f U2 , . . . , f UJ ) be the collection of all local latent functions and f S =( f S1 , f S2 , . . . , f SR ) be the collection of all dependent latent functions. Then the priorover f U and f S is defined as p ( f U , f S | X ) = N (cid:18)(cid:20) f U f S (cid:21) ; , (cid:20) K U K S (cid:21)(cid:19) , (6)where K U = diag ( K U1 , K U2 , . . . , K UJ ) is a block diagonal matrix and K S = I J ⊗ I R . The posterior over the latent functions p ( f U , f S | D ) is p ( f U , f S | X , Y ) = 1 p ( Y | X ) p ( Y | X , f U , f S ) p ( f U , f S | X ) where p ( Y | X ) = (cid:82) p ( Y | X , f U , f S ) p ( f U , f S | X ) d f U d f S is called evidence. Evidenceis a function of hyper-parameters θ = ( θ , θ , . . . , θ J ) and is maximized to estimatethem. For notational simplicity, we suppress the dependence of evidence, posterior andprior on the hyper-parameter θ . Due to the non-Gaussian nature of the likelihood, evi-dence is intractable and the posterior cannot be determined exactly. We use a variationalinference technique to obtain an approximate posterior. Variational inference is fasterthan sampling based techniques used in [7] and does not suffer from convergence prob-lems [16]. It can easily handle multi-class problems and is scalable to models witha large number of parameters. Further, it provides an approximation to the evidencewhich is useful in estimating the hyper-parameters of the model. A variational Inference technique [16] approximates the intractable posterior by an ap-proximate variational distribution. It approximates the posterior p ( f | X , Y ) by a varia-tional distribution q ( f | γ ) , where f = ( f U , f S ) and γ represents the variational param-eters. In variational inference, this is done by minimizing the Kullback-Leibler (KL)divergence between q ( f | γ ) and p ( f | X , Y ) . This is often intractable and the variationalparameters are obtained by maximizing a variational lower bound L ( θ , γ ) . KL ( q ( f | γ ) || p ( f | X , Y )) = − L ( θ , γ ) + log p ( Y | X ) (7)where L ( θ , γ ) = − KL ( q ( f | γ ) || p ( f | X )) + (cid:90) q ( f | γ ) log p ( Y | X , f ) d f . Maximizing the variational lower bound L ( θ , γ ) results in minimizing the KL diver-gence KL ( q ( f | γ ) || p ( f | X , Y )) , since the evidence p ( Y | X ) does not depend on thevariational parameters.We use a variational Gaussian (VG) approximate inference approach [18] where thevariational distribution is assumed to be a Gaussian. Variational Gaussian approachescan be slow because of the requirement to estimate the covariance matrix. Fortunately,recent advances in VG inference approaches [18] enable one to compute the covari-ance matrix using O ( N L ) variational parameters. In fact, we use the VG approach forGPs [12] which requires computation of only O ( N L ) variational parameters, but at the P. K. Srijith, P. Balamurugan, Shirish Shevade same time uses a concave variational lower bound. We assume the variational distribu-tion q ( f | γ ) takes the form of a Gaussian distribution and factorizes as q ( f U | γ U ) q ( f S | γ U ) where γ = { γ U , γ S } . Let q ( f U | γ U ) = N ( f U ; m U , V U ) where γ U = { m U , V U } and q ( f S ) = N ( f S ; m S , V S ) where γ S = { m S , V S } . Then, the variational lowerbound L ( θ , γ ) can be written as L ( θ , γ ) = 12 (log | V U Ω U | + log | V S Ω S | − tr ( V U Ω U ) − tr ( V S Ω S ) (8) − m U (cid:62) Ω U m U − m S (cid:62) Ω S m S ) + N (cid:88) n =1 L (cid:88) l =1 E q ( f U | γ U ) q ( f S | γ S ) [log p ( y nl | x nl , y Snl , f )] where Ω U = K U − , Ω S = K S − and E q ( x ) [ f ( x )] = (cid:82) f ( x ) q ( x ) dx representsthe expectation of f ( x ) with respect to the density q ( x ) . Since K U is block diag-onal, its inverse is block diagonal, and hence Ω U is block diagonal that is Ω U = diag ( Ω U1 , Ω U2 , . . . , Ω UJ ) , where Ω Uj = K Uj − . Similarly, Ω S is also a block diag-onal with each block being a diagonal matrix I J . The marginal variational distributionof local latent function values f Uj is a Gaussian with mean m Uj and covariance V Uj ,and that of dependent latent function values f Sd is a Gaussian with mean m Sd and co-variance V Sd . The variational lower bound L ( θ , γ ) requires computing an expectationof the log likelihood with respect to the variational distribution. However, the integralis intractable since the likelihood is a softmax function. So, we use Jensen’s inequalityto obtain a tractable lower bound to the expectation of log likelihood. The variationallower bound L ( θ , γ ) can be written as (cid:0) J (cid:88) j =1 (log | V Uj Ω Uj | − tr ( V Uj Ω Uj ) − m Uj (cid:62) Ω Uj m Uj )+ R (cid:88) d =1 (log | V Sd Ω Sd | − tr ( V Sd Ω Sd ) − m Sd (cid:62) Ω Sd m Sd ) (cid:1) + N (cid:88) n =1 L (cid:88) l =1 (cid:16) m Uy nl nl + R (cid:88) d =1 m Sd ( y dnl ,y nl ) − log (cid:0) J (cid:88) q =1 exp( m Ujnl + 12 V Uj ( nl,nl ) + R (cid:88) d =1 m Sd ( y dnl ,q ) + 12 V Sd (( y dnl ,q ) , ( y dnl ,q )) ) (cid:1)(cid:17) . (9) The variational parameters γ = {{ m Uj } Jj =1 , { V Uj } Jj =1 , { m Sd } Rd =1 , { V Sd } Rd =1 } are estimated by maximizing the variational lower bound (9). The lower bound is jointlyconcave with respect to all the variational parameters [6] and the optimum can be easilyfound using gradient based optimization techniques.The variational parameters are estimated using a co-ordinate ascent approach. Werepeatedly estimate each variational parameter while keeping the others fixed. The vari-ational mean parameters m Uj and m Sd are estimated using gradient based approaches.The variational covariance matrices V Uj and V Sd are estimated under the positivesemi-definite (p.s.d.) constraint. This can be done efficiently using the fixed point ap-proach mentioned in [12]. It is reported to converge faster than other VG approaches aussian Process Pseudo-Likelihood Models for Sequence Labeling 9 Algorithm 1
Model selection and learning in Gaussian process sequence labelingmodel Input:
Training data ( X , Y ), dependency set S
2: Initialize hyper-parameters θ , variational parameters γ repeat repeat for j = 1 to J do
6: Update m Uj by maximizing (9) w.r.t m Uj
7: Update V Uj by maximizing (9) w.r.t V Uj end for for d = 1 to R do
10: Update m Sd by maximizing (9) w.r.t m Sd
11: Update V Sd by maximizing (9) w.r.t V Sd end for until relative increase in lower bound (9) is small14: Update θ by maximizing (9) w.r.t θ until relative increase in lower bound (9) is small16: Return: θ, γ for GPs and is based on a concave objective function similar to (9). The approach main-tains the p.s.d. constraint on the covariance matrix and computes V Uj by estimatingonly O ( N L ) variational parameters. Estimation of V Uj using the fixed point approachconverges since (9) is strictly concave with respect to V Uj . The variational covari-ance matrix V Sd is diagonal since Ω Sd is diagonal. Hence, for computing a p.s.d. V Sd we need to estimate only the diagonal elements of V Sd under the element-wisenon-negativity constraint. This can be done easily using gradient based methods. Thevariational parameters γ are estimated for a particular set of hyper-parameters θ . Thehyper-parameters θ are also estimated by maximizing the lower bound (9). The vari-ational parameters γ and the model parameters θ are estimated alternately followinga variational expectation maximization (EM) approach [16]. Algorithm 1 summarizesvarious steps involved in our approach.The variational lower bound (9) is strictly concave with respect to each of the vari-ational parameters. Hence, the estimation of variational parameters using co-ordinateascent algorithm (inner loop) converges [3]. Convergence of EM for exponential fam-ily guarantees the convergence of Algorithm 1. The overall computational complexityof Algorithm 1 is dominated by the computation of V Uj . It takes O ( JN L ) time asit requires inversion of J covariance matrices of size N L × N L . The computationalcomplexity for estimating V Sd is O ( RN LJ ) and is negligible compared to the estima-tion of V Uj . Note that the computational complexity of the algorithm increases linearlywith respect to the number of dependencies R . We propose an iterative prediction algorithm which can effectively take into accountthe presence of multiple dependencies. The variational posterior distributions estimated using VG approximation q ( f U ) = (cid:81) Jj =1 q ( f Uj ) = (cid:81) Jj =1 N ( f Uj ; m Uj , V Uj ) and q ( f S ) = (cid:81) Rd =1 q ( f Sd ) = (cid:81) Rd =1 N ( f Sd ; m Sd , V Sd ) can be used to predict a test out-put sequence y ∗ given a test input sequence x ∗ . The predictive probability of assigninga label y ∗ l to a component of the output y ∗ , given x ∗ l and rest of the labels y ∗ \ y ∗ l is p ( y ∗ l | x ∗ l , y ∗ \ y ∗ l ) = (cid:90) p ( y ∗ l | x ∗ l , y ∗ \ y ∗ l , f ∗ ) p ( f ∗ ) d f ∗ = (cid:90) exp( f Uy ∗ l ∗ l + (cid:80) Rd =1 f Sd ∗ ( y d ∗ l , y ∗ l )) (cid:80) Jy ∗ l =1 exp( f Uy ∗ l ∗ l + (cid:80) Rd =1 f Sd ∗ ( y dnl , y nl )) { p ( f Uj ∗ l ) } Jj =1 { p ( f Sd ∗ ) } Rd =1 { df Uj ∗ l } Jj =1 { df Sd ∗ } Rd =1 (10)where p ( f ∗ ) denotes the predictive distribution of all the latent function values for thetest input x ∗ . In (10), p ( f Uj ∗ l ) represents the predictive distribution of the local latentfunction j for a test input component x ∗ l . This is Gaussian with mean m Uj ∗ l and variance v Uj ∗ l where, m Uj ∗ l = K Uj ∗ l (cid:62) Ω Uj m Uj and v Uj ∗ l = K Uj ∗ l, ∗ l − K Uj ∗ l (cid:62) ( Ω Uj − Ω Uj V Uj Ω Uj ) K Uj ∗ l . Here, K Uj ∗ l is an N L dimensional vector obtained from the kernel evaluations for thelabel j between the test input data component x ∗ l and the training data X and K Uj ∗ l, ∗ l represents the kernel evaluation of the test data input component x ∗ l with itself. f Sd is independent of the test data input and the predictive distribution p ( f Sd ∗ ) is the sameas p ( f Sd ) . This is a Gaussian with mean m Sd and covariance V Sd . The computationof the expected value of softmax with respect to the latent functions (10) is intractable.Instead we compute softmax of the expected value of the latent functions and computea normalized probabilistic score. We refine the normalized score to take into accountthe uncertainty in true labels associated with the dependencies and compute the refinednormalized score ( RN S ) as
RN S ( y ∗ l , x ∗ l ) = exp( m Uy ∗ l ∗ l + v Uy ∗ l ∗ l + (cid:80) Rd =1 E y d ∗ l [ g d ( y d ∗ l , y ∗ l )]) (cid:80) Jq =1 exp( m Uj ∗ l + v Uj ∗ l + (cid:80) Rd =1 E y d ∗ l [ g d ( y d ∗ l , q )]) Here, g d ( y d , y ) = m Sd ( y d ,y ) + V Sd (( y d ,y ) , ( y d ,y )) determines the contribution of the la-bel y d of dependency d in predicting the output label y . RN S considers an expectedvalue over all the possible labelings associated with a dependency d . The expectation iscomputed using the RN S value associated with the labels y d ∗ l for the input x d ∗ l , that is, E y d ∗ l [ · ] = (cid:80) Jy d ∗ l =1 RN S ( y d ∗ l , x d ∗ l )[ · ] .We provide an iterative approach to estimate the labels of a test output in Algo-rithm 2. An initial RN S value is computed without considering the dependencies. Weiteratively refine the
RN S value using the previously computed
RN S value by tak-ing into account the dependencies. The process is continued until convergence. Thefinal
RN S value is used to make prediction separately for each output component by aussian Process Pseudo-Likelihood Models for Sequence Labeling 11
Algorithm 2
Prediction in Gaussian process sequence labeling model Input:
Test data x ∗ = ( x ∗ , . . . , x ∗ L ) , posterior mean { m Uj } Jj =1 and { m Sd } Rd =1 andposterior covariance { V Uj } Jj =1 and { V Sd } Rd =1
2: Obtain predictive means {{ m Uj ∗ l } Jj =1 } Ll =1 , and variances {{ v Uj ∗ l } Jj =1 } Ll =1 Initialize :
RNS ( y ∗ l , x ∗ l ) = exp( m Uy ∗ l ∗ l + v Uy ∗ l ∗ l ) (cid:80) Jj =1 exp( m Uj ∗ l + v Uj ∗ l ) ∀ y ∗ l = 1 , . . . , J, ∀ l = 1 . . . , L Initialize : t = 0 repeat t = t + 1 for l = 1 to L do for y ∗ l = 1 to J do RNS t ( y ∗ l , x ∗ l ) = exp( m Uy ∗ l ∗ l + v Uy ∗ l ∗ l + (cid:80) Rd =1 E yd ∗ l [ g d ( y d ∗ l ,y ∗ l )]) (cid:80) Jj =1 exp( m Uj ∗ l + v Uj ∗ l + (cid:80) Rd =1 E yd ∗ l [ g d ( y d ∗ l ,q )])
10: where E y d ∗ l [ · ] = (cid:80) Jy d ∗ l =1 RNS t − ( y d ∗ l , x d ∗ l )[ · ] end for end for until change in RNS t w.r.t RNS t − is small14: (ˆ y ∗ , . . . , ˆ y ∗ L ) = (argmax y ∗ RNS t ( y ∗ , x ∗ ) , . . . , argmax y ∗ L RNS t ( y ∗ L , x ∗ L )) Return: (ˆ y ∗ , . . . , ˆ y ∗ L ) assigning labels with the maximum RN S value. The computational complexity of Al-gorithm 2 is O ( J RL ) and is same as that of Viterbi algorithm [20] for a single de-pendency case. The convergence of Algorithm 2 follows from the analysis presented in[15] for a similar fixed point algorithm. The algorithm is found to converge in a fewiterations in our experiments. We conduct experiments to study the generalization performance of the proposed Gaus-sian Process Sequence labeling (GPSL) model. We use the sequence labeling problemsin natural language processing to study the behavior of the proposed approach. Al-though the proposed approach is general and can handle dependencies of any length,we consider three different models of the proposed approach in our experiments. Thefirst model, GPSL1, assumes that the current label depends only on the previous label.The second model, GPSL2, assumes that the current label depends both on the previousand the next label in the sequence. The third model, GPSL4, assumes that the currentlabel depends on the previous two labels and the next two labels.We consider four sequence labeling problems in natural language processing tostudy the performance of the proposed approach. The datasets for all these problemsare obtained from the CRF++ toolbox. We provide a brief description of the tasks ineach of these data sets. Base NP :
We need to identify noun phrases in a sentence. The starting word in the Available at http://crfpp.googlecode.com/svn/trunk/doc/index.html2 P. K. Srijith, P. Balamurugan, Shirish Shevade noun phrase is given a label B , while the words inside the noun phrase are given a label I . All the other words are given a label O . The task here is to assign each word with alabel from the set { B, I, O } . Chunking :
Shallow parsing or chunking identifies constituents in a sentence such asnoun phrase, verb phrase etc. Here, each word in a sentence is labeled as belonging toverb phrase, noun phrase etc. In the
Chunking dataset, words are assigned a label froma set of size . Segmentation :
Segmentation is the process of finding meaningful segments in a textsuch as words, sentences etc. We consider a word segmentation problem where thewords are identified from a Chinese sentence. The
Segmentation data set assigns eachunit in the sentence a label denoting whether it is beginning of a word ( B ) or inside aword ( I ). The task is to assign either of these two labels to each unit in a sentence. Japanese NE :
We need to perform Named Entity Recognition (NER) where the task isto identify whether the words in a sentence denote a named entity such as person, place,time etc. We use the
JapaneseN E dataset where the Japanese words are assigned oneof different named entities.In all these data sets except Segmentation , a sentence is considered as an inputand words in the sentence as input components. In
Segmentation , every alphabet isconsidered as an input component. The features for each input component are extractedusing the template files provided in the CRF++ package. The properties of all the datasets are summarized in Table 1. It mentions the number of sentences ( N ) used fortraining and testing. The effective sample size ( N L ) for the GPSL models is obtainedby multiplying this quantity by average sentence length which increases the data sizeby an order of magnitude.We compare the performance of the proposed approach with popular sequence la-beling approaches, structural SVM (SSVM) [2] , conditional random field (CRF) [5] ,and GPstruct [7] . All the models used a linear kernel. GPstruct experiments are run for100000 elliptical slice sampling steps. The performance is measured in terms of averageHamming loss over all the test data points. The Hamming loss between the actual testoutput y ∗ and the predicted test output ˆy ∗ is given by Loss ( y ∗ , ˆy ∗ ) = (cid:80) Ll =1 I ( y ∗ l (cid:54) =ˆ y ∗ l ) , where I ( · ) is the indicator function. Table 1 compares the performance (percent-age of the average Hamming loss) of various approaches on the four sequence labelingproblems. The GPSL models, SSVM, CRF and GPstruct are run over independentpartitions of the data set and a mean of the Hamming loss over all the partitions alongwith the standard deviation are reported in Table 1.The reported results show that the GPSL models with multiple dependencies per-formed better than GPstruct on BaseN P and
Segmentation . In the other two datasets, GPSL models came close to GPstruct. We find that increasing the number of de-pendencies helped to improve the performance in general except for the
Segmentation data set. This is due to the difference in nature of the sequence labeling task involved Code available at http://drona.csa.iisc.ernet.in/ ∼ shirish/structsvm sdm.html Code available at http://leon.bottou.org/projects/sgd Code available at https://github.com/sebastien-bratieres/pygpstruct The train and test set partitions are different from those used by [7].aussian Process Pseudo-Likelihood Models for Sequence Labeling 13
Table 1.
Properties of the sequence labeling data sets and a comparison of the performance ofvarious models on these data sets. The approaches GPSL1, GPSL2, GPSL4, SSVM, CRF andGPstruct are compared using average Hamming loss (in percentage). The numbers in bold facestyle indicate the best results among these approaches. ‘ (cid:63) ’ and ‘ † ’ denote if the performance ofa method is significantly different from the best performing method and GPstruct repectively,according to paired t-test with 5% significance level.Base NP Chunking Segmentation Japanese NE ± (cid:63) ± (cid:63) ± ± (cid:63) GPSL2 5.55 ± (cid:63) ± (cid:63) ± ± (cid:63) GPSL4 5.54 ± (cid:63) ± (cid:63) ± ± (cid:63) CRF 5.21 ± † ± (cid:63) † ± (cid:63) † ± (cid:63) SSVM ± † ± † ± ± † GPstruct 5.66 ± (cid:63) ± (cid:63) ± ± (cid:63) in segmentation. For other data sets, the GPSL model which considered both the previ-ous and next label (GPSL2) gave a better performance. The performance of the GPSLmodel which considered the previous and the next 2 labels (GPSL4) improved onlymarginally or worsened compared to GPSL2 on these data sets. We note that increasingthe number of dependencies beyond four did not bring any improvement in performancefor the sequence labeling data sets that we have considered. Overall, the performanceof the SSVM is found to be better than other approaches in these sequence labeling datasets. However, GPSL models have the advantage of being Bayesian and can provide aconfidence over label predictions which is useful for many NLP tasks. The proposed GPSL models are implemented in Matlab. The GPSL Matlab programsare run on a 3.2 GHz Intel processor with 4GB of shared main memory under Linux.The SSVM approach is implemented in C, the CRF approach is coded in C++ and theGPStruct approach is in Python. Since the implementation languages differ, it is unfairto make a runtime comparison of various approaches. Table 2 compares the averageruntime (in seconds) for training various GPSL models and GPstruct on the sequencelabeling data sets. We find that the GPSL models are an order of magnitude faster thanGPStruct. We also find that increasing the dependencies resulted in only a slight in-crease in runtime.
We conducted experiments to study the performance of Algorithm 2 used to make pre-diction. The algorithm is compared with the commonly used Viterbi algorithm [20] forthe sequence labeling task. Viterbi algorithm consists of a forward phase which cal-culates the best value attained at the end of the sequence and a backward phase which
Table 2.
Comparison of average running time (seconds) of various GPSL models and GPstructData GPSL1 GPSL2 GPSL4 GPstructSegmentation 17.13 19.64 22.83 3.82e+03Chunking 1.09e+03 1.35e+03 1.71e+03 4.56e+04Base NP 6.01e+03 6.69e+03 7.25e+03 7.54e+04Japanese NE 1.24e+03 1.56e+03 1.93e+03 4.92e+04
Table 3.
Comparison of the prediction algorithms using GPSL1 modelaverage Hamming loss paired t-test average runtime (seconds) average iterationsData Algorithm 2 Viterbi t-value Algorithm 2 Viterbi Algorithm 2Segmentation 23.45 24.26 3.8183 0.1227 0.0856 5Chunking 13.02 13.69 3.6421 0.2491 0.2628 5Base NP 5.73 5.75 0.3162 0.5207 0.5338 4Japanese NE 8.26 8.84 2.475 0.3661 0.5653 3 finds the sequence of labels that lead to it. It is useful only for the setting where one con-siders a dependency with the previous label. Therefore, we study how the performanceof the GPSL1 model differs when Viterbi algorithm is used for prediction instead of theproposed algorithm. We consider an implementation of the Viterbi algorithm providedby the UGM toolkit [22]. Table 3 compares the predictive and runtime performance ofthe two algorithms. We observe that Algorithm 2 gave a better predictive and runtimeperformance than the Viterbi algorithm. The predictive performance of Algorithm 2is significantly better than Viterbi on
Segmentation , Chunking and
JapaneseN E .The t-values calculated using paired t-test on these data sets are found to be greater thanthe critical value of . for a level of significance . and degrees of freedom. Wealso observed that Algorithm 2 converged in 3-5 iterations on an average. In many sequence labeling tasks in NLP, the labels of some of the output componentsmight be missing in the training data set. This is common when crowd sourcing tech-niques are employed to obtain the labels. Sequence labeling approaches such as SSVMand CRF are not readily applicable to data sets with missing labels. GPSL models areuseful to learn from the data sets with missing labels due to their ability to capturelarger dependencies. We learn the GPSL models from the sequence labeling data setswith some fraction of the labels missing. We vary the fraction of missing labels andstudy how the performance of our model varies with respect to missing labels. Fig-ure 2 provides the variation in performance of various GPSL models as we vary thefraction of missing labels. The performance is measured in terms of accuracy whichis obtained by subtracting the average Hamming loss from . We find that the perfor-mance of the GPSL models does not significantly degrade as the fraction of the missinglabels increases. Figure 2 shows that GPSL4 which uses the previous and the next 2 la-bels provides a better performance than the other GPSL models. GPSL4 learns a bettermodel by considering a larger neighborhood information and is useful to handle datasets with missing labels. aussian Process Pseudo-Likelihood Models for Sequence Labeling 15(a) Base NP (b)
Chunking (c)
Segmentation (d)
Japanese NE
Fig. 2.
Variation in accuracy as the fraction of missing labels is varied from . to . We proposed a novel Gaussian Process approach to perform sequence labeling basedon pseudo-likelihood approximation. The use of pseudo-likelihood enabled the modelto capture multiple dependencies without becoming computationally intractable. Theapproach used a faster inference scheme based on variational inference. We also pro-posed an approach to perform prediction which makes use of the information from theneighboring labels. The proposed approach is useful for a wide range of sequence la-beling problems arising in natural language processing. Experimental results showedthat GPSL models, which capture multiple dependencies, are useful in sequence la-beling problems. The ability to capture multiple dependencies makes them effective inhandling data sets with missing labels.
References
1. Altun, Y., Hofmann, T., Smola, A.J.: Gaussian Process Classification for Segmenting andAnnotating Sequences. In: ICML (2004)6 P. K. Srijith, P. Balamurugan, Shirish Shevade2. Balamurugan, P., Shevade, S., Sundararajan, S., Keerthi, S.: A Sequential Dual Method forStructural SVMs. In: SDM. pp. 223–234 (2011)3. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific (1999)4. Besag, J.: Statistical analysis of non-lattice data. The Statistician 24, 179–195 (1975)5. Bottou, L.: Large-Scale Machine Learning with Stochastic Gradient Descent. In: COMP-STAT (2010)6. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2004)7. Bratieres, S., Quadrianto, N., Ghahramani, Z.: Bayesian Structured Prediction Using Gaus-sian Processes. IEEE transactions on Pattern Analysis and Machine Intelligence (2014)8. Bratieres, S., Quadrianto, N., Nowozin, S., Ghahramani, Z.: Scalable Gaussian ProcessStructured Prediction for Grid Factor Graph Applications. ICML (2014)9. Chai, K.M.A.: Variational Multinomial Logit Gaussian Process. J. Mach. Learn. Res. 13(2012)10. Girolami, M., Rogers, S.: Variational Bayesian Multinomial Probit Regression with GaussianProcess Priors. Neural Computation 18(8), 1790–1817 (2006)11. Heckerman, D., Chickering, D.M., Meek, C., Rounthwaite, R., Kadie, C.: Dependency Net-works for Inference, Collaborative Filtering, and Data Visualization. J. Mach. Learn. Res. 1,49–75 (2001)12. Khan, M.E., Mohamed, S., Murphy, K.P.: Fast Bayesian Inference for Non-Conjugate Gaus-sian Process Regression. In: NIPS. pp. 3149–3157 (2012)13. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: ProbabilisticModels for Segmenting and Labeling Sequence Data. In: ICML. pp. 282–289 (2001)14. Lafferty, J.D., Zhu, X., Liu, Y.: Kernel Conditional Random Fields: Representation andClique Selection. In: ICML (2004)15. Li, Q., Wang, J., Wipf, D.P., Tu, Z.: Fixed-Point Model For Structured Labeling. In: ICML.pp. 214–221 (2013)16. Murphy, K.P.: Machine learning: A Probabilistic Perspective. The MIT Press (2012)17. Noah, A.S.: Linguistic Structure Prediction . Morgan and Claypool (2011)18. Opper, M., Archambeau, C.: The Variational Gaussian Approximation Revisited. NeuralComputation 21, 786–792 (2009)19. Qi, Y., Szummer, M., Minka, T.P.: Bayesian conditional random fields. Proc. AISTATS(2005)20. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in SpeechRecognition. Proceedings of the IEEE 77(2), 257–286 (1989)21. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning (AdaptiveComputation and Machine Learning). MIT Press (2005)22. Schmidt., M.: UGM: A Matlab toolbox for probabilistic undirected graphical models. (2007),