[PDF] Conditional Random Fields and Support Vector Machines: A Hybrid Approach

Abstract

We propose a novel hybrid loss for multiclass and structured prediction problems that is a convex combination of log loss for Conditional Random Fields (CRFs) and a multiclass hinge loss for Support Vector Machines (SVMs). We provide a sufficient condition for when the hybrid loss is Fisher consistent for classification. This condition depends on a measure of dominance between labels - specifically, the gap in per observation probabilities between the most likely labels. We also prove Fisher consistency is necessary for parametric consistency when learning models such as CRFs. We demonstrate empirically that the hybrid loss typically performs as least as well as - and often better than - both of its constituent losses on variety of tasks. In doing so we also provide an empirical comparison of the efficacy of probabilistic and margin based approaches to multiclass and structured prediction and the effects of label dominance on these results.

Full PDF

CConditional Random Fields and Support VectorMachines: A Hybrid Approach

Qinfeng ShiThe University of AdelaideAdelaide, SA, Australia [email protected]

Mark ReidThe Australian National UniversityCanberra, Australia [email protected]

Tiberio CaetanoThe Australian National University and NICTACanberra, Australia

[email protected]

October 25, 2018

Abstract

We propose a novel hybrid loss for multiclass and structured prediction prob-lems that is a convex combination of log loss for Conditional Random Fields(CRFs) and a multiclass hinge loss for Support Vector Machines (SVMs). Weprovide a sufﬁcient condition for when the hybrid loss is Fisher consistent for clas-siﬁcation. This condition depends on a measure of dominance between labels –speciﬁcally, the gap in per observation probabilities between the most likely la-bels. We also prove Fisher consistency is necessary for parametric consistencywhen learning models such as CRFs.We demonstrate empirically that the hybrid loss typically performs as least aswell as – and often better than – both of its constituent losses on variety of tasks. Indoing so we also provide an empirical comparison of the efﬁcacy of probabilisticand margin based approaches to multiclass and structured prediction and the effectsof label dominance on these results.

Conditional Random Fields (CRFs) and Support Vector Machines (SVMs) can be seenas representative of two different approaches to classiﬁcation problems. The former1 a r X i v : . [ c s . L G ] S e p s purely probabilistic – the conditional probability of classes given each observa-tion is explicitly modelled – while the latter is purely discriminative – classiﬁcationis performed without any attempt to model probabilities. Both approaches have theirstrengths and weaknesses. CRFs [7, 11] are known to yield the Bayes optimal solutionasymptomatically but often require a large number of training examples to do accuratemodelling. In contrast, SVMs make more efﬁcient use of training examples but areknown to be inconsistent when there are more than two classes [13, 8].Despite their differences, CRFs and SVMs appear very similar when viewed asoptimisation problems. The most salient difference is the loss used by each: CRFsare trained using a log loss while SVMs typically use a hinge loss. In an attempt tocapitalise on their relative strengths and avoid their weaknesses, we propose a novel hybrid loss which “blends” the two losses. After some background ( §

2) we provide thefollowing analysis: We argue that Fisher Consistency for Classiﬁcation (FCC) – a.k.a.classiﬁcation calibration – is too coarse a notion and introduce a distribution-dependentreﬁnement called Conditional Fisher Consistency for Classiﬁcation ( § § § § In classiﬁcation problems observations x ∈ X are paired with labels y ∈ Y via somejoint distribution D over X × Y . We will write D ( x, y ) for the joint probability and D ( y | x ) for the conditional probability of y given x . Since the labels y are ﬁnite anddiscrete we will also use the notation D y ( x ) for the conditional probability to empha-sise that distributions over Y can be thought of as vectors in R k for k = | Y | . We willuse q to denote distributions over Y when the observations x ∈ X are irrelevant.When the number of possible labels k = | Y | > we call the classiﬁcation problema multiclass classiﬁcation problem. A special case of this type of problem is structuredprediction where the set of labels Y has some combinatorial structure that typicallymeans k is very large [1]. As seen in the experimental section below a variety ofproblems, such as text tagging, can be construed as structured prediction problems.Given m training observations S = { ( x i , y i ) } mi =1 drawn i.i.d. from D , the aim ofthe learner is to produce a predictor h : X → Y that minimises the misclassiﬁcationerror e D ( h ) = P D [ h ( x ) (cid:54) = y ] . Since the true distribution is unknown, an approximatesolution to this problem is typically found by minimising a regularised empirical esti- In structured prediction, each output y involves relationships among ‘sub-components’ of y . For ex-ample, the label of a pixel in an image depends on the label of neighbouring pixels. That’s where the term‘structured’ comes from. However, different y ’s are typically not assumed to possess any joint structure (i.e.,it is typically assumed that the data is drawn from X × Y ). This is why structured prediction is no differentin essence than multiclass classiﬁcation. surrogate loss (cid:96) . Examples of surrogate losses will be discussedbelow.Once a loss is speciﬁed, a solution is found by solving min f m m (cid:88) i =1 (cid:96) ( f ( x i ) , y i ) + Ω( f ) (1)where each model f : X → R k assigns a vector of scores f ( x ) to each observationand the regulariser Ω( f ) penalises overly complex functions. A model f found in thisway can be transformed into a predictor by deﬁning h f ( x ) = argmax y ∈ Y f y ( x ) . Wewill overload the deﬁnition of misclassiﬁcation error and sometimes write e D ( f ) asshorthand for e D ( h f ) .In structured prediction, the models are usually speciﬁed in terms of a parametervector w ∈ R n and a feature map φ : X × Y → R n by deﬁning f y ( x ; w ) = (cid:104) w, φ ( x, y ) (cid:105) and in this case the regulariser is Ω( f ) = λ (cid:107) w (cid:107) for some choice of λ ∈ R . This is theframework used to implement the SVMs and CRFs used in the experiments describedin Section 4. Although much of our analysis does not assume any particular parametricmodel, we explicitly discuss the implications of doing so in § (cid:96) H ( f, y ) = [1 − M ( f, y )] + (2)where [ z ] + = z for z > and is 0 otherwise, and M ( f, y ) = f y − max y (cid:48) (cid:54) = y f (cid:48) y is the margin for the vector f ∈ R k . Intuitively, the hinge loss is minimised by models thatnot only classify observations correctly but also maximise the difference between thehighest and second highest scores assigned to the labels.While there are other, consistent losses for SVMs [13, 8], these cannot scale up tostructured estimations due to computational issues. For example, the multiclass hingeloss (cid:80) j (cid:54) = y [1+ f j ( x )] + is shown to be consistent in [8]. However, it requires evaluating f on all possible labels except the true y . This is intractable for structured estimationwhere the possible labels grow exponentially with the size of the structured output.Since the other known and consistent multiclass hinge losses have similar intractabilitywe will only focus on the margin-based loss (cid:96) H which can be evaluated quickly usingtechniques from dynamic programming, linear programming etc. [14, 12, 1]. The scores given to labels by a general model f : X → R k can be transformed into aconditional probability distribution p ( x ; f ) ∈ [0 , k by letting p y ( x ; f ) = exp( f y ( x )) (cid:80) y ∈ Y exp( f y ( x )) . (3)It is easy to show that under this interpretation the hinge loss for a probabilisticmodel p = p ( · ; f ) is given by (cid:96) H ( p, y ) = (cid:20) − ln p y max y (cid:48) (cid:54) = y p y (cid:48) (cid:21) + (cid:96) L ( p, y ) = − ln p y . This loss penalises models that assign low probability to likely instances labels and,implicitly, that assign high probability to unlikely labels.We now propose a novel hybrid loss for probabilistic models that is a convex com-bination of the hinge and log losses (cid:96) α ( p, y ) = α(cid:96) L ( p, y ) + (1 − α ) (cid:96) H ( p, y ) (4)where mixture of the two losses is controlled by a parameter α ∈ [0 , . Setting α = 1 or α = 0 recovers the log loss or hinge loss, respectively. The intention is that choosing α close to 0 will emphasise having the maximum gap between the largest and secondlargest label probabilities while an α close to 1 will force models to prefer accurateprobability assessments over strong classiﬁcation. A desirable property for a loss is that, given enough data, the models obtained byminimising the loss at each observation will make predictions that are consistent withthe true label probabilities at each observation.Formally, we say vector f ∈ R | Y | is aligned with a distribution q over Y when-ever maximisers of f are also maximisers for q . That is, when argmax y ∈ Y f y ⊆ argmax y ∈ Y q y . If, for all label distributions q , minimising the conditional risk L ( f ) = E y ∼ q [ (cid:96) ( f, y )] for a loss (cid:96) yields a vector f ∗ aligned with q we will say (cid:96) is Fisherconsistent for classiﬁcation (FCC) – or classiﬁcation calibrated [13]. This is an im-portant property for losses since it is equivalent to the asymptotic consistency of theempirical risk minimiser for that loss [13, Theorem 2].The standard multiclass hinge loss (cid:96) H is known to be inconsistent for classiﬁcationwhen there are more than two classes [8, 13]. The analysis in [8] shows that the hingeloss is inconsistent whenever there is an instance x with a non-dominant distribution –that is, D y ( x ) < for all y ∈ Y . Conversely, A distribution is dominant for an instance x if there is some y with D y ( x ) > . In contrast, the log loss used to train non-parametric CRFs is Fisher consistent for probability estimation – that is, the associatedrisk is minimised by the true conditional distribution – and thus (cid:96) C is FCC since theminimising distribution is equal to D ( x ) and thus aligned with D ( x ) . In order to analyse the consistency of the hybrid loss we introduce a more reﬁned notionof Fisher consistency that takes into account the true distribution of class labels. If q = ( q , . . . , q k ) is a distribution over the labels Y then we say the loss (cid:96) is conditionally Note that the Fisher consistency for classiﬁcation is weaker than Fisher consistency for density estima-tion. The former requires the same prediction only, while the latter requires the estimated density is the sameas the true data distribution. In this paper, we focus on the former only. CC with respect to q whenever minimising the conditional risk w.r.t. q , L q ( f ) = E y ∼ q [ (cid:96) ( f, y )] yields a predictor f ∗ that is consistent with q . Of course, if a loss (cid:96) isconditionally FCC w.r.t. q for all q it is, by deﬁnition, (unconditionally) FCC. Theorem 1

Let q = ( q , . . . , q k ) be a distribution over labels and let y = max y q y and y = max y (cid:54) = y q y be the two most likely labels. Then the hybrid loss (cid:96) α is condi-tionally FCC for q whenever q y > or α > − q y − q y − q y . (5)For the proof see Appendix A. Theorem 1 can be inverted and interpreted as a constrainton the conditional distributions of some data distribution D such that a hybrid losswith parameter α will yield consistent predictions. Speciﬁcally, the hybrid loss will beconsistent if, for all x ∈ X such that q = D ( x ) has no dominant label ( i.e. , D y ( x ) ≤ for all y ∈ Y ), the gap D y ( x ) − D y ( x ) between the top two probabilities is largerthan (1 − α )(1 − D y ( x )) . When this is not the case for some x , the classiﬁcationproblem for that instance is, in some sense, too difﬁcult to disambiguate. In this sense,the bound can be seen as a property on distributions akin to Tsybakov’s noise condition[ ? ]. Making this analogy precise is the focus of ongoing work. Since Fisher consistency is deﬁned point-wise on observations, it is not directly ap-plicable to parametric models as these enforce inter-observational constraints ( e.g. smoothness). Abstractly, assuming parametric hypotheses can be seen as a restrictionover the space of allowable scoring functions. When learning parametric models, risksare minimised over some subset F of functions from X → R Y instead of all possiblefunctions. We now show that, given some weak assumptions on the hypothesis class F , a loss being FCC is a necessary condition if the loss is also to be F -consistent.We say a loss (cid:96) is F -consistent if, for any distribution, minimising its associ-ated risk over F yields a hypothesis with minimal 0-1 loss in F . Recall that therisk of a hypothesis f ∈ F associated with a loss (cid:96) and distribution D over X × Y is L D ( f ) = E D [ (cid:96) ( y, f ( x ))] and its 0-1 risk or misclassiﬁcation error is e D ( f ) = P D (cid:2) y (cid:54) = argmax y (cid:48) ∈ Y f y (cid:48) ( x ) (cid:3) . Formally then, given a function class F we say (cid:96) is F -consistent if, for all distributions D , L D ( f ∗ ) = inf f ∈ F L D ( f ) = ⇒ e D ( f ∗ ) = inf f ∈ F e D ( f ) . (6)We need a relatively weak condition on function classes F to state our theorem.We say a class F is regular if the follow two properties hold: 1) For any g ∈ R Y thereexists an x ∈ X and an f ∈ F so that f ( x ) = g ; and 2) For any x ∈ X and y ∈ Y there exists an f ∈ F so that y = argmax y (cid:48) ∈ Y f y (cid:48) ( x ) . Intuitively, the ﬁrst condition While this is simpler and stronger than the usual asymptotic notation of consistency [ ? ] it most read-ily relates to FCC and sufﬁces for our discussion since we are only establishing that FCC is a necessarycondition. Theorem 2

For regular function classes F any loss that is F -consistent is necessarilyalso Fisher Consistent for Classiﬁcation (FCC). The full proof is in Appendix B. The argument sketch is: since F -consistency requires(6) to hold for all D it must hold for a D with all of its mass on a single observation x . If (cid:96) is not FCC there must be some label distribution q and vector g so that L q ( g ) isminimal but e q ( g ) is not. Choosing x so that f ( x ) = g (by the regularity of F ) andsetting D ( y | x ) = q gives a contradiction. We now give a PAC-Bayesian bound [10] for the generalisation error e D of the hybridmodel that can be specialised to recover a bound for the multiclass hinge loss. Asimilar, alternative bound for the hybrid loss and an extended proof is available inAppendix C. Theorem 3 (Generalisation Margin Bound)

For any data distribution D , for anyprior P over w , for any w , any δ ∈ (0 , and for any γ > and any α ∈ (0 , ,with probability at least − δ over random samples S from D with m instances, thereexists a constant c , such that e D ≤ P ( x,y ) ∼ S ( E Q ( M ( w (cid:48) , y )) ≤ γ ) + O (cid:115) || w || c (1 − α ) γ ln( m | Y | ) + ln m + ln δ − m  . Proof [sketch] By choosing the weight prior P ( w ) = Z exp( − (cid:107) w (cid:107) ) and the pos-terior Q ( w (cid:48) ) = Z exp( − (cid:107) w (cid:48) − w (cid:107) ) , one can show e D = P D ( E Q M ( w (cid:48) , y ) ≤ bysymmetry argument proposed in [ ? , 9]. Applying the PAC-Bayes margin bound [ ? , ? ]and knowing the margin threshold γ (cid:48) ≤ c (1 − α ) γ and KL ( Q || P ) = || w || yields thetheorem.Setting α = 0 in the above bound recovers a margin bound for SVMs (see [ ? ] for anaveraging classiﬁers of SVMs, and [ ? ] for structured case). Unfortunately, one cannotset α = 1 to achieve a PAC-Bayes bound for a pure log loss classiﬁer in this mannerdue the the (1 − α ) − dependence. However, to our knowledge, we are not aware ofany PAC-Bayes bound on the generalisation error for log loss. t r a i n i ng e rr o r number of classesSVMCRF and Hybrid Figure 1: Training Error for thehybrid ( α = 0 . ), log and hingeloss vs. number of classes in non-dominant case.The analysis of the hybrid loss suggests it shouldbe able to outperform the hinge loss due to itsimproved consistency on distributions with non-dominant labels. Furthermore, it should alsomake more efﬁcient use of data than log loss ondistributions with dominant labels. These hy-potheses were conﬁrmed by applying the hybrid,log and hinge losses to a number of synthetic mul-ticlass data sets in which the data set size and pro-portion of examples with non-dominant labels arecarefully controlled.We also compared the hybrid loss with the logand hinge losses on several real structured estima-tion problems and observed that the hybrid lossregularly outperforms the other losses and consistently performs at least as well as thebetter of the log and hinge losses on any problem. Two types of multiclass simulations were performed. The ﬁrst examined the perfor-mances of the hybrid, log and hinge losses when no observations had a dominant label.That is all observations were drawn from a D with D y ( x ) < / for all labels y . Thesecond experiment considered distributions with a controlled mixture of observationswith dominant and non-dominant labels. Non-dominant Distributions

To make the experiment as simple as possible, we con-sidered an observation space of size | X | = 1 and focused on varying the number oflabels and their probabilities. The label set Y took the sizes | Y | = 3 , , , . . . , .One label y ∗ ∈ Y was assigned probability D y ∗ ( x ) = 0 . and the remainder aregiven an equal portion of 0.54 ( e.g. , in the 3 class case the other labels each haveprobability 0.27, and in the 10 class case, 0.06). Note that this means for all the la-bel set sizes, the gap D y ∗ ( x ) − D y ( x ) is at least 0.19 which is always greater than (1 − α )(1 − D y ∗ ( x )) = 0 . so the hybrid consistency condition (5) is always met.Features were a constant value in R as were the parameter vectors w y ∈ R for y ∈ Y . Models were found using LBFGS [3]. The resulting training errors for hinge,log and hybrid losses are plotted in Figure 1 as a function of the number of labels. Aswe can clearly see, the hinge loss error increases as the number of classes increases,whereas the errors for the log and the hybrid losses remain a constant ( − D y ∗ ( x ) ), inconcordance with the consistency analysis. Mix of Non-dominant and Dominant Distributions

The second synthetic experi-ment examined how the three losses performed given various training set sizes (denotedby m ) and various proportions of instances with non-dominant distributions (denotedby ρ ). 7 .2 0.25 0.3 0.35 0.40.20.250.30.350.4 Hinge H y b r i d (a) Hybrid v.s. Hinge (31/15) H y b r i d (b) Hybrid v.s. Log (34/15) H i nge (c) Hinge v.s. Log (30/23) Figure 2: Performance of the hybrid, hinge, and log losses on non-dominant/dominantmixtures. Points denote pairs of test accuracies for models trained on one of 60 datasets using the losses named on the axes. Score ( a/b ) denotes the vertical loss with a wins and b losses (ties not counted).We generated 60 different data sets, all with Y = { , , , , } , in the followingmanner: Instances came from either a non-dominant class distribution or a dominantclass distribution. In the non-dominant class case, x ∈ R is set to a predeﬁned, con-stant, non-zero vector and its label distribution is D ( x ) = 0 . and D y ( x ) = 0 . for y > . In the dominant case, each dimension x i was drawn from a normal distribution N ( µ = 1 + y, σ = 0 . depending on the class y = 1 , . . . , . The proportion ρ rangedover 10 values ρ = 0 . , . , . , . . . , and for each ρ , test and validation sets of size1000 were generated. Training set sizes of m = 30 , , , , , were usedfor each ρ value for a total of 60 training sets. The optimal regularisation parameter λ and hybrid loss parameter α were selected using the validation set for each loss oneach training set. Then models with parameters w y ∈ R for y ∈ Y were found usingLBFGS [3] for each of the three losses on each of the 60 training sets and then assessedusing the test set.The results are summarised in Figure 2. Each point shows the test accuracy for apair of losses. The predominance of points above the diagonal lines in a) and b) showthat the hybrid loss outperforms the hinge loss and the log loss in most of the data sets.while the log and hinge losses perform competitively against each other. Unlike the general multiclass case, structured estimation problems have a higher chanceof non-dominant distributions because of the very large number of labels as well as tiesor ambiguity regarding those labels. For example, in text chunking, changing the tagone phrase while leaving the rest unchanged should not drastically change the proba-bility predictions – especially when there are ambiguities. Because of the prevalenceof non-dominant distributions, we expect that training models using a hinge loss toperform poorly on these problems relative to training with hybrid or log losses.

CONLL2000 Text Chunking

Our ﬁrst structured estimation experiment is carriedout on the CONLL2000 text chunking task [4]. The data set has 8936 training sentencesand 2012 testing sentences with 106978 and 23852 phrases (a.k.a. chunks) respectively.8

500 1000 1500 2000 250000.20.40.60.81 number of sentences P r obab ili t y P(y i |x i )P(y i *|x i )P = 0.5 (a) the testing set P r obab ili t y P(y i |x i )P(y i *|x i )P = 0.5 (b) the training set Figure 3: Estimated probabilities of the true label D y i ( x i ) and most likely label D y ∗ i ( x i ) . Sentences are sorted according to D y i ( x i ) and D y ∗ i ( x i ) respectively in as-cending order. D = 1 / is shown as the straight black dot line. About 700 sentencesout of 2012 in the testing set and 2000 sentences out of 8936 in the training set have nodominant class.Table 1: Accuracy, precision, recall and F1 Score on the CONLL2000 text chunkingtask. Train Portion Loss Accuracy Precision Recall F1 ScoreHinge 91.14 85.31 85.52 85.410.1 Log 92.05 87.04 87.01 87.02Hybrid 92.07 87.17 86.93 87.05Hinge 94.61 91.23 91.37 91.301 Log 95.10 92.32 91.97 92.15Hybrid 95.11 92.35 92.00 92.17The task is to divide a text into syntactically correlated parts of words such as nounphrases, verb phrases, and so on. For a sentence with L chunks, its label consistsof the tagging sequence of all its chunks, i.e. y = ( y , y , . . . , y L ) , where y i is thechunking tag for chunk i . As commonly used in this task, the label y is modelledas a 1D Markov chain to account for the dependency between adjacent chunking tags ( y ji , y j +1 i ) given observation x i . Clearly, the model has exponentially many possiblelabels, which suggests there are many non-dominant classes.Since the true underlying distribution is unknown, we train a CRF on the train-ing set and then apply the trained model to both testing and training datasets to getan estimate of the conditional distributions for each instance. We sort the sentences x i from highest to lowest estimated probability on the true chunking label y i given x i . The result is plotted in Figure 3, from which we observe the existence of manynon-dominant distributions — about 1/3 of the testing sentences and about 1/4 of thetraining sentences.We split the data into 3 parts: training ( ), testing ( ) and validation ( ). using the feature template from the CRF++ toolkit [6], and the CRF code from Leon Bottou [2]. λ and the weight α were determined via parameter selec-tion using the validation set. To see the performance with different training sizes, wetook part of the training data to learn the model and gathered statistics on the test set.The accuracy, precision, recall and F1 Score on test set are reported in Table 2 whenusing 10% and 100% of the training set. The hybrid loss outperforms both the hingeloss and the log loss (albeit marginally). baseNP Chunking A similar methodology to the previous experiment is applied tothe BaseNP data set [6]. It has 900 sentences in total and the task is to automaticallyclassify a chunking phrase is as baseNP or not. We split the data into 3 parts: training( ), testing ( ) and validation ( ). Once again, λ and α are determined viamodel selection on the validation set. We report the test accuracy, precision, recall andF1 Score in Table 2 for training on increasing proportion of the training set. The hybridoutperforms the other two losses on all measures. Japanese named entity recognition

Finally, we used a multiclass data set containing716 Japanese sentences and 17 annotated named entities [6]. The task is to locate andclassify proper nouns and numerical information in a document into certain classes ofnamed entities such as names of persons, organizations, and locations. We train all 3models on 216 sentences and test on 500 sentences with the default parameters foundin Bottou’s CRF code. The extra parameter α is selected for the smallest test error. Theresult is reported in Table 3. Once again, the hybrid loss outperforms the others twolosses. 10 Conclusion and Discussion

We have provided theoretical and empirical motivation for the use of a novel hybridloss for multiclass and structured prediction problems which can be used in place ofthe more common log loss or multiclass hinge loss. This new loss attempts to blend thestrength of purely discriminative approaches to classiﬁcation, such as Support Vectormachines, with probabilistic approaches, such as Conditional Random Fields. Theo-retically, the hybrid loss enjoys better consistency guarantees than the hinge loss whileexperimentally we have seen that the addition of a purely discriminative componentcan improve accuracy when data is less prevalent.

Theoretically, we expect that some stronger sufﬁcient conditions on α are possiblesince the bounds used to establish Theorem 1 are not tight. Our conjecture is thata necessary and sufﬁcient condition would include a dependency on the number ofclasses. We are also investigating connections between α and the multiclass Tsybakovnoise condition [ ? ].To our knowledge, the notion of a regular function class for the purposes of con-sistency analysis is a novel one. Characterisations of this property for various existingparametric models would make testing for regularity easier.One current limitation of the hybrid model is the use of a single, ﬁxed α for allobservations in a training set. One interesting avenue to explore would be trying todynamically estimate a good value of α on a per-observation basis. This may furtherimprove the efﬁcacy of the hybrid loss by exploiting the robustness of SVMs (low α )when the label distribution for an observation has a dominant class but switching toprobability estimation via CRFs (high α ) when this is not the case. References [1] G. Bakir, T. Hofmann, B. Sch¨olkopf, A. Smola, B. Taskar, and S. V. N. Vishwanathan.

Predicting Structured Data . MIT Press, Cambridge, Massachusetts, 2007.[2] Leon Bottou. Stochastic gradient descent for conditional random ﬁelds (crfs), 2010. v1.3 http://leon.bottou.org/projects/sgd .[3] Richard H. Byrd, Jorge Nocedal, and Robert B. Schnabel. Representations of quasi-newtonmatrices and their use in limited memory methods.

Mathematical Programming , 1994.[4] CoNLL. Shared task for conference on computational natural language learning (conll-2000), 2000. .[5] K. Crammer and Y. Singer. On the learnability and design of output codes for multiclassproblems. In N. Cesa-Bianchi and S. Goldman, editors,

Proc. Annual Conf. ComputationalLearning Theory , pages 35–46, San Francisco, CA, 2000. Morgan Kaufmann Publishers.[6] Taku Kudo. Crf++: Yet another crf toolkit, 2010. v0.53 http://crfpp.sourceforge.net/ .[7] J. D. Lafferty, A. McCallum, and F. Pereira. Conditional random ﬁelds: Probabilistic mod-eling for segmenting and labeling sequence data. In

Proc. Intl. Conf. Machine Learning ,volume 18, pages 282–289, San Francisco, CA, 2001. Morgan Kaufmann.

8] Yufeng Liu. Fisher consistency of multicategory support vector machines. In

Proc. Intl.Conf. Machine Learning , 2007.[9] David McAllester. Generalization bounds and consistency for structured labeling. In

Pre-dicting Structured Data , Cambridge, Massachusetts, 2007. MIT Press.[10] David A. McAllester. Some PAC Bayesian theorems. In

Proc. Annual Conf. ComputationalLearning Theory , pages 230–234, Madison, Wisconsin, 1998. ACM Press.[11] F. Sha and F. Pereira. Shallow parsing with conditional random ﬁelds. In

Proceedingsof HLT-NAACL , pages 213–220, Edmonton, Canada, 2003. Association for ComputationalLinguistics.[12] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In S. Thrun, L. Saul,and B. Sch¨olkopf, editors,

Advances in Neural Information Processing Systems 16 , pages25–32, Cambridge, MA, 2004. MIT Press.[13] A. Tewari and P.L. Bartlett. On the consistency of multiclass classiﬁcation methods.

Jour-nal of Machine Learning Research , 8:1007–1025, 2007.[14] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods forstructured and interdependent output variables.

J. Mach. Learn. Res. , 6:1453–1484, 2005.

A Proof for Consistency

Proof of Theorem 1

We use L α ( p, D ) = E y ∼ D [ (cid:96)α ( p, y )] and ∆( Y ) to denote distributionsover Y . Since we a free to permute labels within Y we will assume without loss of generalitythat D = max y ∈ Y D y and D = max y (cid:54) =1 D y . The proof now proceeds by contradiction andassumes there is some minimiser p = argmin q ∈ ∆( Y ) L α ( q, D ) that is not aligned with D . Thatis, there is some y ∗ (cid:54) = 1 such that p y ∗ ≥ p . For simplicity, and again without loss of generality,we will assume y ∗ = 2 .The ﬁrst case to consider is when p is a maximum and p < p . Here we construct a q that“ﬂips” the values of p and p and leaves all the values unchanged. That is, q = p , q = p and q y = p y for all y = 3 , . . . , k . Intuitively, this new point is closer to D and therefore theCRF component of the loss will be reduced while the SVM loss won’t increase. The differencein conditional risks satisﬁes L α ( p, D ) − L α ( q, D ) = k (cid:88) y =1 D y . ( (cid:96) α ( p, y ) − (cid:96) α ( q, y ))= D . ( (cid:96) α ( p, − (cid:96) α ( q, D . ( (cid:96) α ( p, − (cid:96) α ( q, D − D )( (cid:96) α ( q, − (cid:96) α ( q, since (cid:96) α ( p,

1) = (cid:96) α ( q, and (cid:96) α ( p,

2) = (cid:96) α ( q, and the other terms cancel by construction.As D − D > by assumption, all that is required now is to show that (cid:96) α ( q, − (cid:96) α ( q,

1) = α ln q q + (1 − α )( (cid:96) H ( q, − (cid:96) H ( q, is strictly positive.Since q > q y for y (cid:54) = 1 we have ln q q > , (cid:96) H ( q,

2) = (cid:104) − ln q q (cid:105) + > , and (cid:96) H ( q,

1) = (cid:104) − ln q q y (cid:105) + < , and so (cid:96) H ( q, − (cid:96) H ( q, > − . Thus, (cid:96) α ( q, − (cid:96) α ( q, > asrequired.Now suppose that p = p is a maximum. In this case we show a slight perturbation q =( p + (cid:15), p − (cid:15), p , . . . , p k ) yields a lower for (cid:15) > . For y (cid:54) = 1 , we have (cid:96) L ( p, y ) − (cid:96) ( q, y ) = 0 nd since p > p y and q > q y thus (cid:96) H ( p, y ) − (cid:96) H ( q, y ) = 1 − ln p y p + 1 − ln q y q = ln p q > − q p = − (cid:15)p since − ln x > − x for x ∈ (0 , and q = p + (cid:15) = p + (cid:15) . Therefore (cid:96) α ( p, y ) − (cid:96) α ( q, y ) > − (cid:15) (1 − α ) p (7)When y = 1 , (cid:96) L ( p, − (cid:96) L ( q,

1) = − ln p q > q − p p = (cid:15)p and (cid:96) H ( p, − (cid:96) H ( q,

1) =(1 − ln p p ) − (1 − ln q q ) = ln q q = ln p + (cid:15)p − (cid:15) since p = p . Thus (cid:96) H ( p, − (cid:96) H ( q, > − p − (cid:15)p + (cid:15) = (cid:15)p + (cid:15) . And so (cid:96) α ( p, y ) − (cid:96) α ( q, y ) > (cid:15) (cid:20) αp + 2(1 − α ) p + (cid:15) (cid:21) (8)Finally, when y = 2 we have (cid:96) L ( p, − (cid:96) L ( q,

2) = − ln p q > q − p p = − (cid:15)p and (cid:96) H ( p, − (cid:96) H ( q,

2) = (1 − ln p p ) − (1 − ln q q ) = ln q q > − q q = − (cid:15)p + (cid:15) . Thus, (cid:96) α ( p, − (cid:96) α ( q, > − (cid:15) (cid:20) αp + 2(1 − α ) p + (cid:15) (cid:21) . (9)Putting the inequalities (7), (8) and (9) together yields lim (cid:15) → L α ( p, D ) − L α ( q, D ) (cid:15)> lim (cid:15) → ( D − D ) (cid:20) αp + 2(1 − α ) p + (cid:15) (cid:21) − k (cid:88) y =3 D y − αp = D − D p (2 − α ) − − D − D p (1 − α )= 1 p ( D − D + (1 − α )(2 D − . Observing that since D > D , when D > the ﬁnal term is positive without any constrainton α and when D < the difference in risks is positive whenever α > − D − D − D (10)completes the proof. B Proof of Necessity of FCC

Proof of Theorem 2

The proof is by contradiction. We assume we have a regular functionclass F and a loss (cid:96) which is F -consistent but not FCC. That is, (6) holds for (cid:96) but there exists adistribution p over Y such that there is a g ∈ R Y which minimises the conditional risk L q ( g ) but argmax y ∈ Y g y (cid:54) = argmax y ∈ Y q y .By the assumption of the regularity of F there is an x ∈ X and a f ∈ F so that f ( x ) = g .We now deﬁne a distribution D over X × Y that puts all its mass on the set { x } × Y so that D ( x, y ) = p y . Since this distribution is concentrated on a single x its full risk and conditionalrisk on x are the same. That is, L D ( · ) = L p ( · ) . Thus, L D ( f ) = L p ( f ) = inf f (cid:48) ∈ F L p ( f (cid:48) ) = inf f (cid:48) ∈ F L D ( f (cid:48) ) y the assumption of F -consistency, since f is a minimiser of L D it must also minimise e D .Once again, the construction of D means that e D ( f ) = e p ( g ) = P y ∼ p (cid:2) y (cid:54) = argmax y (cid:48) ∈ Y g y (cid:3) =1 − p y g where y g = argmax y g y is the label predicted by g . However, e D ( f ) = e p ( g ) = 1 − p y g > − p y ∗ since y ∗ = argmax y p y (cid:54) = argmax g y = y g .By the second regularity property, there must also be an ˆ f ∈ F such that argmax y ˆ f y ( x ) = y ∗ so that e D ( f ) > inf f (cid:48) ∈ F e D ( f (cid:48) ) = e D ( ˆ f ) = 1 − p y ∗ . Thus, we have shown that thereexists a distribution D so f ∈ F is a minimiser of the risk L D but is not a minimiser of themisclassiﬁcation rate e D which contradicts the assumption of the F -consistency of (cid:96) . Therefore, (cid:96) must be FCC. C Proof for PAC-Bayes Bounds

For explicitly, we rewrite M and p y as M ( x, y ; w ) and p ( y | x ; w ) when they are parameterizedby w . Theorem 4 (Generalisation Bound)

For any data distribution D , for any prior P over w , forany δ ∈ (0 , and α ∈ [0 , and for any γ ≥ , for any w , with probability at least − δ overrandom samples S from D with m instances, we have E D (cid:104)(cid:16) γ − M ( x, y ; w ) (cid:17) + (cid:105) ≤ m m (cid:88) i =1 (cid:16) γ − M ( x i , y i ; w ) (cid:17) + + 1(1 − α )  α (cid:114) m + (cid:115) ln P ( w ) + ln A ( α, w ) + ln δ (1 − e − ) m  , where R ( α, w ) = α E D (cid:104) − ln p ( y | x ; w ) (cid:105) + (1 − α ) E D (cid:104)(cid:16) γ − M ( x, y ; w ) (cid:17) + (cid:105) ,R S ( α, w ) = (cid:104) α (cid:80) mi =1 − ln p ( y i | x i ; w ) m + (1 − α ) (cid:80) mi =1 (cid:16) γ − M ( x i , y i ; w ) (cid:17) + m (cid:105) ,A ( α, w ) = E s ∼ D m e m ( R ( α,w ) − R S ( α,w )) . Here A is upper bounded independently of D . For example, for a zero-one loss, it is upperbounded by m + 1 (see [ ? ]). The theorem gives a bound on the true margin error of the hybridmodel. The theorem follows theorem 6 in the appendix immediately. Lemma 5 (PAC-Bayes bound[?, ?])

For any data distribution D , for any prior P and posterior Q over w , for any δ ∈ (0 , , for any loss (cid:96) . With probability at least − δ over random sample S from D with m instances, we have R ( Q, (cid:96) ) ≤ R S ( Q, (cid:96) ) + (cid:115) KL ( Q || P ) + ln( δ E s ∼ D m E w ∼ P e m ( R ( Q,(cid:96) ) − R S ( Q,(cid:96) )) )2 m , where KL ( Q || P ) := E w ∼ Q ln( Q ( w ) P ( w ) ) is the Kullback-Leibler divergence between Q and P , and R ( Q, (cid:96) ) = E Q,D [ (cid:96) ( x, y ; w )] , R S ( Q, (cid:96) ) = E Q (cid:80) mi =1 (cid:96) ( x i ,y i ; w ) m . heorem 6 (Bound on Averaging classiﬁer) For any data distribution D , for any prior P andposterior Q over w , for any δ ∈ (0 , and α ∈ [0 , and for any γ ≥ . With probability atleast − δ over random sample S from D with m instances, we have E Q,D (cid:104) [ γ − M ( x, y ; w )] + (cid:105) ≤ m E Q (cid:104) m (cid:88) i =1 [ γ − M ( x i , y i ; w )] + (cid:105) + α − α (cid:114) m + 11 − α (cid:115) KL ( Q || P ) + ln A ( α ) + ln δ (1 − e − ) m , where KL ( Q || P ) := E w ∼ Q ln( Q ( w ) P ( w ) ) is the Kullback-Leibler divergence between Q and P , and R ( α ) = α E Q,D (cid:104) − ln p ( y | x ; w ) (cid:105) + (1 − α ) E Q,D (cid:104)(cid:16) γ − M ( x, y ; w ) (cid:17) + (cid:105) ,R S ( α ) = E Q (cid:104) α (cid:80) mi =1 − ln p ( y i | x i ; w ) m + (1 − α ) (cid:80) mi =1 (cid:16) γ − M ( x i , y i ; w ) (cid:17) + m (cid:105) ,A ( α ) = E s ∼ D m E w ∼ P e m ( R ( α ) − R S ( α )) . Proof

Since E D (cid:16) E Q (cid:104) (cid:80) mi =1 − ln p ( y i | x i ; w ) m (cid:105)(cid:17) = E Q,D (cid:104) − ln p ( y | x ; w ) (cid:105) , by Chernoff boundwe have P S ∼ D m (cid:18) E Q (cid:20) (cid:80) mi =1 − ln p ( y i | x i ; w ) m (cid:21) − E Q,D (cid:104) − ln p ( y | x ; w ) (cid:105) < (cid:15) (cid:19) > − e − m(cid:15) . Deﬁne B ( S ) := E Q (cid:104) (cid:80) mi =1 − ln p ( y i | x i ; w ) m (cid:105) − E Q,D (cid:104) − ln p ( y | x ; w ) (cid:105) . pplying Lemma 5 for R ( α ) and R S ( α ) , we have for any P, Qδ > P S ∼ D m  R ( α ) ≥ R S ( α ) + (cid:115) KL ( Q || P ) + ln δ + ln A ( α )2 m  ≥ P S ∼ D m  R ( α ) ≥ R S ( α ) + (cid:115) KL ( Q || P ) + ln δ + ln A ( α )2 m , B ( S ) < (cid:15)  ≥ P S ∼ D m  (1 − α ) E Q,D (cid:104)(cid:16) γ − M ( x, y ; w ) (cid:17) + (cid:105) ≥ (1 − α ) (cid:80) mi =1 (cid:16) γ − M ( x i , y i ; w ) (cid:17) + m + α(cid:15) + (cid:115) KL ( Q || P ) + ln δ + ln A ( α )2 m , B ( S ) < (cid:15)  = P S ∼ D m  (1 − α ) E Q,D (cid:104)(cid:16) γ − M ( x, y ; w ) (cid:17) + (cid:105) ≥ (1 − α ) (cid:80) mi =1 (cid:16) γ − M ( x i , y i ; w ) (cid:17) + m + α(cid:15) + (cid:115) KL ( Q || P ) + ln δ + ln A ( α )2 m (cid:12)(cid:12)(cid:12) B ( S ) < (cid:15)  P S ∼ D m (cid:16) B ( S ) < (cid:15) (cid:17) ≥ P S ∼ D m  (1 − α ) E Q,D (cid:104)(cid:16) γ − M ( x, y ; w ) (cid:17) + (cid:105) ≥ (1 − α ) (cid:80) mi =1 (cid:16) γ − M ( x i , y i ; w ) (cid:17) + m + α(cid:15) + (cid:115) KL ( Q || P ) + ln δ + ln A ( α )2 m  P S ∼ D m (cid:16) B ( S ) < (cid:15) (cid:17) Divide two sides by P S ∼ D m (cid:16) B ( S ) < (cid:15) (cid:17) , we get P S ∼ D m  (1 − α ) E Q,D (cid:104)(cid:16) γ − M ( x, y ; w ) (cid:17) + (cid:105) ≥ (1 − α ) (cid:80) mi =1 (cid:16) γ − M ( x i , y i ; w ) (cid:17) + m + α(cid:15) + (cid:115) KL ( Q || P ) + ln δ + ln A ( α )2 m  ≤ δ P S ∼ D m (cid:16) B ( S ) < (cid:15) (cid:17) ≤ δ − e − m(cid:15) . Let (cid:15) = (cid:113) m , and then let δ (cid:48) = δ − e − m ( (cid:15) = δ − e − , we get δ = δ (cid:48) (1 − e − ) . The theoremfollows by substituting δ with δ (cid:48) and dividing by (1 − α ) on both sides of the inequality insideof the probability.on both sides of the inequality insideof the probability.