[PDF] Introducing the Hidden Neural Markov Chain framework

Abstract

Nowadays, neural network models achieve state-of-the-art results in many areas as computer vision or speech processing. For sequential data, especially for Natural Language Processing (NLP) tasks, Recurrent Neural Networks (RNNs) and their extensions, the Long Short Term Memory (LSTM) network and the Gated Recurrent Unit (GRU), are among the most used models, having a "term-to-term" sequence processing. However, if many works create extensions and improvements of the RNN, few have focused on developing other ways for sequential data processing with neural networks in a "term-to-term" way. This paper proposes the original Hidden Neural Markov Chain (HNMC) framework, a new family of sequential neural models. They are not based on the RNN but on the Hidden Markov Model (HMM), a probabilistic graphical model. This neural extension is possible thanks to the recent Entropic Forward-Backward algorithm for HMM restoration. We propose three different models: the classic HNMC, the HNMC2, and the HNMC-CN. After describing our models' whole construction, we compare them with classic RNN and Bidirectional RNN (BiRNN) models for some sequence labeling tasks: Chunking, Part-Of-Speech Tagging, and Named Entity Recognition. For every experiment, whatever the architecture or the embedding method used, one of our proposed models has the best results. It shows this new neural sequential framework's potential, which can open the way to new models, and might eventually compete with the prevalent BiLSTM and BiGRU.

Full PDF

aa r X i v : . [ c s . C L ] F e b I NTRODUCING THE H IDDEN N EUR AL M AR KOV C HAINFR AMEWORK

Elie Azeraf ∗ Watson DepartmentIBM GSB France [email protected]

Emmanuel Monfrini

SAMOVAR, Telecom SudParisInstitut Polytechnique de Paris

Emmanuel Vignon

Watson DepartmentIBM GSB France

Wojciech Pieczynski

SAMOVAR, Telecom SudParisInstitut Polytechnique de Paris A BSTRACT

Nowadays, neural network models achieve state-of-the-art results in many areas as computer visionor speech processing. For sequential data, especially for Natural Language Processing (NLP) tasks,Recurrent Neural Networks (RNNs) and their extensions, the Long Short Term Memory (LSTM)network and the Gated Recurrent Unit (GRU), are among the most used models, having a “term-to-term" sequence processing. However, if many works create extensions and improvements ofthe RNN, few have focused on developing other ways for sequential data processing with neuralnetworks in a “term-to-term" way. This paper proposes the original Hidden Neural Markov Chain(HNMC) framework, a new family of sequential neural models. They are not based on the RNNbut on the Hidden Markov Model (HMM), a probabilistic graphical model. This neural extensionis possible thanks to the recent Entropic Forward-Backward algorithm for HMM restoration. Wepropose three different models: the classic HNMC, the HNMC2, and the HNMC-CN. After describ-ing our models’ whole construction, we compare them with classic RNN and Bidirectional RNN(BiRNN) models for some sequence labeling tasks: Chunking, Part-Of-Speech Tagging, and NamedEntity Recognition. For every experiment, whatever the architecture or the embedding method used,one of our proposed models has the best results. It shows this new neural sequential framework’spotential, which can open the way to new models, and might eventually compete with the prevalentBiLSTM and BiGRU. K eywords Hidden Markov Model · Entropic Forward-Backward · Recurrent Neural Network · Sequence Labeling · Hidden Neural Markov Chain

During the last years, neural networks models [1, 2] show impressive performances in many areas, as computer visionor speech processing. Among them, Natural Language Processing (NLP) has one of the most signiﬁcant expansions.The Recurrent Neural Network [3–5] (RNN) based models, treating text as sequential data, are among the most oftenused models for NLP tasks, especially the Long Short Term Memory network (LSTM) [6] and the Gated RecurrentUnit (GRU) [7]. They can cover all textual applications as word embedding [8] or text translation [9]. They are themost prevalent sequential models with neural networks, having a term-to-term data processing.However, if many works have been done to create extensions of the RNN, very few of them focused on a different wayto use neural networks to treat sequential data with term-to-term processing. There are Transformer [10] based models, ∗ Elie Azeraf is also a member of SAMOVAR, Telecom SudParis, Institut Polytechnique de Paris y x y x y x y Figure 1: Probabilistic oriented graph of the HMMas BERT [11] or XLNet [12], but they have a different structure as they catch all the observations of the sequence inone time (under padding limitations) and require many more parameters and training power. In this paper, we onlyfocus on neural models with term-to-term processing.Among the sequential models, one of the most popular is the Hidden Markov Model (HMM) [13–15], also calledHidden Markov Chain, which is a probabilistic graphical model [16]. In this paper, we propose a new framework ofsequential neural models based on HMM, named Hidden Neural Markov Chains (HNMCs), composed of the classicHNMC, the HNMC of order 2 (HNMC2), and the HNMC with complexiﬁed noise (HNMC-CN). As RNN, they areneural term-to-term models for sequential data processing. Their interest is due to a new way of HMM’s posteriormarginal distribution computation based on the Entropic Forward-Backward (EFB) algorithm, which allows consider-ing arbitrary features [17] with HMM. We adapt EFB to HMM of order 2 (HMM2) and HMM with complexiﬁed noise(HMM-CN), presented in the next section. Therefore, we present HNMC as the HMM neural extension, HNMC2 asthe HMM2 one, and HNMC-CN as the HMM-CN one.The paper is organized as follows. The next section presents the HMM model, its EFB algorithm, the HMM2, theHMM-CN, and their EFB algorithms. Then we introduce the HNMC, the HNMC2, and the HNMC-CN models. Wespecify the computational graph and related training process of the HNMC. We also describe the differences betweenour proposed models and some previous ones combining HMM and neural networks. The fourth part is devotedto experiments. We compare our models with RNN and Bidirectional RNN (BiRNN) [18] for different sequencelabeling tasks: Part-Of-Speech (POS) tagging, Chunking, and Named-Entity-Recognition (NER). We implement manyarchitectures with various embedding methods to reach a convincing empirical comparison. We only compare withRNN and BiRNN, as the latter’s extensions to catch longer memory information, leading to LSTM and GRU, isdiscussed as the perspectives for HNMC based models in the last section.

The Hidden Markov Model is a sequential model created sixty years ago and used in numerous applications [19–21].It allows the restoration of a hidden sequence from an observed one.Let x T = ( x , ..., x T ) be a hidden realization of a stochastic process, taking its values in Λ X = { λ , ..., λ N } , andlet y T = ( y , ..., y T ) be an observed realization of a stochastic one, taking its values in Ω Y = { ω , ..., ω M } . Thecouple ( x T , y T ) is a HMM if its probabilistic law can be written: p ( x T , y T ) = p ( x ) p ( y | x ) p ( x | x ) p ( y | x ) ...p ( x T | x T − ) p ( y T | x T ) The probabilistic oriented graph of the HMM is given in ﬁgure 1.

There are different ways to restore a hidden chain from an observed one using the HMM. With the Maximum APosteriori criterion (MAP), one can use the classic Viterbi [22] algorithm. About the Maximum Posterior Mode(MPM), one can use the classic Forward-Backward [19] (FB) one. However, both Viterbi and FB algorithms useprobabilities p ( y t | x t ) , making them impossible to consider arbitrary features of the observations [23, 24], especiallythe output of a neural network function. To correct this default, the Entropic Forward Backward (EFB) algorithmspeciﬁed below computes the MPM using p ( x t | y t ) and can take into account any features [17]. This makes possiblethe neural extension of the HMM we are going to present.For stationary HMM we consider in the whole paper, the EFB deals with the following parameters:• π ( i ) = p ( x t = λ i ) ; 2 y x y x y x y Figure 2: Probabilistic oriented graph of the HMM of order 2• a i ( j ) = p ( x t +1 = λ j | x t = λ i ) ;• L y ( i ) = p ( x t = λ i | y t = y ) ;The MPM restoration method we consider consists of maximization of the probabilities p ( x t = λ i | y T ) . They aregiven from entropic forward α and entropic backward β functions with: p ( x t = λ i | y T ) = α t ( i ) β t ( i ) P Nj =1 α t ( j ) β t ( j ) (1)Entropic forward functions are computed recursively as follows:• For t = 1 : α ( i ) = L y ( i ) • For ≤ t < T : α t +1 ( i ) = L y t +1 ( i ) π ( i ) N X j =1 α t ( j ) a j ( i ) (2)And the entropic backward ones:• For t = T : β T ( i ) = 1 • For ≤ t < T : β t ( i ) = N X j =1 L y t +1 ( j ) π ( j ) β t +1 ( j ) a i ( j ) (3)One can normalize values at each time in (2) and (3) to avoid underﬂow problems without modifying the probabilities’computation. In this paragraph, we describe an extension of EFB above to HMM2, which allows to catch longer memory informationthan the HMM. The probabilistic law of ( x T , y T ) for the HMM2 is: p ( x T ,y T ) = p ( x ) p ( x | x ) p ( x | x , x ) ...p ( x T | x T − , x T − ) p ( y | x ) p ( y | x ) ...p ( y T | x T ) Its probabilistic graph is given in ﬁgure 2.We introduce the following notation to present the EFB algorithm for HMM2: a i,j ( k ) = p ( x t +2 = λ k | x t = λ i , x t +1 = λ j ) The EFB algorithm for HMM2 is the following: 3 y x y x y x y Figure 3: Probabilistic oriented graph of the HMM-CN• For t = 1 : p ( x = λ i | y T ) = P j α ( i, j ) β ( i, j ) P k P j α ( k, j ) β ( k, j ) • For ≤ t ≤ T : p ( x t = λ i | y T ) = P j α t ( j, i ) β t ( j, i ) P k P j α t ( j, k ) β t ( j, k ) The entropic forward-2 functions α are computed with the following recursion:• For t = 2 : α ( j, i ) = L y ( j ) a j ( i ) L y ( i ) π ( i ) • And for ≤ t < T : α t +1 ( j, i ) = X k α t ( k, j ) a k,j ( i ) L y t +1 ( i ) π ( i ) And the backward-2 functions β with the following one:• For t = T : β T ( j, i ) = 1 • And for ≤ t < T : β t ( j, i ) = X k β t +1 ( i, k ) a j,i ( k ) L y t +1 ( k ) π ( k ) This paragraph describes the new HMM-CN model with related new EFB. It is another extension of HMM aiming toimprove its results. Its probabilistic oriented graph is presented in ﬁgure 3.In this case, the hidden sequence is still a Markov chain, and the conditional law of the observation y t given x T de-pends on x t − , x t , and x t +1 , implying stronger dependency with the hidden chain. The HMM-CN has the probabilisticlaw: p ( x T ,y T ) = p ( x ) p ( x | x ) p ( x | x ) ...p ( x T | x T − ) p ( y | x , x ) p ( y | x , x , x ) ...p ( y T | y T − , y T ) To present the EFB algorithm for HMM-CN, we set:• I j,y ( i ) = p ( x t +1 = λ i | x t = λ j , y t = y ) • J j,y ( i ) = p ( x t = λ i | x t +1 = λ j , y t +1 = y ) The goal of the EFB algorithm is to compute p ( x t = λ i | y T ) , using I j,y ( i ) and J j,y ( i ) above, we show: p ( x t = λ i | y T ) = α CNt ( i ) β CNt ( i ) P j α CNt ( j ) β CNt ( j ) , with the entropic forward-cn functions α CN computed with the following recursion:4 For t = 1 : α CN ( i ) = L y ( i ) • And for ≤ t < T : α CNt +1 ( i ) = X j α CNt ( j ) I j,y t ( i ) L y t +1 ( i ) J i,y t +1 ( j ) π ( j ) a j ( i ) (4)And the entropic backward-cn functions β CN computed with the following one:• For t = T : β CNT ( i ) = 1 • And for ≤ t < T : β CNt ( i ) = X j β CNt +1 ( j ) I i,y t ( j ) L y t +1 ( j ) J j,y t +1 ( i ) π ( i ) a i ( j ) (5)Proofs of the EFB algorithms for HMM2 and HMM-CN are in the appendixes. To extend the HMM considered above to the HNMC, we have to model the three functions, π , a , and L , with afeedforward neural network function modeling L yt +1 ( i ) π ( i ) a j ( i ) . This neural network function has y t +1 concatenatedwith the one-hot encoding of j as input, and outputs a positive vector of size N . To do that, we use a last positiveactivation function as the exponential, the sigmoid, or a modiﬁed Exponential Linear Unit (mELU): f ( x ) = (cid:26) x if x > e x otherwise.Then, we apply the EFB algorithm for sequence restoration. The ﬁrst step of the algorithm is performed thanks to theintroduction of an initial state, which can be drawn randomly or equals to a constant different from . Therefore, wehave constructed the HNMC, a new model able to process sequential data in a “term-to-term” way with neural networkfunctions.We can stack HNMCs to add hidden layers, similarly to the stacked RNN practice, to achieve greater model complexity.The output of a ﬁrst HNMC based EFB restoration layer becoming the input of the next one, and so on, applying theEFB layer after layer. For example, a computational graph of a HNMC composed of four layers is speciﬁed in ﬁgure4. In the general case, we have K + 2 layers:• An input layer y ;• K hidden layers h (1) , h (2) , . . . , h ( K ) ;• An output layer x .We consider that ( H (1) , Y ) , ( H (2) , H (1) ) , ..., ( H ( K ) , H ( K − ) , are HMMs, and the last layer H ( K ) is connectedwith the output layer x thanks to a feedfoward neural network function denoted f . Finally, we compute for each t ∈ { , . . . , T } , x t from y T as follows:1. Computing h (1) from y T using EFB;2. Computing h (2) from h (1) using EFB, considering h (1) as the observations; then compute h (3) from h (2) using EFB, ...3. Computing h ( K ) from h ( K − using EFB, considering h ( K − as the observations;4. Computing x t = f ( h ( K ) t ) , x t is the output vector of probabilities of the different states at time t .5 y y y y h (1)1 h (1)2 h (1)3 h (1)4 h (1)5 h (2)1 h (2)2 h (2)3 h (2)4 h (2)5 x x x x x Figure 4: Computational graph of the HNMC with two hidden layersLet us notice that, from a probabilistic point of view, this stacked HNMC can be seen as a particular Triplet MarkovChain [25] having K + 2 layers, and our restoration method would be an approximation of this model.Thus, the HNMC can be used as a sequential neural model with term-to-term processing, like the RNN. However,unlike the latter, the HNMC uses all the observation y T to restore x t , whereas the RNN uses only y t . One can usethe BiRNN to correct this default, consisting of applying a RNN from right to left, another one from left to right, thenconcatenate the outputs. Neural extensions of HMM2 and HMM-CN follow the same principles as for HMM. For the HMM2, we model a k,j ( i ) L yt +1 ( i ) π ( i ) with a feedfoward neural functions with a positive last activation function, taking as input y t +1 and theone-hot encoding of ( k, j ) . This model is denoted HMNC2.Concerning the HMM-CN, we use two different neural functions: one to model J i,yt +1 ( j ) π ( j ) , and the other one to model I j,y t ( i ) L yt +1 ( i ) a j ( i ) , with the relevant inputs, and positive outputs. This model is denoted HMNC-CN. To learn the different parameters of each of our new models, we consider the backpropagation algorithm [26, 27]frequently used for neural network learning. Given a loss, for example the cross-entropy L CE , θ a parameter of one ofthe model’s functions, and a sequence y T , we compute ∂L CE ∂θ with gradient backpropagation over all the intermediaryvariables. Then, we apply the gradient descent [28] algorithm: θ ( new ) = θ − κ ∂L CE ∂θ with κ the learning rate.As for any neural network architectures, we can apply the gradient descent algorithm for HNMC based models. There-fore, we can create different architectures and combine them with other neural network models as ConvolutionalNeural Networks [29] or feedforward ones. The combination of HMM with neural networks starts in the 1990s [30], focusing on the concatenation of the twomodels. Nowa few papers deal with the subject. The closest model to HNMC is the neural HMM proposed in[31]. However, the proposed method is not EFB based, and neural networks model different parameters from thoseconsidered in this paper. Indeed, they model p ( y t | x t = λ i ) . This implies a sum over all the possible observations tobe computed, which considerably increases the number of parameters for NLP applications, where observations arewords. It also avoids the combination with embedding methods, aiming to convert a word into a continuous vector.Moreover, the proposed training method is based on the Baum-Welch algorithm with Expectation-Maximization [32],or Direct Marginal Likelihood [33], so the ability to create various architectures as it is done with RNN is not trivial. Itfocuses on unsupervised tasks, which is not the case for HNMC. Comparable works can be found in [34,35]. Therefore,the proposed HNMC, based on different neuralized parameters with gradient descent training and aiming a differentobjective, is an original way to combine HMM with neural networks.6 rchitecture 1 RNN BiRNN HNMC HNMC2 HNMC-CNPOS Ext UD . ± .

02 91 . ± .

04 90 . ± .

03 91 . ± . . % ± . Ch GloVe 00 . ± .

08 90 . ± .

55 87 . ± .

13 88 . ± . . ± . NER FT 03 . ± .

14 82 . ± .

56 83 . ± .

10 83 . ± . . ± . Table 1: Results of the different models for POS Tagging, Chunking, and NER, for the Architecture 1 - the modelonly.

Architecture 2

RNN BiRNN HNMC HNMC2 HNMC-CN HSPOS Ext UD . ± .

04 93 . ± .

05 92 . ± .

06 93 . ± . . % ± .

05 50

Ch GloVe 00 . ± .

06 95 . ± .

11 95 . ± . . ± .

13 95 . ± .

07 32

NER FT 03 . ± .

21 87 . ± .

13 88 . ± .

13 88 . ± . . ± .

03 20

Table 2: Results of the different models for POS Tagging, Chunking, and NER, for the Architecture 2 - the modelfollowed by a feedforward neural function, the hidden size is denoted HS.

Architecture 3

RNN BiRNN HNMC HNMC2 HNMC-CN HSPOS Ext UD . ± .

09 92 . ± .

21 92 . ± .

12 92 . ± . . % ± .

03 50

Ch GloVe 00 . ± .

14 94 . ± .

09 95 . ± . . ± .

06 95 . ± .

14 32

NER FT 03 . ± .

12 88 . ± .

31 88 . ± .

19 88 . ± . . ± .

12 20

Table 3: Results of the different models for POS Tagging, Chunking, and NER, for the Architecture 3 - two modelsstacked, the hidden size is denoted HS.

This section presents some experimental results comparing the RNN, the BiRNN, the HNMC, the HNMC2, and theHNMC-CN. After some preliminary presentations of the different tasks and the word embedding process, we createdifferent architectures for all the models and test them for sequence labeling applications. Motivations to the choiceof comparing our models with RNN and BiRNN are discussed in perspectives.

We select sequence labeling applications as they are the most intuitive tasks to apply a sequential model in the NLPframework. It consists of labeling every word in a sentence with a speciﬁc tag. We apply the different models to POSTagging, Chunking, and NER, which are among the most popular sequence labeling applications.The POS tagging consists of labeling every word with its grammatical function as noun (

NOUN ), verb (

VERB ), deter-minant (

DET ), etc. For example, the sentence (Batman, is, the, vigilante, of, Gotham, .) has the labels (NOUN, VERB,DET, NOUN, PREP, NOUN, PUNCT) . The accuracy score is used to evaluate this task.Chunking consists of segmenting a sentence with a more global point of view than the POS tagging. It decomposesthe sentence by groups of words linked by a syntactic function, as a noun phrase ( NP ), a verb phrase ( VP ), an adjectivephrase ( ADJP ), among others. For example, the sentence (The, worst, enemy, of, Batman, is, the, Joker, .) has thefollowing chunk tags (NP, NP, NP, PP, NP, VP, NP, NP, O) . O denotes a word having no chunk tag. The F score isused to measure the performance of this task.The objective of the NER is to ﬁnd the different entities in a sentence. Entities can be the name of a person ( PER ), ofa city (

LOC ), or of a company (

ORG ). For example, the sentence (Bruce, Wayne, „ a, citizen, of, Gotham, „ is, the,secret, identity, of, Batman, .) can have the entities (PER, PER, O, O, O, O, LOC, O, O, O, O, O, O, PER, O) . Theentity set depends on the use-case, and one can it change according to the objective. As for Chunking, the F score isused to evaluate the performances of a model. 7or our experiments, we use three reference datasets: Universal Dependencies English (UD En) [36] for POS Tagging,CoNLL 2000 [37] for Chunking, and we use general entites with the CoNLL 2003 [38] dataset for NER . A sentence is composed of textual data; this type of data cannot be the input of feedforward neural network functions.Indeed, these functions have as input a numerical vector or scalar. Our experiments’ ﬁrst step consists of a pre-processing task to convert a word into a numerical vector, called word embedding, or word encoding. In order to makeour conclusions independent from embedding, we use three different embedding methods: GloVe [40], FastText [41],and EXT encoding [42].

To compare the different models for the different sequence labeling tasks, we implement three architectures for eachmodel:• Architecture 1: only the model;• Architecture 2: the model followed with a feedforward neural network function, equivalent of the ﬁgure 4with the layers ( y, h (1) , x ) for the HNMC;• Architecture 3: two models stacked, equivalent of the ﬁgure 4 with the layers ( y, h (1) , h (2) ) for the HNMC. Every model is programmed in python using PyTorch [43] library for automatic differentiation, and Flair library [44]for word encoding. The loss function is the cross-entropy. All the different parameters are modeled with feedforwardneural networks without hidden layers, equivalent to the logistic regression. About the activation functions, the HNMCbased models always use mELU. For the RNN and BiRNN, we use them as usual, with hyperbolic tangent functions.Every model uses the softmax function at the end of the architecture to output probabilities. We use Adam optimizer[45] for all experiments, with a mini-batch size of . For architecture 1, the learning rate equals . . We usedifferent learning rates for the different layers for the other architectures: . , then . . This conﬁguration givesthe best experimental results for every model. For each architecture, we realize three experiments: POS Tagging with UD En using EXT (POS Ext UD), Chunkingwith CoNLL 2000 using GloVe (Ch GloVe 00), and NER with CoNLL 2003 using FastText (NER FT 03). Eachexperiment is done ﬁve times; we report the mean and the 95%-conﬁdence interval in Table 1, Table 2, and Table 3,with the different sizes of hidden layers, denoted HS.First of all, we can notice that HNMC is always better than RNN. It is certainly because HNMC uses all the observa-tions to restore any hidden variable, making it a bidirectional alternative to the RNN without increasing the numberof parameters, which are slightly equivalent. As expected, the HNMC2 achieves better results than the HNMC, andtherefore the RNN. However, HNMC2 does not reach BiRNN scores, except in some cases, especially for Chunking.Another interesting comparison concerns HNMC-CN and BiRNN. Indeed, the HNMC-CN achieves better results thanthe BiRNN for every experiment. It is a promising result, as prevalent models as BiLSTM and BiGRU are basedon the BiRNN. Therefore, the HNMC-CN can be an alternative to the BiRNN for sequence labeling applications.These different results, comparing HNMC based models with RNN and BiRNN, show the proposed sequential neuralframework’s potential.

We have presented the HNMC framework, a new family a sequential neural models, introducing the classic HNMC,the HNMC2, and the HNMC-CN. We have compared these three models with the RNN and the BiRNN ones. On theone hand, the HNMC achieves better results than the RNN with an equivalent number of parameters. On the otherhand, the HNMC-CN has achieved better results than BiRNN for the different sequence labeling tasks. All these datasets are freely available: UD En on the website https:/universaldependencies.org/

8s a promising perspective, we can extend the HNMC-CN with long-memory methods, as BiRNN is extended toBiLSTM and BiGRU. Therefore, these extensions of HNMC-CN are expected to compete with BiLSTM and BiGRU.It is a challenging perspective, as these models are the most prevalent ones for sequential data processing.

References [1] Yoshua Bengio Ian Goodfellow and Aaron Courville.

Deep Learning . MIT Press, 2016. .[2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436–444, 2015.[3] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by errorpropagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.[4] Michael I Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. In

Artiﬁcial neuralnetworks: concept learning , pages 112–127. 1990.[5] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network archi-tectures. In

International conference on machine learning , pages 2342–2350, 2015.[6] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780,1997.[7] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recur-rent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 , 2014.[8] Alan Akbik, Duncan Blythe, and Roland Vollgraf. Contextual String Embeddings for Sequence Labeling. In

COLING 2018, 27th International Conference on Computational Linguistics , pages 1638–1649, 2018.[9] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In

Advancesin neural information processing systems , pages 3104–3112, 2014.[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin. Attention is all you need. In

Advances in neural information processing systems , pages5998–6008, 2017.[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.[12] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Gen-eralized autoregressive pretraining for language understanding. In

Advances in neural information processingsystems , pages 5753–5763, 2019.[13] Ruslan Leont’evich Stratonovich. Conditional Markov processes. In

Non-linear transformations of stochasticprocesses , pages 427–453. Elsevier, 1965.[14] Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of ﬁnite state Markov chains.

The annals of mathematical statistics , 37(6):1554–1563, 1966.[15] Lawrence Rabiner and B Juang. An introduction to hidden Markov models.

IEEE ASSP Magazine , 3(1):4–16,1986.[16] Daphne Koller and Nir Friedman.

Probabilistic graphical models: principles and techniques . 2009.[17] Elie Azeraf, Emmanuel Monfrini, Emmanuel Vignon, and Wojciech Pieczynski. Hidden Markov Chains, En-tropic Forward-Backward, and Part-Of-Speech Tagging. arXiv preprint arXiv:2005.10629 , 2020.[18] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks.

IEEE transactions on SignalProcessing , 45(11):2673–2681, 1997.[19] Lawrence R Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition.

Proceedings of the IEEE , 77(2):257–286, 1989.[20] Jia Li, Amir Najmi, and Robert M Gray. Image classiﬁcation by a two-dimensional hidden Markov model.

IEEEtransactions on signal processing , 48(2):517–533, 2000.[21] Thorsten Brants. TnT – a statistical part-of-speech tagger. In

Sixth Applied Natural Language ProcessingConference , pages 224–231, Seattle, Washington, USA, April 2000. Association for Computational Linguistics.[22] Andrew Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.

IEEEtransactions on Information Theory , 13(2):260–269, 1967.[23] Dan Jurafsky.

Speech & language processing . Pearson Education India, 2000.924] Charles Sutton and Andrew McCallum. An introduction to conditional random ﬁelds for relational learning.

Introduction to statistical relational learning , 2:93–128, 2006.[25] Wojciech Pieczynski, Cédric Hulard, and Thomas Veit. Triplet Markov chains in hidden signal restoration. In

Image and Signal Processing for Remote Sensing VIII , volume 4885, pages 58–68. International Society forOptics and Photonics, 2003.[26] Yann LeCun, D Touresky, G Hinton, and T Sejnowski. A theoretical framework for back-propagation. In

Proceedings of the 1988 connectionist models summer school , volume 1, pages 21–28. CMU, Pittsburgh, Pa:Morgan Kaufmann, 1988.[27] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, andLawrence D Jackel. Backpropagation applied to handwritten zip code recognition.

Neural computation , 1(4):541–551, 1989.[28] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 ,2016.[29] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with gradient-based learning.In

Shape, contour and grouping in computer vision , pages 319–345. Springer, 1999.[30] Yoshua Bengio, Yann LeCun, and Donnie Henderson. Globally trained handwritten word recognizer using spatialrepresentation, convolutional neural networks, and hidden Markov models. In

Advances in neural informationprocessing systems , pages 937–944, 1994.[31] Ke Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, and Kevin Knight. Unsupervised neural hidden markovmodels. arXiv preprint arXiv:1609.09007 , 2016.[32] Lloyd R Welch. Hidden Markov models and the Baum-Welch algorithm.

IEEE Information Theory SocietyNewsletter , 53(4):10–13, 2003.[33] Ruslan Salakhutdinov, Sam T Roweis, and Zoubin Ghahramani. Optimization with EM and expectation-conjugate-gradient. In

Proceedings of the 20th International Conference on Machine Learning (ICML-03) , pages672–679, 2003.[34] Weiyue Wang, Tamer Alkhouli, Derui Zhu, and Hermann Ney. Hybrid neural network alignment and lexiconmodel in direct HMM for statistical machine translation. In

Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume 2: Short Papers) , pages 125–131, 2017.[35] Weiyue Wang, Derui Zhu, Tamer Alkhouli, Zixuan Gan, and Hermann Ney. Neural hidden Markov model formachine translation. In

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 2: Short Papers) , pages 377–382, 2018.[36] Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning,Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. Universal dependencies v1: A multilin-gual treebank collection. In

Proceedings of the Tenth International Conference on Language Resources andEvaluation (LREC’16) , pages 1659–1666, 2016.[37] Erik F. Tjong Kim Sang and Sabine Buchholz. Introduction to the CoNLL-2000 shared task chunking. In

Fourth Conference on Computational Natural Language Learning and the Second Learning Language in LogicWorkshop , 2000.[38] Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In

Proceedings of the Seventh Conference on Natural Language Learningat HLT-NAACL 2003 , pages 142–147, 2003.[39] Edward Loper and Steven Bird. NLTK: the natural language toolkit. arXiv preprint cs/0205028 , 2002.[40] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation.In

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages1532–1543, 2014.[41] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subwordinformation.

Transactions of the Association for Computational Linguistics , 5:135–146, 2017.[42] Alexandros Komninos and Suresh Manandhar. Dependency based embeddings for sentence classiﬁcation tasks.In

Proceedings of the 2016 Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies , pages 1490–1500, San Diego, California, June 2016. Associationfor Computational Linguistics. 1043] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deeplearning library. In

Advances in neural information processing systems , pages 8026–8037, 2019.[44] Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. Flair: Aneasy-to-use framework for state-of-the-art nlp. In

Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics (Demonstrations) , pages 54–59, 2019.[45] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.

APPENDIX

We introduce a new notation, for each t ∈ { , ..., T } , λ i ∈ Λ X , y t ∈ Ω Y : b i ( y t ) = p ( y t | x t = λ i ) . Proof of the EFB algorithm for HMM2

The EFB algorithm for HMM2 aims to compute, for each t ∈ { , ..., T } , λ i ∈ Λ X , p ( x t = λ i | y T ) .We can show, with the law of HMM2 and ﬁgure 2, for each t > : p ( x t = i | y T ) = P j α ′ t ( j, i ) β ′ t ( j, i ) P k P j α ′ t ( j, k ) β ′ t ( j, k ) with: α ′ t ( j, i ) = p ( x t − = λ j , x t = λ i , y t ) β ′ t ( j, i ) = p ( y t +1: T | x t − = λ j , x t = λ i ) α ′ can be computed with the following recursion:• For t = 2 : α ′ ( j, i ) = π ( j ) b j ( y ) a j ( i ) b i ( y ) • For ≤ t < T : α ′ t +1 ( j, i ) = X k α ′ t ( k, j ) a k,j ( i ) b i ( y t +1 ) And β ′ with the following one:• For t = T, β ′ T ( j, i ) = 1 • For ≤ t < T : β ′ t ( j, i ) = X k β ′ t +1 ( i, k ) a j,i ( k ) b k ( y t +1 ) We can show, for each ≤ t ≤ T : α t ( j, i ) = α ′ t ( j, i ) p ( y ) p ( y ) ...p ( y t ) (6) β t ( j, i ) = β ′ t ( j, i ) p ( y t +1 ) p ( y t +2 ) ...p ( y T ) (7)For t = 2 , α ′ ( j, i ) = p ( y , x = λ j , y , x = λ i )= p ( y ) L y ( j ) p ( y ) a j ( i ) L y ( i ) π ( i ) t = 2 . We suppose (6) for t , and we prove it for t + 1 : α t +1 ( j, i ) = X k α ′ t ( k, j ) p ( y ) p ( y ) ...p ( y t ) a k,j ( i ) b i ( y t +1 ) p ( y t +1 )= α ′ t +1 ( j, i ) p ( y ) p ( y ) ...p ( y t ) p ( y t +1 ) About (7), it is true for t = T . We suppose (7) true for t + 1 < T , and we prove it for t : β t ( j, i ) = X k β ′ t +1 ( i, k ) p ( y t +2 ) ...p ( y T ) a j,i ( k ) b k ( y t +1 ) p ( y t +1 )= β ′ t ( j, i ) p ( y t +1 ) p ( y t +2 ) ...p ( y T ) (6) and (7) and proved for each t .Therefore, p ( x t = λ i | y T ) = α ′ t ( i ) β ′ t ( i ) P j α ′ t ( j ) β ′ t ( j ) = α t ( i ) β t ( i ) P j α t ( j ) β t ( j ) Which ends the proof of EFB algorithm for HMM2.

Proof of the EFB algorithm for HMM-CN

According to ﬁgure 3, ( x t − , y t − ) and ( x t +1: T , y t +1: T ) are independent conditionally on ( x t , y t ) , and thus wehave: p ( x t = λ i | y T ) = α CN ′ t ( i ) β CN ′ t ( i ) P j α CN ′ t ( j ) β CN ′ t ( j ) with: α CN ′ t ( i ) = p ( x t = λ i , y t ) β CN ′ t ( i ) = p ( y t +1: T | x t = λ i , y t ) α CN ′ can be computed with the following recursion:• For t = 1 , α CN ′ ( i ) = π ( i ) p ( y t | x = λ i ) • For ≤ t < T : α CN ′ t +1 ( i ) = X j α CN ′ t ( j ) I j,y t ( i ) p ( y t +1 | x t = λ j , x t +1 = λ i ) And β CN ′ with the following one:• For t = T, β CN ′ T ( i ) = 1 • For ≤ t < T : β CN ′ t ( i ) = X j I i,y t ( j ) p ( y t +1 | x t = λ i , x t +1 = λ j ) β CN ′ t +1 ( j ) We can show: α CNt ( i ) = α CN ′ t ( i ) p ( y ) p ( y ) ...p ( y t ) (8) β CNt ( i ) = β CN ′ t ( i ) p ( y t +1 ) p ( y t +1 ) ...p ( y T ) (9)128) is true for t = 1 . We suppose (8) true for t , and we prove it for t + 1 : α CNt +1 ( i ) = 1 p ( y ) ...p ( y t ) X j α CN ′ t ( j ) I j,y t ( i ) × p ( x t +1 = i | y t +1 ) p ( x t = λ j | x t +1 = λ i , y t +1 ) p ( x t = λ j , x t +1 = λ i )= 1 p ( y ) ...p ( y t ) p ( y t +1 ) X j α CN ′ t ( j ) I j,y t ( i ) × p ( x t = λ j , x t +1 = λ i , y t +1 ) p ( x t = λ j , x t +1 = λ i )= α CN ′ t +1 ( i ) p ( y ) ...p ( y t ) p ( y t +1 ) Therefore, (8) is proved for all t .The proof of (9) follows the same reasoning. (9) is true for t = T . We suppose (9) true for t + 1 , and we prove it a t : β CNt ( i ) = 1 p ( y t +2 ) ...p ( y T ) X j β CN ′ t +1 ( j ) I i,y t ( j ) × p ( x t +1 = λ j | y t +1 ) p ( x t = λ i | x t +1 = λ j , y t +1 ) p ( x t = λ i , x t +1 = λ j )= 1 p ( y t +1 ) p ( y t +2 ) ...p ( y T ) X j β CN ′ t +1 ( j ) I i,y t ( j ) × p ( x t = λ i , x t +1 = λ j , y t +1 ) p ( x t = λ i , x t +1 = λ j )= β CN ′ t ( i ) p ( y t +1 ) p ( y t +2 ) ...p ( y T ) , which prove (9) for all t .Therefore, p ( x t = λ i | y T ) = α CNt ( i ) β CNt ( i ) P j α CNt ( j ) β CNt ( j ))