Highly Fast Text Segmentation With Pairwise Markov Chains
Elie Azeraf, Emmanuel Monfrini, Emmanuel Vignon, Wojciech Pieczynski
aa r X i v : . [ c s . C L ] F e b © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses,in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component ofthis work in other works. H IGHLY F AST T EXT S EGMENTATION W ITH P AIRWISE M AR KOV C HAINS
Elie Azeraf ∗ Watson DepartmentIBM GSB France [email protected]
Emmanuel Monfrini
SAMOVAR, Telecom SudParisInstitut Polytechnique de Paris
Emmanuel Vignon
Watson DepartmentIBM GSB France
Wojciech Pieczynski
SAMOVAR, Telecom SudParisInstitut Polytechnique de Paris A BSTRACT
Natural Language Processing (NLP) models’ current trend consists of using increasingly more extra-data to build the best models as possible. It implies more expensive computational costs and trainingtime, difficulties for deployment, and worries about these models’ carbon footprint reveal a criticalproblem in the future. Against this trend, our goal is to develop NLP models requiring no extra-dataand minimizing training time. To do so, in this paper, we explore Markov chain models, HiddenMarkov Chain (HMC) and Pairwise Markov Chain (PMC), for NLP segmentation tasks. We applythese models for three classic applications: POS Tagging, Named-Entity-Recognition, and Chunk-ing. We develop an original method to adapt these models for text segmentation’s specific challengesto obtain relevant performances with very short training and execution times. PMC achieves equiva-lent results to those obtained by Conditional Random Fields (CRF), one of the most applied modelsfor these tasks when no extra-data are used. Moreover, PMC has training times 30 times shorter thanthe CRF ones, which validates this model given our objectives. K eywords Chunking · Hidden Markov Chain · Named Entity Recognition · Pairwise Markov Chain · Part-Of-Speechtagging
In the past ten years, developments of Deep Learning methods [1, 2] have enabled research in Natural LanguageProcessing (NLP) to take an impressive leap. Some tasks, like Question Answering [3] or Sentiment Analysis [4],seemed unrealistic twenty years ago, and nowadays recent neural models achieved better scores than humans [5] [6] forthese applications. This dynamic’s main motivation is the direct applications of NLP models in the industry, with taskssuch as Named-Entity-Recognition and mail classification. However, the cost of improving models’ performancesincreases, and learning algorithms always need more data and power to be trained and make a prediction. We canproduce models that achieve impressive scores, but deployment and climate change problems [7] raise some issues.Our aim in this paper is to initiate a reflection around light models by introducing a new Markov model design for textsegmentation and comparing it with machine learning algorithms that have a reasonable carbon impact and executiontime. Motivation about the choice of segmentation tasks is explained later in this paper. ∗ Elie Azeraf is also a member of SAMOVAR, Telecom SudParis, Institut Polytechnique de Paris he Hidden Markov Chains (HMC), introduced by Stratonovitch sixty years ago [8] [9] [10] [11] [12] [13], whichmodel poor correlations, are widely used in machine learning, and have especially been applied for NLP segmentationtasks [14] [15] [16].For over twenty years, HMCs have been strictly generalized to Pairwise Markov Chains (PMCs) [17], a family ofmodels including HMCs. In PMCs, the “hidden" chain is not necessarily Markov, and the noise is modeled in a morecorrelated – and thus more informative – way. However, PMCs keep the same advantages as HMCs in regards tothe hidden data estimation. In particular, the training and estimation tasks still have linear complexity. PMCs haveespecially been studied for image segmentation, with discrete hidden variables and continuous observations. It turnsout that using PMCs instead of HMCs can divide the error rate by two [18] [19]. However, for specific reasons relativeto language processing, which we will develop later, PMCs have never been applied for NLP tasks.This paper explores the interest in using PMCs for three of the main text segmentation tasks: Part-Of-Speech (POS)Tagging, Chunking, and Named-Entity-Recognition (NER). The best methods for these tasks are based on DeepLearning models [20] [21]. However, to produce excellent scores, these models require a large amount of extra-data,which results in very long training time and difficulties in deploying them with classic architectures.The paper is organized as follows. In the next section, we present and compare PMCs and HMCs, the bayesiansegmentation methods, and the parameter estimation algorithm. The third section is devoted to the text segmentationtasks and an original way to adapt Markov chain models for these tasks while keeping fast training and relevant results.Experiments are presented in section four. The last section is devoted to conclusions and perspectives.
Let X T = ( X , ..., X T ) and Y T = ( Y , ..., Y T ) be two discrete stochastic processes. For all t in { , ..., T } , X t takes its values in Λ X = { λ , ..., λ N } and Y t takes its values in Ω Y = { ω , ..., ω M } . Let us then consider the process Z T = ( X T , Y T ) .Our study takes place in the case of latent variables, with an observed realization y T of Y T , and the correspondinghidden realization x T of X T having to be estimated.In the following, in order to simplify the notation, for all t in { , ..., T } , events { X t = x t } , { Y t = y t } and { Z t = z t } will be denoted by { x t } , { y t } and { z t } , respectively. Starting from the most correlated shape of the couple ( X T , Y T ) , we can say that Z T is an HMC if it is possible towrite its probability law as: p ( z T ) = p ( x ) p ( y | x ) p ( x | x ) p ( y | x ) ...p ( x T | x T − ) p ( y T | x T ) (1)Moreover, for all λ i , λ j ∈ Λ X and ω k ∈ Ω Y , we consider homogeneous HMCs defined with the following parameters:• Initial probability π = ( π (1) , ..., π ( N )) with π ( i ) = p ( x = λ i ) • Transition matrix A = { a i ( j ) } with ∀ t ∈ { , ..., T − } , a i ( j ) = p ( x t +1 = λ j | x t = λ i ) • Emission matrix B = { b i ( k ) } with ∀ t ∈ { , ..., T } , b i ( k ) = p ( y t = ω k | x t = λ i ) An oriented dependency graph of the HMC is given in figure 1.We can note that the dependency graph is almost reduced to the minimum to link dependent couple of sequentialdata, and yet HMC is known to be a robust model. PMCs, which we will detail just after, allow to introduce moreconnections to the graph, as shown in figure 2.Starting from the most correlated shape of the couple ( x T , y T ) , we can say that z T is a PMC if z T is a Markovchain, which is equivalent to admitting that its distribution is of the form: p ( z T ) = p ( z ) p ( z | z ) ...p ( z T | z T − ) (2)2 y x y x y x y Figure 1: Probabilistic graphical model of HMC x y x y x y x y Figure 2: Probabilistic graphical model of PMCwhich can be rewritten, without loss of generality, p ( z T ) = p ( x ) p ( y | x ) p ( x | x , y ) p ( y | x , x , y ) ...p ( x T | x T − , y T − ) p ( y T | x T , x T − , y T − ) (3)When comparing (1) and (3) we can see that a PMC is a HMC [17] if and only if ∀ t ∈ { , ..., T − } ,• p ( x t +1 | x t , y t ) = p ( x t +1 | x t ) and• p ( y t +1 | x t +1 , x t , y t ) = p ( y t +1 | x t +1 ) .From a graphical point of view, we can materialize on figures 1 and 2 how PMCs are more general than HMCs. Inparticular, the markovian assumption of x T made for HMCs is not supposed for PMCs, but the determinist bayesianinference is still possible.In this paper, we consider homogeneous PMCs defined with three sets of parameters, ∀ λ i , λ j ∈ Λ X , ∀ ω k , ω l ∈ Ω Y , ∀ t ∈ { , ..., T − } :• Initial probability matrix Π P MC = { π P MC ( i, k ) } with π P MC ( i, k ) = p ( x = λ i , y = ω k ) ;• Transition matrix A P MC = { a P MCi,k ( j ) } with: a P MCi,k ( j ) = p ( x t +1 = λ j | x t = λ i , y t = ω k ) ;• Emission matrix B P MC = { b P MCi,j,k ( l ) } : b P MCi,j,k ( l ) = p ( y t +1 = ω l | x t = λ i , x t +1 = λ j , y t = ω k ) . There are two Bayesian methods to estimate the realization ˆ x T of the hidden chain: the Marginales Posterior Mode(MPM) and the Maximum A Posteriori (MAP).The MPM estimator is given by : ˆ x MP M = (ˆ x , ..., ˆ x T ) with ∀ t ∈ { , ..., T } ,p (ˆ x t | y T ) = sup λ i ∈ Λ X p ( x t = λ i | y T ) and the MAP estimator is : ˆ x MAP = arg max x T ∈ (Λ X ) T p ( x T | y T ) MPM estimator is given with Forward-Backward algorithm [9] [12], while MAP is given by the Viterbi one [22]. Wetested both of them, and MPM gives slightly better results than MAP in term of accuracy for the tasks we consider, sowe only present results for MPM. 3e thus have to compute ∀ t ∈ { , ..., T } , ∀ λ i ∈ Λ X , p ( x t = λ i | y T ) to maximize posterior marginals with the mostprobable state at every time t ∈ { , ..., T } . We present the Forward-Backward algorithm in the case of PMC whichgeneralizes the classical HMC one. ∀ t ∈ { , ..., T } , ∀ λ i ∈ Λ X : p ( x t = λ i | y T ) = α t ( i ) β t ( i ) P j ∈ Λ X α t ( j ) β t ( j ) where ∀ t ∈ { , ..., T } , ∀ λ i ∈ Λ X : α t ( i ) = p ( y t , x t = λ i ) β t ( i ) = p ( y t +1: T | x t = λ i , y t ) ∀ λ i ∈ Λ X , ∀ t ∈ { , ..., T } , α t ( i ) and β t ( i ) can still be computed thanks to forward and backward recursions:• with y = ω k , α ( i ) = π P MC ( i, k ) • ∀ t ∈ { , ..., T − } , with y t = ω k , y t +1 = ω l , α t +1 ( i ) = X λ j ∈ Λ X a P MCj,i ( k ) b P MCj,i,k ( l ) α t ( j ) and • β T ( i ) = 1 • ∀ t ∈ { , ..., T − } with y t = ω k , y t +1 = ω l , β t ( i ) = X λ j ∈ Λ X a P MCi,j ( k ) b P MCi,j,k ( l ) β t +1 ( j ) One can normalize forward and backward probabilities at every step, which allows them to avoid underflowed compu-tational problems without modifying the results. This algorithm can be executed with matrix computation, allowing ahighly fast execution.
We estimate the parameters with maximum likelihood estimator [23] [14]. It consists in computing the empiricalfrequencies of the probabilities of interest. We have, ∀ λ i , λ j ∈ Λ X , ∀ ω k , ω l ∈ Ω Y , ˆ π ( i ) = N i L , ˆ a i ( j ) = N i,j N i , ˆ b i ( k ) = M i,k N i and ˆ π P MC ( i, k ) = N i,k L , ˆ a P MCi,k ( j ) = N i,k,j M i,k , ˆ b P MCi,j,k ( l ) = N i,k,j,l N i,k,j where N i,k,j,l is the number of occurrences of the pattern ( x t = λ i , y t = ω k , x t +1 = λ j , y t +1 = ω l ) in the L chains of the training set. Then, N i,k,j = P ω l ∈ Ω Y N i,k,j,l , N i,j = P ω k ∈ Ω Y N i,k,j , M i,k = P λ j ∈ Λ X N i,k,j and N i = P λ j ∈ Λ X N i,j . Finally, N i and N i,k are respectively the number of times x = λ i and z = ( λ i , ω k ) in the L chains of the training set.When card (Λ X ) or card (Ω Y ) are huge, some patterns may not be observed in the training set, which implies that thecorresponding estimation is zero. It is the case for NLP with Ω Y , the space of possible written words. To minimizethis problem, especially in PMC, we accept to partially “downgrade" PMC to HMC when necessary. This originalprocess is represented in figure 3. It consists of approximating the forward and backward probabilities of the PMC bythe HMC ones when those of PMC equal .Moreover, it is essential to note that “online” learning is fast and easy in those models. If new sentences enrich thetraining set, updating parameters only requires us to “add” information to already determined values without retrainingthe model on the complete dataset. 4 y x y x y x y x y x y Figure 3: Graphical model of a partially “downgraded” PMC - The couple of observations ( y , y ) and ( y , y ) neverappeared in this order in the training set Let us now have a few words on the three text segmentation tasks that we are interested in: POS Tagging, Chunking,and NER. For those three tasks, we observe the words of some sentences for which we have to find the specific labels.POS tagging consists in labeling each word with its grammatical function as verb, determinant, adjective... For exam-ple, ( y , y , ..., y ) = ( John, likes, the, blue, house, at, the, end, of, the, street, . ) and ( x , x , ..., x ) = ( Noun, Verb,Det, Adj, Noun, Prep, Det, Noun, Prep, Det, Noun, Punct ). The performance of POS Tagging is evaluated with theaccuracy error.NER consists in discriminating the “entities” from the words of sentences. The “entities” can be a person (PER), alocalization (LOC), or an organization (ORG). The set of entities depends on the use case. For example, (
John, works,at, IBM, in, Paris, . ) can have the following entities (
PER, O, O, ORG, O, LOC, O ) where O stands for “no entity”.The set of entities depends on the use case, for example, finding the name of protein or DNA in medical data as in [24].The performance for NER is evaluated with F score [25].Chunking consists of segmenting a sentence into its sub-constituents, such as noun (NP), verb (VP), and prepositionalphrases (PP). For example, ( We, saw, the, yellow, dog, . ) has the chunks: (
NP, VP, NP, NP, NP, O ). Like NER, theperformance is evaluated with the F score. More details about these tasks can be found in the NLTK book [26].We took these three tasks as they can be the basis of more complex ones. For example, with text classification whenthe corpus is small, with limited architecture avoiding the deployment of a heavy deep learning model. This case canhappen, for example, with email classification, with models having to process hundreds or thousands of emails persecond with a maximum RAM power of 4Gb, given a relatively small training set. One way to handle this problem isto construct a text segmentation model, then another one using the predicted labels, to make a prediction. The threelabel types we consider, POS tag, chunk tag, and entity, are particularly useful to construct this type of architecture,motivating the choice of these tasks.However, When working in the context of text segmentation, the cardinal of Ω Y is very large, and managing unknownpatterns (not in the train set) is then a real challenge. Although the Markov properties of our processes partially helpto tackle this problem, it is not enough. PMC and HMC have to be adapted to consider this problem. Our goal is alsoto keep a very short training time. First of all, we can use the “downgrading" process from PMC to HMC described inSection II.B to alleviate this problem when PMC is used.Moreover, for HMC, we look for extra information in the unknown observed words. Given a word ω , We introducethe following functions:• u ( ω ) = 1 if the first letter of ω is up, otherwise• h ( ω ) = 1 if ω has a hyphen, otherwise• f ( ω ) = 1 if ω is the first word of the sentence, otherwise• d ( ω ) = 1 if ω has a digit, 0 otherwise• s m ( ω ) = the suffix of length m of ω Then, for ω k ∈ Ω Y unknown, and ∀ λ i ∈ Λ X , we approximate b i,k by b i ( k ) ≈ p ( u ( ω k ) , h ( ω k ) , f ( ω k ) , d ( ω k ) , s ( ω k ) | x t = λ i ) which is also estimated with empirical frequencies. If s ( ω k ) is unknown in the training set, we use s ( ω k ) orotherwise s ( ω k ) or finally s ( ω k ) . This original approximation method allows us to improve the segmentation ofunknown words while keeping a very short training time with PMC and HMC. The choice of the different featuresdepends of the language, these ones are selected for English, and one can add any features according to languagecharacteristics. 5t is important to note that the choice of the features is crucial as we cannot use arbitrary features with Markov chainmodels [23] [27] [28]. It preserves the model from a “second level” of unknown patterns, and it avoids having our newapproximation of b i ( k ) to be equal to too many times. To calibrate the performance of our models, we compare their performances to benchmark models with no extra-data:Maximum Entropy Markov Models [27], Recurrent Neural Networks [29] [30], Long-Short-Term-Memory (LSTM)network [31], Gated Recurrent Unit [32], Conditional Random Field (CRF) [33], and the BiLSTM-CRF [34]. Wepresent the results of the best one with appreciable training and execution time, the CRF, having the same featuresdescribed in section 3, to which we add the suffix of length 4 and prefixes of length 1 to 4. Comparing results to thismodel is particularly relevant, as it is one of the most applied models for these tasks when no extra-data are used. Theexperiments are done with python. We code our own Markov Chain models, and we use the CRF Suite [35] libraryfor the CRF models.Table 1: HMC, CRF and PMC for POS Tagging with error rates, KW stands for Known Words, and UW for UnknownWords HMC CRF PMCC O NLL 2000 .
96% 2 . . % C O NLL 2000 KW .
94% 1 . . % C O NLL 2000 UW . . % 16 . C O NLL 2003 . . % 4 . C O NLL 2003 KW . . % 3 . C O NLL 2003 UW . . % 15 . UD E
NGLISH . . % 7 . UD E
NGLISH KW .
07% 5 . . % UD E
NGLISH UW . . % 30 . Table 2: HMC, CRF and PMC for NER with F scores, KW stands for Known Words, and UW for Unknown WordsHMC CRF PMCC O NLL 2003 .
44 79 . . C O NLL 2003 KW .
07 87 . . C O NLL 2003 UW . . . Table 3: HMC, CRF and PMC for Chunking with F scores, KW stands for Known Words, and UW for UnknownWords HMC CRF PMCC O NLL 2000 .
72 93 . . C O NLL 2000 KW .
18 93 . . C O NLL 2000 UW . . . C O NLL 2003 .
30 94 . . C O NLL 2003 KW .
65 94 . . C O NLL 2003 UW . . . We use reference datasets for every tasks: CoNLL 2000 [36] for Chunking, UD English [37] for POS Tagging, andCoNLL 2003 [38] for NER . In addition, we take advantage of having enough data to use CoNLL 2000 and CoNLL O NLL 2000 POS 1.3 S S S C O NLL 2000 C
HUNK S S S C O NLL 2003 POS 1.5S 180 S S C O NLL 2003 NER 1.5 S S S C O NLL 2003 C
HUNK S S S UD E
NGLISH
POS 1.7 S S S PMC has good performances with no extra-data for these tasks, with equivalent results to the CRF. The main advantageof PMC is its training and execution time, much faster for text segmentation. It confirms the interest of using PMCto build an extra-light model for text segmentation, which can then be used for most complex tasks without beingrestrictive for deployment.We are conscious that our models are not as competitive as the best deep learning ones, especially for NER. Theselatter use a large amount of data to construct word embeddings [40] [41] [42], which become the input of models, likeBiLSTM-CRF, for example, to achieve state-of-the-art results [43]. On the one hand, using these embeddings is notpossible with PMC with the Viterbi or the Forward-Backward algorithms, as this model cannot use observations witharbitrary features. On the other hand, these neural models are heavy and impossible to deploy without a significantconfiguration.The training time, the execution time, and the carbon footprint of PMC are significantly lower, which is our project’smain objective. As a perspective, PMC has been extended with the Triplet Markov Chain (TMC) model [44] [45] [46].We can apply this extension to observe the possible improvements, especially for NER, while keeping Markov models’relevant properties.
References [1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning . MIT Press, 2016. .[2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature , 521(7553):436–444, 2015.[3] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machinecomprehension of text. arXiv preprint arXiv:1606.05250 , 2016.[4] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learn-ing Word Vectors for Sentiment Analysis. In
Proceedings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies , pages 142–150, Portland, Oregon, USA, June 2011.Association for Computational Linguistics. 75] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.[6] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert:A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 , 2019.[7] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning inNLP. arXiv preprint arXiv:1906.02243 , 2019.[8] Ruslan Leont’evich Stratonovich. Conditional Markov processes. In
Non-linear transformations of stochasticprocesses , pages 427–453. Elsevier, 1965.[9] Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state Markov chains.
The annals of mathematical statistics , 37(6):1554–1563, 1966.[10] Olivier Cappé, Eric Moulines, and Tobias Rydén.
Inference in hidden Markov models . Springer Science &Business Media, 2006.[11] Lawrence Rabiner and B Juang. An introduction to hidden Markov models.
IEEE ASSP Magazine , 3(1):4–16,1986.[12] Lawrence R Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition.
Proceedings of the IEEE , 77(2):257–286, 1989.[13] Yariv Ephraim and Neri Merhav. Hidden Markov processes.
IEEE Transactions on information theory ,48(6):1518–1569, 2002.[14] Thorsten Brants. TnT: a statistical part-of-speech tagger. In
Proceedings of the sixth conference on Appliednatural language processing , pages 224–231. Association for Computational Linguistics, 2000.[15] Sudha Morwal, Nusrat Jahan, and Deepti Chopra. Named entity recognition using hidden markov model (hmm).
International Journal on Natural Language Computing (IJNLC) , 1(4):15–23, 2012.[16] Asif Ekbal, S Mondal, and Sivaji Bandyopadhyay. Pos tagging using hmm and rule-based chunking.
TheProceedings of SPSAL , 8(1):25–28, 2007.[17] Wojciech Pieczynski. Pairwise Markov Chains.
IEEE Trans. Pattern Anal. Mach. Intell. , 25(5):634–639, May2003.[18] Stéphane Derrode and Wojciech Pieczynski. Signal and image segmentation using pairwise Markov chains.
IEEETransactions on Signal Processing , 52(9):2477–2489, Sep. 2004.[19] Ivan Gorynin, Hugo Gangloff, Emmanuel Monfrini, and Wojciech Pieczynski. Assessing the segmentationperformance of pairwise and triplet Markov models.
Signal Processing , 145, 12 2017.[20] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Gen-eralized autoregressive pretraining for language understanding. In
Advances in neural information processingsystems , pages 5753–5763, 2019.[21] Alan Akbik, Duncan Blythe, and Roland Vollgraf. Contextual String Embeddings for Sequence Labeling. In
COLING 2018, 27th International Conference on Computational Linguistics , pages 1638–1649, 2018.[22] Andrew Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.
IEEEtransactions on Information Theory , 13(2):260–269, 1967.[23] Daniel Jurafsky and James H. Martin.
Speech and Language Processing: An Introduction to Natural LanguageProcessing, Computational Linguistics, and Speech Recognition .[24] Tomoko Ohta, Yuka Tateisi, Jin-Dong Kim, Hideki Mima, and Junichi Tsujii. The GENIA corpus: An annotatedresearch abstract corpus in molecular biology domain. In
Proceedings of the second international conference onHuman Language Technology Research , pages 82–86, 2002.[25] Leon Derczynski. Complementarity, f-score, and nlp evaluation. In
Proceedings of the Tenth InternationalConference on Language Resources and Evaluation (LREC’16) , pages 261–266, 2016.[26] Edward Loper and Steven Bird. NLTK: the natural language toolkit. arXiv preprint cs/0205028 , 2002.[27] Andrew McCallum, Dayne Freitag, and Fernando CN Pereira. Maximum Entropy Markov Models for Informa-tion Extraction and Segmentation. In
Icml , volume 17, pages 591–598, 2000.[28] Charles Sutton and Andrew McCallum. An introduction to conditional random fields for relational learning.
Introduction to statistical relational learning , 2:93–128.[29] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network archi-tectures. In
International conference on machine learning , pages 2342–2350, 2015.830] Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural networks for se-quence learning. arXiv preprint arXiv:1506.00019 , 2015.[31] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural computation , 9(8):1735–1780,1997.[32] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recur-rent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 , 2014.[33] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilistic modelsfor segmenting and labeling sequence data. 2001.[34] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprintarXiv:1508.01991 , 2015.[35] Naoaki Okazaki. CRFsuite: a fast implementation of Conditional Random Fields (CRFs), 2007.[36] Erik F Sang and Sabine Buchholz. Introduction to the CoNLL-2000 shared task: Chunking. arXiv preprintcs/0009008 , 2000.[37] Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning,Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. Universal dependencies v1: A multilin-gual treebank collection. In
Proceedings of the Tenth International Conference on Language Resources andEvaluation (LREC’16) , pages 1659–1666, 2016.[38] Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent namedentity recognition. arXiv preprint cs/0306050 , 2003.[39] Slav Petrov, Dipanjan Das, and Ryan McDonald. A universal part-of-speech tagset. arXiv preprintarXiv:1104.2086 , 2011.[40] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation.In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages1532–1543, 2014.[41] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subwordinformation.
Transactions of the Association for Computational Linguistics , 5:135–146, 2017.[42] Alan Akbik, Tanja Bergmann, and Roland Vollgraf. Pooled contextualized embeddings for named entity recogni-tion. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 724–728, 2019.[43] Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. FLAIR: AnEasy-to-Use Framework for State-of-the-Art NLP. In
Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics (Demonstrations) , pages 54–59, 2019.[44] Wojciech Pieczynski. Chaınes de Markov triplet.
Comptes Rendus Mathematique , 335(3):275–278, 2002.[45] Shou Chen and Xiangqian Jiang. Modeling Repayment Behavior of Consumer Loan in Portfolio across BusinessCycle: A Triplet Markov Model Approach.
Complexity , 2020, 2020.[46] Haoyu Li, Stéphane Derrode, and Wojciech Pieczynski. An adaptive and on-line imu-based locomotion activityclassification method using a triplet markov model.