[PDF] Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources

Abstract

We propose a novel hybrid approach to lemmatization that enhances the seq2seq neural model with additional lemmas extracted from an external lexicon or a rule-based system. During training, the enhanced lemmatizer learns both to generate lemmas via a sequential decoder and copy the lemma characters from the external candidates supplied during run-time. Our lemmatizer enhanced with candidates extracted from the Apertium morphological analyzer achieves statistically significant improvements compared to baseline models not utilizing additional lemma information, achieves an average accuracy of 97.25% on a set of 23 UD languages, which is 0.55% higher than obtained with the Stanford Stanza model on the same set of languages. We also compare with other methods of integrating external data into lemmatization and show that our enhanced system performs considerably better than a simple lexicon extension method based on the Stanza system, and it achieves complementary improvements w.r.t. the data augmentation method.

Full PDF

EEnhancing Sequence-to-Sequence Neural Lemmatization with ExternalResources

Kirill Milintsevich

Institute of Computer ScienceUniversity of TartuTartu, Estonia [email protected]

Kairit Sirts

Institute of Computer ScienceUniversity of TartuTartu, Estonia [email protected]

Abstract

We propose a novel hybrid approach to lemma-tization that enhances the seq2seq neuralmodel with additional lemmas extracted froman external lexicon or a rule-based system. Dur-ing training, the enhanced lemmatizer learnsboth to generate lemmas via a sequential de-coder and copy the lemma characters from theexternal candidates supplied during run-time.Our lemmatizer enhanced with candidates ex-tracted from the Apertium morphological ana-lyzer achieves statistically signiﬁcant improve-ments compared to baseline models not utiliz-ing additional lemma information, achieves anaverage accuracy of 97.25% on a set of 23UD languages, which is 0.55% higher thanobtained with the Stanford Stanza model onthe same set of languages. We also com-pare with other methods of integrating exter-nal data into lemmatization and show that ourenhanced system performs considerably betterthan a simple lexicon extension method basedon the Stanza system, and it achieves comple-mentary improvements w.r.t. the data augmen-tation method. State-of-the-art lemmatization systems are basedon attentional sequence-to-sequence neural archi-tectures operating on characters that transform thesurface word form into its lemma (Kanerva et al.,2018; Qi et al., 2018). Like any other supervisedlearning model, these systems are dependent onthe amount and quality of the existing training data.Attempts to develop even more accurate lemmati-zation systems can focus on improving the model’sarchitecture or obtaining additional data. Whileannotating additional data is an ongoing processfor many smaller languages in the Universal Depen-dencies (UD) collection, there are also other data https://github.com/501Good/lexicon-enhanced-lemmatization sources available that can be useful for improvinglemmatization systems. In particular, we refer to ex-isting rule-based morphological analyzers, lexicons,and other such resources.Three potential sources for extracting additionallemma candidates are Apertium, Unimorph, andUD Lexicons initiatives. Apertium is an open-source rule-based machine translation platform(Forcada et al., 2011). It also includes rule-basedmorphological analyzers based on ﬁnite-state trans-ducers that cover 80 languages. Unimorph is aproject aimed at collecting annotated morphologicalinﬂection data, including lemmas, from Wiktionary(Kirov et al., 2016), a free open dictionary for manylanguages. Currently, the Unimorph project covers110 languages. UD Lexicons is a collection of53 morphological lexicons in CoNLL-UL formatcovering 38 languages. UD Lexicons mostly useApertium and Giellatekno systems to generate theannotations (Sagot, 2018).Several previous works have proposed methodsto improve lemmatization systems by augmentingthe training data with additional instances (Bergma-nis and Goldwater, 2019; Kanerva et al., 2020). Inthis paper, we propose another approach that bothmodiﬁes the model architecture and leverages addi-tional data. Unlike previous work where the modelgains from extracting extra knowledge from theadditional data provided for training, our primarygoal is to teach the model to use external resources,even those that may only be available later duringtest time. In particular, the proposed system is adual-encoder model, which receives two inputs foreach word: 1) the word form itself to be lemmatizedand 2) (optionally) the lemma candidates for thatword form extracted from a lexicon or generatedby a rule-based system. Both inputs are encoded http://unimorph.org/ http://atoll.inria.fr/~sagot/ a r X i v : . [ c s . C L ] J a n ith two diﬀerent encoders and passed to the de-coder. The decoder then learns via two separateattentional mechanisms to generate the lemma viathe combination of the regular transduction and bycopying characters from the external candidates.This way, the model is trained to use two sources ofinformation–the regular training set and the optionsproposed by an external resource.The experiments with several models enhancedwith external data on 23 UD languages show thatthe best model using additional lemma candidatesgenerated by the Apertium system achieves sig-niﬁcantly higher results than the baseline modelstrained on the UD training set only. Also, we com-pare our method with other methods using externaldata. The enhanced system performs considerablybetter than a simple lexicon extension method basedon the Stanza system, and it achieves complemen-tary improvements w.r.t. the data augmentationmethod of Kanerva et al. (2020). Nowadays, state-of-the-art lemmatization systemsare typically based on a neural sequence-to-sequence architecture, as demonstrated by the va-riety of systems presented at the CoNLL 2018(Zeman et al., 2018) and SIGMORPHON 2019(McCarthy et al., 2019) shared tasks. Several sys-tems, including the TurkuNLP pipeline, the winnerof the lemmatization track at CoNLL 2018 Sharedtask, use an attention-based translation model (Kan-erva et al., 2018; Qi et al., 2018). The input to thesystem is the character sequence of a surface form(SF), which is “translated" into the lemma by anattention-based decoder. The input sequence canalso be extended with POS tags (Qi et al., 2018)and morphological features (Kanerva et al., 2018).Another approach was used by the UDPipe Futuresystem, the second-best model at the CoNLL 2018Shared Task. Straka (2018) proposed to produce alemma by constructing a set of rules that transformthe SF into a lemma. These rules can includecopying, moving, or deleting a character in the SF,as well as additional rules for changing or preservingthe casing. Thus, the lemmatization task is renderedinto a multi-class classiﬁcation task of choosingthe correct transformation rule among the set of allpossible rules generated from the training set. Ayear later, Straka et al. (2019) improved the resultfor the lemmatization by adding BERT contextualembeddings (Devlin et al., 2019) to the input, which made them the best lemmatization system at theSIGMORPHON 2019 Shared Task.Several previous works have proposed to lever-age additional data to improve lemmatization. Inthe simplest form, training data itself can be used tocreate a lexicon that maps word forms to its lemma.This strategy has been adopted by the Stanford neu-ral lemmatization system (Qi et al., 2018), whichcreates such lexicons from the training sets andresorts to lemma generation only when the lexiconlookup fails. One can easily imagine extendingsuch a lexicon with external resources. Rosa andMareček (2018) adopted another simple way ofusing Unimorph lexicons to post-ﬁx the morpholog-ical features and lemmas predicted by the UDPipesystem (Straka and Straková, 2017). The post-ﬁxis performed by simply looking up the SF fromthe Unimorph lexicon and, if the match is found,replacing the model prediction with the tags andlemmas found in the lexicon.Another line of work has used additional data toaugment the training data set. Bergmanis and Gold-water (2019) augmented their training set by ﬁrstlisting all non-ambiguous word-lemma pairs fromUnimorph lexicons and then extracted sentencesfrom Wikipedia that contained these words. Theythen trained the context-sensitive Lematus model(Bergmanis and Goldwater, 2018) on this extendedpartially lemmatized data set. Kanerva et al. (2018)used Apertium’s morphological analyzer module toextend the training set for languages with tiny UDdatasets. Apertium was used to generate all possiblemorphological analyses to 5000 sentences selectedfrom the Wikipedia of the respective language. Foreach sentence, the most likely analysis sequencewas then obtained via a disambiguating languagemodel. The words that were assigned an Apertium-generated lemma during this process were addedto the lemmatizer training set. In the subsequentwork, Kanerva et al. (2020) extended the trainingdata even more. They used Apertium to analyzeall words found in the CoNLL 2017 web crawldataset (Ginter et al., 2017) or in the Wikipediaof the respective language. All new words withunambiguous lemma and morphological analysiswere added to the augmented training set.

The core of the proposed model is the Stanford lem-matizer (Qi et al., 2018, 2020) which is a sequence-to-sequence model with attention. It takes character- inHcnhin+cncin+cn outputSFPOSFEATScandidate

BiLSTMEncoder

Encoder Decoder hinhcncinccn

LSTMAttentionAttentionLinearBiLSTMEncoder LinearLinearBiLSTMEncoder

LinearBiLSTMEncoderBiLSTMEncoder LSTMAttentionAttentionLinear h - hidden state for the last timestep c - cell state for the last timestep H - hidden states from the last layer for all timesteps in - state(s) for the encoded SF-POS-FEATS cn - state(s) for the encoded candidates Figure 1: The architecture of the dual-encoder enhancedlemmatizer. Layers that comprise the original Stanzalemmatizer are marked with a bold red border. level word representation and the POS tag as inputand processes them with a bidirectional LSTM en-coder. Then, it passes the encoder outputs to anLSTM decoder, which applies a soft dot attentionlayer after every LSTM cell. Finally, the output isconstructed via greedy decoding.We make several changes to the model archi-tecture as shown in Figure 1. The componentscomprising the original Stanford lemmatizer aremarked on the ﬁgure with the bold red border. First,we add another encoder that encodes the lemma can-didates provided by the external system. The outputrepresentations of both encoders are combined witha linear layer and fed to the decoder. Secondly,we add another attention layer to the decoder thatattends to the outputs of the second encoder. Theoutputs are ﬁnally combined with a linear layer.Finally, in addition to the POS tag, we also addmorphological features to the ﬁrst encoder’s input.Additionally, we implement the encoder dropoutto simulate the situation when the external candi-dates are absent. The value of the encoder dropoutthat varies in the range of { . , . } deﬁnes theprobability of discarding all candidates from abatch during training. Thus, the model will trainonly the main encoder based on this batch. Thishelps to train the model to perform more robustly inboth situations when the candidates in the secondencoder are present or absent. Data

The models are trained and tested on theUniversal Dependencies (UD) v2.5 corpora (Zemanet al., 2019). As additional external data, the lexi- cons from the Unimorph project (Kirov et al., 2016),UD Lexicons (Sagot, 2018), and lemmas generatedwith the Apertium morphological analyzer module(Forcada et al., 2011) are used. We also experimentwith the lexicon constructed from the training setto simulate the situation when no additional data isavailable—this scenario assesses the eﬀect of thesecond encoder without external data. The experi-ments are conducted on 23 languages from the UDcollection. The basis of this selection was that allthese languages are supported by both Unimorph,UD Lexicons, and Apertium.To extract lemmas from the Unimorph lexicon,the input surface form (SF) is queried from thelexicon to retrieve the corresponding lemma. Somemorphological forms in the Unimorph lexiconsconsist of several space-separated tokens; thesewere discarded. UD Lexicons are presented inthe CoNLL-UL format, which is an extension ofthe CoNLL-U format. This makes the extractionprocess trivial since the lexicons are already pre-tokenized. For Apertium, all generated lemmaswere stripped from special annotation symbols, andduplicate lemmas were removed. Finally, the sim-ple training set based lexicon solution, similar toQi et al. (2018), consists of two lookup dictionar-ies. The ﬁrst lexicon maps SF-POS pairs to theirlemmas, the second lexicon maps just SF’s to theirpossible lemmas found in the training set. Thelemma candidates for a SF are selected by ﬁrstquerying the input SF and POS tag from the SF-POS dictionary and, in case of failure, falling backto the SF dictionary.

Baselines

As the ﬁrst baseline, we compare ourresults with Stanza, the lemmatization modulefrom the Stanza pipeline (Qi et al., 2020), which isa repackaging of the Stanford lemmatization systemfrom the CoNLL 2018 Shared Task (Qi et al., 2018).We used the lemmatization models trained on theUDv2.5 available on the Stanza web page. As theDefault baseline, we use our enhanced model,with the second encoder always being empty.

Experimental Setup

We train four enhanceddual-encoder models that diﬀer in the input tothe second encoder. For all models, the input to theﬁrst encoder is the concatenation of SF characters,POS tag, and morphological features. During thetraining phase, gold POS tags and morphologicalfeatures are supplied, while during inference, POStags predicted with the Stanza tagger are used. Theinput to the second encoder is the following: for the reebank Size All words Out-of-vocabularyDef Lex Uni Apt Stanza Def Apt Diﬀ OOV% cs_pdt 1,503K 98.51 98.66

Table 1: Lemmatization accuracy of the models enhanced with training the set lexicon (Lex), Unimorph lexicon(Uni), and Apertium systems (Apt) as well as the Default (Def)and Stanza baselines on 23 UD languages. second baseline (Default), it is always empty; forthe Lexicon, Unimorph, and Apertium enhancedmodels, it contains the lemma candidate(s) from thetraining set based lexicon, Unimorph lexicons, andApertium analyses respectively. If several possiblecandidates are returned for a SF, then these are con-catenated. The encoder dropout for the Lexiconmodel is set to 0 . Table 1 shows the results for all three enhancedsystems and two baselines. The Apertium modeloutperforms other models for most languages, al-though the absolute diﬀerences are quite small. The Lexicon model and the Default baseline are on thesame level on average, suggesting that supplying themodel with lemmas extracted from the training setvia the second encoder does not help to leverage thetraining data better. However, all enhanced models,including the Default model, perform better thanthe Stanza baseline, suggesting that omitting thelexicon heuristics and supplying the input tokenswith both POS and morphological features mightimprove performance.One-way ANOVA was performed to detectstatistical diﬀerence between the systems. Asigniﬁcant diﬀerence between the scores at the 𝑝 < .

05 level ( 𝑝 = . The results for be_hse were extreme outliers and were notincluded in the comparison. The Unimorph-enhanced modelwas excluded from this test as its results did not conform tothe normality requirement. o the the Default ( 𝑝 𝑎𝑑 𝑗 = . 𝑝 𝑎𝑑 𝑗 < . 𝑝 𝑎𝑑 𝑗 = . 𝑝 𝑎𝑑 𝑗 < . 𝑝 -valueadjusted for multiple comparisons using the Bon-ferroni correction.As the baseline model performances are alreadyvery high and the external information is expectedto improve the lemmatization most for the newwords unseen during training, we computed theaccuracy of the out-of-vocabulary words (OOV)for the best performing Apertium model and theDefault baseline. In this context, OOV wordsare those words in the test set that were not seenby the model during training. The results areshown in the right-most section of the Table 1.The improvements on the OOV words are variable,depending on the language, although on average,the improvement of the Apertium model over theDefault baseline is more than 1%. We hypothesizethat the direction and the magnitude of these eﬀectsare dependent on the coverage and the quality ofthe Apertium morphological analyzer. In this section, we analyze more thoroughly thepotential of the proposed method. First, we com-pare our enhanced system with alternative methodsfor deploying external data, particularly with thedata augmentation method proposed by Kanervaet al. (2020) and a lexicon extension method imple-mented based on the Stanza system (Qi et al., 2020).Secondly, we present more analyses to provide evi-dence towards the conclusion that the improvementspresented for the enhanced model in the previoussection can be attributed to our system’s abilityto make use of external resources supplied to themodel via the second encoder.

We implemented the transducer augmentationmethod described by Kanerva et al. (2020). Thismethod’s basic idea relies on applying existing mor-phological analyzers (in this case, Apertium) tounannotated data to generate additional traininginstances. To obtain the augmentation data, werecreated the experiments of Kanerva et al. (2020)with 8K additional data. First, we collected a wordfrequency list for each language based on automati-cally annotated CoNLL2017 corpora (Ginter et al.,2017). For the languages not present in this dataset(Belarusian and Armenian), we used the wikidump to extract the word frequency list. Next, all wordsin the list were analyzed with the Apertium morpho-logical analyzer. Then, we used the scripts fromthe original experiments of Kanerva et al. (2020)to convert the Apertium analyses to the UD formatand ﬁlter out ambiguous cases. Finally, the 8K mostfrequent words not already present in the trainingset together with their analyses were chosen andappended to the UD training set.Although both the enhanced and augmented sys-tems utilize Apertium as the external source, addi-tional data usage diﬀers. The augmented systemuses Apertium to create extra labeled training data,while our enhanced model uses Apertium to gener-ate additional lemma candidates to the words of thesame initial training set. On the other hand, duringtest time, the augmented model must fully rely onthe regularities learned during training, while ourenhanced model can additionally look at the lemmasfor words that were never seen during training.The comparison of our Apertium-enhancedmodel and the augmented model is shown in theﬁrst two blocks of Table 2. The ﬁrst two columnsreintroduce the Default and Apertium-enhancedmodels’ results from the Table 1, the third and thefourth columns show the same two models trainedon the augmented training sets. Overall, the aver-age results for both Apertium-enhanced and theaugmented Default model (the column Def+8K)are very similar, with the average of the Apertium-enhanced model being slightly higher (97.25 vs.97.17). The Apertium-enhanced model is betterin 15 languages out of 23 (underlined in the table),while the augmented model surpasses the enhancedmodel on 8 models. The Apt+8K column showsthe results of a model combining both augmen-tation and enhanced methods—the training datais ﬁrst augmented with the additional 8K wordsand then additionally enhanced with the Apertiumcandidates via the second encoder. The combinedapproach scores are the best for 8 languages outof 23, resulting in an average improvement overthe augmented Default model of 0.14% and overthe Apertium-enhanced model of 0.06% in abso-lute. These results show that both augmentationand enhancement methods can contribute in com-plementary ways. https://github.com/jmnybl/universal-lemmatizer reebank Def Apt Def+8K Apt+8K Apt . Apt+E Apt+Uni Apt+UD

Our models Augmented models The second encoder input variescs_pdt 98.51 98.55 98.49 † eu_bdt 96.48 96.68 96.66 † be_hse 81.91 82.63 † Average 97.03 97.25 97.17

Table 2: Comparison of the enhanced models with the augmentation method: Def is the Default model, Aptis the Apertium-enhanced model, Def+8K and Apt+8K are the same Default and Apertium-enhanced modelswith augmented data. For the models marked with † , the UD Lexicon is absent and is replaced with Apertiumcandidates instead. Another simple baseline method for using externaldata is to use a lexicon or an external system ﬁrstand only resort to neural generation when the sur-face form (SF) is not present in the lexicon. Thisis essentially how the Stanza lemmatizer works.Stanza constructs a lexicon based on the trainingset. During inference, the prediction goes through acascade of three steps: 1) if the SF is present in thelexicon, then the lemma is immediately retrievedfrom the lexicon. 2) If the SF is novel and is missingfrom the lexicon, an edit operation is generated thatdecides whether the SF itself or its lowered form isthe lemma, or whether neither is true. 3) Only inthe last case the lemma is generated by the sequen-tial decoder. For testing out the lexicon extensionsystem, we used the pretrained Stanza models butextended the lexicon stored in the Stanza systemwith additional items. Note that Stanza lexicons can only store one lemma per SF-POS combina-tion. Thus, if any of the external lexicons containambiguous lemmas, the ﬁrstly encountered lemmais chosen for each word.We extended the Stanza lexicons with boththe Apertium 8K datasets used for training theaugmented models in section 6.1 and the UD lexi-cons (Sagot, 2018). The results of these evaluationsare shown in Table 3. The set of languages in thistable is slightly diﬀerent than in Table 1, only in-cluding those languages for which the UD lexiconsare existent. The left block shows the results withvarious Stanza models. The ﬁrst column shows thebaseline Stanza results (taken from Table 1), thesecond and the third columns present the Stanzamodel with its lexicon extended with the UD lexi-cons and the 8K words, respectively. The originalUD lexicon for Russian contained many erroneouslemmas due to poor post-processing, which skewed reebank Stanza Stanza+UD Stanza+8K Apt Lex+UD Lex+8K cs_pdt 98.58 † † ur_udtb 95.62 95.66 95.64 97.13 97.28 gl_ctg 98.59 98.64 98.60 tr_imst 96.73 96.90 96.83 97.17 97.17 Average 97.60 97.63 97.61

Table 3: Evaluation of the eﬀect of the Stanza-based lexicon extension method; comparison with the Apertium-enhanced (Apt) and the Lexicon-enhanced systems (Lex+UD and Lex+8K). the average accuracy. Thus, we did additional post-processing to put it in line with other languages.The average scores of the Stanza systems ex-tended with both UD and 8K lexicons remainroughly the same. However, when extending theStanza with UD lexicons, most languages improveat least slightly, as shown with the underlined scoresin the column Stanza+UD. Overall, on average, thesimple lexicon extension method falls considerablybehind our Apertium-enhanced model (97.63 vs.98.00), the scores of which are again replicated inthe ﬁrst column of the right-most block.However, the Apertium-enhanced model is notdirectly comparable to the Stanza models with ex-tended lexicons because 1) the training data diﬀersas the enhanced model has access to extra lemmacandidates of the training set words during trainingand 2) the lexicons available during the test timeare diﬀerent. Thus, we also show in the last twocolumns of the right-hand block of Table 3 theresults of two Lexicon-enhanced models (recallSection 4 and Table 1), similarly extended withthe UD and 8K lexicons. The Lexicon-enhancedmodel has access to the same data as the Stanza model during both training (training set + the train-ing set based lexicon) and testing.While the Lexicon-enhanced model alone doesnot perform better than the Default baseline (seeresults in Table 1), adopting additional UD or8K lexicons during test time increases the resultsto the same level with the Apertium-enhancedmodel. This shows that our proposed approachdoes not need additional resources during training—the model can be trained to use external sourcesbased on the lexicon created from the training set.Then, the system’s real beneﬁts can be achievedwhen using extra resources later during test time.Without those resources, the model still performson the same level as the non-enhanced baseline.We hypothesize that our dual-encoder approachperforms better than the Stanza with extendedlexicon partly because of the diﬀerences in theusage of the external data. Since Stanza uses thelexicon resources as a ﬁrst step in the cascade, it isprone to potential errors and noise in the lexicons.The dual-encoder model is safer against noise inthis respect because the lemma candidates are notsimply chosen as the prediction if present but areather fed through the system that can decide howmuch to take or ignore from the given candidates.Also, because Stanza lexicons have the restrictionof only one lemma per word-POS pair, the systemmight solve some ambiguities erroneously. Ourapproach is also more ﬂexible in this respect, asthe second encoder can be given several candidates,and again, the system learns to decide itself fromwhich candidate how much to take. On average,there are 0 .

71 lemma candidates per input word,and 1 .

09 lemma candidates per input word whenexcluding those words that do not have externallemma candidates.

Next, we performed a set of evaluations to arguefor the eﬀect of the second encoder in the enhancedmodel. We suggest that the improvements presentedin Table 1 for the Apertium-enhanced model overthe Default baseline are indeed due to the inputprovided via the second encoder. To demonstratethat, we evaluated the test set for each languageagain, on the same model that was trained withApertium lemma candidates but leaving the secondencoder empty for the test time. For that, we re-trained the Apertium-enhanced models with theencoder dropout of 0 .

8. This means that duringtraining, 80% of the time, the lemma candidatesprovided for the second encoder are dropped, andthe model trains only the main encoder. The reason-ing for using the dropout is similar to one providedfor the Lexicon-enhanced model in Section 3—ifthe lemma candidates are always provided duringtraining, the model learns to rely equally on both en-coders. Due to that, if the second encoder remainsempty during testing, the performance degradesconsiderably. If, on the other hand, the dropout isused, then the model learns to make predictionsboth when the candidates in the second encoder arepresent and also when they are absent. The resultsof these experiments are shown in the right-mostblock of Table 2.We ﬁrst show in Table 2 that the results of theApertium-enhanced models trained with dropoutare equivalent to the results obtained withoutdropout as evidenced by the column Apt . . Next,when the second encoder is empty (column Apt+E),the test results are similar to the ones obtained withthe Default model, providing evidence that the im-provements are indeed due to the extra info suppliedvia the second encoder during test time. Addition- ally, we emulated the scenario when extra lexiconinformation becomes available after training themodel. In this case, it is straightforward to integratethis information into the system without having toretrain the model. The last two columns in Table 2show the following scenarios on this respect: 1)Unimorph lexicons in addition to Apertium (7thcolumn Apt+Uni) and 2) UD lexicons (the last col-umn Apt+UD) in addition to Apertium. The resultsin Table 2 show that, on average, extending theApertium system with these particular lexicons donot add any beneﬁt. The reasons for that can be two-fold: 1) The UD lexicons are for most languagesconstructed based on the Apertium system and thusmight not add any extra information; 2) The cov-erage of Unimorph lexicons in terms of lemmas istypically smaller than of Apertium systems.Table 4 shows some examples when the Defaultmodel predicted incorrect lemma while the Aper-tium-enhanced model predicted the correct one. Insome cases, Apertium provided the only and cor-rect candidate for the Apertium-enhanced model,which was picked as a ﬁnal prediction. In othercases, several candidates are provided to the secondencoder, and the enhanced model chooses the cor-rect one in most of the cases. This indicates that thesecond encoder eﬀectively learns how to use thecandidates to better control the lemma generation. All dual-encoder models were trained with bothPOS and morphological features in the input, whilethe Stanza baseline only uses POS-tag information.Thus, the eﬀect of the morphological features isa potential confounding factor when comparingthe performance of the enhanced models to theStanza baseline. To evaluate the eﬀect of themorphological features, we trained the Default andApertium-enhanced models with only providingPOS-tag information to the input.Figure 2 shows the improvement in accuracyover the Default model trained with POS-tagsonly of 1) the Default model trained with bothPOS-tags and morphological features, 2) the Aper-tium-enhanced model trained with only POS-tags,and 3) the Apertium-enhanced model trained withboth POS-tags and morphological features. It canbe seen that for some of the languages, the mostimprovement comes from adding morphologicalfeatures to the input, while for other languagesadding the second encoder gives the main boost. nput Def Apt Candidate(s) паперi * папер папiр папiр (cid:104) paperi (cid:105) (cid:104) paper (cid:105) (cid:104) papir (cid:105) (cid:104) papir (cid:105) чотирьох * четвери чотири четверо, чотири (cid:104) čotyr’oh (cid:105) (cid:104) četvery (cid:105) (cid:104) čotyry (cid:105) (cid:104) četvero, čotyry (cid:105) Antworten *Antworte

Antworten antworten, antwortbesten bester gut gut раскладзе * раскладз расклад раскласцi, расклад (cid:104) raskladze (cid:105) (cid:104) raskladz (cid:105) (cid:104) rasklad (cid:105) (cid:104) rasklasci, rasklad (cid:105) стаiць стаiць стаяць стаяць, стаiць (cid:104) staic’ (cid:105) (cid:104) staic’ (cid:105) (cid:104) stajac’ (cid:105) (cid:104) stajac’, staic’ (cid:105) Table 4: Examples for Ukrainian, German, and Be-larusian words corrected by the enhanced model. Allpredictions of the Default (Def) are incorrect, the un-grammatical ones are marked with *. The correct pre-dictions of the Apertium-enhanced (Apt) models arein bold . The last column shows the external candidates.

However, for most languages, combining the sec-ond encoder and morphological features providesthe largest eﬀect, which seems to be more complexthan a linear combination of the two. We sup-pose that, in this scenario, the attention mechanismworks diﬀerently—it allegedly takes the morpholog-ical features into account when picking the correctlemma from the multiple candidates.

We proposed a method for enhancing neural lemma-tization by integrating external input into the modelvia a second encoder and showed that the systemincorporating Apertium morphological analyzersigniﬁcantly improved the performance over thebaselines. Both Bergmanis and Goldwater (2019)and Kanerva et al. (2020) used external resourcesto augment the training data, and thus, the improve-ment of their system is dependent on the amountand quality of the extended data supplied duringtraining. On the other hand, our method trains thesystem to use the external information providedduring run-time, thus making it independent of theparticular external data available during training.We experimentally showed that the enhancingmethod is both slightly better and complementaryto the data augmentation method of Kanerva et al.(2020). We also compared our system with a simplelexicon extension method implemented based onthe Stanza system. When trained and tested in acomparable setting, the proposed enhanced systemachieves considerably higher results.Although the model’s computational complexityis increased by introducing the second encoder, itis counterbalanced by our model being more robust

Figure 2: Independent and cumulative eﬀects of thesecond encoder and the morphological features on themodel’s performance. The origin of the x-axis is theperformance of the Default model with POS-tags only. to noise and the ambiguities stemming from theexternal lexicons. Moreover, the main bottleneckin computation originates not from the neural net-work’s increased size but can rather stem from theexternal system. For example, in our experiments,the main bottleneck in computation originated fromexecuting the transducer-based Apertium morpho-logical analyser. To overcome this bottleneck, onepossible trade-oﬀ between the speed and accuracyis to precompile a candidate list large enough tocover the most frequent words for a given language.This is a problem that also simpler baseline methodsadopting external resources have to address.Finally, it is worth noting that the proposedmethod could be beneﬁcial for less-resourced lan-guages. However, establishing this claim wouldneed more systematic experiments exploring specif-ically on this question, which we did not focus onin this paper. Still, because the signiﬁcant improve-ments shown in this work are obtained on languageswith larger datasets, the possible gains on smalleratasets can be larger.

Acknowledgments

The ﬁrst author was supported by the IT AcademyProgram (StudyITin.ee).

References

Toms Bergmanis and Sharon Goldwater. 2018. Con-text Sensitive Neural Lemmatization with Lematus.In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers) , pages 1391–1400.Toms Bergmanis and Sharon Goldwater. 2019. Train-ing Data Augmentation for Context-Sensitive NeuralLemmatizer Using Inﬂection Tables and Raw Text.In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long and Short Papers) , pages 4119–4128.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Mikel L. Forcada, Mireia Ginestí-Rosell, Jacob Nord-falk, Jim O’Regan, Sergio Ortiz-Rojas, Juan An-tonio Pérez-Ortiz, Felipe Sánchez-Martínez, GemaRamírez-Sánchez, and Francis M. Tyers. 2011.Apertium: A Free/Open-Source Platform for Rule-Based Machine Translation.

Machine translation ,25(2):127–144.Filip Ginter, Jan Hajic, Juhani Luotolahti, Milan Straka,and Daniel Zeman. 2017. CoNLL 2017 Shared Task–Automatically Annotated Raw Texts and Word Em-beddings. LINDAT/CLARIN digital library at theInstitute of Formal and Applied Linguistics, CharlesUniversity.Jenna Kanerva, Filip Ginter, Niko Miekka, AkseliLeino, and Tapio Salakoski. 2018. Turku NeuralParser Pipeline: An End-to-End System for theCoNLL 2018 Shared Task. In

Proceedings of theCoNLL 2018 Shared Task: Multilingual parsingfrom raw text to universal dependencies , pages 133–142.Jenna Kanerva, Filip Ginter, and Tapio Salakoski. 2020.Universal Lemmatizer: A Sequence to SequenceModel for Lemmatizing Universal DependenciesTreebanks.

Natural Language Engineering , pages1–30. Christo Kirov, John Sylak-Glassman, Roger Que, andDavid Yarowsky. 2016. Very-large Scale Parsingand Normalization of Wiktionary MorphologicalParadigms. In

Proceedings of the Tenth InternationalConference on Language Resources and Evaluation(LREC 2016) , Paris, France. European Language Re-sources Association (ELRA).Arya D. McCarthy, Ekaterina Vylomova, Shĳie Wu,Chaitanya Malaviya, Lawrence Wolf-Sonkin, Gar-rett Nicolai, Christo Kirov, Miikka Silfverberg, Se-bastian J. Mielke, Jeﬀrey Heinz, et al. 2019. TheSIGMORPHON 2019 Shared Task: MorphologicalAnalysis in Context and Cross-Lingual Transfer forInﬂection. In

Proceedings of the 16th Workshop onComputational Research in Phonetics, Phonology,and Morphology , pages 229–244.Peng Qi, Timothy Dozat, Yuhao Zhang, and Christo-pher D Manning. 2018. Universal Dependency Pars-ing from Scratch. In

Proceedings of the CoNLL 2018Shared Task: Multilingual Parsing from Raw Text toUniversal Dependencies , pages 160–170.Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, andChristopher D. Manning. 2020. Stanza: A PythonNatural Language Processing Toolkit for Many Hu-man Languages. In

Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics: System Demonstrations .Rudolf Rosa and David Mareček. 2018. CUNI x-ling:Parsing Under-Resourced Languages in CoNLL2018 UD Shared Task. In

Proceedings of the CoNLL2018 Shared Task: Multilingual Parsing from RawText to Universal Dependencies , pages 187–196.Benoît Sagot. 2018. A Multilingual Collection ofCoNLL-U-Compatible Morphological Lexicons. In

Proceedings of the Eleventh International Confer-ence on Language Resources and Evaluation (LREC2018) .Milan Straka. 2018. UDPipe 2.0 Prototype at CoNLL2018 UD Shared Task. In

Proceedings of the CoNLL2018 Shared Task: Multilingual Parsing from RawText to Universal Dependencies , pages 197–207.Milan Straka and Jana Straková. 2017. Tokenizing, POStagging, lemmatizing and parsing UD 2.0 with UD-Pipe. In

Proceedings of the CoNLL 2017 SharedTask: Multilingual Parsing from Raw Text to Univer-sal Dependencies , pages 88–99, Vancouver, Canada.Association for Computational Linguistics.Milan Straka, Jana Straková, and Jan Hajic. 2019. UD-Pipe at SIGMORPHON 2019: Contextualized Em-beddings, Regularization with Morphological Cate-gories, Corpora Merging. In

Proceedings of the 16thWorkshop on Computational Research in Phonetics,Phonology, and Morphology , pages 95–103.University of Tartu. 2018. UT Rocket. share.neic.no.aniel Zeman, Jan Hajic, Martin Popel, Martin Pot-thast, Milan Straka, Filip Ginter, Joakim Nivre, andSlav Petrov. 2018. CoNLL 2018 Shared Task: Mul-tilingual Parsing from Raw Text to Universal Depen-dencies. In