[PDF] Extending Neural Keyword Extraction with TF-IDF tagset matching

Abstract

Keyword extraction is the task of identifying words (or multi-word expressions) that best describe a given document and serve in news portals to link articles of similar topics. In this work we develop and evaluate our methods on four novel data sets covering less represented, morphologically-rich languages in European news media industry (Croatian, Estonian, Latvian and Russian). First, we perform evaluation of two supervised neural transformer-based methods (TNT-KID and BERT+BiLSTM CRF) and compare them to a baseline TF-IDF based unsupervised approach. Next, we show that by combining the keywords retrieved by both neural transformer based methods and extending the final set of keywords with an unsupervised TF-IDF based technique, we can drastically improve the recall of the system, making it appropriate to be used as a recommendation system in the media house environment.

Full PDF

EExtending Neural Keyword Extraction with TF-IDF tagset matching

Boshko Koloski

Joˇzef Stefan InstituteJoˇzef Stefan Int. Postgraduate SchoolJamova 39, Ljubljana [email protected]

Senja Pollak

Joˇzef Stefan InstituteJamova 39, Ljubljana [email protected]

Blaˇz ˇSkrlj

Joˇzef Stefan InstituteJoˇzef Stefan Int. Postgraduate SchoolJamova 39, Ljubljana [email protected]

Matej Martinc

Joˇzef Stefan InstituteJamova 39, Ljubljana [email protected]

Abstract

Keyword extraction is the task of identifyingwords (or multi-word expressions) that best de-scribe a given document and serve in newsportals to link articles of similar topics. Inthis work we develop and evaluate our meth-ods on four novel data sets covering less-represented, morphologically-rich languagesin European news media industry (Croatian,Estonian, Latvian and Russian). First, weperform evaluation of two supervised neu-ral transformer-based methods (TNT-KID andBERT+BiLSTM CRF) and compare them to abaseline TF-IDF based unsupervised approach.Next, we show that by combining the key-words retrieved by both neural transformer-based methods and extending the ﬁnal set ofkeywords with an unsupervised TF-IDF basedtechnique, we can drastically improve the re-call of the system, making it appropriate to beused as a recommendation system in the mediahouse environment.

Keywords are words (or multi-word expressions)that best describe the subject of a document, effec-tively summarise it and can also be used in severaldocument categorization tasks. In online news por-tals, keywords help efﬁcient retrieval of articleswhen needed. Similar keywords characterise ar-ticles of similar topics, which can help editors tolink related articles, journalists to ﬁnd similar arti-cles and readers to retrieve articles of interest whenbrowsing the portals. For journalists manually as-signing tags (keywords) to articles represents a de-manding task, and high-quality automated keywordextraction shows to be one of components in newsdigitalization process that many media houses seekfor. The task of keyword extraction can generallybe tackled in an unsupervised way, i.e., by relyingon frequency based statistical measures (Camposet al., 2020) or graph statistics ( ˇSkrlj et al., 2019),or with a supervised keyword extraction tool, whichrequires a training set of sufﬁcient size and fromappropriate domain. While supervised methodstend to work better due to their ability to adapt toa speciﬁcs of the syntax, semantics, content, genreand keyword assignment regime of a speciﬁc text(Martinc et al., 2020a), their training for some lessresource languages is problematic due to scarcityof large manually annotated resources. For thisreason, studies about supervised keyword extrac-tion conducted on less resource languages are stillvery rare. To overcome this research gap, in thispaper we focus on supervised keyword extractionon three less resource languages, Croatian, Latvianand Estonian, and one fairly well resourced lan-guage (Russian) and conduct experiments on datasets of media partners in the EMBEDDIA project .In media house environments, automatic key-word extraction systems are expected to returna diverse list of keyword candidates (of constantlength), which is then inspected by a journalist whomanually selects appropriate candidates. While thesupervised language model based approaches inmost cases offer good enough precision for thistype of usage as a recommendation system, therecall of these systems is nevertheless problem-atic. Supervised systems learn how many keywordsshould be returned for each news article on the goldstandard train set, which generally contains only asmall amount of manually approved candidates foreach news article. For example, among the datasets http://embeddia.eu/ a r X i v : . [ c s . C L ] J a n sed in our experiments (see Section 3), the Rus-sian train set contains the most (on average 4.44)present keywords (i.e., keywords which appear inthe text of the article and can be used for trainingof the supervised models) per article, while theCroatian test set contains only 1.32 keywords perarticle. This means that for Croatian, the modelwill learn to return around 1.32 keywords for eacharticle, which is not enough.To solve this problem we show that we can im-prove the recall of the existing supervised keywordextraction system by:• Proposing an additional TF-IDF tagset match-ing technique, which ﬁnds additional keywordcandidates by ranking the words in the newsarticle that have appeared in the predeﬁnedkeyword set containing words from the goldstandard train set. The new hybrid system ﬁrstchecks how many keywords were returned bythe supervised approach and if the numberis smaller than needed, the list is expandedby the best ranked keywords returned by theTF-IDF based extraction system.• Combining the outputs of several state-of-the-art supervised keyword extraction approaches.The rest of this work is structured as follows:Section 3 describes the datasets on which we eval-uate our method. Section 4 describes our proposedmethod with all corresponding steps. The exper-iment settings are described in Section 5 and theevaluation of the proposed methods is shown inSection 6, followed by the conclusions and the pro-posed further work in Section 7. Many different approaches have been developedto tackle the problem of extracting keywords. Theearly approaches, such as KP-MINER (El-Beltagyand Rafea, 2009) and RAKE (Rose et al., 2010)rely on unsupervised techniques which employ fre-quency based metrics for extraction of keywordsfrom text. Formally, aforementioned approachessearch for the words w from vocabulary V thatmaximize a given metric h for a given text t :kw = argmax w ∈V h ( w, t ) . In these approaches, frequency of high relevance.These approaches assume that the more frequent a given word, the more important the meaning thisword carries for a given document. Most popu-lar such metrics are the na¨ıve frequency - simplecounting and the term frequency-inverse documentfrequency (TF-IDF) (Salton and McGill, 1986).Most recent state-of-the-art statistical ap-proaches, such as YAKE (Campos et al., 2020),also employ frequency based features, but combinethem with other features such as casing, position,relatedness to context and dispersion of a speciﬁcterm in order to derive a ﬁnal score for each key-word candidate.Another line of research models this problemby exploiting concepts from graph theory. Ap-proaches, such as TextRank (Mihalcea and Tarau,2004), Single Rank (Wan and Xiao, 2008), Topi-cRank (Bougouin et al., 2013) and Topical PageR-ank (Sterckx et al., 2015) build a graph G , i.e., amathematical construct described by a set of ver-texes V and a set of edges E connecting two ver-tices. In one of the most recent approaches calledRaKUn ( ˇSkrlj et al., 2019), a directed graph isconstructed from text, where vertexes V and twowords w i , w i +1 are linked if they appear followingone another. Keywords are ranked by a shortestpath-based metric from graph theory - the load cen-trality.The task of keyword extraction can also be tack-led in a supervised way. One of the ﬁrst supervisedapproaches was an algorithm named KEA Wit-ten et al. (2005), which uses only TF-IDF and theterm’s position in the text as features for term identi-ﬁcation. More recent neural approaches to keyworddetection consider the problem as a sequence-to-sequence generation task (Meng et al., 2017) andemploy a generative model for keyword predic-tion with a recurrent encoder-decoder frameworkand an attention mechanism capable of detectingkeywords in the input text sequence whilst also po-tentially ﬁnding keywords that do not appear in thetext.Finally, the newest branch of models considerkeyword extraction as a sequence labelling taskand tackle keyword detection with transformers.Sahrawat et al. (2020) fed contextual embeddingsgenerated by several transformer models (BERT(Devlin et al., 2018), RoBERTa (Liu et al., 2019),GPT-2 (Radford et al., 2019), etc.) into two typesof neural architectures, a bidirectional Long short-term memory network (BiLSTM) and a BiLSTMnetwork with an additional Conditional randomelds layer (BiLSTM-CRF).Another state-of-the-art transformer based ap-proach is TNT-KID (Transformer-based NeuralTagger for Keyword Identiﬁcation) (Martinc et al.,2020a), which does not rely on pretrained languagemodels such as BERT, but rather allows the user totrain their own language model on the appropriatedomain. The study shows that smaller unlabelleddomain speciﬁc corpora can be successfully usedfor unsupervised pretraining, which makes the pro-posed approach easily transferable to low-resourcelanguages. It also proposes several modiﬁcationsto the transformer architecture in order to adapt itfor a novel task and improve performance of themodel. We conducted experiments on datasets containingnews in four languages; Latvian, Estonian, Rus-sian and Croatian. Latvian, Estonian and Russiandatasets contain news from the Ekspress Group,speciﬁcally from Estonian Ekspress Meedia (newsin Estonian and Russian) and from Latvian Delﬁ(news in Latvian and Russian). The dataset statis-tics are presented in Table 2. The media-housesprovided news articles from 2015 up to the 2019.We divided them into training and test sets. Forthe training set, we used the articles from 2018,while for the test set the articles from 2019 wereused. In our study, we also use tagsets of key-words. Tagset corresponds either to a collection ofkeywords maintained by editors of a media house(see e.g. Estonian tagset), or an automatically con-structed tagset from assigned keywords from arti-cles available in a training set. The type of tagsetand the number of unique tags for each languageare listed in Table 1.

Table 1: Distribution of tags provided per language.The media houses provided tagsets for Estonian andRussian, while the tags for Latvian and Croatian wereconsidered from the train dataset.

Dataset Unique tags Type of tagsCroatian 25599 ConstructedEstonian 52068 ProvidedRussian 5899 ProvidedLatvian 4015 Constructed

The recent supervised neural methods are very pre-cise, but, as was already mentioned in Section 1, insame cases they do not return a sufﬁcient number ofkeywords. This is due to the fact that the methodsare trained on the training data with a low numberof gold standard keywords (as it can be seen fromTable 2). To meet the media partners’ needs, we de-signed a method, which complements state-of-the-art neural methods (the TNT-KID method (Martincet al., 2020b) and the transformer-based methodproposed by (Sahrawat et al., 2020), which areboth described in Section 2) by a tagset matchingapproach, returning constant number of keywords( k =10). In our approach, we ﬁrst take the keywords re-turned by a neural keyword extraction methodand next complement the returned keyword listby adding the missing keywords to achieve the setgoal of k keywords. The added keywords are se-lected by taking the top-ranked candidates from theTF-IDF tagset matching extraction conducted onthe preprocessed news articles and keywords. First, we concatenate the body and the title of thearticle. After that we lowercase the text and removestopwords. Finally, the text is tokenized and lem-matized with the Lemmagen3 lemmatizer(Jurˇsiˇcet al., 2010), which supports lemmatization for allthe languages except Latvian. For Latvian we usethe LatvianStemmer . For the stopword removalwe used the Stopwords-ISO Python library whichcontained stopwords for all four languages. Theﬁnal cleaned textual input consists of the concate-nation of all of the preprocessed words from thedocument. We apply the same preprocessing pro-cedure on the predetermined tagsets for each lan-guage. The preprocessing procedure is visualizedin Figure 1.

The TF-IDF weighting scheme (Salton and McGill,1986) assigns each word its weight w based on thefrequency of the word in the document (term fre-quency) and the number of documents the wordappears in (inverse document frequency). More https://github.com/rihardsk/LatvianStemmer https://github.com/stopwords-iso able 2: Media partners’ datasets used for empirical evaluation of keyword extraction algorithms. Avg. Train Avg. TestDataset Total docs Total kw. Total docs doc len. kw. % present kw. present kw. Total docs doc len kw. % present kw. present kw.Croatian 52756 26896 47479 420.32 3.10 0.47 1.32 5277 464.14 3.28 0.55 1.62Estonian 18497 59242 10750 395.24 3.81 0.65 2.77 7747 411.59 4.09 0.69 3.12Russian 25306 5953 13831 392.82 5.66 0.76 4.44 11475 335.93 5.43 0.79 4.33Latvian 24774 4036 13133 378.03 3.23 0.53 1.69 11641 460.15 3.19 0.55 1.71

Figure 1: Preprocessing pipeline used for the documentnormalization and cleaning. speciﬁcally, TF-IDF is calculated with the follow-ing equation:

T F − IDF i = tf i,j · log e ( | D | df i ) The formula has two main components:•

Term-frequency (tf) that counts the number ofappearances of a word in the document (in theequation above, tf i,j denotes the number ofoccurrences of the word i in the document j )• Inverse-document-frequency (idf) ensures thatwords appearing in more documents are as-signed lower weights (in the formula above df i is the number of documents containingword i and | D | denotes the number of docu-ments).The assumption is that words with a higher TF-IDF value are more likely to be keywords. For a given neural keyword extraction method N ,and for each document d , we select l best rankedkeywords according to the TF-IDF(tm), which ap-pear in the keyword tagset for each speciﬁc dataset.Here, l corresponds to k - m , where k = 10 and m corresponds to the number of keywords returnedby a neural method.Since some of the keywords in the tagsets pro-vided by the media partners were variations of thesame root word (i.e., keywords are not lemmatized),we created a mapping from a root word (i.e., a wordlemma or a stem) to a list of possible variations in the keyword dataset. For example, a word ’riigiek-sam’ ( ’exam’ ) appearing in the article, could bemapped to three tags in the tagset by the Estonianmedia house with the same root form ’riigieksam’ : ’riigieksamid’, ’riigieksamide’ and ’riigieksam’ .We tested several strategies for mapping the oc-currence of a word in the news article to a speciﬁctag in the tagset. For each lemma that mapped tomultiple tags, we tested returning a random tag,a tag with minimal length and a tag of maximallength. In the ﬁnal version, we opted to returnthe tag with the minimal length, since this tag cor-responded to the lemma of the word most oftenselected. We conducted experiments on the datasets de-scribed in Section 3. We evaluate the followingmethods and combinations of methods:•

TF-IDF(tm):

Here, we employ the prepro-cessing and TF-IDF-based weighting of key-words described in Section 4 and select thetop-ranked keywords that are present in thetagset.•

TNT-KID (Martinc et al., 2020b): For eachdataset, we ﬁrst pretrain the model with an au-toregressive language model objective. Afterthat, the model is ﬁne-tuned on the same trainset for the keyword extraction task. We usethe same hyperparameters as proposed in theoriginal study (i.e., sequence length of 256,embedding size of 512 and batch size 8) andemploy the same preprocessing (i.e., we low-ercase and tokenize the text with a Sentence-piece (Kudo and Richardson, 2018) byte-pairencoding tokenization scheme).•

BERT + BiLSTM-CRF (Sahrawat et al.,2020): We employ an uncased multilingualBERT model with an embedding size of 768 More speciﬁcally, we use the ’bert-base-multilingual-uncased’ implementation of BERT from the Transform-ers library ( https://github.com/huggingface/transformers ). nd 12 attention heads, with an additionalBiLSTM-CRF token classiﬁcation head, sameas in (Sahrawat et al., 2020).• TNT-KID & BERT + BiLSTM-CRF : Weextracted keywords with both of the methodsand complemented the TNT-KID extractedkeywords with the BERT + BiLSTM-CRF ex-tracted keyword in order to retrieve more key-words. Duplicates (i.e., keywords extractedby both methods) are removed.•

TNT-KID & TF-IDF : If the keyword set ex-tracted by TNT-KID contains less than 10 key-words, it is expanded with keywords retrievedwith the proposed TF-IDF(tm) approach, i.e.,best ranked keywords according to TF-IDF,which do not appear in the keyword set ex-tracted by TNT-KID.•

BERT + BiLSTM-CRF & TF-IDF : If thekeyword set extracted by BERT + BiLSTM-CRF contains less than 10 keywords, it is ex-panded with keywords retrieved with the pro-posed TF-IDF(tm) approach, i.e., best rankedkeywords according to TF-IDF, which do notappear in the keyword set extracted by BERT+ BiLSTM-CRF.•

TNT-KID & BERT + BiLSTM-CRF & TF-IDF: the keyword set extracted with the TNT-KID is complemented by keywords extractedwith BERT + BiLSTM-CRF (duplicates areremoved). If after the expansion the keywordset still contains less than 10 keywords, it isexpanded again, this time with keywords re-trieved by the TF-IDF(tm) approach.For TNT-KID, which is the only model thatrequires language model pretraining, languagemodels were trained on train sets in Table 2 forup to ten epochs. Next, TNT-KID and BERT+ BiLSTM-CRF were ﬁne-tuned on the trainingdatasets, which were randomly split into 80 percentof documents used for training and 20 percent ofdocuments used for validation. The documents con-taining more than 256 tokens are truncated, whilethe documents containing less than 256 tokens arepadded with a special < pad > token at the end.We ﬁne-tuned each model for a maximum of 10epochs and after each epoch the trained model wastested on the documents chosen for validation. Themodel that showed the best performance on this set of validation documents (in terms of F@10 score)was used for keyword detection on the test set. For evaluation, we employ precision, recall andF1 score. While F1@10 and recall@10 are themost relevant metrics for the media partners, wealso report precision@10, precision@5, recall@5and F1@5. Only keywords which appear in a text(present keywords) were used as a gold standard,since we only evaluate approaches for keywordtagging that are not capable of ﬁnding keywordswhich do not appear in the text. Lowercasing andlemmatization (stemming in the case of Latvian)are performed on both the gold standard and theextracted keywords (keyphrases) during the eval-uation. The results of the evaluation of the best-performing models on all four languages is listedin Table 3.Results suggest, that neural based approaches,TNT-KID and BERT+BiLSTM-CRF offer com-parable performance on all datasets but never-theless achieve different results for different lan-guages. TNT-KID outperforms BERT-BiLSTM-CRF model according to all the evaluation metricson the Estonian and Russian news dataset. It alsooutperforms all other methods in terms of precisionand F1 score. On the other hand, BERT+BiLSTM-CRF preforms better on the Croatian dataset interms of precision and F1-score. On Latvian TNT-KID achieves top results in terms of F1, whileBERT+BiLSTM-CRF offers better precision. Theproposed TF-IDF tagset matching method performspoorly in most of the cases. The exception is theCroatian dataset, where it outperforms TNT-KIDaccording to all criteria but R@10. Most likely, thisis connected to the distribution of articles in theCroatian train and test set, where we have 47,479train articles and 5,277 test articles and a train totest ratio of 8.62. In other three languages, the ratioof train to test articles is 1.13 for Latvian, 1.21 forRussian and 1.39 for Estonian.Even though the TF-IDF tagset matching methodperforms poorly on its own, we can neverthelessdrastically improve the recall of both neural sys-tems, if we expand the keyword tag sets returnedby the neural methods with the TF-IDF rankedkeywords. The improvement is substantial and con-sistent for all datasets, but it nevertheless comesat the expanse of the lower precision and F1 score.This is not surprising, since the ﬁnal expanded key- able 3: Results on the EMBEDDIA media partner datasets.

Model P@5 R@5 F1@5 P@10 R@10 F1@10

Croatian

TF-IDF 0.3430 0.4955 0.4054 0.3364 0.4987 0.4018TNT-KID 0.3364 0.4925 0.3998 0.3273 0.5089 0.3984BERT + BiLSTM-CRF

TNT-KID & TF-IDF(tm) 0.2835 0.5664 0.3779 0.2594 0.6224 0.3662BERT + BiLSTM-CRF & TF-IDF(tm) 0.3003 0.5569 0.3901 0.2782 0.5732 0.3746TNT-KID & BERT + BiLSTM-CRF 0.2961 0.5354 0.3813 0.2778 0.5778 0.3752TNT-KID & BERT + BiLSTM-CRF & TF-IDF(tm) 0.2653

Estonian

TF-IDF 0.0716 0.1488 0.0966 0.0496 0.1950 0.0790TNT-KID

BERT + BiLSTM-CRF 0.5118 0.4617 0.4855 0.5078 0.4775 0.4922TNT-KID & TF-IDF(tm) 0.3463 0.5997 0.4391 0.1978 0.6541 0.3037BERT + BiLSTM-CRF & TF-IDF(tm) 0.3175 0.4978 0.3877 0.1789 0.5381 0.2686TNT-KID & BERT + BiLSTM-CRF 0.4421 0.6014 0.5096 0.4028 0.6438 0.4956TNT-KID & BERT + BiLSTM-CRF & TF-IDF(tm) 0.3588

Russian

TF-IDF 0.1764 0.2314 0.2002 0.1663 0.3350 0.2223TNT-KID

BERT + BiLSTM-CRF 0.6901 0.5467 0.5467 0.6849 0.5643 0.6187TNT-KID & TF-IDF(tm) 0.4519 0.6293 0.5261 0.2981 0.6946 0.4172BERT + BiLSTM-CRF & TF-IDF(tm) 0.4157 0.5728 0.4818 0.2753 0.6378 0.3846TNT-KID & BERT + BiLSTM-CRF 0.6226 0.6375 0.6300 0.5877 0.6707 0.6265TNT-KID & BERT + BiLSTM-CRF & TF-IDF(tm) 0.4622

Latvian

TF-IDF 0.2258 0.5035 0.3118 0.1708 0.5965 0.2655TNT-KID 0.6089 0.6887

BERT + BiLSTM-CRF word set always returns 10 keywords, i.e., muchmore than the average number of present gold stan-dard keywords in the media partner datasets (seeTable 2), which badly affects the precision of theapproach.Combining keywords returned by TNT-KID andBERT + BiLSTM-CRF also consistently improvesrecall, but again at the expanse of lower preci-sion and F1 score. Overall, for all four languages,the best performing method in terms of recall isthe TNT-KID & BERT + BiLSTM-CRF & TF-IDF(tm).

In this work we tested two state-of-the-art neu-ral approaches for keyword extraction, TNT-KID(Martinc et al., 2020a) and BERT BiLSTM-CRF(Sahrawat et al., 2020), on three less resourced European languages, Estonian, Latvian, Croatian,as well as on Russian. We also proposed a tagsetbased keyword expansion approach, which drasti-cally improves the recall of the method, makingit more suitable for the application in the mediahouse environment.Our study is one of the very few studies wheresupervised keyword extraction models were em-ployed on several less resource languages. Theresults suggest that these models perform well onlanguages other than English and could also besuccessfully leveraged for keyword extraction onmorphologically rich languages.The focus of the study was whether we can im-prove the recall of the supervised models, in orderto make them more useful as recommendation sys-tems in the media house environment. Our methodmanages to increase the number of retrieved key-ords, which drastically improves the recall forall languages. For example, by combing all neu-ral methods and the TF-IDF based apporach, weimprove on the recall@10 achieved by the bestperforming neural model, TNT-KID, by 14.6 per-centage points for Croatian , 9.70 percentage pointsfor Estonian, 9.63 percentage points for Russianand 17.12 percentage points for Latvian. The re-sulting method nevertheless offers lower precision,which we will try to improve in the future work.In the future we also plan to perform a qualita-tive evaluation of our methods by journalists fromthe media houses. Next, we plan to explore howadding background knowledge from knowledgebases - lexical (e.g. Wordnet(Fellbaum, 1998)) orfactual (e.g. WikiData(Vrandeˇci´c and Kr¨otzsch,2014)) would beneﬁt the aforementioned methods.The assumption is that with the linkage of the textrepresentation and the background knowledge wewould achieve a more representative understandingof the articles and the concepts appearing in them,hoping for a more successful keyword extraction.In traditional machine-learning setting a com-mon practice of combining different classiﬁer out-puts to a single output is referred to as stacking. Wepropose a further research in the scope of combi-nation of various keyword extraction models to beconducted. One way to incorporate this would beto add the notion of the positional encoding, sincesome of the keywords in the news-media domainoften can be found in the beginning of the articleand the TF-IDF(tm) does not account this whileapplying the weighting on the matched terms.

Acknowledgements will be added for the ﬁnal, non-anonymized, version of the article.

References

Adrien Bougouin, Florian Boudin, and B´eatrice Daille.2013. Topicrank: Graph-based topic ranking forkeyphrase extraction. In

International joint con-ference on natural language processing (IJCNLP) ,pages 543–551.Ricardo Campos, V´ıtor Mangaravite, Arian Pasquali,Al´ıpio Jorge, C´elia Nunes, and Adam Jatowt. 2020.Yake! keyword extraction from single documentsusing multiple local features.

Information Sciences ,509:257 – 289.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Samhaa R. El-Beltagy and Ahmed Rafea. 2009. Kp-miner: A keyphrase extraction system for englishand arabic documents.

Inf. Syst. , 34(1):132–144.Christiane Fellbaum. 1998.

WordNet: An ElectronicLexical Database . Bradford Books.Matjaˇz Jurˇsiˇc, Igor Mozetiˇc, Tomaˇz Erjavec, and NadaLavraˇc. 2010. Lemmagen: Multilingual lemmatisa-tion with induced ripple-down rules.

Journal of Uni-versal Computer Science , 16(9):1190–1214.Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Matej Martinc, Blaˇz ˇSkrlj, and Senja Pollak.2020a. Tnt-kid: Transformer-based neural tag-ger for keyword identiﬁcation. arXiv preprintarXiv:2003.09166 .Matej Martinc, Blaˇz ˇSkrlj, and Senja Pollak. 2020b.Tnt-kid: Transformer-based neural tagger for key-word identiﬁcation.Rui Meng, Sanqiang Zhao, Shuguang Han, DaqingHe, Peter Brusilovsky, and Yu Chi. 2017.Deep keyphrase generation. arXiv preprintarXiv:1704.06879 .Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-ing order into text. In

Proceedings of the 2004 con-ference on empirical methods in natural languageprocessing , pages 404–411.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. Techni-cal report, OpenAI.Stuart Rose, Dave Engel, Nick Cramer, and WendyCowley. 2010. Automatic keyword extraction fromindividual documents.

Text mining: applicationsand theory , 1:1–20.Dhruva Sahrawat, Debanjan Mahata, Mayank Kulka-rni, Haimin Zhang, Rakesh Gosangi, Amanda Stent,Agniv Sharma, Yaman Kumar, Rajiv Ratn Shah, andRoger Zimmermann. 2020. Keyphrase extractionfrom scholarly articles as sequence labeling usingcontextualized embeddings. In

Proceedings of Eu-ropean Conference on Information Retrieval (ECIR2020) , pages 328–335.Gerard Salton and Michael J McGill. 1986. Introduc-tion to modern information retrieval.laˇz ˇSkrlj, Andraˇz Repar, and Senja Pollak. 2019.Rakun: Rank-based keyword extraction via unsuper-vised learning and meta vertex aggregation. In

In-ternational Conference on Statistical Language andSpeech Processing , pages 311–323. Springer.Lucas Sterckx, Thomas Demeester, Johannes Deleu,and Chris Develder. 2015. Topical word importancefor fast keyphrase extraction. In

Proceedings of the24th International Conference on World Wide Web ,pages 121–122.Denny Vrandeˇci´c and Markus Kr¨otzsch. 2014. Wiki-data: A free collaborative knowledgebase.

Commun.ACM , 57(10):78–85.Xiaojun Wan and Jianguo Xiao. 2008. Single doc-ument keyphrase extraction using neighborhoodknowledge. In

AAAI , volume 8, pages 855–860.Ian H Witten, Gordon W Paynter, Eibe Frank, CarlGutwin, and Craig G Nevill-Manning. 2005. Kea:Practical automated keyphrase extraction. In