ELSKE: Efficient Large-Scale Keyphrase Extraction
EELSKE: Efficient Large-Scale Keyphrase Extraction
Johannes Knittel, Steffen Koch, Thomas Ertl
Institute for Visualization and Interactive SystemsUniversity of Stuttgart [email protected]
Abstract
Keyphrase extraction methods can provideinsights into large collections of documentssuch as social media posts. Existing meth-ods, however, are less suited for the real-timeanalysis of streaming data, because they arecomputationally too expensive or require re-strictive constraints regarding the structure ofkeyphrases. We propose an efficient approachto extract keyphrases from large document col-lections and show that the method also per-forms competitively on individual documents.
Automatically extracting descriptive words (key-words) or phrases (keyphrases) from documents isimportant for a wide range of tasks, including doc-ument summarization and improved informationretrieval in databases (Alami Merrouni et al., 2020).Several graph- (Mihalcea and Tarau, 2004; Wanand Xiao, 2008; Bougouin and Boudin, 2013; ˇSkrljet al., 2019), statistical- (El-Beltagy and Rafea,2009; Rose et al., 2010; Campos et al., 2020),and machine learning-based methods (Meng et al.,2017; Xiong et al., 2019; Wang et al., 2019; Yeand Wang, 2020; Santosh et al., 2020) have beendeveloped to find a limited set of concise words orphrases that best describe a certain document.A typical keyphrase extraction pipeline consistsof two steps (Hasan and Ng, 2010). First, the al-gorithm extracts a set of candidate phrases. Then,a suitable ranking is applied to retrieve the bestfits. Part-of-Speech tagging is often used to re-trieve candidates that are composed of nouns andadjectives (Hasan and Ng, 2010; Mihalcea and Ta-rau, 2004; Wan and Xiao, 2008), but this excludeslonger sequences. Furthermore, most POS tag-gers need a considerable amount of processingtime (Horsmann et al., 2015). YAKE (Camposet al., 2020) does not make use of POS tagging, but focuses on extracting up to tri-grams per de-fault. In recent years, machine learning techniqueshave been proposed that significantly outperformedprevious state of the art, but powerful models arecomputationally expensive, need extensive trainingdata, and may generalize less to foreign domainsdue to the supervised training.In this work, we shift the focus slightly from doc-uments to collections of (micro-)documents. Apartfrom analyzing individual documents, keyphraseextraction methods can also provide insights intolarge document collections, for instance, to gain anoverview of recent news reports or trending topicson social media. Existing methods largely focuson single documents and do not take into accountthe particular challenges of analyzing streamingdata such as continuously incoming tweets. Shortdocuments provide little context that can be har-vested for mining descriptive keyphrases, and thesheer quantity of newly published items per secondrequires a great amount of computational resourcesor efficient methods. This is particularly impor-tant for applications that need to process incomingdocuments immediately, e.g., in scenarios that aimat providing situational awareness. Dealing withlarge collections, we may also want to find frequentlonger phrases to better understand the underlyingdata with additional context information.Unfortunately, we can hardly rely on syntacticaland structural assumptions regarding phrase candi-dates if we need to avoid Part-of-Speech taggingfor efficiency. Extracting context-rich keyphrasesfrom large datasets in a timely manner is thereforeparticularly challenging. The complexity of extract-ing every possible n -gram increases linearly with n , and an extended set of candidates will also leadto an increase of similarly worded keyphrases.Hence, we propose a new method to efficientlyextract descriptive, but potentially long phrasesthat appear unusually often, including completeentences. For the ranking, we extend the conceptof TF-IDF to phrases and adapt it to the analysis oflarge document collections. We call the document or the collection of docu-ments from which we want to extract keyphrases source . This means that if we want to analyze col-lections, we concat the individual documents intoone big document. Let V be our vocabulary ofterms in the source, then each keyphrase p i is asequence ( v i , . . . , v im ) with v ij ∈ V . While TF-IDF to rank candidate keyphrases is often usedas a baseline to evaluate more advanced rankingapproaches, it performs surprisingly well in com-bination with Part-of-Speech tagging (Hasan andNg, 2010; Meng et al., 2017) and, importantly, haslittle requirements and external dependencies. Themain idea of TF-IDF for ranking keywords is toweight terms v i with their frequency in the source f s ( v i ) in relation to the document frequency of theterm in a reference collection f d ( v i ) comprising N documents: TF - IDF( v i ) = f s ( v i ) ln Nf d ( v i ) (1)A list of stop words that should be ignored isoften used to greatly improve the results. One wayto extend TF-IDF to phrases is to sum up the indi-vidual scores of each term (Hasan and Ng, 2010),but this favors long phrases. In this work, we there-fore set the phrase frequency in the source in re-lation with the document frequency of the phrasein a reference collection. Unfortunately, with anincreasing number of words in the source the re-lation between those two components diverge andthe influence of the inverse document frequencydiminishes. For typical English documents, e.g.,news reports, the maximum term frequency rangesaround 500. If we analyze the concatenation ofthousands or even hundreds of thousands of doc-uments, however, the most frequent term can eas-ily appear more often than 10 000 times. Hence,we introduce a sublinear scaling factor to adaptthe phrase frequency depending on the size of thesource: PF - IDF( p i ) = s ( p i ) = f s ( p j ) µ ln Nf d ( p j ) (2)If the maximum term frequency f max s exceeds , we set µ = log f max s , otherwise µ = 1 . This means we non-linearly scale the maximumterm frequency in the source down to the upperlimit of while keeping the scaled frequency ofterms that only appear once at . In other words,we adjust the term or phrase frequency such that thetypical relation between term frequency and inversedocument frequency remains similar irrespectivelyof the size of the collection. Unfortunately, extracting every { , , . . . , m } -gram in the source and computing the correspond-ing document frequency is not feasible, especiallyif the source comprises millions of sentences. Weexploit the fact that in most use cases we only wantto extract the top k keyphrases, e.g., the top 1000.With this assumption we can speed up the processof extracting candidates as described in this section. (1) Extracting Uni- And Bigrams: We first ex-tract uni- and bigrams (excluding stop words), cal-culate their respective PF-IDF score s ( p i ) , sort theresults in descending order, and store the sourceand document frequencies in a map-like structurefor fast retrieval. The score at position k ( s k ) is alower limit, i.e., we can divide s k by the maximumpossible inverse document frequency log N andraise it to the power of µ to retrieve the minimumfrequency threshold f th . It follows that every pat-tern in our final top k has to have a frequency of atleast f th . Hence, we only need to extract phrasesthat appear at least as often as our threshold in thesource. The higher f th , the higher the speedup. (2) Extracting Longer Phrases: We now needto calculate the frequencies of phrases that con-tain more than two words. We ignore phrases thatonly contain stop words or appear only once (incase the minimum frequency is 1). In the naiveapproach we would need to look at and count every { , , . . . , m } -gram at every position in the source.With the lower limit of f th , though, we can stop theinner loop earlier if the frequency of the current bi-gram is below the threshold. The frequency of anybigram in a sequence is an upper limit of the fre-quency of that sequence and any longer sequence.Retrieving the bigram frequency is in O (1) becausewe have already counted these in the first step. Atthe end of this step we discard every phrase thatdoes not meet our frequency threshold of f th . (3) Discarding Redundant Sub-Phrases: Foreach phrase ( v i , . . . , v im ) we also have m − sub-hrases ( v i ) , ( v i , v i ) , ... in our candidate set fromthe previous steps. We want to discard such sub-phrases that have the same frequency in the source,because they only appear as part of the longer se-quence. (4) Calculating The Document Frequency: We need to retrieve the document frequency ofeach candidate phrase in our reference collection.To speed up the process we can first build a bigram-based index of the collection. Then, we can calcu-late the PF-IDF score of every phrase and discardpatterns with a lower score than our threshold s k . After applying the first part of our pipeline, wecould already extract the top k { , , ..., m } -gramsin the source according to the PF-IDF weightingscheme. However, among these candidates we of-ten have several variations of similar phrases withslightly different frequencies. Hence, we want tofurther condense the field of candidates to retrievethe most salient and descriptive keyphrases. Stop Word-Heavy Candidates:
We first re-move candidates if they only contain one term v i that is not a stop word and s ( v i ) < s k , i.e, onlythe additional stop word put the term above thethreshold. Redundant Longer Candidates:
Second, weremove longer phrases that provide little additionalcontext. For instance, we want to discard at a birth-day party if birthday party is one of our candidatephrases. We say a phrase p j is a longer phraseof p i if the sequence p j contains the sequence p i .For each phrase p i we determine whether there is alonger phrase p j with at most two additional wordsin front of p i and/or after p i , the overhang . We onlykeep the longer phrase if the individual PF-IDFscore of any overhang is high enough (and does notcontain only stop words), i.e., s ( v j ) ≥ λs k for anoverhanging word v j or s (( v j , v j +1 )) ≥ λs k for anoverhanging bigram. In the remaining part of thepaper we set λ = 0 . . A lower lambda increasesthe number of additional phrase variations. If theoverhang to the left or right is more than two wordswe assume that the longer phrase is unique enoughcompared to the shorter phrase. Redundant Shorter Candidates:
Third, we dis-card shorter phrases that are already well repre-sented by longer phrases. Given a candidate phrase
Ranking With Adjusted Phrase Frequency (PF-IDF) hampshire (22966), nhprimary2020 (6365), roger stone (14946), victory tonight is the beginning of the end for donald trump (5065), camden fairview high school (4441), hampshire primary (4232), stone case (4152), buttigieg (8483), bernie sanders (20571)
Ranking With Plain Phrase Frequency (TF-IDF): trump (102388), bernie (44728), hampshire (22966), bernie sanders (20571), roger stone (14946), primary (17560), doj (17921), people (39007), prosecutors (14827)
Figure 1: Top 10 keyphrases from 1m tweets publishedaround Feb 12, 2020 (phrase frequency in brackets).The top depicts the final ranking using the adjustedphrase frequency (PF-IDF, µ (cid:54) = 1 ), and the bottom us-ing the plain phrase frequency (TF-IDF, µ = 1 ) p i , we determine its set M i of the shortest anddistinct longer phrases among the candidates, i.e.,any phrase p j ∈ M must not be a longer phraseof any other p l ∈ M and must be incompatible with any other p l ∈ M . A phrase p j is incom-patible with p l if they share a common subse-quence ( p i in this case), but continue differentlyin either direction. As an example, happy birth-day is incompatible with great birthday , but notwith birthday party . We remove the phrase p i if s ( p i ) − (cid:80) j ∈ M i s ( p j ) < s k . For instance, we woulddiscard day if the candidates memorial day and stpatricks day were already covering most occur-rences of day .Most approaches try to find already good initialcandidates so that they only need to rank these inthe final step. In contrast to this, our approachfirst collects candidates in a broader way, performsan initial ranking, and then reduces the set ofkeyphrases for the final ranking. The advantageof this strategy is that there are much less restric-tions regarding potential keyphrases. The final listmay contain phrases that start and/or end with stopwords as well as complete sentences. At the sametime the second part of our pipeline ensures thatwe keep the number of redundant phrases low.Figure 1 depicts an example of our approach ap-plied to 1m tweets. It shows that the top keyphrasescontain both single terms and longer phrases, andthat our sublinear scaling reduces the score of fre-quent terms that reveal little context. k tweets (20k words) 100k tweets (2m words) 1m tweets (20m words) Time (s) Speed-Up Time (s) Speed-Up Time (s) Speed-UpBaseline 0.896 1 49.16 1 642.28 1
Top k=100
Top k=1000
Table 1: Performance of our candidate selection process compared to counting every { } -gram (baseline). We want to investigate the speed-up of our phrasecandidate extraction pipeline (step 1 to 4) com-pared to the baseline, that is, extracting and calcu-lating the PF-IDF score of every { } -gramin the source. To make the comparison fair, wedisabled parallelization and used the same methodsfor both approaches to discard sub-phrases (step3) and calculate the TF-IDF score (step 4), includ-ing the bigram-based index structure to quicklydetermine the document frequency in the referencecollection. We tested different configurations ona collection of lowercase tweets without punctua-tion. We report the average duration of three runs.The reference collection to determine the inversedocument frequency comprises 2m tweets. Eachtweet is made up of 20 words on average. Table 1shows that our selection process is between oneand two orders of magnitude faster than countingevery m-gram. The run time of the second part ofour pipeline is negligible compared to the selec-tion process, it approximately needs between 200and 400ms for the third case with 20m words. Weneed to tokenize the input and convert it to a vector-based representation for both approaches. This iscomparable to the processing time of our k = 100 candidate selection process. It should be noted thatit typically takes even longer than our baseline totag the same amount of data with a decent POS tag-ger. For instance, the popular Stanford POS taggerwould need approximately between half an hourand five hours to process 20m tokens, dependingon the model (Horsmann et al., 2015). The goal of our method is to analyze collectionsof documents rather than single documents, butwe still evaluated the performance on the Se-mEval (Kim et al., 2010), Krapivin (Krapivin,2008), Inspec (Hulth, 2003), and NUS (Nguyenand Kan, 2008) datasets to compare it with previ-ous work. We largely follow the procedure of Meng et al. (Meng et al., 2017), which was also done inChen et al. (Chen et al., 2020). We analyzed ti-tles and abstracts, and measured the F @ k scoresof present keyphrases of the gold standard. Weapplied stemming when comparing the extractedkeyphrases with the gold standard and for deter-mining which keyphrases in the gold standard are present , to make our results compatible with re-ported scores in related work. We used the list ofEnglish stop words from the NLTK toolkit . TF-IDF (POS) describes the TF-IDF-based baselinemethod that uses POS-based rules for retrievingcandidates (Hasan and Ng, 2010). The results inTable 2 show that while the supervised recurrentneural network-based techniques take the lead, ourapproach is competitive among the unsupervisedmethods.
We presented a new technique for extractingkeyphrases that exhibits several advantages whichare particularly relevant if continuously incomingdata has to be analyzed in a timely manner. Itcan efficiently analyze large collections and im-poses little restrictions on the length and structureof keyphrases, but it also performs reasonably wellif targeted at individual documents.
Acknowledgments
This research was supported by the German Sci-ence Foundation (DFG) as part of the projectVAOST (project number 392087235) and as partof the Priority Program VA4VGI (SPP 1894).
References
Zakariae Alami Merrouni, Bouchra Frikh, and BrahimOuhbi. 2020. Automatic keyphrase extraction: a sur-vey and trends.
Journal of Intelligent InformationSystems . https://gist.github.com/sebleier/554280 emEval Krapivin Inspec NUS F @5 F @10 F @5 F @10 F @5 F @10 F @5 F @10 ELSKE a) a) a) a) b) c) a) b) Table 2: Benchmarks on present keyphrase prediction with reported values from a) (Meng et al., 2017) b) (Chenet al., 2020) and c) (Martinc et al., 2020). The last two RNN-based methods are supervised.Adrien Bougouin and Florian Boudin. 2013. Topi-cRank : Graph-Based Topic Ranking for KeyphraseExtraction. In
Proc. IJCNLP 2013 .Ricardo Campos, V´ıtor Mangaravite, Arian Pasquali,Al´ıpio Jorge, C´elia Nunes, and Adam Jatowt. 2020.YAKE! Keyword extraction from single documentsusing multiple local features.
Information Sciences .Jun Chen, Xiaoming Zhang, Yu Wu, Zhao Yan, andZhoujun Li. 2020. Keyphrase generation with corre-lation constraints. In
Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, EMNLP 2018 .Samhaa R. El-Beltagy and Ahmed Rafea. 2009. KP-Miner: A keyphrase extraction system for Englishand Arabic documents.
Information Systems .Kazi Saidul Hasan and Vincent Ng. 2010. Conundrumsin unsupervised keyphrase extraction: Making senseof the state-of-the-art. In
Coling 2010 - 23rd Inter-national Conference on Computational Linguistics,Proceedings of the Conference .Tobias Horsmann, Nicolai Erbs, and Torsten Zesch.2015. Fast or Accurate ? – A Comparative Evalu-ation of PoS Tagging Models.
Proceedings of theInternational Conference of the German Society forComputational Linguistics and Language Technol-ogy (GSCL-2015) .Anette Hulth. 2003. Improved automatic keyword ex-traction given more linguistic knowledge.Su Nam Kim, Olena Medelyan, Min Yen Kan, and Tim-othy Baldwin. 2010. SemEval-2010 Task 5: Auto-matic keyphrase extraction from scientific articles.In
ACL 2010 - SemEval 2010 - 5th InternationalWorkshop on Semantic Evaluation, Proceedings .Mikalai Krapivin. 2008. Large Dataset for KeyphraseExtraction.
Technical Report .Matej Martinc, Blaˇz ˇSkrlj, and Senja Pollak. 2020.TNT-KID: Transformer-based Neural Tagger forKeyword Identification. Rui Meng, Sanqiang Zhao, Shuguang Han, DaqingHe, Peter Brusilovsky, and Yu Chi. 2017. Deepkeyphrase generation. In
ACL 2017 - 55th AnnualMeeting of the Association for Computational Lin-guistics, Proceedings of the Conference (Long Pa-pers) .Rada Mihalcea and Paul Tarau. 2004. TextRank:Bringing order into texts.
Proceedings of EMNLP .Thuy Dung Nguyen and Min-Yen Kan. 2008.Keyphrase Extraction in Scientific Publications. In
Asian Digital Libraries. Looking Back 10 Years andForging New Frontiers .Stuart Rose, Dave Engel, Nick Cramer, and WendyCowley. 2010. Automatic Keyword Extraction fromIndividual Documents. In
Text Mining: Applica-tions and Theory .Tokala Yaswanth Sri Sai Santosh, Debarshi KumarSanyal, Plaban Kumar Bhowmick, and Partha Pra-tim Das. 2020. DAKE: Document-Level Attentionfor Keyphrase Extraction.Blaˇz ˇSkrlj, Andraˇz Repar, and Senja Pollak. 2019.RaKUn: Rank-based Keyword Extraction via Unsu-pervised Learning and Meta Vertex Aggregation. In
Lecture Notes in Computer Science (including sub-series Lecture Notes in Artificial Intelligence andLecture Notes in Bioinformatics) .Xiaojun Wan and Jianguo Xiao. 2008. Single doc-ument keyphrase extraction using neighborhoodknowledge. In
Proceedings of the National Confer-ence on Artificial Intelligence .Yue Wang, Jing Li, Hou Pong Chan, Irwin King,Michael R. Lyu, and Shuming Shi. 2019. Topic-Aware Neural Keyphrase Generation for Social Me-dia Language.Lee Xiong, Chuan Hu, Chenyan Xiong, Daniel Cam-pos, and Arnold Overwijk. 2019. Open DomainWeb Keyphrase Extraction Beyond Language Mod-eling.ai Ye and Lu Wang. 2020. Semi-supervised learningfor neural keyphrase generation. In