Title-Guided Encoding for Keyphrase Generation
Wang Chen, Yifan Gao, Jiani Zhang, Irwin King, Michael R. Lyu
TTitle-Guided Encoding for Keyphrase Generation
Wang Chen,
Yifan Gao,
Jiani Zhang,
Irwin King,
Michael R. Lyu Department of Computer Science and Engineering,The Chinese University of Hong Kong, Shatin, N.T., Hong Kong Shenzhen Key Laboratory of Rich Media Big Data Analytics and Application,Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, China { wchen, yfgao, jnzhang, king, lyu } @cse.cuhk.edu.hk Abstract
Keyphrase generation (KG) aims to generate a set ofkeyphrases given a document, which is a fundamental task innatural language processing (NLP). Most previous methodssolve this problem in an extractive manner, while recently,several attempts are made under the generative setting usingdeep neural networks. However, the state-of-the-art genera-tive methods simply treat the document title and the docu-ment main body equally, ignoring the leading role of the ti-tle to the overall document. To solve this problem, we intro-duce a new model called Title-Guided Network (TG-Net) forautomatic keyphrase generation task based on the encoder-decoder architecture with two new features: (i) the title isadditionally employed as a query-like input, and (ii) a title-guided encoder gathers the relevant information from the titleto each word in the document. Experiments on a range of KGdatasets demonstrate that our model outperforms the state-of-the-art models with a large margin, especially for documentswith either very low or very high title length ratios.
Introduction
Keyphrases are short phrases that can quickly provide themain information of a given document (the terms “docu-ment”, “source text” and “context” are interchangeable inthis study, and all of them represent the concatenation ofthe title and the main body.). Because of the succinct andaccurate expression, keyphrases are widely used in informa-tion retrieval (Jones and Staveley 1999), document catego-rizing (Hulth and Megyesi 2006), opinion mining (Berend2011), etc. Due to the huge potential value, various auto-matic keyphrase extraction and generation methods havebeen developed. As shown in Figure 1, the input usually con-sists of the title and the main body, and the output is a set ofkeyphrases.Most typical automatic keyphrase extraction meth-ods (Witten et al. 1999; Medelyan, Frank, and Witten 2009;Mihalcea and Tarau 2004) focus on extracting presentkeyphrases like “relevance profiling” in Figure 1, whichare the exact phrases appearing in the source text. The mainideas among them are identifying candidate phrases first andthen ranking algorithms. However, these methods ignore the
Copyright c (cid:13)
Title:
Within-document retrieval : A user-centred evaluation of relevance profiling
Main Body:
We present a user-centred, task-oriented, comparative evaluation of two within-document retrieval tools. ProfileSkim computes a relevance profile for a document with respect to a query, and presents the profile as an interactive bar graph. …
Relevance profiling should prove highly beneficial for users trying to identify relevant information within long documents. …(a) Present Keyphrases:{within-document retrieval; relevance profiling}(b) Absent Keyphrases:{interactive information retrieval; task-oriented evaluation;language models}
Figure 1: An example of keyphrase generation. The presentkeyphrases are bold and italic in the source text.semantic meaning underlying the context content and can-not generate absent keyphrases like “interactive informa-tion retrieval”, which do not appear in the source text.To overcome the above drawbacks, several encoder-decoder based keyphrase generation methods have been pro-posed including CopyRNN (Meng et al. 2017) and Copy-CNN (Zhang, Fang, and Weidong 2017). First, these meth-ods treat the title and the main body equally and concate-nate them as the only source text input. Then, the encodermaps each source text word into a hidden state vector whichis regarded as the contextual representation. Finally, basedon these representations, the decoder generates keyphrasesfrom a predefined vocabulary regardless of the presence orabsence of the keyphrases. A serious drawback of thesemodels is that they ignore the leading role of the title andconsequently fail to sufficiently utilize the already summa-rized information in it.It is a widely agreed fact that the title can be viewed as ahigh-level summary of a document and the keyphrases pro-vide more details of the key topics introduced in the docu-ment (Li et al. 2010). They play a similar and complemen-tary role with each other. Therefore, keyphrases should haveclose semantic meaning to the title (Li et al. 2010). For ex-ample, as shown in Figure 1, the title contains most of thesalient points reflected by these keyphrases including “re- a r X i v : . [ c s . C L ] J a n rieval”, “profiling”, and “evaluation”. Statistically, we studythe proportion of keyphrases related to the title on the largestKG dataset and show the results in Table 1. For simplicity,we define a TitleRelated keyphrase as the keyphrase contain-ing at least one common non-stop-word with the title. FromTable 1, we find that about 33% absent keyphrases are
Ti-tleRelated . For present keyphrases, the
TitleRelated percent-age is up to around 60%. By considering the fact that thelength of a title is usually only 3%-6% of the correspondingsource text, we can conclude that the title, indeed, containshighly summative and valuable information for generatingkeyphrases. Moreover, information in the title is also helpfulin reflecting which part of the main body is essential, suchas the part containing the same or related information withthe title. For instance, in Figure 1, the point “evaluation” inthe title can guide us to focus on the part “... task-oriented,comparative evaluation ...” of the main body, which is highlyrelated to the absent keyphrase “task-oriented evaluation”.To sufficiently leverage the title content, we introduce anew title-guided network by taking the above fact into thekeyphrase generation scenario. In our model, the title is ad-ditionally treated as a query-like input in the encoding stage.First, two bi-directional Gated Recurrent Unit (GRU) (Choet al. 2014) layers are adopted to separately encode the con-text and the title into corresponding contextual representa-tions. Then, an attention-based matching layer is used togather the relevant title information for each context wordaccording to the semantic relatedness. Since the context isthe concatenation of the title and the main body, this layerimplicitly contains two parts. The former part is the “titleto title” self-matching, which aims to make the salient in-formation in the title more important. The latter part is the“main body to title” matching wherein the title informationis employed to reflect the importance of information in themain body. Next, an extra bi-directional GRU layer is usedto merge the original contextual information and the corre-sponding gathered title information into the final title-guidedrepresentation for each context word. Finally, the decoderequipped with attention and copy mechanisms utilizes the fi-nal title-guided context representation to predict keyphrases.We evaluate our model on five real-world benchmarks,which test the ability of our model to predict present and ab-sent keyphrases. Using these benchmarks, we demonstratethat our model can effectively exploit the title informationand it outperforms the relevant baselines by a significantmargin: for present (absent) keyphrase prediction, the im-provement gain of F1-measure at 10 (Recall at 50) score isup to 9.4% (19.1%) compared to the best baseline on thelargest dataset. Besides, we probe the performance of ourmodel and a strong baseline CopyRNN on documents withdifferent title length ratios (i.e., the title length over the con-text length). Experimental results show that our model con-sistently improves the performance with large gains, espe-cially for documents with either very low or very high titlelength ratios.Our main contributions consist of three parts: • A new perspective on keyphrase generation is explored,which sufficiently employs the title to guide the keyphraseprediction process. Keyphrase TitleRelated %Present 54,403 32,328 59.42Absent 42,997 14,296 33.25Table 1: The statistics of
TitleRelated keyphrases on the val-idation set of
KP20k . • A novel TG-Net model is proposed, which can effectivelyleverage the useful information in the title. • The overall empirical results on five real-world bench-marks show that our model outperforms the state-of-the-art models significantly on both present and absentkeyphrase prediction, especially for documents with ei-ther very low or very high title length ratios.
Related Work
Automatic Keyphrase Extraction
Most of the automatic keyphrase extraction methods con-sist of two steps. Firstly, the candidate identification stepobtains a set of candidate phrases such as phrases withsome specific part-of-speech (POS) tags (Medelyan, Frank,and Witten 2009; Witten et al. 1999). Secondly, in the ranking step, all the candidates are ranked based on theimportance computed by either unsupervised ranking ap-proaches (Wan and Xiao 2008; Mihalcea and Tarau 2004;Florescu and Caragea 2017) or supervised machine learn-ing approaches (Medelyan, Frank, and Witten 2009; Wit-ten et al. 1999; Nguyen and Luong 2010; Florescu andJin 2018). Finally, the top-ranked candidates are selectedas the keyphrases. Besides these widely developed two-step approaches, there are also some methods using a se-quence labeling operation to extract keyphrases (Zhang etal. 2016; Luan, Ostendorf, and Hajishirzi 2017; Gollapalli,Li, and Yang 2017). But they still cannot generate absentkeyphrases.Some extraction approaches (Li et al. 2010; Liu et al.2011) also consider the influence of the title. Li et al. (2010)proposes a graph-based ranking algorithm which initializesthe importance score of title phrases as one and the others aszero and then propagates the influence of title phrases iter-atively. The biggest difference between Li et al. (2010) andour method is that our method utilizes the contextual infor-mation of the title to guide the context encoding, while theirmodel only considers the phrase occurrence in the title. Liuet al. (2011) models keyphrase extraction process as a trans-lation operation from a document to keyphrases. The title isused as the target output to train the translator. Comparedwith our model, one difference is that this method still can-not handle semantic meaning of the context. The other isthat our model regards the title as an extra query-like inputinstead of a target output.
Automatic Keyphrase Generation
Keyphrase generation is an extension of keyphrase ex-traction which explicitly considers the absent keyphraseprediction. CopyRNN (Meng et al. 2017) first framesthe generation process as a sequence-to-sequence learningask and employs a widely used encoder-decoder frame-work (Sutskever, Vinyals, and Le 2014) with attention (Bah-danau, Cho, and Bengio 2015) and copy (Gu et al. 2016)mechanisms. Based on CopyRNN, various extensions (Haiand Lu 2018; Jun et al. 2018) are recently proposed. How-ever, these recurrent neural network (RNN) based modelsmay suffer the low-efficiency issues because of the com-putation dependency between the current time step andthe preceding time steps in RNN. To overcome this short-coming, CopyCNN (Zhang, Fang, and Weidong 2017) ap-plies a convolutional neural network (CNN) based encoder-decoder model (Gehring et al. 2017). CopyCNN employsposition embedding for obtaining a sense of order in the in-put sequence and adopts gated linear units (GLU) (Dauphinet al. 2017) as the non-linearity function. CopyCNN notonly achieves much faster keyphrase generation speed butalso outperforms CopyRNN on five real-world benchmarkdatasets.Nevertheless, both CopyRNN and CopyCNN treat the ti-tle and the main body equally, which ignores the seman-tic similarity between the title and the keyphrases. Mo-tivated by the success of query-based encoding in vari-ous natural language processing tasks (Gao et al. 2018;Song, Wang, and Hamza 2017; Nema et al. 2017; Wang etal. 2017), we regard the title as an extra query-like input toguide the source context encoding. Consequently, we pro-pose a TG-Net model to explicitly explore the useful infor-mation in the title. In this paper, we focus on how to incor-porate a title-guided encoding into the RNN-based model,but it is also convenient to apply this idea to the CNN-basedmodel in a similar way.
Problem Definition
We denote vectors with bold lowercase letters, matrices withbold uppercase letters and sets with calligraphy letters. Wedenote Θ as a set of parameters and W as a parameter ma-trix.Keyphrase generation (KG) is usually formulated as fol-lows: given a context x , which is the concatenation ofthe title and the main body, output a set of keyphrases Y = { y i } i =1 ,...,M where M is the keyphrase number of x . Here, the context x = [ x , . . . , x L x ] and each keyphrase y i = [ y i , . . . , y iL y i ] are both word sequences, where L x isthe length (i.e., the total word number) of the context and L y i is the length of the i -th produced keyphrase y i .To adapt the encoder-decoder framework, M context-keyphrase pairs { ( x , y i ) } i =1 ,...,M are usually split. Sincewe additionally use the title t = [ t , . . . , t L t ] with length L t as an extra query-like input, we split M context-title-keyphrase triplets { ( x , t , y i ) } i =1 ,...,M instead of context-keyphrase pairs to feed our model. For conciseness, we use ( x , t , y ) to represent such a triplet, where y is one of its tar-get keyphrases. Our Proposed Model
Title-Guided Encoder Module
As shown in Figure 2, the title-guided encoder module con-sists of a sequence encoding layer, a matching layer, and a ! " ! ! $ ! % ! & ' " ' ' $ ' % ' & Title Main Body ( " ( Title . . .
Sequence EncodingLayerMatchingLayerMergingLayer [' $ ; + $ ] The source context - The title . ' $ Attention-based Matching /! $ ⊕ ' $ /! " ⊕ ' " /! ⊕ ' /! % ⊕ ' % /! & ⊕ ' & ' " ' ' % ' & - " - - $ - % - & . " . ( " ( . . .. . . . . . . . .. . . Figure 2: The title-guided encoder module. (Best viewed incolor.)merging layer. First, the sequence encoding layer reads thecontext input and the title input and learns their contextualrepresentations separately. Then the matching layer gathersthe relevant title information for each context word reflect-ing the important parts of the context. Finally, the merginglayer merges the aggregated title information into each con-text word producing the final title-guided context represen-tation.
Sequence Encoding Layer
At first, an embedding look-up table is applied to map each word within the context andthe title into a dense vector with a fixed size d e . To incor-porate the contextual information into the representation ofeach word, two bi-directional GRUs (Cho et al. 2014) areused to encode the context and the title respectively: −→ u i = GRU ( x i , −→ u i − ) , (1) ←− u i = GRU ( x i , ←− u i +1 ) , (2) −→ v j = GRU ( t j , −→ v j − ) , (3) ←− v j = GRU ( t j , ←− v j +1 ) , (4)where i = 1 , , . . . , L x and j = 1 , , . . . , L t . x i and t j arethe d e -dimensional embedding vectors of the i -th contextword and the j -th title word separately. −→ u i , ←− u i , −→ v j , and ←− v j are d/ -dimensional hidden vectors where d is the hiddendimension of the bi-directional GRUs. The concatenations u i = [ −→ u i ; ←− u i ] ∈ R d and v j = [ −→ v j ; ←− v j ] ∈ R d are used asthe contextual vectors for the i -th context word and the j -thtitle word respectively. Matching Layer
The attention-based matching layer isengaged to aggregate the relevant information from the titlefor each word within the context. The aggregation operation c i = attn ( u i , [ v , v , . . . , v L t ]; W ) is as follows: c i = L t (cid:88) j =1 α i,j v j , (5) α i,j = exp( s i,j ) / L t (cid:88) k =1 exp( s i,k ) , (6) s i,j = ( u i ) T W v j , (7)here c i ∈ R d is the aggregated information vector for the i -th word of x . α i,j ( s i,j ) is the normalized (unnormalized)attention score between u i and v j .Here, the matching layer is implicitly composed of twoparts because the context is a concatenation of the titleand the main body. The first part is the “title to title” self-matching part, wherein each title word attends the whole ti-tle itself and gathers the relevant title information. This partis used to strengthen the important information in the titleitself, which is essential to capture the core information be-cause the title already contains much highly summative in-formation. The other part is the “main body to title” match-ing part wherein each main body word also aggregates therelevant title information based on semantic relatedness. Inthis part, the title information is employed to reflect the im-portance of information in the main body based on the factthat the highly title-related information in the main bodyshould contain core information. Through these two parts,this matching layer can utilize the title information muchmore sufficiently than any of the previous sequence to se-quence methods. Merging Layer
Finally, the original contextual vector u i and the aggregated information vector c i are used as the in-puts to another information merging layer: −→ m i = GRU ([ u i ; c i ] , −→ m i − ) , (8) ←− m i = GRU ([ u i ; c i ] , ←− m i +1 ) , (9) (cid:101) m i = λ u i + (1 − λ )[ −→ m i ; ←− m i ] , (10)where [ u i ; c i ] ∈ R d , −→ m i ∈ R d/ , ←− m i ∈ R d/ , [ −→ m i , ←− m i ] ∈ R d , and (cid:101) m i ∈ R d . The u i in Eq. (10) is a residual connec-tion, and λ ∈ (0 , is the corresponding hyperparameter.Eventually, we obtain the title-guided contextual represen-tation of the context (i.e., [ (cid:101) m , (cid:101) m , . . . , (cid:101) m L x ] ), which is re-garded as a memory bank for the later decoding process. Decoder Module
After encoding the context into the title-guided repre-sentation, we engage an attention-based decoder (Luong,Pham, and Manning 2015) incorporating with copy mech-anism (See, Liu, and Manning 2017) to produce keyphrases.Only one foward GRU is used in this module: h t = GRU ([ e t − ; ˜ h t − ] , h t − ) , (11) ˆ c t = attn ( h t , [ (cid:101) m , (cid:101) m , . . . , (cid:101) m L x ]; W ) , (12) ˜ h t = tanh ( W [ˆ c t ; h t ]) , (13)where t = 1 , , . . . , L y , e t − ∈ R d e is the embedding of the ( t − -th predicted word wherein e is the embedding ofthe start token, ˆ c t ∈ R d is the aggregated vector for h t ∈ R d from the memory bank [ (cid:101) m , (cid:101) m , . . . , (cid:101) m L x ] , and ˜ h t ∈ R d isthe attentional vector at time step t .Consequently, the predicted probability distribution overthe predefined vocabulary V for current step is computed by: P v ( y t | y t − , x , t ) = softmax ( W v ˜ h t + b v ) , (14)where y t − = [ y , . . . , y t − ] is the previous predicted wordsequence, and b v ∈ R |V| is a learnable parameter vector. Before generating the predicted word, a copy mechanismis adopted to efficiently exploit the in-text information andto strengthen the extraction capability of our model. We fol-low See, Liu, and Manning (2017) and first calculate a softswitch between generating from the vocabulary and copyingfrom the source context x at time step t : g t = σ ( w Tg ˜ h t + b g ) , (15)where w g ∈ R d is a learnable parameter vector and b g is alearnable parameter scalar. Eventually, we get the final pre-dicted probability distribution over the dynamic vocabulary V ∪ X , where X are all words appearing in the source con-text. For simplicity, we use P v ( y t ) and P final ( y t ) to denote P v ( y t | y t − , x , t ) and P final ( y t | y t − , x , t ) respectively: P final ( y t ) = (1 − g t ) P v ( y t ) + g t (cid:88) i : x i = y t ˆ α t,i , (16)where ˆ α t,i is the normalized attention score between h t and (cid:101) m i . For all out-of-vocabulary (OOV) words (i.e., y t / ∈ V ),we set P v ( y t ) as zero. Similarly, if word y t does not appearin the source context x (i.e., y t / ∈ X ), the copy probability (cid:80) i : x i = y t ˆ α t,i is set as zero. Training
We use the negative log likelihood loss to train our model: L = − L y (cid:88) t =1 logP final ( y t | y t − , x , t ; Θ) , (17)where L y is the length of target keyphrase y and y t is the t -th target word in y , and Θ represents all the learnable pa-rameters. Experiment Settings
The keyphrase prediction performance is first evaluated bycomparing our model with the popular extractive methodsand the state-of-the-art generative methods on five real-world benchmarks. Then, comparative experiments of dif-ferent title length ratios are performed on our model andCopyRNN for further model exploration. Finally, an abla-tion study and a case study are conducted to better under-stand and interpret our model.The experiment results lead to the following findings: • Our model outperforms the state-of-the-art models on allthe five benchmark datasets for both present and absentkeyphrase prediction. • Our model consistently improves the performance on var-ious title length ratios and obtains relative higher im-provement gains for both very low and very high titlelength ratios. • The title-guided encoding part and the copy part are con-sistently effective in both present and absent keyphraseprediction tasks.We implement the models using PyTorch (Paszke et al.2017) on the basis of the OpenNMT-py system (Klein et al.2017). raining Dataset
Because of the public accessibility, many commonly-usedscientific publication datasets are used to evaluate the ex-plored KG methods. This study also focuses on generat-ing keyphrases from scientific publications. For all the gen-erative models (i.e. our TG-Net model as well as all theencoder-decoder baselines), we choose the largest publiclyavailable keyphrase generation dataset
KP20k constructedby Meng et al. (2017) as the training dataset.
KP20k con-sists of a large amount of high-quality scientific publicationsfrom various computer science domains. Totally 567,830 ar-ticles are collected in this dataset, where 527,830 for train-ing, 20,000 for validation, and 20,000 for testing. Both thevalidation set and testing set are randomly selected. Sincethe other commonly-used datasets are too small to train a re-liable generative model, we only train these generative mod-els on
KP20k and then test the trained models on all thetesting part of the datasets listed in Table 2. As for the tra-ditional supervised extractive baseline, we follow Meng etal. (2017) and use the dataset configuration shown in Ta-ble 2. To avoid the out-of-memory problem, for
KP20k , weuse the validation set to train the traditional supervised ex-tractive baseline.
Testing Datasets
Besides
KP20k , we also adopt other four widely-used sci-entific datasets for comprehensive testing, including
In-spec (Hulth 2003),
Krapivin (Krapivin, Autaeu, and March-ese 2009),
NUS (Nguyen and Kan 2007), and
SemEval-2010 (Kim et al. 2010). Table 2 summarizes the statisticsof each testing dataset.Dataset Total Training Testing
Inspec
Krapivin
NUS
211 FFCV 211
SemEval-2010
288 188 100
KP20k
Implementation Details
For all datasets, the main body is the abstract, and the con-text is the concatenation of the title and the abstract. Duringpreprocessing, various operations are performed includinglowercasing, tokenizing by CoreNLP (Manning et al. 2014),and replacing all the digits with the symbol (cid:104) digit (cid:105) . We de-fine the vocabulary V as the 50,000 most frequent words.We set the embedding dimension d e to 100, the hiddensize d to 256, and λ to 0.5. All the initial states of GRUcells are set as zero vectors except that h is initialized as [ −→ m L x ; ←− m ] . We share the embedding matrix among the con-text words, the title words, and the target keyphrase words.All the trainable variables including the embedding matrix are initialized randomly with uniform distribution in [-0.1,0.1]. The model is optimized by Adam (Kingma and Ba2015) with batch size = 64, initial learning rate = 0.001, gra-dient clipping = 1, and dropout rate = 0.1. We decay thelearning rate into the half when the evaluation perplexitystops dropping. Early stopping is applied when the valida-tion perplexity stops dropping for three continuous evalua-tions. During testing, we set the maximum depth of beamsearch as 6 and the beam size as 200. We repeat the experi-ments of our model three times using different random seedsand report the averaged results.We do not remove any predicted single-word phrase inthe post-processing for KP20k during testing, which is dif-ferent from Meng et al. (2017), since our model is trained onthis dataset and it can effectively learn the distribution of thesingle-word keyphrases. But for other testing datasets, weonly keep the first predicted single-word phrase followingMeng et al. (2017).
Baseline Models and Evaluation Metric
For present keyphrase predicting experiment, we usetwo unsupervised models including TF-IDF and Tex-tRank (Mihalcea and Tarau 2004), and one supervised modelMaui (Medelyan, Frank, and Witten 2009) as our tradi-tional extraction baselines. Besides, we also select Copy-RNN (Meng et al. 2017) and CopyCNN (Zhang, Fang, andWeidong 2017), the two state-of-the-art encoder-decodermodels with copy mechanism (Gu et al. 2016), as the base-lines for present keyphrase prediction task. As for absentkeyphrase prediction, since traditional extraction baselinescannot generate such keyphrases, we only choose CopyRNNand CopyCNN as the baseline models. For all baselines, weuse the same setups as Meng et al. (2017) and Zhang, Fang,and Weidong (2017).The recall and
F-measure (F ) are employed as our met-rics for evaluating these algorithms. Recall is the numberof correctly predicted keyphrases over the total number oftarget kayphrases. F score is computed based on the Re-call and the
Precision , wherein
Precision is defined as thenumber of correctly predicted keyphrases over the total pre-dicted keyphrase number. Following Meng et al. (2017) andZhang, Fang, and Weidong (2017), we also employ PorterStemmer for preprocessing when determining whether twokeyphrases are matched.
Results and Analysis
Present Keyphrase Predicting
In this section, we compare present keyphrase predictionability of these models on the five real-world benchmarkdatasets. The F-measures at top 5 and top 10 predictions ofeach model are shown in Table 3.From this table, we find that all the generative modelssignificantly outperforms all the traditional extraction base-lines. Besides, we also note that our TG-Net model achievesthe best performance on all the datasets with significant mar-gins. For example, on
KP20k dataset, our model improves9.4% (F @10 score) than the best generative model Copy-CNN. Compared to CopyRNN which also applies an RNN- odel Inspec Krapivin NUS SemEval KP20k F @5 F @10 F @5 F @10 F @5 F @10 F @5 F @10 F @5 F @10TF-IDF 0.221 0.313 0.129 0.160 0.136 0.184 0.128 0.194 0.102 0.162TextRank 0.223 0.281 0.189 0.162 0.195 0.196 0.176 0.187 0.175 0.147Maui 0.040 0.042 0.249 0.216 0.249 0.268 0.044 0.039 0.270 0.230CopyRNN 0.278 0.342 0.311 0.266 0.334 0.326 0.293 0.304 0.333 0.262CopyCNN 0.285 0.346 0.314 0.272 0.342 0.330 0.295 0.308 0.351 0.288TG-Net % gain 10.5% 10.1% 11.1% 8.5% 18.7% 12.1% 7.8% 4.5% 6.0% 9.4%Table 3: Present keyphrase predicting results on all test datasets. “% gain” is the improvement gain over CopyCNN. Model Inspec Krapivin NUS SemEval KP20k
R@10 R@50 R@10 R@50 R@10 R@50 R@10 R@50 R@10 R@50CopyRNN 0.047 0.100 0.113 0.202 0.058 0.116 0.043 0.067 0.125 0.211CopyCNN 0.050 0.107 0.119 0.205 0.062 0.120 0.044 0.074 0.147 0.225TG-Net % gain 26.0% 7.5% 22.7% 23.4% 21.0% 14.2% 2.3% 2.7% 6.1% 19.1%Table 4: Absent keyphrase predicting results on all test datasets. “% gain” is the improvement gain over CopyCNN.based framework, our model improves about 20.2%. The re-sults show that our model obtains much stronger keyphraseextraction ability than CopyRNN and CopyCNN.
Absent Keyphrase Predicting
In this setting, we consider the absent keyphrase predict-ing ability which requires the understanding of the semanticmeaning of the context. Only the absent target keyphrasesand the absent predictions are preserved for this evaluation.Generally, recalls at top 10 and top 50 predictions are en-gaged as the metrics to evaluate how many absent targetkeyphrases are correctly predicted.The performance of all models is listed in Table 4. It is ob-served that our TG-Net model consistently outperforms theprevious sequence-to-sequence models on all the datasets.For instance, our model exceeds 19.1% (R@50 score) on
KP20k than the state-of-the-art model CopyCNN. Overall,the results indicate that our model is able to capture the un-derlying semantic meaning of the context content much bet-ter than these baselines, as we have anticipated.
Keyphrase Predicting on Various Title LengthRatios
To find out how our title incorporation influences the pre-diction ability, we compare the keyphrase predicting abil-ity of two RNN-based models (i.e., our model and Copy-RNN) on different title length ratios. The title length ratiois defined as the title length over the context length. Thisanalysis is based on the
KP20k testing dataset. In view ofthe title length ratio, we preprocess the testing set into fivegroups (i.e., < > <3% 3%-6% 6%-9% 9%-12% >12%0.300.320.340.360.38 ( a ) F @ M ea s u r e TG-NetCopyRNN<3% 3%-6% 6%-9% 9%-12% >12%Title Length Ratio0.110.120.130.14 ( b ) I m p r o v e m en t G a i n Figure 3: Present keyphrase predicting ability (F1@5 mea-sure) on various title length ratios.tio is higher. One possible explanation is that when the ti-tle is long, it conveys substantial salient information of theabstract. Therefore, the chance for the models to attend tothe core information is enhanced, which leads to the ob-served situation. This figure also shows that both TG-Netand CopyRNN get worse performance on >
12% group than9%-12% group. The main reason is that there exist somedata with a short abstract in >
12% group, which leads to thelack of enough context information for correctly generatingall keyphrases.In Figure 3(b), we find that our TG-Net consistently im-proves the performance with a large margin on five test-ing groups, which again indicates the effectiveness of ourmodel. In a finer perspective, we note that the improvementgain is higher on the lowest (i.e., < itle: Exponential stability of switched stochastic delay systems with non-linear uncertainties Abstract:
This article considers the robust exponential stability of uncertain switched stochastic systems with time-delay. Both almost sure (sample) stability and stability in mean square are investigated. Based on Lyapunov functional methods and linear matrix inequality techniques, new criteria for exponential robust stability of switched stochastic delay systems with non-linear uncertainties are derived in terms of linear matrix inequalities and average dwell-time conditions. Numerical examples are also given to illustrate the results. (a)
Present Keyphrases
Target: {stochastic systems; non-linear uncertainties; exponential stability; linear matrix inequality; average dwell-time}
CopyRNN: linear matrix inequality , 2. switched stochastic systems, 3. robust stability, 4. exponential stability , 5. average dwell-timeTG-Net: exponential stability , 2. switched stochastic systems, 3. average dwell-time , 4. non-linear uncertainties , 5. linear matrix inequality(b) Absent Keyphrases
Target: {switched systems; time-delay system}
CopyRNN: switched systems , 2. switched delay systems, 3. robust control, 4. uncertain systems, 5. switched stochastic stochastic systems TG-Net:
1. almost sure stability, 2. switched systems , 3. time-delay systems , 4. mean square stability, 5. uncertain systems
Figure 4: A prediction example of CopyRNN and TG-Net. The top 5 predictions are compared and the correct predictions arehighlighted in bold. > >
12% group, the titleplays a more important role than in other groups, and conse-quently our model benefits more by not only explicitly em-phasizing the title information itself, but also utilizing it toguide the encoding of information in the main body. As for <
3% group, the effect of such a short title is small on thelatter part of the context in CopyRNN because of the longdistance. However, our model explicitly employs the titleto guide the encoding of each context word regardless ofthe distance, which utilizes the title information much moresufficiently. Consequently, our model achieves much higherimprovement in this group. While we only display the re-sults of present keyphrase prediction, the absent keyphrasepredicting task gets the similar results.
Ablation Study
We also perform an ablation study on
Krapivin for bet-ter understanding the contributions of the main parts ofour model. For a comprehensive comparison, we conductthis study on both present keyphrase prediction and absentkeyphrase prediction.As shown in Table 5, after we remove the title-guidedpart and only reserve the sequence encoding for the context(i.e., -title), both the present and absent keyphrase predic-tion performance become obviously worse, indicating thatour title-guided context encoding is consistently critical forboth present and absent keyphrase generation tasks. We alsoinvestigate the effect of removing the copy mechanism (i.e.,-copy) from our TG-Net. From Table 5, we notice that thescores decrease dramatically on both present and absentkeyphrase prediction, which demonstrates the effectivenessof the copy mechanism in finding important parts of the con-text.
Case Study
A keyphrase prediction example for a paper about the expo-nential stability of uncertain switched stochastic delay sys-tems is shown in Figure 4. To be fair, we also only comparethe RNN-based models (i.e., TG-Net and CopyRNN). Forpresent keyphrase, we find that a present keyphrase “non-
Present AbsentModel
F1@5 F1@10 R@10 R@50TG-Net -title 0.334 0.288 0.142 0.240-copy 0.306 0.281 0.127 0.216Table 5: Ablation study on
Krapivin dataset.linear uncertainties”, which is a title phrase, is correctly pre-dicted by our TG-Net, while CopyRNN fails to do so. As forabsent keyphrase, we note that CopyRNN fails to predict theabsent keyphrase “time-delay systems”. But our TG-Net caneffectively utilize the title information “stochastic delay sys-tems” to locate the important abstract information “stochas-tic systems with time-delay” and then successfully generatethis absent keyphrase. These results exhibit that our model iscapable of capturing the title-related core information moreeffectively and achieving better results in predicting presentand absent keyphrases.
Conclusion
In this paper, we propose a novel TG-Net for keyphrase gen-eration task, which explicitly considers the leading role ofthe title to the overall document main body. Instead of sim-ply concatenating the title and the main body as the onlysource input, our model explicitly treats the title as an ex-tra query-like input to guide the encoding of the context.The proposed TG-Net is able to sufficiently leverage thehighly summative information in the title to guide keyphrasegeneration. The empirical experiment results on five popu-lar real-world datasets exhibit the effectiveness of our modelfor both present and absent keyphrase generation, especiallyfor a document with very low or very high title length ratio.One interesting future direction is to explore more appropri-ate evaluation metrics for the predicted keyphrases insteadof only considering the exact match with the human labeledkeyphrases as the current recall and
F-measure do. cknowledgments
The work described in this paper was partially supported bythe Research Grants Council of the Hong Kong Special Ad-ministrative Region, China (No. CUHK 14208815 and No.CUHK 14210717 of the General Research Fund), and Mi-crosoft Research Asia (2018 Microsoft Research Asia Col-laborative Research Award). We would like to thank JingjingLi, Hou Pong Chan, Piji Li and Lidong Bing for their com-ments.
References [2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural ma-chine translation by jointly learning to align and translate. In
ICLR .[2011] Berend, G. 2011. Opinion expression mining by exploitingkeyphrase extraction. In
IJCNLP , 1162–1170.[2014] Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.;Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phraserepresentations using rnn encoder–decoder for statistical machinetranslation. In
EMNLP , 1724–1734.[2017] Dauphin, Y. N.; Fan, A.; Auli, M.; and Grangier, D. 2017.Language modeling with gated convolutional networks. In
ICML ,933–941.[2017] Florescu, C., and Caragea, C. 2017. A position-biasedpagerank algorithm for keyphrase extraction. In
AAAI Student Ab-stracts , 4923–4924.[2018] Florescu, C., and Jin, W. 2018. Learning feature represen-tations for keyphrase extraction. In
AAAI Student Abstracts .[2018] Gao, Y.; Bing, L.; Li, P.; King, I.; and Lyu, M. R. 2018.Generating distractors for reading comprehension questions fromreal examinations. arXiv preprint arXiv:1809.02768 .[2017] Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin,Y. N. 2017. Convolutional sequence to sequence learning. In
ICML , 1243–1252.[2017] Gollapalli, S. D.; Li, X.; and Yang, P. 2017. Incorporatingexpert knowledge into keyphrase extraction. In
AAAI , 3180–3187.[2016] Gu, J.; Lu, Z.; Li, H.; and Li, V. O. 2016. Incorporatingcopying mechanism in sequence-to-sequence learning. In
ACL ,volume 1, 1631–1640.[2018] Hai, Y., and Lu, W. 2018. Semi-supervised learning forneural keyphrase generation. arXiv preprint arXiv:1808.06773 .[2006] Hulth, A., and Megyesi, B. B. 2006. A study on automat-ically extracted keywords in text categorization. In
COLING andACL , 537–544.[2003] Hulth, A. 2003. Improved automatic keyword extractiongiven more linguistic knowledge. In
EMNLP , 216–223.[1999] Jones, S., and Staveley, M. S. 1999. Phrasier: a system forinteractive document retrieval using keyphrases. In
SIGIR , 160–167.[2018] Jun, C.; Xiaoming, Z.; Yu, W.; Zhao, Y.; and Zhoujun, L.2018. Keyphrase generation with correlation constraints. arXivpreprint arXiv:1808.07185 .[2010] Kim, S. N.; Medelyan, O.; Kan, M.-Y.; and Baldwin, T.2010. Semeval-2010 task 5 : Automatic keyphrase extraction fromscientific articles. In
Proceedings of the 5th International Work-shop on Semantic Evaluation , 21–26.[2015] Kingma, D. P., and Ba, J. 2015. Adam: A method forstochastic optimization. In
ICLR . [2017] Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; and Rush, A.2017. Opennmt: Open-source toolkit for neural machine transla-tion. In
ACL System Demonstrations , 67–72.[2009] Krapivin, M.; Autaeu, A.; and Marchese, M. 2009. Largedataset for keyphrases extraction. Technical report, University ofTrento.[2010] Li, D.; Li, S.; Li, W.; Wang, W.; and Qu, W. 2010. Asemi-supervised key phrase extraction approach: Learning from ti-tle phrases through a document semantic network. In
ACL Short ,296–300.[2011] Liu, Z.; Chen, X.; Zheng, Y.; and Sun, M. 2011. Automatickeyphrase extraction by bridging vocabulary gap. In
CoNLL , 135–144.[2017] Luan, Y.; Ostendorf, M.; and Hajishirzi, H. 2017. Scien-tific information extraction with semi-supervised neural tagging.In
EMNLP , 2641–2651.[2015] Luong, T.; Pham, H.; and Manning, C. D. 2015. Effec-tive approaches to attention-based neural machine translation. In
EMNLP , 1412–1421.[2014] Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard,S.; and McClosky, D. 2014. The stanford corenlp natural languageprocessing toolkit. In
ACL System Demonstrations , 55–60.[2009] Medelyan, O.; Frank, E.; and Witten, I. H. 2009. Human-competitive tagging using automatic keyphrase extraction. In
EMNLP , 1318–1327.[2017] Meng, R.; Zhao, S.; Han, S.; He, D.; Brusilovsky, P.; andChi, Y. 2017. Deep keyphrase generation. In
ACL , volume 1,582–592.[2004] Mihalcea, R., and Tarau, P. 2004. Textrank: Bringing orderinto text. In
EMNLP .[2017] Nema, P.; Khapra, M. M.; Laha, A.; and Ravindran, B. 2017.Diversity driven attention model for query-based abstractive sum-marization. In
ACL , volume 1, 1063–1072.[2007] Nguyen, T. D., and Kan, M.-Y. 2007. Keyphrase extractionin scientific publications. In
ICADL , 317–326.[2010] Nguyen, T. D., and Luong, M.-T. 2010. Wingnus: Keyphraseextraction utilizing document logical structure. In
Proceedings ofthe 5th International Workshop on Semantic Evaluation , 166–169.[2017] Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.;DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017.Automatic differentiation in pytorch. In
NIPS-W .[2017] See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to thepoint: Summarization with pointer-generator networks. In
ACL ,volume 1, 1073–1083.[2017] Song, L.; Wang, Z.; and Hamza, W. 2017. A unified query-based generative model for question generation and question an-swering. arXiv preprint arXiv:1709.01058 .[2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence tosequence learning with neural networks. In
NIPS , 3104–3112.[2008] Wan, X., and Xiao, J. 2008. Single document keyphraseextraction using neighborhood knowledge. In
AAAI , 855–860.[2017] Wang, W.; Yang, N.; Wei, F.; Chang, B.; and Zhou, M.2017. Gated self-matching networks for reading comprehensionand question answering. In
ACL , volume 1, 189–198.[1999] Witten, I. H.; Paynter, G. W.; Frank, E.; Gutwin, C.; andNevill-Manning, C. G. 1999. Kea: Practical automatic keyphraseextraction. In
Proceedings of the fourth ACM conference on Digitallibraries , 254–255.2016] Zhang, Q.; Wang, Y.; Gong, Y.; and Huang, X. 2016.Keyphrase extraction using deep recurrent neural networks on twit-ter. In
EMNLP , 836–845.[2017] Zhang, Y.; Fang, Y.; and Weidong, X. 2017. Deep keyphrasegeneration with a convolutional sequence to sequence model. In