[PDF] A New Citation Recommendation Strategy Based on Term Functions in Related Studies Section

Abstract

Purpose: Researchers frequently encounter the following problems when writing scientific articles: (1) Selecting appropriate citations to support the research idea is challenging. (2) The literature review is not conducted extensively, which leads to working on a research problem that others have well addressed. This study focuses on citation recommendation in the related studies section by applying the term function of a citation context, potentially improving the efficiency of writing a literature review. Design/methodology/approach: We present nine term functions with three newly created and six identified from existing literature. Using these term functions as labels, we annotate 531 research papers in three topics to evaluate our proposed recommendation strategy. BM25 and Word2vec with VSM are implemented as the baseline models for the recommendation. Then the term function information is applied to enhance the performance. Findings: The experiments show that the term function-based methods outperform the baseline methods regarding the recall, precision, and F1-score measurement, demonstrating that term functions are useful in identifying valuable citations. Research limitations: The dataset is insufficient due to the complexity of annotating citation functions for paragraphs in the related studies section. More recent deep learning models should be performed to future validate the proposed approach. Practical implications: The citation recommendation strategy can be helpful for valuable citation discovery, semantic scientific retrieval, and automatic literature review generation. Originality/value: The proposed citation function-based citation recommendation can generate intuitive explanations of the results for users, improving the transparency, persuasiveness, and effectiveness of recommender systems.

Full PDF

Preprint submitted to Emerald

A New Citation Recommendation Strategy Based on Term Functions in Related Studies Section

Haihua Chen Department of Information Science, University of North Texas, Denton, TX, USA, 76201

Abstract

In the era of big scholarly data, researchers frequently encounter the following problems when writing scientific articles: 1) it’s challenging to select appropriate references to support the research idea, and 2) literature review is not conducted extensively, which leads to working on a research problem that has been well addressed by others. Citation recommendation assists researchers to decide which article should be cited in a timely manner, as well as perform comprehensive and high-quality review of scientific literature. Some work has been done on this valuable and challenging task, but few of them focused on applying the semantic information of the citation context. This paper proposes a new citation recommendation strategy based on term function – the functions or roles of citation context in related studies section. We present 9 term functions as identified from the literature and annotated 531 research papers in 3 areas to evaluate our approach. The experiment results demonstrate that term functions are effective to identifying valuable references. The proposed method recommends more accurate citations for a given topic when compared to several baseline methods. The citation recommendation strategy can be helpful to generate automatic summaries and literature reviews. Keywords:

Citation Recommendation, Term Function, Patterns of Paragraphs in Related Studies, BM25 Introduction

The amount of scientific literature has been increasing exponentially in recent years. For example, publications in the field of Computer Science included in Web of Science have been growing from 396 in 1995 to 37,684 in 2017. Due to this massive growth of scientific literature, it has become more and more time-consuming and difficult for readers to review related literature and decide which article to cite. Fortunately, citation recommendation (CR) has been proven to be effective and useful in helping users to decide which papers should be cited from their reading list (McNee, Albert, & Cosley, 2002; Strohman, Croft, & Jensen, 2007; He, Pei, Kifer, Mitra, & Giles, 2010; Liu, Yan, & Yan, 2013; Raamkumar, Foo, & Pang, 2016; Beel & Dinesh, 2017). A CR system suggests previous studies to be reviewed and cited for new research articles. It helps researchers to cite appropriate previous studies and to avoid missing important literatures. Usually, an automatic CR system accepts a research topic and provides a list of publications that can be cited. Different from traditional search approaches as offered by search engine and digital libraries, CR systems focus on finding the relevant publications rather than any texts or pages similar to the topics (Strohman, Croft, & Jensen, 2007). Email: [email protected] esearch on CR has applied multi-semantic information and non-semantic information. Multi-semantic information is usually included in the citation context, defined as a sequence of words appearing around a citation placeholder. For example, He et al. (2010) designed a non-parametric probabilistic CR model which measured the context-based relevance between a citation context and a document to be recommended, and automatically identified citation contexts in a manuscript where citations were needed by applying contextual information (He, Kifer, Pei, Mitra, & Giles, 2011). Tang & Zhang (2009) discovered topical aspects of the citation contexts of each paper and recommended papers based on the discovered topic distribution. Zarrinkalam & Kahani (2013) used citation context as a textual feature to enhance the performance of CR tasks. Duma et al. (2016; 2019) integrated core scientific concepts classification and discourse annotation into context-based CR. Non-semantic information mainly focuses on the relationship between articles and authors. For example, McNee, Albert, & Cosley (2002) created a rating matrix by using the citation web between papers. Chen, Mayanglambam, Hsu, Lu, Lee, & Ho (2011) also leveraged such citation-network-based methodology but named it as citation authority diffusion (CAD). Livne, Gokuladas, Teevan, Dumais, & Adar (2014) developed a CR system which took the author similarity, venue relevancy, and co-citation into consideration for augment sparse citation networks. Son & Kim (2018) proposed a multilevel citation network-based academic paper recommender system by comparing all the indirectly linked papers to the paper of interest. It can recommend both the research topic and the academic theory related papers. Also, some scholars combined multi-semantic information with non-semantic information to enhance the CR performance (Strohman, Croft, & Jensen, 2007; Bethard & Jurafsky, 2010; Ebesu & Fang, 2017; Jeong et al. 2019). Current CR systems could be further improved by considering new factors. One promising approach to enhance CR service is to focus on recommending citations in the related studies sections. Since citation content analysis results showed that more than 60% of the references and the most highly cited articles appeared in the introduction and literature review sections (collectively referred as related studies sections in this paper) of the citing papers (Ding, Liu, Guo, & Cronin, 2013). In fact, recommending citations in related work sections has been discussed as meaningful to fascinating literature review writing (Huang, Wu, Mitra, & Giles, 2014; Livne et al., 2014). However, little research has been conducted on this valuable task, except for (Raamkumar, Foo, & Pang, 2015; Raamkumar, Foo, & Pang, 2016) who constructed a recommender system for providing a shortlist of papers based on article type preference, coverage, citation count, and user specified prospective keywords to assist researchers’ literature review and writing process . The drawback is that users still need to spend much energy on how to organize these literatures. We believe that a CR algorithm will save users a lot of time while writing if it can recommend papers categorized by their term functions in related studies sections. In this paper, term function refers to the semantic role or function of a segment, or a paragraph in related studies section (Cheng, 2015), which had been argued to be promising in scientific literature retrieval and scholarly recommendation (Li, Cheng, & Lu, 2017). For example, when people conduct research on citation context recognition (CCR), the related work may involve problem statement on CCR, the CCR methods, datasets used in CCR, CCR related tools, CCR evaluation method, and the applications of CCR. Therefore, citation recommendation service can provide recommendation lists according to these “term functions” based on users’ requirement s, which will be more likely to meet their information needs. This paper focuses on this innovative and challenging task, aiming to explore more efficient CR strategy and to improve scholars’ reading experience. Compared with previous studies, the contributions of this paper mainly include three aspects:

We investigate the citation organization patterns in the related work sections. Scientific articles tend to follow a certain style of organizing sections and paragraphs (Luong, Nguyen, & Kan, 2012). To understand how researchers usually organize related studies, we develop a term function annotation scheme at the paragraph-level, an annotation experiment showed that there were four common patterns of organizing literature in the related work sections. • We propose a term function-based citation recommendation framework to recommend users ’ articles based on their assigned term function of a certain paragraph in the related work sections. To the best of our knowledge, this proposed framework is the first to introduce term function into citation recommendation task. • We conduct several experiments on “real - world” datasets obtained from ACM Digital Library to evaluate the impacts of the term function factors and the performance of the proposed method. The experimental results show that our method considering term functions outperforms the baseline method (BM25), improving F1 to 5.0% and Recall to 18.4% on average respectively, indicating the effectiveness of applying term function on citation recommendation. The remainder of this paper is structured as follows: Section 2 reviews the related work. The proposed framework and term function annotation experiment are demonstrated in Section 3. Section 4 introduces the recommendation algorithms used in this paper. In Section 5, we describe our experiment based on the ACM Digital Library dataset and report the results. Finally, in Section 6, we conclude and discuss areas of future study. Related work

In this study, we assume that paragraphs in related work sections could be organized by the term function of the citation sentences in them. Therefore, our work in this paper first annotates each paragraph in related work sections with a certain term function, then combines with paragraph content and term function weight to recommend citations for this paragraph. Given this focus, we review scientific literatures related to this research from term function and citation recommendation.

Term function

Originally, term function references the semantic role that a term plays in the scientific literature (Cheng, 2015) and also represents the function of the sentence where the term belongs to. Suppose all the citation sentences in a paragraph in the related work section share a sp ecific “term function” (for example, the research method related to the citing article), we believe this paragraph will be easier to understand for all the citations focus on a specific term function (of a topic).

Except for the term function “research method” we mentioned above, there are many other term functions, such as research topic, technology dataset, application, evaluation, etc. However, according to previous definitions, term functions could be classified into different schemes, as shown in table 1.

Table 1 classification of term function (Li, Cheng, & Lu, 2017) Classification of term function Authors ead, goal, method, other Kondo (2009) Technology, effect Nanba, Kondo, & Takezawa (2010) Focus, technique, domain Gupta & Manning (2012) Technique, application Tsai, Kundu, & Roth (2013) Method, task, other Huang & Wan (2013) Domain-independent: Research topic, Research method Domain-related: Case, tool, data set, etc. Cheng (2015) Overall, the classification of term function is still obscure and not uniform. Thus, considering the specific problems we aim to solve in this article, we took some categories from existing literature and proposed three new categories of term function, which we will introduce in the next section. As a novel research issue, although several effective attempts have been made on term function analysis, the topic still remains to be developed, especially the gap between term function analysis and its applications, which forms the exact initiation of our study.

Citation recommendation

The task of recommending citations for research papers was first introduced by McNee, Albert, & Cosley (2002). Since then, there has been a rich line of research on this topic. For example, Strohman, Croft, & Jensen (2007) combined the content of previous literature and its citation network to recommend relevant material that a given article should cite, finding that this mixed method performed much better than the text similarity approach, whose research sparked the interest in CR problems. Two years later, Tang and Zhang (2009) proposed a topic distribution discovery-based CR model, which performed well on sentence level CR on NIPS dataset, they argued that CR can be integrated to academic search systems to improve service. In 2011, He et al. (2011) presented a CR system that automatically identified contexts where citations were need in a manuscript and recommended appropriate citations. In 2013, Kates-harbeck & Haggblade (2013) used a machine learning method which utilized both context-based features and text-based features to rank references for a given short text. Recently, with the growing of scholarly data and the development of deep learning, neural networks were transplanted to CR problems, aiming to train more robust models and enhance CR performance (Huang, Wu, Chen, Mitra, & Giles, 2015; Ebesu & Fang, 2017). In terms of the context scope based on how a citation recommendation list was generated, CR task can be divided into two aspects: local citation recommendation (LCR) (Tang, Wan, and Zhang, 2014; Yang et al., 2019) and global citation recommendation (GCR) (Tang, Wan, and Zhang, 2014;

Ayala-Gomez et al., 2018 ). Local Citation Recommendation

Local citation recommendation aims to recommend citations for a specific context where citations are needed, which is also called context aware citation recommendation (Tang, Wan, and Zhang, 2014). Here, the specific context can be one sentence or several sentences, in this paper, it represents all the sentences that appeared in a aragraph in the related work sections. Since contexts contain rich semantic information, content-based approaches were usually used in LCR. He et al. (2010) proposed an effective context-aware citation recommendation approach which designed a non-parametric probabilistic model to measure the context-based relevance between a citation context and a document. In the similar task, the translation model was used on the words in the documents to translate the query terms in the citation context, which bridged the vocabulary gap between the citation context and recommended document (Lu, He, Shan, & Yan, 2011; He, Nie, Lu, & Zhao, 2012). Meanwhile, they presented a novel task to automatically identify locations where citations were needed for a given manuscript without a bibliography and recommended citations in those locations (He et al., 2011). Rokach, Mitra, Kataria, Huang, & Giles (2013) presented a supervised learning method utilizing three types of features (general features, author-aware features, and context-aware features) to recommend citations for a given citation context. This approach had been applied to CiteSeerX digital library as a citation recommendation service. Except for these features, time or publication date has also been proven to be important in citation recommendation (Jiang, Liu, & Gao, 2014; Jiang 2015;

Jiang, Liu, & Gao, 2015; Gao 2016). To provide a personalized citation recommendation service, (Liu, Yan, & Yan, 2013) combined user profiles with context-aware citation recommendation. Experimental results showed that the proposed strategy outperformed language model-based and translation model-based algorithms. Other research explored the effectiveness of citation networks (Livne et al., 2014; Son and Kim, 2018; Jiang, Yin, Gao, Lu, & Liu, 2018), core scientific concepts (Duma et al., 2016) in local citation recommendation. (Duma & Klein, 2014) introduced a citation resolution method to evaluate context-based citation recommendation systems. Overall, taking citation context as the main evidence to recommend citations has become a theorem in this field (Bhagavatula, Feldman, Power, & Ammar, 2018). Recently, with the popularity of deep learning in both academic and industry, different deep learning algorithms have been applied to citation recommendation. (Huang et al., 2015) represented words and documents by learning simultaneously from citation context and cited document pairs; neural probabilistic model was used to estimate the probability of words appeared in the citation context under a candidate reference paper, this approach improved the overall recommendation with a 5% gain on Recall@10 compared to translation model. (Ebesu & Fang, 2017) proposed a neural citation network (NCN) which can model the semantic composition of citation contexts and corresponding cited documents titles by exploiting author relations. The method was proven very effective since it outperformed several state-of-art baselines on all metrics (Recall, MAP, MRR, and NDCG) by 13-16%.

Global Citation Recommendation

Different from LCR, global citation recommendation focuses on recommending a reference list for a given paper (Tang, Wan, and Zhang, 2014). However, there exists a bias: these systems and research only focus on finding the relevant papers rather than the important papers. However, it is still difficult for those researchers who cannot evaluate the academic value of the relevant papers to select appropriate citations. Subsequently, the recommended relevant papers require to be manually evaluated the novelty from the researchers (Chen et al. 2011) ather than limit the scope of GCR to recommend just relevant references, Küçüktunç, Saule, Kaya, & Çatalyürek (2012; 2013; 2015) recommended a diverse reference list which allowed the users to reach either old, well-cited, well- known research papers or recent, less known ones. To improve users’ experience,

Wu, Hua, Li, & Pei (2012) proposed to recommend references based on their information needs, such as publication time preference, self-citation preference, co-citation preference and publication reputation preference. However, most citation recommendation research is based on a closed-world view which is limited to using a single data source for recommendation, which usually cannot meet users’ information needs abo ut different aspects while writing (Zarrinkalam & Kahani, 2012). To break this gap, Zarrinkalam & Kahani (2012) introduced a strategy to recommend citations based on multiple linked data sources provided by the emerging web of data, which enriched the background data of recommender systems and provided better recommendations. Later, they proposed to compute the semantic distances based on rational and textual features to recommend references for the input text from a bibliographic dataset (Zarrinkalam & Kahani, 2013; Yang et al., 2019). Kates-harbeck & Haggblade (2013) recommended references for a given abstract with a set of key technical words, an author list and a publication data using a two-stage methodology. In the first stage, they trained a classifier to rank a candidate reference list based on the paper score; in the second stage, they re-ranked the scored candidate papers using connectivity information. Their experiment results showed that text-based features were most effective in rankings, and the re-ranking algorithm overall improved the performance, but not significantly. Caragea, Silvescu, Mitra, & Giles (2013) compared the performance of collaborative filtering-based (CF) approach and singular value decomposition-based (SVD) approach on GCR using CiteSeer dataset, finding that SVD performed better than CF since SVD can easily incorporate additional information. As a benefit of different citation recommendation tasks and citation recommendation algorithms mentioned above, many citation recommendation systems have been developed, for example, ActiveCite (Zhou, 2010), CiteSight (Livne et al., 2014), RefSeer (Huang et al., 2014), ClusCite (Ren, Liu, Yu, Khandelwal, Gu, Wang, & Han, 2014), DiSCern (Chakraborty, Modani, Narayanam, & Nagar), and Rec4LRW (Sesagiri Raamkumar, Foo, & Pang 2015). Due to the usefulness but challenge of this task, citation recommendation is attracting more and more attentions from researchers. Proposed Approach

The overarching objective of our work is to recommend citations in the related work sections and to verify the effectiveness of term function in this recommendation task. In terms of the first objective, we notice that the literature review sections are usually organized in particular patterns to address different aspects as related to an article, thus recommending citations by following these patterns might customize users’ personal information needs. As for the second objective, we observe that a specific term in a segment or a paragraph might indicate its role in related work sections; in this paper, we define it as term function, which is valuable information for CR. The main problems addressed in our work are: (1) identify the paragraph organization patterns in related work sections, and (2) based on these patterns, recommend citations by involving term functions as weighting parameter. Fig. 1 presents the framework of our proposed approach to solve the problems. Given a user topic, the CR system first retrieves a list of publications as candidates for recommendation. Then it assigns weights to he candidates based on their term functions for re-ranking, which have been obtained through the matching of the pattern in which the candidate occurs. At last, the top ranked candidates are provided to the user.

Fig. 1.

Framework of term function-based citation recommendation

Paragraph organization patterns analysis in related work sections

Typically, researchers organized related literature by following some specific patterns (e.g., based on the similarity, based on different topics, based on the published time, or based on the roles) rather than roughly put a list of relevant literature together. Meanwhile, this literature was organized to demonstrate previous research problems, methods, datasets, and application related to that paper, which we defined as term functions in our study. In this section, we proposed a term function classification scheme which was used to annotate a real-world dataset acquired from ACL Anthology; then we analyzed the annotation results to identify the paragraph organization patterns in related work sections.

Term function classification scheme in paragraph level

Term function means the semantic role and specific function that a term plays in the scientific literature (Cheng, 2015), the function of the term also reflects the function of the citation context it locates in the scientific literature. For example, support vector machine (SVM) is the core research problem in (Li, Wang, & Wang, 2010) but the method to solve image classification problems in (Melgani & Bruzzone, , 1998) is cited as a related research problem in (Li, Wang, & Wang, 2010) but cited as a related method to solve image classification problems in (Melgani & Bruzzone, ing the related literature in the citing is organized by these term functions. To verify this hypothesis, we develop a term function classification scheme by combining categories from existing literature and three new categories which we draw from a pilot study, as shown in Table 2.

Table 2

Term functions in the related studies section Category Source Description

Application

Tsai et al. (2013)

Describes existing application of the core problem and method in this article ataset

Cheng (2015)

Describes related datasets to this article

Evaluation

Cheng (2015)

Describes related evaluation methods to this article

Method

Huang & Wan (2013)

Describes previous work related to the core method of the article

Method+ Problem

New

Describes the core method of the article and introduces what problems it can be used to solve

Problem

Kondo et al. (2009)

Describes previous work related to the core research problem of the article

Problem+ Method

New

Describes the core research problem of the article and introduces the existing method to the problem

Tool

Cheng (2015)

Describes related tools used in this article

Topic-irrelevant

New

Describes previous work not very relevant

Annotation

We collected a dataset consisting of 238, 109, and 184 research papers in sentiment analysis, information extraction, and recommender systems respectively, manually downloaded in pdf format from ACL Anthology; they were placed in three separate folders. All the papers were then converted into text format with the assistance of Apache PDFBox . To process the data for annotation, we extracted useful information including the title, the abstract, and the related works section from all text files in the dataset. Paragraph organization pattern annotation is a complex and manual work which requires extensive domain knowledge. Therefore, we invited two Ph.D. students in the field of data science who are familiar with sentiment analysis, information extraction, and recommender systems to annotate the data. Before annotation, they were required to pre-annotate 10 articles under each folder to understand the annotation guideline, shown as follows: • First of all, read the title and abstract to get the research problems and research methods of the article. • Then find the paragraph which needs to be annotated (ignore paragraphs without citations), read the contents of the paragraph to get the features which indicate its term function, label the paragraph with a category of term function mentioned above. • Avoid subjective judgment during annotation. The annotations results were further cross-checked by the first author of this paper. The pairwise Kappa coefficients (Viera & Garrett, 2005) were applied to calculate the inter- annotator agreement. The Cohen’s kappa scores were 0.778, 0.871, and 0.748 on sentiment analysis, information extraction, and recommender systems respectively. These are satisfactory according to existing evaluation standard (Viera & Garrett, 2005).

Paragraph organization patterns analysis

To investigate how authors organized literatures in related work sections, we conducted a statistical analysis of the annotation results. Fig. 2 represents the distribution of term functions under each category over different https://pdfbox.apache.org/ opics. Based on the results, most paragraphs belong to problem+ method, problem, method+ problem, and method, with a percentage of 38.2%, 29.5%, 15.6%, and 6.0% on average, respectively, but with a slight difference between different topics. This supports the intuitions that authors usually pay more attention to the research problems and methods relevant to their topics while writing literature review. In other words, they tend to focus on investigating who was involved in studying the problems and methods, how they described the problems and methods, what methods have been used to solve the problem, and how the method was applied to other problems. Sometimes, existing datasets, evaluation methods, tools and applications related to their research were also involved in two or more independent paragraphs. Besides, some topic-irrelevant materials were referred in a few paragraphs, suggesting that the research topic is novel, and existing literatures in this area were lacking. The findings enlighten us that it is necessary to recommend citations in the related work sections by considering these information needs. In this paper, we explored this innovative citation recommendation task based on four major term functions: problem+ method, problem, method+ problem, and method. Fig. 2.

Statistical analysis of the term function distribution in three fields

Term function-based citation recommendation

As discussed previously, term function-based citation recommendation is a kind of global citation recommendation at paragraph level. Users are required to input the topic (e.g., citation recommendation, translation model) and the term function (e.g. problem+ method, method+ problem), our approach will recommend a list of citations which are both topic and term function relevant.

Problem definition

Definition 1 (Original document collection). Given a document d with t and a as its title and abstract, the document has a set of paragraphs P= {p , p , …, p i } in the related works section, the term function of each paragraph p i is defined as l i . Therefore, the paragraph set P has a corresponding term function set L= {l , l , …, l i }. All the original documents together with their paragraph sets and term function sets form the original document collection D . efinition 2 (Candidate document collection). Given a document c with t and a as its title and abstract, we define its term function as F = {f , f , f , f }, where f , f , f , f refer to problem+ method, problem, method+ problem, and method respectively, which are determined by the term function of its citing paragraphs. All the candidate documents were extracted from the references of the original documents; they form the candidate document collection C. Definition 3 (Term function-based citation recommendation). Given a paragraph p whose topic is t and expected term function is f , our goal is to estimate the probability of each document in the candidate document collection C to be cited in this paragraph. Citation recommendation with BM25 algorithm

BM25 (

Robertson & Zaragoza, 2010) is a text-based method, which has been widely employed in many information retrieval systems and proven to be effective. Recently, it has been also frequently selected as a baseline model in the citation recommendation tasks (Ren et al., 2014; Sesagiri Raamkumar, Foo, & Pang, 2015; Jiang, Liu, & Gao, 2015; Gao, 2016; Ebesu & Fang 2017; Bhagavatula, Feldman, Power, & Ammar, 2018). In this paper, we use BM25 as the baseline method. With the BM25 ranking function, the relevance score of a candidate document d with respect to a paragraph as a query q is calculated as follows: 𝑆𝑐𝑜𝑟𝑒(𝑞, 𝑑) = ∑ 𝑙𝑜𝑔

𝑁−𝑛(𝑞 𝑖 )+0.5𝑛(𝑞 𝑖 )+0.5𝑛𝑖 ∙ f 𝑖 ∙(𝑘 +1)𝑓 𝑖 +𝑘 ∙(1−𝑏+𝑏∙ 𝑑𝑙𝑎𝑣𝑔𝑑𝑙 ) (1) Where k and b are free hyper-parameters. avgdl and N respectively donate the average document length and total number of documents in the collection. f i and n(q i ) respectively represent the frequencies of a query term q i in a candidate document and the total number of documents which contain the query term q i . Term function weighting-based BM25 recommendation model

In this paper, we explore the influence of term functions in citation recommendation and propose the term function weighting-based BM25 recommendation model. Comparing with the original BM25 model introduced above, an important step is conducted before recommendation to compute the weight each document in the candidate document collection on each of the four term functions. We define

F = (𝑓 , 𝑓 , 𝑓 , 𝑓 ) as the weighted term function of each document d in the candidate document collection, where 𝑓 , 𝑓 , 𝑓 , 𝑓 respectively represents the frequency of problem, method, problem+method, method+problem. After that, the term function should be added to the BM25 model to improve the performance. For a document d and a paragraph as a query q , q f donates the expected term function of this query/paragraph. Therefore, we calculate the relevance score on the term function dimension as follow: 𝑝(𝑞 𝑓 , 𝑑) = 𝑓𝑓 +𝑓 +𝑓 +𝑓 , 𝑓 ∈ {𝑓 , 𝑓 , 𝑓 , 𝑓 } (2) For example, if a user inputs a query q with the term function as problem, one of a document d in the candidate document collection whose term function is F = (4,2,8,1) , we compute its term function weight on problem, method, problem+method, method+problem respectively as , , , , where the sum of the weights should adhere to the following constraint: 𝑤𝑒𝑖𝑔ℎ𝑡 𝑗 = 1(𝑤𝑒𝑖𝑔ℎ𝑡 𝑗 ∈ [0,1]) (3) Finally, the relevance score of a candidate document d with respect to a paragraph as a query q which combines text similarity and term function matching is computed with the following equation: 𝑆𝑐𝑜𝑟𝑒(𝑞, 𝑑) = (1 + 𝑝(𝑞 𝑓 , 𝑑)) ∙ ∑ 𝑙𝑜𝑔 𝑁−𝑛(𝑞 𝑖 )+0.5𝑛(𝑞 𝑖 )+0.5𝑛𝑖 ∙ f 𝑖 ∙(𝑘 +1)𝑓 𝑖 +𝑘 ∙(1−𝑏+𝑏∙ 𝑑𝑙𝑎𝑣𝑔𝑑𝑙 ) (4) Pseudo code for recommendation with BM25 or Term function weighting-based BM25 model is described in Algorithm 1. Algorithm 1

Recommendation with BM25 or Term function weighting-based BM25

Input:

Candidate paper list Rank papers by BM25 scores If Year of candidate paper < year of original paper Add candidate paper into new recommendation list

Else

Continue

End if

Length of new recommendation list is 30

Output:

New recommendation list Experiments

In this section, we first introduce the dataset for experiments and the experimental setup to evaluate the recommendation performance of our proposed approach. After that, we report the results of standard measures for information retrieval and recommendation (Recall, Precision, and F1 score). Finally, analysis and discussion are presented based on the experimental results.

Dataset

Since there is no existing standard benchmark dataset with annotated term function at the paragraph level for term function-based citation recommendation, we collect and build our data set from ACL Anthology in the domains of information extraction, sentiment analysis, and recommender systems. In total, 531 research articles with full text and their 2875 reference articles with title, published year, and abstract are included. As mentioned in section 3.1, all the paragraphs in the related work sections of the 531 research articles have been labeled with one of the four term functions. Since many of the reference articles have been co-cited by the original articles, it is easy for us to figure out their term function distributions. For example, the term function distribution of paper “On the recommending of citations for research papers (McNee et al., 2002) ” is (3/8, 4/8, 0, 1/8) in terms of (problem+ method, problem, method+ problem, and method), which can be used as weighting parameters. We se Python NLTK tool to perform text pre-processing like removing numbers, stop words and stemming. Finally, the experiment dataset is stored in a MySQL database. In our experiment, we randomly partition the dataset into 5 subsamples and then perform a 5-fold cross validation on the exact same partition for our approaches and the baseline methods. At each time, four sets were used as training sets for term function weighting and the remaining one set was used for testing. Also, a small portion of examples split from the training set were used for validation. We performed the process five times and averaged their performance for evaluation. Experimental setup • BM25.

We compute the similarity scores between paragraph and the candidate recommendation articles, where we restrict the published data of the candidate articles a year before the original article. • Term function weighting-based BM25 . In this method, we modify the BM25 model by involving term function information and rank the candidate recommendation articles based on the scores computed by Eq. (4). Moreover, we conduct a comparison study on three fields including: information extraction (300 paragraphs), sentiment analysis (383 paragraphs) and recommender system (510 paragraphs) to analysis the effect of term function on citation recommendation performance in different fields. Therefore, there are six runs in total: BM25 in information extraction, sentiment analysis, and recommender system respectively (IE-BM25,

SA-BM25, RS-BM25 ) as baselines; BM25 with term function weighting in information extraction, sentiment analysis, and recommender system respectively (

IE-TFW-BM25, SA-TFW-BM25, and RS-TFW-BM25 ). Evaluation metrics

Citation recommendation is essentially an information retrieval task, the top ranked documents are the most important to get correct (Wu et al., 2012). Therefore, we employ IR evaluation measures including Precision, Recall, and F-measure in our experiments. For a given query (paragraph with its labelled term function) in the test set, we use the original set of references, which were not present while training, as the ground truth (Rokach et al., 2013). In experiments, the number of recommended citations is set as 5, 10, 20, and 30 respectively.

Performance comparison

Table 3 shows the result for each compared method on our dataset and include the average precision score, recall score, and F1 score in terms of different number of citations in the recommendation list. There are some interesting findings from the results: First, the term function-based methods perform much better than the baseline methods. Second, there are no significant differences between different fields on both the baseline methods and the term function-based methods. Third, fig. 3. Shows that the term function-based method improves the baseline by 5.0% (average) on F1 score when the number of recommended citations is set s 20 and improves the baseline by 12.9% (average) on recall when the number of recommended citations is set as 30, which demonstrates that term function is a useful feature in citation recommendation. Table 3

Performance of baseline methods and the methods with term function weighting Metrics Runs Top 5 Top 10 Top 20 Top 30 Precision IE-BM25 10.6% 12.6% 7.1% 6.0% IE-TFW-BM25

SA-BM25 11.1% 10.5% 8.2% 7.0% SA-TFW-BM25

RS-BM25 14.3% 12.1% 7.5% 6.4% RS-TFW-BM25

Recall IE-BM25 14.9% 35.1% 40.0% 50.6% IE-TFW-BM25

SA-BM25 15.6% 29.6% 46.3% 59.3% SA-TFW-BM25

RS-BM25 21.3% 36.2% 44.7% 57.4% RS-TFW-BM25

F1-score IE-BM25 12.4% 18.5% 12.1% 10.7% IE-TFW-BM25

SA-BM25 13.0% 15.5% 13.9% 12.5% SA-TFW-BM25

RS-BM25 17.1% 18.1% 12.8% 11.5% RS-TFW-BM25

Fig. 3.

F1-score of different runs To control the mixed weight of the term function-based method, we use a parameter α∈ (0,1] to adjust the weight of term function-based similarity. The results were shown in fig. 4, term function only enhances citation recommendation to some extent, when the weight of term function similarity is too high, it will lead to some noise instead. Fig. 4.

F1-score with term function weight adjustment

Results analysis and discussion

Our experiment results show that compared to the traditional BM 25 retrieval model, which has been proved effective in information retrieval and recommendation, our term function weighting-based strategy recommend a more accurate and structured literature list, demonstrating that term function is an effective feature in citation recommendation, especially when recommending papers for structured literature review. Compared with deep learning based-recommendation approaches (

Zhang et al., 2019 ), our strategy can generate intuitive explanations of the results for users or system designers, which can help improve the system transparency, persuasiveness, trustworthiness, and effectiveness. For example, users can easily figure out both the content and term function relevance of a recommended item. Not only will this new citation recommendation strategy benefit for semantic scientific information retrieval, but also benefit for automatically structured summarization and literature review generation. However, there are some limitations of our proposed strategy. In our measurement, we assume that any paper other than the actual citation is not relevant. In fact, there may be multiple papers that provide the same support/evidence and can equally serve as valid citations. Therefore, the above measures probably underestimate the actual performance. It might be interesting to look at which papers other than the one the y’re citing are relevant as well. However, that would require subjective and manual relevance judgements. Another is the limited amount of data. Since term function annotation is still a challenging task, which requires filed experts and a lot of manual work. Although Cheng (2015) is trying to develop automatic term function identification techniques, there is a huge gap between experiment and application. Absolutely, our citation recommendation strategy initiated a direction that how term function could be used and how to construct a structured literature review system. Conclusion and future Work

In this paper, we proposed a new citation recommendation strategy based on term functions in related studies section. Based on the hypothesis that researchers usually organize citation in the related work sections with ome patterns, we develop a term function annotation scheme at the paragraph-level, an annotation experiment showed that there were four common patterns of organizing literatures in the related work sections. Following this theory, we come up with a term function-based citation recommendation framework to recommend users articles based on their assigned term function of a certain paragraph in the related work sections. Using the “real - world” dataset collected from ACL Anthology, we designed recommendation experiment in three filed: information extraction, recommender system, and sentiment analysis, with BM 25 model as the baseline method. The experiment results show that our proposed citation recommendation strategy improves the baseline by 5.0% (average) on F1 score when the number of recommended citations is set as 20 and improves the baseline by 12.9% (average) on recall when the number of recommended citations is set as 30, demonstrating that term function is an effective feature in citation recommendation, especially when recommending papers for structured literature review. In the future we will try to develop algorithms to automatically build large-scale data collections with term function for each paragraph, in this way, we can combine the state of the art contextual word embeddings such as BERT with term functions to improve the recommendation performance and to provide explainable recommendation results. We will also explore the usefulness of some other features, such as citing time and citation location. More practically, we will implement a recommender system which assist researchers to write structured literature review more easily based on our citation recommendation strategy.

References [1] Ayala-Gomez, F., Daroczy, B., Benczur, A., Mathioudakis, M., & Gionis, A. (2018). Global citation recommendation using knowledge graphs. Journal of Intelligent and Fuzzy Systems, 34(5), 3089-3100. https://doi.org/10.3233/JIFS-169493. [2] Beel, J., & Dinesh, S. (2017). Real-World Recommender Systems for Academia: The Pain and Gain in Building, Operating, and Researching them. In Proceedings of the ECIR'17 workshop on bibliometric-enhanced information retrieval (pp. 6-17). [3] Bethard, S., & Jurafsky, D. (2010). Who should I cite: learning literature search models from citation behavior. In Proceedings of the 19th ACM international conference on information and knowledge management (CIKM'10) (pp. 609-618). ACM . [4] Bhagavatula, C., Feldman, S., Power, R., & Ammar, W. (2018). Content-based citation recommendation. arXiv preprint arXiv:1802.08301. [5] Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167. [6] Caragea, C., Silvescu, A., Mitra, P., & Giles, C. L. (2013). Can't see the forest for the trees?: a citation recommendation system. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries (JCDL'13) (pp. 111-114). ACM. [7] Chakraborty, T., Modani, N., Narayanam, R., & Nagar, S. (2015). Discern: a diversified citation recommendation system for scientific queries. In Proceedings of the 31th international conference on data engineering (ICDE'15) (pp. 555-566). IEEE. [8] Chen, C., Mayanglambam, S., Hsu, F., Lu, C., Lee, H., & Ho, J. (2011). Novelty paper recommendation using citation authority diffusion. In Proceedings of the international conference on technologies and applications of artificial intelligence (TAAI'11) (pp. 126-131). IEEE. [9] Cheng, Q. (2015). Term Function Recognition of Academic Text. (Unpublished doctoral dissertation). Wuhan University, Wuhan, China. [10] Ding, Y., Liu, X., Guo, C., & Cronin, B. (2013). The distribution of references across texts: Some implications for citation analysis. Journal of Informetrics, 7(3), 583-592. [11] Duma, D., & Klein, E. (2014). Citation resolution: A method for evaluating context-based citation recommendation systems. In Proceedings of the 52th annual meeting of the association for computational linguistics (ACL'14) (Vol. 2, pp. 358-363). 12] Duma, D., Liakata, M., Clare, A., Ravenscroft, J., & Klein, E. (2016). Applying Core Scientific Concepts to Context-Based Citation Recommendation. In Proceedings of the 10th edition of the language resources and evaluation conference (LREC'16). [13] Duma, D., Liakata, M., Clare, A., Ravenscroft, J., & Klein, E. (2016). Rhetorical Classification of Anchor Text for Citation Recommendation. D-Lib Magazine, 22(9/10). [14] Duma, D. (2019). Contextual citation recommendation using scientific discourse annotation schemes (Doctoral dissertation). United Kingdom, Edinburgh: The University of Edinburgh. [15] Ebesu, T., & Fang, Y. (2017). Neural citation network for context-aware citation recommendation. In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval (SIGIR'17) (pp. 1093-1096). ACM. [16] Gao, Z. (2015). Examining influences of publication dates on citation recommendation systems. In Proceedings of the 12th international conference on fuzzy systems and knowledge discovery (FSKD'15) (pp. 1400-1405). IEEE. [17] Huang, W., Wu, Z., Mitra, P., & Giles, C. (2014). RefSeer: Citation Recommendation System. In Proceedings of the 14th ACM/IEEE-CS joint conference on digital libraries (JCDL'14) (pp. 371 – üçü ktun ç , O., Saule, E., Kaya, K., & Ç ataly ü rek, U. (2012). Direction awareness in citation recommendation. In Proceedings of the international workshop on ranking in databases (DBRank ’

12) in conjunction with VLDB ’