[PDF] Improving Scholarly Knowledge Representation: Evaluating BERT-based Models for Scientific Relation Classification

Abstract

With the rapid growth of research publications, there is a vast amount of scholarly knowledge that needs to be organized in digital libraries. To deal with this challenge, techniques relying on knowledge-graph structures are being advocated. Within such graph-based pipelines, inferring relation types between related scientific concepts is a crucial step. Recently, advanced techniques relying on language models pre-trained on the large corpus have been popularly explored for automatic relation classification. Despite remarkable contributions that have been made, many of these methods were evaluated under different scenarios, which limits their comparability. To this end, we present a thorough empirical evaluation on eight Bert-based classification models by focusing on two key factors: 1) Bert model variants, and 2) classification strategies. Experiments on three corpora show that domain-specific pre-training corpus benefits the Bert-based classification model to identify the type of scientific relations. Although the strategy of predicting a single relation each time achieves a higher classification accuracy than the strategy of identifying multiple relation types simultaneously in general, the latter strategy demonstrates a more consistent performance in the corpus with either a large or small size of annotations. Our study aims to offer recommendations to the stakeholders of digital libraries for selecting the appropriate technique to build knowledge-graph-based systems for enhanced scholarly information organization.

Full PDF

IImproving Scholarly Knowledge Representation:Evaluating BERT-based Models forScientiﬁc Relation Classiﬁcation

Ming Jiang , Jennifer D’Souza − − − ,S¨oren Auer − − − , and J. Stephen Downie − − − University of Illinois at Urbana Champaign, USA TIB Leibniz Information Centre for Science and Technology andL3S Research Center at Leibniz University of Hannover, Hannover, Germany { mjiang17|jdownie } @illinois.edu, { jennifer.dsouza|auer } @tib.eu Abstract.

With the rapidly growing number of research publications, there isa vast amount of scholarly information that needs to be organized in digital li-braries. To deal with this challenge, digital libraries use semantic techniques tobuild knowledge-base structures for organizing scientiﬁc information. Identify-ing relations between scientiﬁc terms can help with the construction of a repre-sentative knowledge-based structure. While advanced automated techniques havebeen developed for relation extraction, many of these techniques were evaluatedunder different scenarios, which limits their comparability. To this end, this studypresents a thorough empirical evaluation of eight B

ERT -based classiﬁcation mod-els by exploring two factors: 1) B

ERT model variants, and 2) classiﬁcation strate-gies. To simulate real-world settings, we conduct our sentence-level assessmentusing the abstracts of scholarly publications in three corpora, two of which aredistinct corpora and the third of which is the union of the ﬁrst two. Our ﬁndingsshow that S CI B ERT models perform better than B

ERT

BASE models. The strategyof classifying a single relation each time is preferred in the corpus consisting ofabundant scientiﬁc relations, while the strategy of identifying multiple relationsat one time is beneﬁcial to the corpus with sparse relations. Our results offer rec-ommendations to the stakeholders of digital libraries for selecting the appropriatetechnique to build a structured knowledge-based system for the ease of scholarlyinformation organization.

Keywords:

Digital library · Information extraction · Scholarly text mining · Se-mantic relation classiﬁcation · Knowledge graphs · Neural machine learning.

Today scientiﬁc endeavors are increasingly facing a publication deluge [27], which re-sults in the rapid growth of scholarly publications needing to be accessible in digi-tal libraries. While abundant resources have been provided for scholarly communica-tion in digital libraries, it is still challenging for researchers to obtain comprehensive,ﬁne-grained and context-sensitive scholarly knowledge for their research, especiallyfor those who study a research problem that involves multiple disciplines [13]. Ac-cording to [13,4], there are three main factors that lead to this issue. First, the set of a r X i v : . [ c s . D L ] A p r M. Jiang et al. keywords used for indexing publication documents may not be able to cover all as-pects of knowledge involved in each publication. Second, the traditional keyword searchon document-based publications fails to consider the semantic associations among thepieces of scholarly information. Finally, the solely manual processing of unstructuredscholarly knowledge does not scale to the number of publications, thus rendering alarge part of the scholarly canon unused. To improve this problem, some initiatives [13]advocate for combining human and machine intelligence to build an interlinked and se-mantically rich graph structure to organize the scholarly information in digital libraries.A key aspect of building a knowledge graph for the scholarly record is identify-ing relations between scientiﬁc terms. In the natural language processing community,within the context of human-annotated datasets comprising abstracts of scholarly arti-cles [5,10], seven relation types between scientiﬁc terms are being studied. They areH

YPONYM -O F , P ART -O F , U SAGE , C

OMPARE , C

ONJUNCTION , F

EATURE -O F , andR ESULT . The annotations are in the form of the following generalized relation triples: (cid:104) experiment (cid:105) C OMPARE (cid:104) another experiment (cid:105) ; (cid:104) method (cid:105) U SAGE (cid:104) data (cid:105) ; (cid:104) method (cid:105) U S - AGE (cid:104) research task (cid:105) . Since human language exhibits the phenomenon of paraphrasingwhere the same concept can be expressed in various ways, the direct identiﬁcation ofa particular relation between scientiﬁc terms is impractical—a problem addressed by aclassiﬁcation task. In the framework of an automated pipeline for generating knowledgegraphs over massive volumes of scholarly records, scientiﬁc relation classiﬁcation–thetask reviewed in this paper–is therefore indispensable. The resulting task is recognizingwhich particular relation assertion between a pair of scientiﬁc terms in scholarly articlesholds given a set of possible predeﬁned relations.In this age of the “deep learning tsunami”, to build automated scientiﬁc relation(SR) classiﬁcation systems, one can combine various neural architectures to achievehigh classiﬁcation accuracy [18]. And with the recent introduction of BERT [8] wordembedding models, the opportunity to obtain boosted machine learning systems is fur-ther accentuated. While prior studies [6,26] of SR systems have demonstrated highclassiﬁer performances by tapping into these recent deep learning developments, theperformances have been reported only on single evaluation scenarios, e.g., based onevaluating on a single dataset. Based on such lean evaluations of prior systems, it is, atpresent, difﬁcult to obtain conclusive insights about the robustness of the classiﬁers inreal-world diverse application settings: such as within scholarly digital library frame-works hosting diverse collections of articles. Moreover, existing evaluations are notidentical between each other, since their evaluated datasets differ greatly; hence resultsare not comparable. Therefore we have no general insights into the best choice amongthe SR systems to recommend in practical settings—so far they all seem equally good.In this work, we comprehensively surveyed eight deep learning techniques for sci-entiﬁc relation classiﬁcation based on two different classiﬁcation strategies and fourB

ERT model variants. Further, we evaluated these systems on all available evaluationresources for the SR task including: a dataset of scholarly articles from the ACL an-thology [10]; a more diverse dataset of scholarly articles from various Artiﬁcial Intelli-gence conference proceedings [16]; and a third resource that leveraged the two datasetscombined, thereby offering a more realistic setting of unbalanced distribution of datadomains and, further, offering robust training for classiﬁcation models on annotations valuating BERT-based Models for SR Classiﬁcation 3 made by two different groups of annotators. Our ultimate goal is to help the stakehold-ers of digital libraries select the optimal tool to implement knowledge-based scientiﬁcinformation ﬂows. In summary, we address the following research questions in this pa-per:1. What is the impact of the eight classiﬁers on scientiﬁc relation classiﬁcation?2. Which of the seven relation types studied are easy or challenging for classiﬁcation?3. What is the practical relevance of the seven relation types in a scholarly knowledgegraph?

Relations Mined from Scientiﬁc Publications.

Overall, knowledge is organized indigital libraries based on the following three aspects of the digital collections: 1) meta-data, 2) free-form content, and 3) ontologized content [23,12]. In this context, the maincategories of relations that have been explored for scholarly publications belong to twogroups. One group includes metadata relations such as authorship, co-authorship, andcitations [24,22]. Research in this group mainly focuses on examining the social dimen-sion of scholarly communication such as co-author prediction [22] and scholarly com-munity analysis [24]. The second group includes semantic relations, either as free-formsemantic content classes [15,14] or as ontologized classes [21,19]. In the framework ofautomatic systems, content relations have been examined for: 1) scientiﬁc relation iden-tiﬁcation that involves determining which scientiﬁc term pairs are related [10,14], and2) scientiﬁc relation classiﬁcation that involves determining which relation type existsbetween related term pairs, where the relation types are typically pre-deﬁned [26,6,16].With respect to ontologized relation classes, prior work primarily considers the concep-tual hierarchy based on formal concept analysis [21,19].We attempt the task of classifying semantic relations that were created from free-form text. Given that the digital libraries are interested in the creation of linked data [11],our attempted task directly facilitates the creation of scholarly knowledge graphs [3],offering structured data for use by librarians to generate linked data.

Techniques Developed for Relation Classiﬁcation.

Both rule-based [1] and learning-based [7,28] methods have been developed for relation classiﬁcation. Traditionally,learning-based systems relied on hand-crafted semantic and/or syntactic features [1,7].In recent years, the success of deep learning techniques have nearly obviated the need tomanually design features since they can more effectively learn latent feature representa-tions for discriminating between relations. An attention-based bidirectional long short-term memory network (BiLSTM) [28] was one of the ﬁrst top-performing systems thatleveraged neural attention mechanisms to capture important information per sentencefor relation classiﬁcation. Another advanced system [17] leveraged a dynamic spangraph framework based on BiLSTMs to simultaneously extract terms and infer theirpairwise relations. Aside from these neural methods considering the word sequenceorder, transformer-based models [25] that use self-attention mechanisms to quantifythe semantic association of each word to its context have become the current state-of-the-art in relation classiﬁcation. E.g. B

ERT word embeddings [8]. It can be trained

M. Jiang et al.

Id Relation SemEval18 SciERC Combined

Total % Total % Total %1 U

SAGE : a scientiﬁc entity that is used for/by/on another scientiﬁc entity.E.g.

MT system is applied to

Japanese

658 42.13% 2,437 52.43% 3,095 49.84%2 F

EATURE -O F : An entity is a characteristic or abstract model of anotherentity. E.g. computational complexity of uniﬁcation

392 25.10% 264 5.68% 656 10.56%3 C

ONJUNCTION : Entities that are related in a lexical conjunction i.e., with‘and’ ‘or’. E.g. videos from

Google Video and a

NatGeo documentary

582 12.52% 582 9.37%4 P

ART -O F : scientiﬁc entities that are in a part-whole relationship. E.g. de-scribing the processing of utterances in a discourse

304 19.46% 269 5.79% 573 9.23%5 R

ESULT : An entity affects or yields a result. E.g. With only 12 trainingspeakers for SI recognition , we achieved a 7.5% word error rate

92 5.89% 454 9.77% 546 8.79%6 H

YPONYM -O F : An entity whose semantic ﬁeld is included within that ofanother entity. E.g. Image matching is a problem in

Computer Vision

409 8.80% 409 6.59%7 C

OMPARE : An entity is compared to another entity. E.g. conversationtranscripts have features that differ signiﬁcantly from neat texts

116 7.43% 233 5.01% 349 5.62%

Overall

Table 1:

Overview of corpus statistics. ‘Total’ and ‘%’ columns show the number and percentageof instances annotated with the corresponding relation over all abstracts, respectively. Generatedin the Open Research Knowledge Graph[3] . to model data from any domain—the original B ERT models were trained on booksand Wikipedia. Now with the newly introduced S CI B ERT [6], there are B

ERT modelstrained on scholarly publications as well.With respect to the classiﬁcation strategy, the single-relation-at-a-time classiﬁca-tion (SRC) that identiﬁes the relation type for an entity pair each time are regularlyadopted by prior work [28,17,6]. To improve the classiﬁcation efﬁciency, [26] designeda B

ERT -based classiﬁer that can recognize multiple pairwise relationships at one time,which can be regarded as a multiple-relations-at-a-time classiﬁcation (MRC). Differingfrom prior work that emphasizes classiﬁcation improvement, we focus on providing aﬁne-grained analysis of existing resources for selecting the proper tool to extract andorganize scientiﬁc information in digital libraries.

For our comprehensive evaluations, we select both the publicly available NLP datasets [10,16]annotated for the scientiﬁc relation classiﬁcation task. These datasets contain a set ofmanually annotated scholarly abstracts for their scientiﬁc terms and the scientiﬁc re-lations between pairs of terms. Additionally, we combine the two datasets into a thirdnew dataset, which offers a more realistic evaluation setting since it provides a larger,more diverse task representation. In the sequel, we describe our evaluation corpora.

C1: The SemEval18 Corpus.

This corpus was created for the SemEval-2018 Task7 [10] as a community-wide research initiative. It provided 500 manually annotatedabstracts from scholarly articles in computational linguistics from the ACL Anthology.In the abstracts, originally, six discrete semantic relations were deﬁned that were studiedto capture the predominant information content. For our evaluation, we omit one of thesix, viz. T

OPIC , since it is not well-represented in the corpus, and simply consider thefollowing ﬁve relations: U

SAGE , R

ESULT , M

ODEL , P

ART W HOLE , and C

OMPARISON .In all, the dataset comprised 500 annotated abstracts partitioned into a training dataset valuating BERT-based Models for SR Classiﬁcation 5 for machine learning containing 350 abstracts and a test dataset of 150 abstracts forevaluating the trained machine learning model.

C2: The SciERC Corpus.

Our second evaluation corpus [16], also contains a setof 500 manually annotated abstracts of scholarly articles with their scientiﬁc termsand their pairwise relations. Unlike the SemEval18 corpus, the SciERC corpus rep-resents diverse underlying data domains where the abstracts were taken from 12 AIconference/workshop proceedings in ﬁve research areas: artiﬁcial intelligence, natu-ral language processing, speech, machine learning, and computer vision. These ab-stracts were annotated for the following seven relations: C

OMPARE , P

ART - OF , C ON - JUNCTION , E

VALUATE - FOR , F

EATURE - OF , U SED - FOR , and H

YPONYM -O F . And, formachine learning, the corpus was prepartitioned into a 350/50/100 train/dev/test split.Between the two corpora, except the C ONJUNCTION and H

YPONYM -O F relations, ﬁveof the deﬁned relations are semantically identical. C3: The Combined Corpus.

To create this corpus, we merged C1 and C2 . This in-volved renaming relations. Speciﬁcally, U SED -F OR in C2 which is semantically similarto U SAGE in C1 , was renamed as U SAGE . Further, based on our observations of the twocorpora, for R

ESULT in C1 and E VALUATE -F OR in C2 , we found that the arguments ofthese relations were in reverse order. E.g. [accuracy] for [semantic classiﬁcation] is la-beled as “accuracy” → E VALUATE -F OR → “semantic classiﬁcation” in C2 , which canbe regarded as “semantic classiﬁcation” → R ESULT → “accuracy”. In this way, we re-named all instances annotated with relation E VALUATE -F OR in corpus C2 into R ESULT by ﬂipping their argument order. The two additional relations in C2 , i.e., H YPONYM -O F and C ONJUNCTION , that were not in C1 were preserved as is. Thus, we created athird evaluation corpus of 1000 abstracts presenting a more realistic evaluation scenarioof large and heterogenous data.In Table 1, we present corpus statistics for each of the evaluation corpora. Across allthree, U SAGE is the most predominant. Particularly pertinent in the context of the digitallibraries are the columns ‘SemEval18’ and ‘SciERC’ in Table 1 that were generatedfrom the Open Research Knowledge Graph [3]. The contributions in each scholarlyarticle about the respective SemEval18 and SciERC datasets [10,16] are encoded viasubject-predicate-object triples and were customized for comparability and merged in asingle view. This customized comparison view of the scholarly knowledge graph of thetwo articles is persistently accessible for all researchers to access or edit. While thiscustomized graph with machine-actionable data from the scholarly article is not basedupon the triples that are addressed in this work, the relations we study can be harnessedsimilarly. ERT -based Scientiﬁc Relation Classiﬁers B ERT [8], Bidirectional Encoder Representations from Transformers, as a pretrainedlanguage representation built on cutting-edge neural technology, provides NLP practi-tioners with high-quality language features from text data simply out-of-the-box that M. Jiang et al. improves performance on many NLP tasks. These models return contextualized em-beddings for tokens which can be directly employed as features for various NLP tasks.Further, with minimal task-speciﬁc extensions over the core B

ERT architecture, the em-beddings can be relatively inexpensively ﬁne-tuned to the task at hand, in turn facilitat-ing even greater boosts in task performance.In this work, for scientiﬁc relation classiﬁcation, we employ B

ERT embeddings andwe also ﬁne-tune them within two different neural extensions: 1) for single-relation-at-a-time classiﬁcation (SRC); and 2) for multiple-relation-at-a-time classiﬁcation (MRC).In the remainder of the section, we ﬁrst describe the B

ERT models we employ followedby the SRC and MRC classiﬁers that implement different classiﬁcation objectives.

ERT

Variants B ERT models as pretrained language representations are available in several variantsdepending on the model conﬁguration parameters and on the underlying training data.While there are over 16 types, in this work we select the following four core variants. B ERT

BASE The ﬁrst two models we use are in the category of the pretrained B

ERT

BASE .They were pretrained on billions of words from text data comprising the BooksCorpus(800M words) [29] and English Wikipedia (2,500M words). The two models we selectare: a cased model (where the case of the underlying words were preserved when train-ing B

ERT

BASE ) and an uncased model (where the underlying words were all lowercasedwhen training B

ERT

BASE ). S CI B ERT The next two models employed in this study are in the category of thepretrained scientiﬁc B

ERT called S CI B ERT . They are language models based on B

ERT but trained on a large corpus of scientiﬁc text. Speciﬁcally, they are trained on a randomsample of 1.14M papers from Semantic Scholar [2] consisting of full text of 18% papersfrom the computer science domain and 82% from the broad biomedical domain. LikeB

ERT

BASE , for S CI B ERT , we use its cased and uncased variants.

ERT -based Classiﬁers

We implement the aforementioned B

ERT models within two neural system extensionsthat respectively adopt different classiﬁcation strategies.

Single-relation-at-a-time Classiﬁcation (SRC)

Classiﬁcation models built for SRCgenerally extend the core B

ERT architecture with one additional linear classiﬁcationlayer that has K × H dimensions, where K is the number of labels and H denotes thesize of hidden states. The label probabilities are further normalized by using a softmaxfunction and the classiﬁer assigns the label with the maximum probability to the target. https://github.com/google-research/bert https://github.com/allenai/scibertvaluating BERT-based Models for SR Classiﬁcation 7 Multiple-relations-at-a-time Classiﬁcation (MRC)

This strategy is a more recent in-novation on the classiﬁcation problem in which the classiﬁer can be trained with all therelation instances in a sentence at a time or predicts all the instances in one pass, asopposed to separately for each instance. In this case, however, the core B

ERT architec-ture’s self-attention mechanism is modiﬁed to efﬁciently consider representations of therelative positions of tokens that represent scientiﬁc terms [20,26]. While this modiﬁca-tion enables encoding the novel multiple-relations-at-a-time problem, for obtaining theclassiﬁcation probabilities, the MRC is also extended with a linear classiﬁcation layer,though not identical to the SRC since it has to model the modiﬁed architecture.

Experimental datasets, B ERT word embeddings, and Classiﬁcation strategies.

Ourcomprehensive evaluation setup involved three different corpora, four B

ERT embed-ding variants, and two classiﬁcation strategies. Given this, we trained a total of eight different classiﬁers, which for each of the three corpora resulted in 24 trained models.Each corpus was already prepartitioned three ways as training/dev/test by the originalcreators, which we adopt. To train optimal classiﬁers on the respective corpus, we tunedthe learning rate parameter η for values { } . Evaluation Metrics.

We employ the standard machine learning classiﬁcation evaluationindicators, i.e., Precision ( P ), Recall ( R ), F1-score ( F ), and Accuracy ( Acc ). In this section, we present results from our comprehensive evaluations with respect tothe three main research questions that undergird this study.

SRC MRC Avg ± Std

SemEval18 SciERC Combined SemEval18 SciERC CombinedAcc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1Bert-base uncased 76.42 71.74 84.6 77.25 81.75 77.38 80.4 79.98 83.42 74.84 80.84 76.29 81.24 ± ± ± ± ± ± ± ± Avg. Scores Acc. 84.10 F1 80.22 Acc. 82.35 F1 77.52

Table 2: Scientiﬁc relation classiﬁcation results over three datasets (SemEval18, Sci-ERC, & Combined), four B

ERT model variants (B

ERT cased & uncased; S CI B ERT cased & uncased), and two classiﬁcation strategies (SRC & MRC).

Acc. is accuracy and F is the macro F1-score; Top scores are in bold; Scores in red from our surveyed systems arethose that are on par with the state-of-the-art reported results [6,26] on SemEval18 and SciERC. M. Jiang et al. RQ1: What is the impact of the eight classiﬁers on scientiﬁc relation classiﬁcation?

The eight classiﬁers are obtained from two classiﬁcation strategies built over four B

ERT model variants. We examine their classiﬁcation results (depicted in Table 2) in terms ofthe following three key characteristics of the classiﬁers.

The classiﬁcation strategy, i.e., SRC vs. MRC.

From the

Acc and F shown in Table 2,we see that SRC outperforms MRC on two corpora except the SemEval18 corpus. Onecharacteristic of the SemEval18 corpus is that it has signiﬁcantly lower number of an-notations than the other two copora. Thus we infer that the novel MRC strategy is morerobust than SRC because its performance level is unaffected by a drop in the number ofthe annotations. Word embedding features, i.e., B ERT vs. S CI B ERT . Regarding the B

ERT word embed-ding models, S CI B ERT outperformed B

ERT on all three corpora with higher accuracyand F1 scores. Since our experimental corpora are all scholarly data, as an expectedresult, word embeddings trained on the similar data domains are better suited.

Vocabulary case in B ERT models, i.e., cased vs. uncased.

We observe that the un-cased B

ERT models (S CI B ERT : 82.91, B

ERT : 81.24) show higher classiﬁcation accu-racy than their cased counterpart (S CI B ERT : 81.71, B

ERT : 80.05) on average. Further,the uncased models have a lower standard deviation in accuracy overall (S CI B ERT :2.04, B

ERT : 2.84) than the cased models (S CI B ERT : 4.60, B

ERT : 4.14); comparisonson F are along similar lines. Hence, our results suggest that uncased B ERT modelscan achieve more stable performances than cased variants.In conclusion, with respect to the classiﬁcation strategy, we ﬁnd SRC outperformsMRC (see averaged scores in the last row in Table 2). Nevertheless, the advancedMRC strategy demonstrates consistently robust performance that remains relatively un-affected by smaller dataset sizes compared to the SRC (e.g. SRC vs. MRC results onthe SemEval18 corpus). On the other hand, with respect to B

ERT word embedding vari-ants, from the averaged scores in the last column in Table 2, the S CI B ERT uncasedmodel posits as the optimal word embedding features model on scholarly articles.

RQ2: Which of the seven relation types studied are easy/challenging to classify?

Examining the ﬁne-grained per-relation classiﬁcation results in Tables 3 to 5 acrossall our evaluation corpora for both SRC and MRC, we note the classiﬁcation ranked

Relationship TypeSemEval18

SRC MRC

P R F1 P R F1U

SAGE R ESULT

OMPARE

ODEL -F EATURE

ART -W HOLE

Table 3: Per-relation classiﬁcation scores of SRC and MRC best systems on SemEval18. valuating BERT-based Models for SR Classiﬁcation 9Relationship TypeSciERC

SRC MRC

P R F1 P R F1U

SED -F OR C ONJUNCTION

YPONYM -O F VALUATE -F OR OMPARE

ART -O F EATURE -O F Table 4: Per-relation classiﬁcation results of SRC and MRC best systems on SciERC.

Relationship TypeCombined

SRC MRC

P R F1 P R F1C

ONJUNCTION U SAGE

YPONYM -O F OMPARE

ESULT

ART -O F EATURE -O F Table 5: Per-relation classiﬁcation results of the best SRC and MRC systems on theCombined corpus.order. Of all relations, U

SAGE (U SED -F OR ) is the easiest classiﬁcation target, since inall three tables, it is in the two topmost. U SAGE is the most predominant type in allcorpora. This accounts, in part, for its high-scored classiﬁcation, since the classiﬁersare trained on a signiﬁcant number of training instances for U

SAGE compared to therest.For the challenging relations, we examine the results per corpus. Starting with theTable 3 results for the SemEval18 corpus, we observe that both SRC and MRC ﬁndP

ART -W HOLE most challenging to classify. We surmise that this relation displays highdiversity in the underlying natural language text from which it is induced; hence theclassiﬁer is unable to generalize a consistent set of patterns for it. Observing classiﬁca-tion performance ranks, for three of the ﬁve relations (i.e., U

SAGE , M

ODEL -F EATURE and P

ART -W HOLE ), SRC and MRC obtain the same classiﬁcation rank order. For R E - SULT and C

OMPARE , they are opposites, where SRC classiﬁes R

ESULT better thanMRC.In Table 4 results for SciERC, both classiﬁers perform signiﬁcantly low on tworelations, viz. F

EATURE -O F and P ART -O F . Since these two relations are not the mostunderrepresented in the corpus, we theorize that their low classiﬁcation performance isowed to the natural language text diversity from which they are deduced. In this case,obtaining more annotated instances is one way to boost classiﬁer performance. In terms A nn o t a & o n Predic&on

Confusion Matrix of SRC Classiﬁca&on A nn o t a & o n Predic&on

Confusion Matrix of MRC Classiﬁca&on

Fig. 1: Confusion matrices for SRC and MRC misclassiﬁcations on the Combined data.of the ranked order of performances on the relations, SRC and MRC perform identicallyon SciERC data.And lastly in Table 5 results on the Combined corpus, for the challenging relations,both SRC and MRC have the same result as they did on SciERC—i.e., F

EATURE -O F followed by P ART -O F are the most challenging. And we theorize the same reason forthe low scores on these relations, since Combined contains SciERC data. Given thetwo corpora in the Combined dataset, SciERC additionally introduced C ONJUNCTION which SemEval18 did not have. C

ONJUNCTION is among the top two easiest relationsto classify, with U

SAGE as the other, for the classiﬁers trained on SciERC and on theCombined corpus. Further, its classiﬁcation is better in the Combined corpus than inSciERC. This lends an understanding to the realistic evaluation settings that the Com-bined corpus presents. To elaborate, for U

SAGE , instances from SemEval18 and Sci-ERC (i.e. U

SED -F OR ) are combined, resulting in an insigniﬁcant dip in performance(on the Combined corpus, U SAGE ranks second easiest compared with SemEval18 andSciERC) since they are now non-uniform annotation signals. As opposed to the caseof C

ONJUNCTION , the Combined corpus obtains a uniform annotation signal from justthe SciERC corpus and ranks a minor degree higher at classifying it.Finally, a list summarizing the top-scoring per-relation performances for scientiﬁcrelation classiﬁcation across all three tables, includes the following: U

SAGE (SRC inSciERC), C

ONJUNCTION (SRC in Combined), H

YPONYM -O F (SRC in SciERC), R E - SULT (MRC in SemEval18), P

ART -O F (MRC in SemEval18 for P ART -W HOLE ), andF

EATURE -O F (MRC in SemEval18 for M ODEL -F EATURE ). Since the SemEval18 cor-pus appears the most times in the top-ranked results, we conclude that its annota-tions obtain a relatively better trained classiﬁer. However, the SemEval18 corpus onlyincludes scholarly abstracts from one AI domain i.e. NLP (in the ACL Anthology),whereas SciERC is more comprehensively inclusive across various AI domains. Thus,an additional factor that classiﬁers trained on SciERC handle is domain diversity.

Error Analysis

A closer look at the misclassiﬁcations is portrayed in the confusionmatrices in Figure 1 for the SRC and MRC strategies on the Combined corpus. Fourof the seven relations, i.e. H

YPONYM -O F , R ESULT , P

ART -O F , and F EATURE -O F , arehighly likely to be misclassiﬁed as U SAGE . This shows that our classiﬁers are biased by valuating BERT-based Models for SR Classiﬁcation 11

Rela%on Type W o r d D i s t a n c e Fig. 2: Word distance distributions between scientiﬁc term pairs in the Combined data.the predominant U

SAGE relation. In general, unbalanced distribution of training sam-ples (see the details in the corpus section) is, more often than not, one of the mainfactors for confusion learned in machine learning systems. For the most challenging re-lations F

EATURE -O F and P ART -O F , after U SAGE , are highly likely to be confused witheach other (F

EATURE -O F as P ART -O F ( ∼

10% confusion), and vice-versa ( ∼ YPONYM -O F and F EATURE -O F that loosely demonstrate arelation heirarchy such that H YPONYM -O F subsumes F EATURE -O F , but not the otherway around, we ﬁnd the classiﬁcation confusion demonstrates a consistent pattern tothis data. From the matrices, we see that H YPONYM -O F has ∼

6% likelihood to be pre-dicted as F

EATURE -O F , but none of the F EATURE -O F (0%) instances were confusedwith H YPONYM -O F .To offer another pertinant angle on the classiﬁer error analysis, we compute the worddistance distributions between related scientiﬁc term pairs in the Combined corpus. Thisdata is depicted in Figure 2. In general, most box plots shown in the ﬁgure are skewedwith a long upper whisker and a short lower whisker, which indicates that the majorityof paired scientiﬁc terms are close in the text. Speciﬁcally, the scientiﬁc term pairs inthe C ONJUNCTION relation, which in linguistic terms should necessarily be close. Thisconsistent pattern could be another reason why C

ONJUNCTION is among the easiestrelations to classify. Further, the average word distance of F

EATURE -O F , P ART -O F ,H YPONYM -O F , and C OMPARE is closer to the lower quartile than the other relations.Such varied distribution may bring challenges for a classiﬁer to identify these relations.Notably, the similar median value and spread range between F

EATURE -O F and P ART -O F could account for why they are challenging to classify.Finally, we examine the third research question that undergirded this study. RQ3: What is the practical relevance of the seven relations studied in this paperin a scholarly knowledge graph?

As a practical illustration of the relation triplesstudied in this work, we build a knowledge graph from their annotations in the 1000scholarly abstracts in the Combined dataset. This is depicted in Figure 3. Lookingat the corpus-level graph (the right graph), we observe that generic scientiﬁc termssuch as “method,” “approach,” and “system” are the most densely connected nodes, asexpected since generic terms are found across research areas. In the zoomed-in ego-network of the term “machine translation” (the left graph), H

YPONYM -O F is meaning-fully highlighted by its role linking “machine translation” and its sibling nodes as the taskslexiconsoperational_foreign_language_tutoring machine_translation information_retrievalevaluation_measuresnatural_language_generation speech_recognition ranking_tasks NLP_problems USAGEUSAGE CONJUNCTIONHYPONYM-OFCONJUNCTION HYPONYM-OFUSAGE HYPONYM-OFHYPONYM-OF CONJUNCTIONHYPONYM-OFHYPONYM-OFCONJUNCTIONHYPONYM-OFHYPONYM-OF RESULTUSAGE nouns languages

French method it one other methods paraphrase BLEU

NIST paraphrases scheme system subcategorization_dictionary parser accuracy

Bayesian_framework domains modalities approach state-of-the-art_methods task robustness tasks patterns dictionaries solution model algorithm parsing linguistic_information they

KL-ONE_style_representationlanguage_tutoring_application

SHORTSTR2 full-length_supportsF-measure tracking framework approaches word_alignment It algorithms speech_understanding unknown_words classifiers alignmentsAPT words neural_network NLP_structures word_senses techniques models positive_feature_vectors categorization_task text_modelimage_classification problem extensions segmentation translation_memory_system retrieval retrieval_accuracy images face_metrics those syntactic_information morphology computational_model Bulgarian MT word_segmentation translation MT_system taggers

English

Web language_model shape word_sense_disambiguation SMT_models

WSD_accuracythatSMT features

ASR_output machine_learning_approachhubsgraph unlexicalized_PCFG_parserspeech dictionary

English-to-Japanese_dictionary grammar parallel_corpora baseline_system machine_translation grammatical_formalisms Ant-Q SRDA recognition classification connectionist_componentconnectionist_(_neural_)_networknatural_language_interpretation parsing_strategies ungrammatical_structures classifiermeasureinformation_retrieval syntactic_dependencies_LASsemantic_dependencies_F1discourse_structure PDTBstatistical_methods applications data well_calibrated_probabilitiessense_priors technique databasetagginglemmatizationMWEs ambiguitynatural_language_processing statistical_machine_translationArabic_stemmer rules

HMMs analysis hierarchical_relationsrelations high-resolution_reconstructions maximum_likelihood_identity_parameter_vectordecodingAdaBoost baseline argumentsdefeat_rules three_techniquessyntactic_analysisrecovery_techniquenamed_entity_(_NE_)_recognition_(_NER_)_system systems natural_languagesemanticASR IR evaluation_measures spelling_correction two-level_morphology

Turkish log-linear_block_bigram_modelreal-valued_features computational_complexity translation_quality acoustic_modellingrapid_searchlarge-vocabulary_CSRmilitary_application_tasksrecognition-time_adaptation_techniques similarity taxonomic_similarityassociative_similaritythisshift-reduce_dependency_parsers board

Intel_i860_chip natural_language_system datasets decoder grammars relational_probability_model generative_model

Inference supervised_learningunsupervised_learning them precision recall machine_learning_techniquestraining_data dataset bottom-up_pattern-matching_parser activity corpus mechanism restricted_vocabularyrobustness_problems segmentation_bakeoff

MSR-open sense_resolution

WordNetsemantic_relations transfer_phase principlesnatural_language_generation methodology real-life_trialscomputational_lexiconspeech_recognition_applicationsmorphological_componentLFG formalisms eventscomputer_visionfactors language_pairsword_n-grams natural_language_interfaceSVCspeech_quality anaphoric_componentlinguistic_phenomena linguistic_representationlinguistic_concepts standalone_systemopen-domain_question_answeringsubcategorization_cues spoken_language punctuation pos_tag_ambiguityfine-grained_grammatical_categoriesmulti-tagging_approach discourse_relationsCzech motion_analysis noiseconditional_independenciesanimated_GIFs natural_language_descriptions representative_techniqueslinguistic_theorymeasurable_signals coder annotation temporal_expressionsper-object_3D_reconstructionspositionsvelocities global_variablesglobal_propertiesinference discourse syntaxsemanticstreebank computer_vision_problems three-component_systemregion-level_component recognizer knowledgeGPSG PTZ_cameras relative_positioningorientation error_correction NLP_taskslog-linear_modelsknowledge_sources visual_representation benchmark_datasetslarge-scale_knowledge_basegraph_cutsbelief_propagationgeneralized_metaphor bilingual_corporabilingual_dependency_relations paraphrasing_methodgeneralized_translation_knowledgetranslation_knowledgestrategies

PAKTUS eventcorporacontinuous_density_HMMn-gram_statistics Acoustic_modeling boosting construction-specific_approachspecialized_parsing_techniquesclustering_and_filtering_processesexperimentscorpus-based_approach speech_recognition highly_inflective_languages spelling-checkersstochastic_optimization_algorithms language_modeling_tasksuser_modelingspecifications unsupervised_learning_method

Keyword_Search blogsgenres language_functionalitiescomponentssyntactic_analyzersemantic_analyzer boosting_approachranking_tasksNLP_problems spectral_clusteringeditor dialog_systems generation

ILIMP reverberationembedded_computer_vision_applicationscorpus-based_supervised_word_sense_disambiguation_(_WSD_)_systemDutch energy_minimization_algorithmlexical_substitutionssemantic_role_labeling parse_treehuman_judgment JapaneseHuman-Machine_Communicationarc-gvcluster_tree extensionlocalizing_functional_objectsinternal_model music_scene_analysis_systemtensor_method video_recognition_tasks text edges verbs texts performance context word meaning utterances terms sentence sentences query_termsverb proposition

Fig. 3: A knowledge graph constructed from the relation triples in the Combined corpus.The graph on the left is the “ego-network” of the scientiﬁc term “machine translation”.The node size is determined by node weighted degree. Colors denote the modularityclasses based on the graph structure.

The graph was generated using https://gephi.org/ research tasks “speech recognition,” and “natural language generation” to the parentnode “NLP problems.” The term “lexicon” is related by U

SAGE to “machine translation”and “operational foreign language.” The C

ONJUNCTION link joins “machine translation”and “speech recognization” both aim at translating information from one source to theother one. This knowledge graph now enables various property comparisons across1000 scholarly abstracts (consider the corpus statistics Table 1 generated from the OpenResearch Knowledge Graph presented in the corpus section).

We have investigated the scientiﬁc relation classiﬁcation task for improving scholarlyknowledge representations in digital libraries. Our surveyed systems offer a compre-hensive view of results that are attainable given advanced neural technology when puttogether in varying combinations, including the ones that produce the state-of-the-artresults. We have categorized neural technology in terms of four B

ERT -based embed-ding models and two classiﬁcation strategies. Our results indicate that, of the variousword embedding models, S CI B ERT is the optimal choice for a corpus of scholarly data.In terms of classiﬁcation strategies, the single-relation-at-a-time system outperformsthe multiple-relations-at-a-time classiﬁcation system. Nevertheless, the latter is robustover all datasets, even in settings with lean annotated data, in which case it outperformsthe former. Our ﬁndings are obtained over a broad scenario of model performances forscientiﬁc relation classiﬁcation now available to the stakeholders of the digital libraries.

To further facilitate the choice of the proper technique for classifying scientiﬁc rela-tions toward the creation of structured, semantic representations over scholarly articles, valuating BERT-based Models for SR Classiﬁcation 13 there are two main avenues that are worthwhile for future exploration. As we have seenin the course of examining our

RQ2 , there exist label biases in the annotated corporasuch that some relations are better represented than others (E.g. U

SAGE ). Toward thisend, such data needs to be further curated by experts to enable a well-represented do-main. Further, digital libraries deal with various domains in Science in general. Whileour evaluations have been performed on corpus that covers the Artiﬁcial Intelligenceresearch area, there still remains a plenty of potentials to explore other research do-mains that are unrelated to Computer Science speciﬁcally. Finally, we have examinedscientiﬁc relation classiﬁcation in terms of seven relations, ontologized models of thescientiﬁc world [9,19] posit a larger set of relations or properties. For this, techniquessuch as open information extraction or ontology-based extraction are viable alternativesfor future developments.

References

1. Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections.In: Proceedings of the ﬁfth ACM conference on Digital libraries. pp. 85–94 (2000)2. Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., Downey, D.,Dunkelberger, J., Elgohary, A., Feldman, S., Ha, V., et al.: Construction of the literaturegraph in semantic scholar. In: NAACL, Volume 3 (Industry Papers). pp. 84–91 (2018)3. Auer, S., Kovtun, V., Prinz, M., Kasprzik, A., Stocker, M., Vidal, M.E.: Towards a knowledgegraph for science. In: Proceedings of the 8th International Conference on Web Intelligence,Mining and Semantics. pp. 1–6 (2018)4. Auer, S., Mann, S.: Toward an open knowledge research graph. The Serials Librarian (2019)5. Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: Semeval 2017 task 10:Scienceie-extracting keyphrases and relations from scientiﬁc publications. In: Proceedingsof SemEval-2017. pp. 546–555 (2017)6. Beltagy, I., Lo, K., Cohan, A.: SciBERT: A pretrained language model for scientiﬁc text. In:EMNLP-IJCNLP. pp. 3615–3620. ACL, Hong Kong, China (Nov 2019)7. Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Proceedings ofthe 42nd annual meeting on ACL. p. 423. ACL (2004)8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectionaltransformers for language understanding. In: Proceedings of NAACL, Volume 1 (Long andShort Papers). pp. 4171–4186. ACL, Minneapolis, Minnesota (Jun 2019)9. Fathalla, S., Vahdati, S., Auer, S., Lange, C.: Towards a knowledge graph representing re-search ﬁndings by semantifying survey articles. In: TPDL. pp. 315–327. Springer (2017)10. G´abor, K., Buscaldi, D., Schumann, A.K., QasemiZadeh, B., Zargayouna, H., Charnois, T.:Semeval-2018 task 7: Semantic relation extraction and classiﬁcation in scientiﬁc papers. In:Proceedings of SemEval-2018. pp. 679–688 (2018)11. Hallo, M., Luj´an-Mora, S., Mat´e, A., Trujillo, J.: Current state of linked data in digital li-braries. Journal of Information Science (2), 117–127 (2016)12. Haslhofer, B., Isaac, A., Simon, R.: Knowledge graphs in the libraries and digital humanitiesdomain. arXiv preprint arXiv:1803.03198 (2018)13. Jaradeh, M.Y., Oelen, A., Farfar, K.E., Prinz, M., D’Souza, J., Kismih´ok, G., Stocker, M.,Auer, S.: Open research knowledge graph: Next generation infrastructure for semantic schol-arly knowledge. In: KCAP. pp. 243–246. ACM, New York, NY, USA (2019)4 M. Jiang et al.14. Jiang, M., Diesner, J.: A constituency parsing tree based method for relation extraction fromabstracts of scholarly publications. In: Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13). pp. 186–191 (2019)15. Klampﬂ, S., Kern, R.: An unsupervised machine learning approach to body text and table ofcontents extraction from digital scientiﬁc articles. In: TPDL. pp. 144–155. Springer (2013)16. Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identiﬁcation of entities, relations,and coreference for scientiﬁc knowledge graph construction. In: EMNLP. pp. 3219–3232(2018)17. Luan, Y., Wadden, D., He, L., Shah, A., Ostendorf, M., Hajishirzi, H.: A general frameworkfor information extraction using dynamic span graphs. arXiv preprint arXiv:1904.03296(2019)18. Manning, C.D.: Computational linguistics and deep learning. Computational Linguistics41