Improving Scholarly Knowledge Representation: Evaluating BERT-based Models for Scientific Relation Classification
IImproving Scholarly Knowledge Representation:Evaluating BERT-based Models forScientific Relation Classification
Ming Jiang , Jennifer D’Souza − − − ,S¨oren Auer − − − , and J. Stephen Downie − − − University of Illinois at Urbana Champaign, USA TIB Leibniz Information Centre for Science and Technology andL3S Research Center at Leibniz University of Hannover, Hannover, Germany { mjiang17|jdownie } @illinois.edu, { jennifer.dsouza|auer } @tib.eu Abstract.
With the rapidly growing number of research publications, there isa vast amount of scholarly information that needs to be organized in digital li-braries. To deal with this challenge, digital libraries use semantic techniques tobuild knowledge-base structures for organizing scientific information. Identify-ing relations between scientific terms can help with the construction of a repre-sentative knowledge-based structure. While advanced automated techniques havebeen developed for relation extraction, many of these techniques were evaluatedunder different scenarios, which limits their comparability. To this end, this studypresents a thorough empirical evaluation of eight B
ERT -based classification mod-els by exploring two factors: 1) B
ERT model variants, and 2) classification strate-gies. To simulate real-world settings, we conduct our sentence-level assessmentusing the abstracts of scholarly publications in three corpora, two of which aredistinct corpora and the third of which is the union of the first two. Our findingsshow that S CI B ERT models perform better than B
ERT
BASE models. The strategyof classifying a single relation each time is preferred in the corpus consisting ofabundant scientific relations, while the strategy of identifying multiple relationsat one time is beneficial to the corpus with sparse relations. Our results offer rec-ommendations to the stakeholders of digital libraries for selecting the appropriatetechnique to build a structured knowledge-based system for the ease of scholarlyinformation organization.
Keywords:
Digital library · Information extraction · Scholarly text mining · Se-mantic relation classification · Knowledge graphs · Neural machine learning.
Today scientific endeavors are increasingly facing a publication deluge [27], which re-sults in the rapid growth of scholarly publications needing to be accessible in digi-tal libraries. While abundant resources have been provided for scholarly communica-tion in digital libraries, it is still challenging for researchers to obtain comprehensive,fine-grained and context-sensitive scholarly knowledge for their research, especiallyfor those who study a research problem that involves multiple disciplines [13]. Ac-cording to [13,4], there are three main factors that lead to this issue. First, the set of a r X i v : . [ c s . D L ] A p r M. Jiang et al. keywords used for indexing publication documents may not be able to cover all as-pects of knowledge involved in each publication. Second, the traditional keyword searchon document-based publications fails to consider the semantic associations among thepieces of scholarly information. Finally, the solely manual processing of unstructuredscholarly knowledge does not scale to the number of publications, thus rendering alarge part of the scholarly canon unused. To improve this problem, some initiatives [13]advocate for combining human and machine intelligence to build an interlinked and se-mantically rich graph structure to organize the scholarly information in digital libraries.A key aspect of building a knowledge graph for the scholarly record is identify-ing relations between scientific terms. In the natural language processing community,within the context of human-annotated datasets comprising abstracts of scholarly arti-cles [5,10], seven relation types between scientific terms are being studied. They areH
YPONYM -O F , P ART -O F , U SAGE , C
OMPARE , C
ONJUNCTION , F
EATURE -O F , andR ESULT . The annotations are in the form of the following generalized relation triples: (cid:104) experiment (cid:105) C OMPARE (cid:104) another experiment (cid:105) ; (cid:104) method (cid:105) U SAGE (cid:104) data (cid:105) ; (cid:104) method (cid:105) U S - AGE (cid:104) research task (cid:105) . Since human language exhibits the phenomenon of paraphrasingwhere the same concept can be expressed in various ways, the direct identification ofa particular relation between scientific terms is impractical—a problem addressed by aclassification task. In the framework of an automated pipeline for generating knowledgegraphs over massive volumes of scholarly records, scientific relation classification–thetask reviewed in this paper–is therefore indispensable. The resulting task is recognizingwhich particular relation assertion between a pair of scientific terms in scholarly articlesholds given a set of possible predefined relations.In this age of the “deep learning tsunami”, to build automated scientific relation(SR) classification systems, one can combine various neural architectures to achievehigh classification accuracy [18]. And with the recent introduction of BERT [8] wordembedding models, the opportunity to obtain boosted machine learning systems is fur-ther accentuated. While prior studies [6,26] of SR systems have demonstrated highclassifier performances by tapping into these recent deep learning developments, theperformances have been reported only on single evaluation scenarios, e.g., based onevaluating on a single dataset. Based on such lean evaluations of prior systems, it is, atpresent, difficult to obtain conclusive insights about the robustness of the classifiers inreal-world diverse application settings: such as within scholarly digital library frame-works hosting diverse collections of articles. Moreover, existing evaluations are notidentical between each other, since their evaluated datasets differ greatly; hence resultsare not comparable. Therefore we have no general insights into the best choice amongthe SR systems to recommend in practical settings—so far they all seem equally good.In this work, we comprehensively surveyed eight deep learning techniques for sci-entific relation classification based on two different classification strategies and fourB
ERT model variants. Further, we evaluated these systems on all available evaluationresources for the SR task including: a dataset of scholarly articles from the ACL an-thology [10]; a more diverse dataset of scholarly articles from various Artificial Intelli-gence conference proceedings [16]; and a third resource that leveraged the two datasetscombined, thereby offering a more realistic setting of unbalanced distribution of datadomains and, further, offering robust training for classification models on annotations valuating BERT-based Models for SR Classification 3 made by two different groups of annotators. Our ultimate goal is to help the stakehold-ers of digital libraries select the optimal tool to implement knowledge-based scientificinformation flows. In summary, we address the following research questions in this pa-per:1. What is the impact of the eight classifiers on scientific relation classification?2. Which of the seven relation types studied are easy or challenging for classification?3. What is the practical relevance of the seven relation types in a scholarly knowledgegraph?
Relations Mined from Scientific Publications.
Overall, knowledge is organized indigital libraries based on the following three aspects of the digital collections: 1) meta-data, 2) free-form content, and 3) ontologized content [23,12]. In this context, the maincategories of relations that have been explored for scholarly publications belong to twogroups. One group includes metadata relations such as authorship, co-authorship, andcitations [24,22]. Research in this group mainly focuses on examining the social dimen-sion of scholarly communication such as co-author prediction [22] and scholarly com-munity analysis [24]. The second group includes semantic relations, either as free-formsemantic content classes [15,14] or as ontologized classes [21,19]. In the framework ofautomatic systems, content relations have been examined for: 1) scientific relation iden-tification that involves determining which scientific term pairs are related [10,14], and2) scientific relation classification that involves determining which relation type existsbetween related term pairs, where the relation types are typically pre-defined [26,6,16].With respect to ontologized relation classes, prior work primarily considers the concep-tual hierarchy based on formal concept analysis [21,19].We attempt the task of classifying semantic relations that were created from free-form text. Given that the digital libraries are interested in the creation of linked data [11],our attempted task directly facilitates the creation of scholarly knowledge graphs [3],offering structured data for use by librarians to generate linked data.
Techniques Developed for Relation Classification.
Both rule-based [1] and learning-based [7,28] methods have been developed for relation classification. Traditionally,learning-based systems relied on hand-crafted semantic and/or syntactic features [1,7].In recent years, the success of deep learning techniques have nearly obviated the need tomanually design features since they can more effectively learn latent feature representa-tions for discriminating between relations. An attention-based bidirectional long short-term memory network (BiLSTM) [28] was one of the first top-performing systems thatleveraged neural attention mechanisms to capture important information per sentencefor relation classification. Another advanced system [17] leveraged a dynamic spangraph framework based on BiLSTMs to simultaneously extract terms and infer theirpairwise relations. Aside from these neural methods considering the word sequenceorder, transformer-based models [25] that use self-attention mechanisms to quantifythe semantic association of each word to its context have become the current state-of-the-art in relation classification. E.g. B
ERT word embeddings [8]. It can be trained
M. Jiang et al.
Id Relation SemEval18 SciERC Combined
Total % Total % Total %1 U
SAGE : a scientific entity that is used for/by/on another scientific entity.E.g.
MT system is applied to
Japanese
658 42.13% 2,437 52.43% 3,095 49.84%2 F
EATURE -O F : An entity is a characteristic or abstract model of anotherentity. E.g. computational complexity of unification
392 25.10% 264 5.68% 656 10.56%3 C
ONJUNCTION : Entities that are related in a lexical conjunction i.e., with‘and’ ‘or’. E.g. videos from
Google Video and a
NatGeo documentary
582 12.52% 582 9.37%4 P
ART -O F : scientific entities that are in a part-whole relationship. E.g. de-scribing the processing of utterances in a discourse
304 19.46% 269 5.79% 573 9.23%5 R
ESULT : An entity affects or yields a result. E.g. With only 12 trainingspeakers for SI recognition , we achieved a 7.5% word error rate
92 5.89% 454 9.77% 546 8.79%6 H
YPONYM -O F : An entity whose semantic field is included within that ofanother entity. E.g. Image matching is a problem in
Computer Vision
409 8.80% 409 6.59%7 C
OMPARE : An entity is compared to another entity. E.g. conversationtranscripts have features that differ significantly from neat texts
116 7.43% 233 5.01% 349 5.62%
Overall
Table 1:
Overview of corpus statistics. ‘Total’ and ‘%’ columns show the number and percentageof instances annotated with the corresponding relation over all abstracts, respectively. Generatedin the Open Research Knowledge Graph[3] . to model data from any domain—the original B ERT models were trained on booksand Wikipedia. Now with the newly introduced S CI B ERT [6], there are B
ERT modelstrained on scholarly publications as well.With respect to the classification strategy, the single-relation-at-a-time classifica-tion (SRC) that identifies the relation type for an entity pair each time are regularlyadopted by prior work [28,17,6]. To improve the classification efficiency, [26] designeda B
ERT -based classifier that can recognize multiple pairwise relationships at one time,which can be regarded as a multiple-relations-at-a-time classification (MRC). Differingfrom prior work that emphasizes classification improvement, we focus on providing afine-grained analysis of existing resources for selecting the proper tool to extract andorganize scientific information in digital libraries.
For our comprehensive evaluations, we select both the publicly available NLP datasets [10,16]annotated for the scientific relation classification task. These datasets contain a set ofmanually annotated scholarly abstracts for their scientific terms and the scientific re-lations between pairs of terms. Additionally, we combine the two datasets into a thirdnew dataset, which offers a more realistic evaluation setting since it provides a larger,more diverse task representation. In the sequel, we describe our evaluation corpora.
C1: The SemEval18 Corpus.
This corpus was created for the SemEval-2018 Task7 [10] as a community-wide research initiative. It provided 500 manually annotatedabstracts from scholarly articles in computational linguistics from the ACL Anthology.In the abstracts, originally, six discrete semantic relations were defined that were studiedto capture the predominant information content. For our evaluation, we omit one of thesix, viz. T
OPIC , since it is not well-represented in the corpus, and simply consider thefollowing five relations: U
SAGE , R
ESULT , M
ODEL , P
ART W HOLE , and C
OMPARISON .In all, the dataset comprised 500 annotated abstracts partitioned into a training dataset valuating BERT-based Models for SR Classification 5 for machine learning containing 350 abstracts and a test dataset of 150 abstracts forevaluating the trained machine learning model.
C2: The SciERC Corpus.
Our second evaluation corpus [16], also contains a setof 500 manually annotated abstracts of scholarly articles with their scientific termsand their pairwise relations. Unlike the SemEval18 corpus, the SciERC corpus rep-resents diverse underlying data domains where the abstracts were taken from 12 AIconference/workshop proceedings in five research areas: artificial intelligence, natu-ral language processing, speech, machine learning, and computer vision. These ab-stracts were annotated for the following seven relations: C
OMPARE , P
ART - OF , C ON - JUNCTION , E
VALUATE - FOR , F
EATURE - OF , U SED - FOR , and H
YPONYM -O F . And, formachine learning, the corpus was prepartitioned into a 350/50/100 train/dev/test split.Between the two corpora, except the C ONJUNCTION and H
YPONYM -O F relations, fiveof the defined relations are semantically identical. C3: The Combined Corpus.
To create this corpus, we merged C1 and C2 . This in-volved renaming relations. Specifically, U SED -F OR in C2 which is semantically similarto U SAGE in C1 , was renamed as U SAGE . Further, based on our observations of the twocorpora, for R
ESULT in C1 and E VALUATE -F OR in C2 , we found that the arguments ofthese relations were in reverse order. E.g. [accuracy] for [semantic classification] is la-beled as “accuracy” → E VALUATE -F OR → “semantic classification” in C2 , which canbe regarded as “semantic classification” → R ESULT → “accuracy”. In this way, we re-named all instances annotated with relation E VALUATE -F OR in corpus C2 into R ESULT by flipping their argument order. The two additional relations in C2 , i.e., H YPONYM -O F and C ONJUNCTION , that were not in C1 were preserved as is. Thus, we created athird evaluation corpus of 1000 abstracts presenting a more realistic evaluation scenarioof large and heterogenous data.In Table 1, we present corpus statistics for each of the evaluation corpora. Across allthree, U SAGE is the most predominant. Particularly pertinent in the context of the digitallibraries are the columns ‘SemEval18’ and ‘SciERC’ in Table 1 that were generatedfrom the Open Research Knowledge Graph [3]. The contributions in each scholarlyarticle about the respective SemEval18 and SciERC datasets [10,16] are encoded viasubject-predicate-object triples and were customized for comparability and merged in asingle view. This customized comparison view of the scholarly knowledge graph of thetwo articles is persistently accessible for all researchers to access or edit. While thiscustomized graph with machine-actionable data from the scholarly article is not basedupon the triples that are addressed in this work, the relations we study can be harnessedsimilarly. ERT -based Scientific Relation Classifiers B ERT [8], Bidirectional Encoder Representations from Transformers, as a pretrainedlanguage representation built on cutting-edge neural technology, provides NLP practi-tioners with high-quality language features from text data simply out-of-the-box that M. Jiang et al. improves performance on many NLP tasks. These models return contextualized em-beddings for tokens which can be directly employed as features for various NLP tasks.Further, with minimal task-specific extensions over the core B
ERT architecture, the em-beddings can be relatively inexpensively fine-tuned to the task at hand, in turn facilitat-ing even greater boosts in task performance.In this work, for scientific relation classification, we employ B
ERT embeddings andwe also fine-tune them within two different neural extensions: 1) for single-relation-at-a-time classification (SRC); and 2) for multiple-relation-at-a-time classification (MRC).In the remainder of the section, we first describe the B
ERT models we employ followedby the SRC and MRC classifiers that implement different classification objectives.
ERT
Variants B ERT models as pretrained language representations are available in several variantsdepending on the model configuration parameters and on the underlying training data.While there are over 16 types, in this work we select the following four core variants. B ERT
BASE The first two models we use are in the category of the pretrained B
ERT
BASE .They were pretrained on billions of words from text data comprising the BooksCorpus(800M words) [29] and English Wikipedia (2,500M words). The two models we selectare: a cased model (where the case of the underlying words were preserved when train-ing B
ERT
BASE ) and an uncased model (where the underlying words were all lowercasedwhen training B
ERT
BASE ). S CI B ERT The next two models employed in this study are in the category of thepretrained scientific B
ERT called S CI B ERT . They are language models based on B
ERT but trained on a large corpus of scientific text. Specifically, they are trained on a randomsample of 1.14M papers from Semantic Scholar [2] consisting of full text of 18% papersfrom the computer science domain and 82% from the broad biomedical domain. LikeB
ERT
BASE , for S CI B ERT , we use its cased and uncased variants.
ERT -based Classifiers
We implement the aforementioned B
ERT models within two neural system extensionsthat respectively adopt different classification strategies.
Single-relation-at-a-time Classification (SRC)
Classification models built for SRCgenerally extend the core B
ERT architecture with one additional linear classificationlayer that has K × H dimensions, where K is the number of labels and H denotes thesize of hidden states. The label probabilities are further normalized by using a softmaxfunction and the classifier assigns the label with the maximum probability to the target. https://github.com/google-research/bert https://github.com/allenai/scibertvaluating BERT-based Models for SR Classification 7 Multiple-relations-at-a-time Classification (MRC)
This strategy is a more recent in-novation on the classification problem in which the classifier can be trained with all therelation instances in a sentence at a time or predicts all the instances in one pass, asopposed to separately for each instance. In this case, however, the core B
ERT architec-ture’s self-attention mechanism is modified to efficiently consider representations of therelative positions of tokens that represent scientific terms [20,26]. While this modifica-tion enables encoding the novel multiple-relations-at-a-time problem, for obtaining theclassification probabilities, the MRC is also extended with a linear classification layer,though not identical to the SRC since it has to model the modified architecture.
Experimental datasets, B ERT word embeddings, and Classification strategies.
Ourcomprehensive evaluation setup involved three different corpora, four B
ERT embed-ding variants, and two classification strategies. Given this, we trained a total of eight different classifiers, which for each of the three corpora resulted in 24 trained models.Each corpus was already prepartitioned three ways as training/dev/test by the originalcreators, which we adopt. To train optimal classifiers on the respective corpus, we tunedthe learning rate parameter η for values { } . Evaluation Metrics.
We employ the standard machine learning classification evaluationindicators, i.e., Precision ( P ), Recall ( R ), F1-score ( F ), and Accuracy ( Acc ). In this section, we present results from our comprehensive evaluations with respect tothe three main research questions that undergird this study.
SRC MRC Avg ± Std
SemEval18 SciERC Combined SemEval18 SciERC CombinedAcc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1Bert-base uncased 76.42 71.74 84.6 77.25 81.75 77.38 80.4 79.98 83.42 74.84 80.84 76.29 81.24 ± ± ± ± ± ± ± ± Avg. Scores Acc. 84.10 F1 80.22 Acc. 82.35 F1 77.52
Table 2: Scientific relation classification results over three datasets (SemEval18, Sci-ERC, & Combined), four B
ERT model variants (B
ERT cased & uncased; S CI B ERT cased & uncased), and two classification strategies (SRC & MRC).
Acc. is accuracy and F is the macro F1-score; Top scores are in bold; Scores in red from our surveyed systems arethose that are on par with the state-of-the-art reported results [6,26] on SemEval18 and SciERC. M. Jiang et al. RQ1: What is the impact of the eight classifiers on scientific relation classification?
The eight classifiers are obtained from two classification strategies built over four B
ERT model variants. We examine their classification results (depicted in Table 2) in terms ofthe following three key characteristics of the classifiers.
The classification strategy, i.e., SRC vs. MRC.
From the
Acc and F shown in Table 2,we see that SRC outperforms MRC on two corpora except the SemEval18 corpus. Onecharacteristic of the SemEval18 corpus is that it has significantly lower number of an-notations than the other two copora. Thus we infer that the novel MRC strategy is morerobust than SRC because its performance level is unaffected by a drop in the number ofthe annotations. Word embedding features, i.e., B ERT vs. S CI B ERT . Regarding the B
ERT word embed-ding models, S CI B ERT outperformed B
ERT on all three corpora with higher accuracyand F1 scores. Since our experimental corpora are all scholarly data, as an expectedresult, word embeddings trained on the similar data domains are better suited.
Vocabulary case in B ERT models, i.e., cased vs. uncased.
We observe that the un-cased B
ERT models (S CI B ERT : 82.91, B
ERT : 81.24) show higher classification accu-racy than their cased counterpart (S CI B ERT : 81.71, B
ERT : 80.05) on average. Further,the uncased models have a lower standard deviation in accuracy overall (S CI B ERT :2.04, B
ERT : 2.84) than the cased models (S CI B ERT : 4.60, B
ERT : 4.14); comparisonson F are along similar lines. Hence, our results suggest that uncased B ERT modelscan achieve more stable performances than cased variants.In conclusion, with respect to the classification strategy, we find SRC outperformsMRC (see averaged scores in the last row in Table 2). Nevertheless, the advancedMRC strategy demonstrates consistently robust performance that remains relatively un-affected by smaller dataset sizes compared to the SRC (e.g. SRC vs. MRC results onthe SemEval18 corpus). On the other hand, with respect to B
ERT word embedding vari-ants, from the averaged scores in the last column in Table 2, the S CI B ERT uncasedmodel posits as the optimal word embedding features model on scholarly articles.
RQ2: Which of the seven relation types studied are easy/challenging to classify?
Examining the fine-grained per-relation classification results in Tables 3 to 5 acrossall our evaluation corpora for both SRC and MRC, we note the classification ranked
Relationship TypeSemEval18
SRC MRC
P R F1 P R F1U
SAGE R ESULT
OMPARE
ODEL -F EATURE
ART -W HOLE
Table 3: Per-relation classification scores of SRC and MRC best systems on SemEval18. valuating BERT-based Models for SR Classification 9Relationship TypeSciERC
SRC MRC
P R F1 P R F1U
SED -F OR C ONJUNCTION
YPONYM -O F VALUATE -F OR OMPARE
ART -O F EATURE -O F Table 4: Per-relation classification results of SRC and MRC best systems on SciERC.
Relationship TypeCombined
SRC MRC
P R F1 P R F1C
ONJUNCTION U SAGE
YPONYM -O F OMPARE
ESULT
ART -O F EATURE -O F Table 5: Per-relation classification results of the best SRC and MRC systems on theCombined corpus.order. Of all relations, U
SAGE (U SED -F OR ) is the easiest classification target, since inall three tables, it is in the two topmost. U SAGE is the most predominant type in allcorpora. This accounts, in part, for its high-scored classification, since the classifiersare trained on a significant number of training instances for U
SAGE compared to therest.For the challenging relations, we examine the results per corpus. Starting with theTable 3 results for the SemEval18 corpus, we observe that both SRC and MRC findP
ART -W HOLE most challenging to classify. We surmise that this relation displays highdiversity in the underlying natural language text from which it is induced; hence theclassifier is unable to generalize a consistent set of patterns for it. Observing classifica-tion performance ranks, for three of the five relations (i.e., U
SAGE , M
ODEL -F EATURE and P
ART -W HOLE ), SRC and MRC obtain the same classification rank order. For R E - SULT and C
OMPARE , they are opposites, where SRC classifies R
ESULT better thanMRC.In Table 4 results for SciERC, both classifiers perform significantly low on tworelations, viz. F
EATURE -O F and P ART -O F . Since these two relations are not the mostunderrepresented in the corpus, we theorize that their low classification performance isowed to the natural language text diversity from which they are deduced. In this case,obtaining more annotated instances is one way to boost classifier performance. In terms A nn o t a & o n Predic&on
Confusion Matrix of SRC Classifica&on A nn o t a & o n Predic&on
Confusion Matrix of MRC Classifica&on
Fig. 1: Confusion matrices for SRC and MRC misclassifications on the Combined data.of the ranked order of performances on the relations, SRC and MRC perform identicallyon SciERC data.And lastly in Table 5 results on the Combined corpus, for the challenging relations,both SRC and MRC have the same result as they did on SciERC—i.e., F
EATURE -O F followed by P ART -O F are the most challenging. And we theorize the same reason forthe low scores on these relations, since Combined contains SciERC data. Given thetwo corpora in the Combined dataset, SciERC additionally introduced C ONJUNCTION which SemEval18 did not have. C
ONJUNCTION is among the top two easiest relationsto classify, with U
SAGE as the other, for the classifiers trained on SciERC and on theCombined corpus. Further, its classification is better in the Combined corpus than inSciERC. This lends an understanding to the realistic evaluation settings that the Com-bined corpus presents. To elaborate, for U
SAGE , instances from SemEval18 and Sci-ERC (i.e. U
SED -F OR ) are combined, resulting in an insignificant dip in performance(on the Combined corpus, U SAGE ranks second easiest compared with SemEval18 andSciERC) since they are now non-uniform annotation signals. As opposed to the caseof C
ONJUNCTION , the Combined corpus obtains a uniform annotation signal from justthe SciERC corpus and ranks a minor degree higher at classifying it.Finally, a list summarizing the top-scoring per-relation performances for scientificrelation classification across all three tables, includes the following: U
SAGE (SRC inSciERC), C
ONJUNCTION (SRC in Combined), H
YPONYM -O F (SRC in SciERC), R E - SULT (MRC in SemEval18), P
ART -O F (MRC in SemEval18 for P ART -W HOLE ), andF
EATURE -O F (MRC in SemEval18 for M ODEL -F EATURE ). Since the SemEval18 cor-pus appears the most times in the top-ranked results, we conclude that its annota-tions obtain a relatively better trained classifier. However, the SemEval18 corpus onlyincludes scholarly abstracts from one AI domain i.e. NLP (in the ACL Anthology),whereas SciERC is more comprehensively inclusive across various AI domains. Thus,an additional factor that classifiers trained on SciERC handle is domain diversity.
Error Analysis
A closer look at the misclassifications is portrayed in the confusionmatrices in Figure 1 for the SRC and MRC strategies on the Combined corpus. Fourof the seven relations, i.e. H
YPONYM -O F , R ESULT , P
ART -O F , and F EATURE -O F , arehighly likely to be misclassified as U SAGE . This shows that our classifiers are biased by valuating BERT-based Models for SR Classification 11
Rela%on Type W o r d D i s t a n c e Fig. 2: Word distance distributions between scientific term pairs in the Combined data.the predominant U
SAGE relation. In general, unbalanced distribution of training sam-ples (see the details in the corpus section) is, more often than not, one of the mainfactors for confusion learned in machine learning systems. For the most challenging re-lations F
EATURE -O F and P ART -O F , after U SAGE , are highly likely to be confused witheach other (F
EATURE -O F as P ART -O F ( ∼
10% confusion), and vice-versa ( ∼ YPONYM -O F and F EATURE -O F that loosely demonstrate arelation heirarchy such that H YPONYM -O F subsumes F EATURE -O F , but not the otherway around, we find the classification confusion demonstrates a consistent pattern tothis data. From the matrices, we see that H YPONYM -O F has ∼
6% likelihood to be pre-dicted as F
EATURE -O F , but none of the F EATURE -O F (0%) instances were confusedwith H YPONYM -O F .To offer another pertinant angle on the classifier error analysis, we compute the worddistance distributions between related scientific term pairs in the Combined corpus. Thisdata is depicted in Figure 2. In general, most box plots shown in the figure are skewedwith a long upper whisker and a short lower whisker, which indicates that the majorityof paired scientific terms are close in the text. Specifically, the scientific term pairs inthe C ONJUNCTION relation, which in linguistic terms should necessarily be close. Thisconsistent pattern could be another reason why C
ONJUNCTION is among the easiestrelations to classify. Further, the average word distance of F
EATURE -O F , P ART -O F ,H YPONYM -O F , and C OMPARE is closer to the lower quartile than the other relations.Such varied distribution may bring challenges for a classifier to identify these relations.Notably, the similar median value and spread range between F
EATURE -O F and P ART -O F could account for why they are challenging to classify.Finally, we examine the third research question that undergirded this study. RQ3: What is the practical relevance of the seven relations studied in this paperin a scholarly knowledge graph?
As a practical illustration of the relation triplesstudied in this work, we build a knowledge graph from their annotations in the 1000scholarly abstracts in the Combined dataset. This is depicted in Figure 3. Lookingat the corpus-level graph (the right graph), we observe that generic scientific termssuch as “method,” “approach,” and “system” are the most densely connected nodes, asexpected since generic terms are found across research areas. In the zoomed-in ego-network of the term “machine translation” (the left graph), H
YPONYM -O F is meaning-fully highlighted by its role linking “machine translation” and its sibling nodes as the taskslexiconsoperational_foreign_language_tutoring machine_translation information_retrievalevaluation_measuresnatural_language_generation speech_recognition ranking_tasks NLP_problems USAGEUSAGE CONJUNCTIONHYPONYM-OFCONJUNCTION HYPONYM-OFUSAGE HYPONYM-OFHYPONYM-OF CONJUNCTIONHYPONYM-OFHYPONYM-OFCONJUNCTIONHYPONYM-OFHYPONYM-OF RESULTUSAGE nouns languages
French method it one other methods paraphrase BLEU
NIST paraphrases scheme system subcategorization_dictionary parser accuracy
Bayesian_framework domains modalities approach state-of-the-art_methods task robustness tasks patterns dictionaries solution model algorithm parsing linguistic_information they
KL-ONE_style_representationlanguage_tutoring_application
SHORTSTR2 full-length_supportsF-measure tracking framework approaches word_alignment It algorithms speech_understanding unknown_words classifiers alignmentsAPT words neural_network NLP_structures word_senses techniques models positive_feature_vectors categorization_task text_modelimage_classification problem extensions segmentation translation_memory_system retrieval retrieval_accuracy images face_metrics those syntactic_information morphology computational_model Bulgarian MT word_segmentation translation MT_system taggers
English
Web language_model shape word_sense_disambiguation SMT_models
WSD_accuracythatSMT features
ASR_output machine_learning_approachhubsgraph unlexicalized_PCFG_parserspeech dictionary
English-to-Japanese_dictionary grammar parallel_corpora baseline_system machine_translation grammatical_formalisms Ant-Q SRDA recognition classification connectionist_componentconnectionist_(_neural_)_networknatural_language_interpretation parsing_strategies ungrammatical_structures classifiermeasureinformation_retrieval syntactic_dependencies_LASsemantic_dependencies_F1discourse_structure PDTBstatistical_methods applications data well_calibrated_probabilitiessense_priors technique databasetagginglemmatizationMWEs ambiguitynatural_language_processing statistical_machine_translationArabic_stemmer rules
HMMs analysis hierarchical_relationsrelations high-resolution_reconstructions maximum_likelihood_identity_parameter_vectordecodingAdaBoost baseline argumentsdefeat_rules three_techniquessyntactic_analysisrecovery_techniquenamed_entity_(_NE_)_recognition_(_NER_)_system systems natural_languagesemanticASR IR evaluation_measures spelling_correction two-level_morphology
Turkish log-linear_block_bigram_modelreal-valued_features computational_complexity translation_quality acoustic_modellingrapid_searchlarge-vocabulary_CSRmilitary_application_tasksrecognition-time_adaptation_techniques similarity taxonomic_similarityassociative_similaritythisshift-reduce_dependency_parsers board
Intel_i860_chip natural_language_system datasets decoder grammars relational_probability_model generative_model
Inference supervised_learningunsupervised_learning them precision recall machine_learning_techniquestraining_data dataset bottom-up_pattern-matching_parser activity corpus mechanism restricted_vocabularyrobustness_problems segmentation_bakeoff
MSR-open sense_resolution
WordNetsemantic_relations transfer_phase principlesnatural_language_generation methodology real-life_trialscomputational_lexiconspeech_recognition_applicationsmorphological_componentLFG formalisms eventscomputer_visionfactors language_pairsword_n-grams natural_language_interfaceSVCspeech_quality anaphoric_componentlinguistic_phenomena linguistic_representationlinguistic_concepts standalone_systemopen-domain_question_answeringsubcategorization_cues spoken_language punctuation pos_tag_ambiguityfine-grained_grammatical_categoriesmulti-tagging_approach discourse_relationsCzech motion_analysis noiseconditional_independenciesanimated_GIFs natural_language_descriptions representative_techniqueslinguistic_theorymeasurable_signals coder annotation temporal_expressionsper-object_3D_reconstructionspositionsvelocities global_variablesglobal_propertiesinference discourse syntaxsemanticstreebank computer_vision_problems three-component_systemregion-level_component recognizer knowledgeGPSG PTZ_cameras relative_positioningorientation error_correction NLP_taskslog-linear_modelsknowledge_sources visual_representation benchmark_datasetslarge-scale_knowledge_basegraph_cutsbelief_propagationgeneralized_metaphor bilingual_corporabilingual_dependency_relations paraphrasing_methodgeneralized_translation_knowledgetranslation_knowledgestrategies
PAKTUS eventcorporacontinuous_density_HMMn-gram_statistics Acoustic_modeling boosting construction-specific_approachspecialized_parsing_techniquesclustering_and_filtering_processesexperimentscorpus-based_approach speech_recognition highly_inflective_languages spelling-checkersstochastic_optimization_algorithms language_modeling_tasksuser_modelingspecifications unsupervised_learning_method
Keyword_Search blogsgenres language_functionalitiescomponentssyntactic_analyzersemantic_analyzer boosting_approachranking_tasksNLP_problems spectral_clusteringeditor dialog_systems generation
ILIMP reverberationembedded_computer_vision_applicationscorpus-based_supervised_word_sense_disambiguation_(_WSD_)_systemDutch energy_minimization_algorithmlexical_substitutionssemantic_role_labeling parse_treehuman_judgment JapaneseHuman-Machine_Communicationarc-gvcluster_tree extensionlocalizing_functional_objectsinternal_model music_scene_analysis_systemtensor_method video_recognition_tasks text edges verbs texts performance context word meaning utterances terms sentence sentences query_termsverb proposition
Fig. 3: A knowledge graph constructed from the relation triples in the Combined corpus.The graph on the left is the “ego-network” of the scientific term “machine translation”.The node size is determined by node weighted degree. Colors denote the modularityclasses based on the graph structure.
The graph was generated using https://gephi.org/ research tasks “speech recognition,” and “natural language generation” to the parentnode “NLP problems.” The term “lexicon” is related by U
SAGE to “machine translation”and “operational foreign language.” The C
ONJUNCTION link joins “machine translation”and “speech recognization” both aim at translating information from one source to theother one. This knowledge graph now enables various property comparisons across1000 scholarly abstracts (consider the corpus statistics Table 1 generated from the OpenResearch Knowledge Graph presented in the corpus section).
We have investigated the scientific relation classification task for improving scholarlyknowledge representations in digital libraries. Our surveyed systems offer a compre-hensive view of results that are attainable given advanced neural technology when puttogether in varying combinations, including the ones that produce the state-of-the-artresults. We have categorized neural technology in terms of four B
ERT -based embed-ding models and two classification strategies. Our results indicate that, of the variousword embedding models, S CI B ERT is the optimal choice for a corpus of scholarly data.In terms of classification strategies, the single-relation-at-a-time system outperformsthe multiple-relations-at-a-time classification system. Nevertheless, the latter is robustover all datasets, even in settings with lean annotated data, in which case it outperformsthe former. Our findings are obtained over a broad scenario of model performances forscientific relation classification now available to the stakeholders of the digital libraries.
To further facilitate the choice of the proper technique for classifying scientific rela-tions toward the creation of structured, semantic representations over scholarly articles, valuating BERT-based Models for SR Classification 13 there are two main avenues that are worthwhile for future exploration. As we have seenin the course of examining our
RQ2 , there exist label biases in the annotated corporasuch that some relations are better represented than others (E.g. U
SAGE ). Toward thisend, such data needs to be further curated by experts to enable a well-represented do-main. Further, digital libraries deal with various domains in Science in general. Whileour evaluations have been performed on corpus that covers the Artificial Intelligenceresearch area, there still remains a plenty of potentials to explore other research do-mains that are unrelated to Computer Science specifically. Finally, we have examinedscientific relation classification in terms of seven relations, ontologized models of thescientific world [9,19] posit a larger set of relations or properties. For this, techniquessuch as open information extraction or ontology-based extraction are viable alternativesfor future developments.
References
1. Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections.In: Proceedings of the fifth ACM conference on Digital libraries. pp. 85–94 (2000)2. Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., Downey, D.,Dunkelberger, J., Elgohary, A., Feldman, S., Ha, V., et al.: Construction of the literaturegraph in semantic scholar. In: NAACL, Volume 3 (Industry Papers). pp. 84–91 (2018)3. Auer, S., Kovtun, V., Prinz, M., Kasprzik, A., Stocker, M., Vidal, M.E.: Towards a knowledgegraph for science. In: Proceedings of the 8th International Conference on Web Intelligence,Mining and Semantics. pp. 1–6 (2018)4. Auer, S., Mann, S.: Toward an open knowledge research graph. The Serials Librarian (2019)5. Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: Semeval 2017 task 10:Scienceie-extracting keyphrases and relations from scientific publications. In: Proceedingsof SemEval-2017. pp. 546–555 (2017)6. Beltagy, I., Lo, K., Cohan, A.: SciBERT: A pretrained language model for scientific text. In:EMNLP-IJCNLP. pp. 3615–3620. ACL, Hong Kong, China (Nov 2019)7. Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Proceedings ofthe 42nd annual meeting on ACL. p. 423. ACL (2004)8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectionaltransformers for language understanding. In: Proceedings of NAACL, Volume 1 (Long andShort Papers). pp. 4171–4186. ACL, Minneapolis, Minnesota (Jun 2019)9. Fathalla, S., Vahdati, S., Auer, S., Lange, C.: Towards a knowledge graph representing re-search findings by semantifying survey articles. In: TPDL. pp. 315–327. Springer (2017)10. G´abor, K., Buscaldi, D., Schumann, A.K., QasemiZadeh, B., Zargayouna, H., Charnois, T.:Semeval-2018 task 7: Semantic relation extraction and classification in scientific papers. In:Proceedings of SemEval-2018. pp. 679–688 (2018)11. Hallo, M., Luj´an-Mora, S., Mat´e, A., Trujillo, J.: Current state of linked data in digital li-braries. Journal of Information Science (2), 117–127 (2016)12. Haslhofer, B., Isaac, A., Simon, R.: Knowledge graphs in the libraries and digital humanitiesdomain. arXiv preprint arXiv:1803.03198 (2018)13. Jaradeh, M.Y., Oelen, A., Farfar, K.E., Prinz, M., D’Souza, J., Kismih´ok, G., Stocker, M.,Auer, S.: Open research knowledge graph: Next generation infrastructure for semantic schol-arly knowledge. In: KCAP. pp. 243–246. ACM, New York, NY, USA (2019)4 M. Jiang et al.14. Jiang, M., Diesner, J.: A constituency parsing tree based method for relation extraction fromabstracts of scholarly publications. In: Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13). pp. 186–191 (2019)15. Klampfl, S., Kern, R.: An unsupervised machine learning approach to body text and table ofcontents extraction from digital scientific articles. In: TPDL. pp. 144–155. Springer (2013)16. Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations,and coreference for scientific knowledge graph construction. In: EMNLP. pp. 3219–3232(2018)17. Luan, Y., Wadden, D., He, L., Shah, A., Ostendorf, M., Hajishirzi, H.: A general frameworkfor information extraction using dynamic span graphs. arXiv preprint arXiv:1904.03296(2019)18. Manning, C.D.: Computational linguistics and deep learning. Computational Linguistics41