Experiments on transfer learning architectures for biomedical relation extraction
Walid Hafiane, Joel Legrand, Yannick Toussaint, Adrien Coulet
EExperiments on transfer learning architectures forbiomedical relation extraction
Walid Hafiane ∗ , Joël Legrand ∗ , Yannick Toussaint ∗ , Adrien Coulet ∗ , ∗∗∗ ∗∗ Inria Paris, Inserm UMR1138, Université de Paris, France{first name.last name} @loria.fr
Abstract.
Relation extraction (RE) consists in identifying and structuring au-tomatically relations of interest from texts. Recently, BERT improved the topperformances for several NLP tasks, including RE. However, the best way touse BERT, within a machine learning architecture, and within a transfer learningstrategy is still an open question since it is highly dependent on each specifictask and domain. Here, we explore various BERT-based architectures and trans-fer learning strategies (i.e., frozen or fine-tuned ) for the task of biomedical RE ontwo corpora. Among tested architectures and strategies, our *BERT-segMCNNwith fine-tuning reaches performances higher than the state-of-the-art on the twocorpora (1.73 % and 32.77 % absolute improvement on ChemProt and PGx-Corpus corpora respectively). More generally, our experiments illustrate theexpected interest of fine-tuning with BERT, but also the unexplored advantageof using structural information (with sentence segmentation), in addition to thecontext classically leveraged by BERT. The biomedical literature is continuously growing in volume, which makes natural lan-guage processing (NLP) an essential tool for leveraging domain knowledge it expresses. Inthis context, the NLP task of relation extraction (RE) plays an important role in many applica-tions, such as semantic understanding, query answering or knowledge summarization. Indeed,RE allows to extract and structure elements of knowledge from natural language texts by iden-tifying and typing the relationships that may be mentioned between named entities (Pawaret al., 2017). The resulting relations can be normalized and assembled in the form of a knowl-edge graph (KG), which summarizes a domain of knowledge and may serve as a structuredintermediate for subsequent tasks of knowledge discovery or knowledge comparison (Monninet al., 2019). In this work, we explore the task of RE, in the biomedical domain, and from adeep learning perspective.Deep neural networks have achieved significant success for the task of RE (Kumar, 2017),and more generally for NLP tasks (Collobert et al., 2011). In particular, convolutional and re-current neural networks (CNN and RNN) have been successfully used for the task of biomedi-cal RE. Inspired from the vision domain, multi-channel CNN (MCNN) have been used in NLP a r X i v : . [ c s . C L ] N ov xperiments on transfer learning architectures for biomedical relation extractiondomain for RE (Quan et al., 2016), because of their ability to extract local features latent in theseveral embedding vectors. Here, the convolution kernels allow for extracting relevant featuresin a hierarchical manner. In addition, weight sharing implies a processing time advantage forthis architecture. In contrast, RNN is an architecture that can capture the semantic featuresin the context. The recurrence allows for storing representations of the input sequence in theform of an activation. The last hidden state summarizes the context of the whole input, andthus is considered as a semantic representation at the sentence level. Bidirectional Long Short-Term Memory (BiLSTM) (Hochreiter and Schmidhuber, 1997) is a RNN broadly used in NLPmainly for its ability to capture relatively short past and future context.Indeed, previously mentioned architectures fail at capturing long contexts. This led to theadoption of attention mechanism in order to capture long-distance relations (Chen et al., 2020).Leveraging this idea, BERT (Bidirectional Encoder Representations from Transformers) com-bines encoders and attention mechanisms, and then advances the level of performances in manyNLP tasks including RE (Devlin et al., 2019; Shi and Lin, 2019). BERT principle, such asElmo (Embeddings from Language Models), is based on transfer learning (Peters et al., 2018).It consists of pre-training a language models on large corpora with unsupervised tasks, whichprovides a general language understanding, subsequently used in downstream NLP tasks.For instance, BERT models are strong on learning semantic features due to transformerattention architecture, but they are less capable to capture the structural information. Indeed, noexplicit order is imposed between the representation vectors in its different blocks, particularlyin the attention sublayer. This leads to a weak contribution of the structural information toBERT output. The unique source of this information is positions embedding included in theinputs, this embedding structural information propagates through the 12 transformer blocksof BERT, which results in a poor representation of this information. In contrast, Li et al.(2019) demonstrate that using sentence segmentation as pre-processing may improve an RNNarchitecture by providing structural information. Chen et al. (2020) are exploring this pre-processing with a CNN architecture.Our work explores two transfer learning strategies to enhance BERT variants (denoted*BERT) performances for the task of biomedical RE. We propose BERT-based architecturesfor “frozen” and “fine-tuning” transfer learning strategies, in an attempt to extract more rele-vant features, out of BERT representation vectors. Moreover, we explore strengthening BERTwith structural information using sentence segmentation as post-processing, which is -up to ourknowledge- an original approach. We tested on two benchmarks biomedical corpora ChemProt(Kringelum et al., 2016) and PGxCorpus (Legrand et al., 2020) .This article is structured as follows: Section 2 provides elements of background on ourlearning task (biomedical RE), on BERT architecture and transfer learning strategies. Section 3details the BERT-based architectures and transfer learning strategies we implemented. Section4 exposes experiments and their results. Section 5 discusses our results and concludes. Relation extraction (RE) aims at identifying, in unstructured text, all the instances of apredefined set of types of relations between identified entities. As a result, relations are com-. Hafiane et al.F IG . 1 – Example of sentence with four named entities and a relation between two of them.This relation is an extract from PGxCorpus(Legrand et al., 2020).posed of two or more named entities and a label that type the association between them. Figure1 provides an example of relation of the type “influences” between two entities (of type Lim-ited_variation and Pharmacodynamic_phenotype). We are considering here binary RE that canbe seen as a classification task by computing a score for each possible relation type, given asentence and two identified entities. We experimented with two manually annotated corpora which focus on slightly differentkind of biomedical relationships.
ChemProt is a corpora of relationships between proteins and chemicals, with 10,031 exam-ples, distributed in train, validation, and test sets (Kringelum et al., 2016). Annotated relationsare of 13 different types of protein-molecule relations. Table (1) shows the distribution of ex-amples over relation types in the train set. ChemProt is particularly unbalanced since the mostfrequent relation type in the train set represents 39.4%, while the least frequent type representsonly 0.1%. For instance, “AGONIST INHIBITOR” has only 4 examples in the train set.
PGxCorpus is a corpora of pharmacogenomic relationships between genomic factors anddrugs or drug responses (Legrand et al., 2020). It is composed of 2,875 examples, distributedover 7 different types of relations (isAssociatedWith, influences, causes, decreases, increases,treats, isEquivalentTo). The percentage and number of examples in each type of relationshipare presented in table(2). The distribution of relationships is unbalanced: the most frequent is“influences” (32.6%), while “causes” is the least frequent (5.8%).The two corpora were chosen for distinct reasons: ChemProt is a well established corpusthat served as Benchmark in many previous work on transfer learning, enabling us to com-pare our performances with state-of-the-art approaches; PGxCorpus is a novel corpus for thedomain of Pharmacogenomics, which offers opportunities for studying the extraction of n-aryand/or nested relationships.
BERT (Devlin et al., 2019) is the first encoder architecture that captures past and future con-text simultaneously, unlike OpenAI GPT (Improving Language Understanding by GenerativePre-Training) (Radford et al., 2018), which only captures past or Elmo (Peters et al., 2018)which uses a concatenation of the independently captured past and future contexts. BERTencoder is made up of a stack of 12 identical blocks, where each block is composed of twosublayers. The first one is a multi-head self-attention mechanism, and the second is a fullyconnected feedforward neural network. Residual connections are used around each of the twoxperiments on transfer learning architectures for biomedical relation extractionRange Class Size > >
400 I.-DOWNREGULATOR / SUBSTRATE 487(11.7%) / 480(11.5%) >
300 I.-UPREGULATOR / ACTIVATOR 387(9.3%) / 323(7.7%) >
200 ANTAGONIST / PRODUCT-OF 235(5.6%) / 233(5.6%) >
100 AGONIST / DOWNREGULATOR 156(3.7%) / 131(3.1%) >
10 UPREGULATOR / SUBSTRATE-P. 67(1.6%) / 14(0.3%) ≤
10 AGONIST-A. / AGONIST-I. 10(0.2%) / 4(0.1%)T AB . 1 – Distribution of relationships by type in ChemProt.Range Class Size >
400 Influences / IsAssociatedWith 937(32.6%) / 733(25.5%) >
250 IsEquivalentTo / Decreases 293(10.2%) / 263(9.1%) >
200 Increases / Treats 243(8.5%) / 238(8.3%) ≤
200 Causes 168(5.8%)T AB . 2 – Distribution of relationships by type in PGxCorpus.sublayers, followed by layer normalization applied after each sublayer. The transformer ispre-trained on two unsupervised tasks (i.e., prediction of masked words, prediction of the nextsentence) on large corpora (i.e., Wikipedia of 2.5 billion words, BookCorpus of 800 millionwords). The aim of the pre-training is to acquire a general language understanding, with theidea of transferring this ability to downstream tasks. According to (Pan and Yang, 2010) transfer learning is divided into two main categories: inductive and transductive transfer . Inductive transfer consists of transferring information be-tween different tasks, usually within the same domain. Whereas transductive transfer consistsin transferring information from one domain to another, within the same task.Following this transductive approach, the initial BERT model has been used to developmodels particularly adapted to specific domain and languages, such as those of biomedicine.It results from this transfer several variants of BERT that use the pre-trained BERT modeland apply a domain transfer in order to make BERT more efficient in the target domain. Inparticular, BioBERT initializes its weights with BERT pre-trained model and then continue thetraining on a large set of biomedical texts (about 18 billion words) (Lee et al., 2020). BERT andBioBERT performances have already been compared on the task of RE on ChemProt corpus,showing that BioBERT was outperforming. SciBERT is another BERT variant that follows thesame approach, but uses 3.17 billion words of mixed-scientific texts (Beltagy et al., 2019), outof which 82% comes under the biomedical domain. SciBERT is the state-of-the-art model onthe task of RE on ChemProt.Following the inductive approach, BERT pre-training model and its ability to understand,or represent language, is typically reused and enriched (e.g., with an additional layer of neu-rons) for different tasks such as named entity recognition, or RE. In this context, we explore. Hafiane et al.both frozen and fine-tuning strategies. The frozen strategy consists in reusing layers of neuronsfrom a previously trained model, and “freeze” them, i.e., avoiding to update its parametersduring future training steps. Next, one usually adds layers on top of the frozen layers so thesenew layers will consider output of frozen layers as input features for the final learning task.In contrast, the fine-tuning strategy also reuses layers from a previously trained model, plusadditional layers on top, but allows to tune the parameters of the whole model in regards withthe new train set. In the frozen strategy, the fact that the gradient back-propagation applies onlyto the added neurons presents an advantage in terms of processing time, whereas fine-tuningmay obtain increased performance by adapting the pre-trained parameters to the new train set.In this work we explore both transductive and inductive transfer learning for the tasks ofbiomedical RE, by experimenting not only various BERT variants, but also both frozen andfine-tuning strategies.
In the frozen strategy, our baseline is the frozen architecture reportedfor RE in the SciBERT article, which consists in an RNN-based model on top of a BERT-like model(Beltagy et al., 2019). For simplicity, we denoted BERT-like model, i.e., eitherBERT or BioBERT or SciBERT with the *BERT. In this architecture, *BERT is used as apre-trained contextualized word embeddings. The added classifier is composed of a 2-layerBiLSTM of size 200, followed by a multilayer perceptron applied on the concatenated firstand last BiLSTM vectors. This RNN part is supposed to extract the contextual informa-tion from embeddings vectors sequence, in order to feed the multilayer perceptron classi-fier. For short, and depending on the frozen pre-trained model, we denote this baseline archi-tecture by BERT+BiLSTM, BioBERT+BiLSTM and SciBERT+BiLSTM, or more generally*BERT+BiLSTM. *BERT+MLP
In the fine-tuning strategy, our baseline is the architecture reported for RE inBERT, BioBERT and SciBERT. It consists in a simple multilayer perception (MLP), composedof a single fully connected hidden layer, as an extension on the top of BERT pre-trained models(Devlin et al., 2019; Lee et al., 2020; Beltagy et al., 2019). In the following, we denote thesearchitectures with BERT+MLP, BioBERT+MLP and SciBERT+MLP.In the original SciBERT article, RE has been evaluated on ChemProt following both thefrozen (SciBERT+BiLSTM) and fine-tuning (SciBERT+MLP) strategies. We reproduced vol-untarily their results for further comparisons, with other architectures or corpus (i.e., PGxCor-pus).
The first proposed architecture is the extension of BERT with a MCNN (de-noted generally *BERT+MCNN). We experiment this combination, with the hope that MCNNextracts local information from the representation vectors computed by BERT. Indeed BERTxperiments on transfer learning architectures for biomedical relation extractionF IG . 2 – Overview of our BERT-segMCNN architecture.has the ability to extract long distance contextual information using multi-headed attention,whereas local contextual information captured by the MCNN may provide additional discrim-inated features. The fact that the order between the different dimensions is not explicit in therepresentation vectors lead us to consider them as independent features and handling each di-mension as a different embeddings. Accordingly, each dimension represents a channel of theMCNN. This architecture is followed by a max-pooling to obtain a sentence representationvector, which in turn fed a multilayer perceptron. MCNN contains only few parameters (dueto parameters sharing), no recurrence and relatively few layers, which globally tends to re-duce the internal covariate shift effect. This let us envisage computing fine-tuning with thisarchitecture (in addition to the frozen, that is less computationally expensive). As a results, weexperiment this architecture with both frozen and fine-tuning strategies. *BERT+BiLSTM-MCNN The second type of proposed architectures extends BERT witha BiLSTM and a MCNN. The exploration of this architecture is motivated by the fact thatBiLSTM may capture semantic features in BERT representation vectors and that MCNN maycapture the local information from the same vectors. To explore how these two kind of fea-tures may be combined we proposed two *BERT+BiLSTM-MCNN architectures: one linear(*BERT+BiLSTM-MCNN L.) where BiLSTM outputs input MCNN, before a maxpooling anda multilayer perceptron for classification ; one parallel (*BERT+BiLSTM-MCNN P.) whereboth MCNN and BLSTM inputs are BERT representation vectors. The output of MCNN andBiLSTM are then concatenated to feed the multilayer perceptron for classification. These ar-chitectures are experimented only in the frozen strategy, mainly because in the fine-tuningtraining the depth of BERT, plus BiLSTM-MCNN extension makes the computation veryheavy and may lead to a vanishing gradient problem.. Hafiane et al. *BERT+segMCNN
The third proposed architecture extends BERT with five MCNN, eachof them taking as an input a different partition (or segment) of BERT representation vectors.We propose this architecture because classically, no order is imposed in the transformer be-tween representation vectors, in particular in the attention layer. As a result, the structural in-formation may be limited in the output representations vectors. In order to strengthen *BERTarchitecture with this information, we proposed five MCNN, in parallel, as illustrated in Figure(2). The segmentation of the representation vectors used is based on the location of the entitiesin the sentence. Representation vectors are split according to their belonging to one of thefollowing five partitions: before the first occurring entity, first entity, between the first entityand second entity, second entity, and after the second entity. In order to carry out this post-processing, entities positions are reported from the input of the whole architecture to the inputof the segMCNN block, as shown Figure(2). Max polling and fully connected (FC) layer areapplied after each MCNN, which provides five vectors, each one representing one segment.This architecture ends with an MLP used for a classification of the sentence representationvector, obtained by the concatenation of the five segment vectors.
We setup 4 batches of experiments, each of them associated with one target corpus (Chem-Prot or PGxCorpus) and one transfer strategy (frozen or fine-tuning). In each batch, we testedfor comparison both state-of-the-art architectures and the original ones we proposed.
Depending on the experiments we used as evaluation metrics either the macro or the microaveraged F-measure (noted F-macro and F-micro respectively). F-measure combines precisionand recall by computing their harmonic mean.When reporting a unique F-measure for a multi-class classification task, there is various ways to average the F-measure: either by consideringindependently each example i.e., averaging on all examples treating them equally regardlessof their class (for F-micro); or by a simple arithmetic average on classes treating all classesequally regardless of their size (for F-macro). Classically, F-micro is preferred in highly un-balanced settings. In our case, we preferred F-macro in experiments with PGxCorpus, andF-micro with ChemProt because the latest includes rare types of relationships. We note thatF-micro is computationally equivalent to the accuracy metric in the case of disjoint classes,which is the case in this work. In both cases, F-measure by class (i.e., by type of relationship)are available in an online supplementary material (see result section).
Models are trained to minimize the cross-entropy function. With ChemProt corpora, ourmodels are trained on the train set, tested on the test, and the validation set is used for modelselection and hyperparameter tuning. We reused the hyperparameters tuned on ChemProt forPGxCorpus experiments. For PGxCorpus no train, validation and test sets are predefined. Thuswe adopted a 10-fold cross-validation strategy to define them randomly several times. Resultsof hyperparameter tuning provided us with various optimal setting for each pair architecture-transfer strategy. Accordingly we report here the sets of parameters values we used. For CNN,xperiments on transfer learning architectures for biomedical relation extractionthe convolution filter sizes are (3, 5, 7) and the number of filters of (3, 6). For BiLSTM weused two-layers of size 200, and for MLP one hidden layer of size (64, 100). We train ourmodels with a batch size of 32 and use two regularization types: a dropout of (0.1, 0.25, 0.5)and a L2-regularization of (0, 0.01). We optimize the loss function using AdamWithDecaywith a decay of (0, 0.01) and an initial learning rate of 0.001 for the frozen transfer strategy,and ( . − , . − , − ) for the fine-tuning strategy. Numbers of epochs used are (30, 65)for the frozen strategy, and (5, 8) for the fine-tuning strategy. Added weights are initialized bya normal distribution ( µ = 0 , σ = 0 . ). The learning procedure is initialized with differentrandom weights (about 100 times) and we report the mean performances. We reported thestandard deviation to analyze the stability of our models.Experiments has been developed with Python and PyTorch libraries, with the versionBERT-base-uncased of BERT, v1.1 of BioBERT-uncased and SciBERT-SciVOCAB-uncasedof SciBERT. Architecture F-micro σ (Beltagy et al., 2019) SciBERT+BiLSTM 75.03 –BERT + BiLSTM 64.41 2.42+ MCNN 68.26 1.54+ BiLSTM-MCNN L. 75.35 1.02+ BiLSTM-MCNN P. 62.07 1.33BioBERT + BiLSTM 72.86 1.36+ MCNN 77.57 0.70+ BiLSTM-MCNN L. + BiLSTM-MCNN L. 79.24 0.76+ BiLSTM-MCNN P. 70.45 1.18T AB . 3 – Evaluation of performances of various transfer learning architectures on ChemProt,using the frozen strategy. With ChemProt corpus
Results obtained with the frozen strategy and ChemProt corpus arepresented in Table 3. The first line of Table 3 reports state-of-the-art (SOTA) results from
1. The programmatic code of our experiments is available at https://github.com/hafianewalid/Transfer-Learning-Architectures-for-Biomedical-Relation-Extraction . Hafiane et al.
Architecture Precision Recall F-macro σ (Legrand et al., 2020) MCNN – – 45.67 4.51BERT + BiLSTM 54.29 54.82 54.29 0.84+ MCNN 57.64 53.84 53.73 1.33+ BiLSTM-MCNN L. 71.85 70.47 70.63 1.38+ BiLSTM-MCNN P. 54.48 54.37 53.99 BioBERT + BiLSTM 57.16 57.32 56.93 0.89+ MCNN 69.52 66.25 67.02 2.43+ BiLSTM-MCNN L. AB . 4 – Evaluation of performances of various transfer learning architectures on PGxCorpus,using the frozen strategy.(Beltagy et al., 2019), which we reproduced and reported on the 10 th line of the table (denotedSOTA reproduction). It drifts of 0.07% F-micro, within a standard deviation of 1.Regardless the variant of BERT that we used, *BERT+BiLSTM-MCNN L. models pro-duces the most performing models on ChemProt corpus. BioBERT+BiLSTM-MCNN achievesthe higher performance with 80.08 % F-micro. With PGxCorpus
Results obtained with the frozen strategy and PGxCorpus are presentedin Table 4. The first line reports SOTA results from (Legrand et al., 2020), obtained with asimple MCNN, without any BERT pre-trained model.As expected, we observe that all our (BERT-based) architectures exceed this baseline, andare more stable. In particular, we note that independently of the variant of BERT used, thearchitecture *BERT-BiLSTM-MCNN L. produces the most performing models on PGxCorpus,similarly to what we observed with ChemProt. We also note that, this time SciBERT+BiLST-M-MCNN L. overpass (slightly) BioBERT version of the same architecture.
With ChemProt corpus
Results obtained with the fine-tuning strategy and ChemProt cor-pus are presented in Table 5. The first line of Table 5 reports SOTA results from (Beltagyet al., 2019), which we reproduced and reported on the 8 th line of the table (denoted SOTAreproduction). It drifts of 1.2% F-micro, within a standard deviation of 1.47.We observe that for each variant of BERT, models obtained with *BERT+segMCNN ex-ceeds the performances of the others. In particular we note that BioBERT+segMCNN let usxperiments on transfer learning architectures for biomedical relation extraction Architecture F-micro σ (Beltagy et al., 2019) SciBERT+MLP 83.64 –BERT + MLP 79.28 1.21+ MCNN 72.18 1.73+ segMCNN 81.43 0.91BioBERT + MLP 82.28 3.27+ MCNN 83.90 1.11+ segMCNN SciBERT + MLP (SOTA Reproduction) 82.44 1.47+ MCNN 82.98 0.96+ segMCNN 84.77 T AB . 5 – Evaluation of performances of various transfer learning architectures on ChemProt,using the fine-tuning strategy. Architecture Precision Recall F-macro σ (Legrand et al., 2020) MCNN – – 45.67 4.51BERT + MLP 70.80 69.70 69.88 4.03+ MCNN 71.73 73.39 72.22 + segMCNN 74.31 74.53 74.17 1.08BioBERT + MLP 73.49 68.84 70.47 5.87+ MCNN 74.40 71.16 72.23 4.30+ segMCNN 77.38 77.24 77.00 2.56SciBERT + MLP 75.61 75.44 75.32 2.11+ MCNN 75.88 76.08 75.70 1.04+ segMCNN AB . 6 – Evaluation of performances of various transfer learning architectures on PGxCorpus,using the fine-tuning strategy.reach a new state-of-the-art performance (85.37%, absolute improvement of 1.73%). We alsoobserve that the addition of our segmentation approaches with our 5 MCNN in parallel (segM-CNN) surpasses the SOTA results with both BioBERT and SciBERT as a base. With PGxCorpus
Results obtained with the fine-tuning strategy and PGxCorpus are pre-sented in Table 6. The first line reports SOTA results from (Legrand et al., 2020), obtainedwith a simple MCNN, without any BERT pre-trained model.. Hafiane et al.As expected, we observe that all our (BERT-based) architectures exceed this baseline, andare more stable. We observe that for each variant of BERT, *BERT+segMCNN model exceedsthe performances of the others. In particular we note that SciBERT+segMCNN let us reach anew state-of-the-art performance on this corpus (78.44%, absolute improvement of 32.77%).It seems that on this corpus, we observe a first improvement with *BERT+MCNN architecture,and a second one with *BERT+segMCNN. We also note that, differently to the results obtainedon ChemProt, the best BERT variant is SciBERT (vs. BioBERT with ChemProt).Experiments with the *BERT+MCNN architectures have been performed in both frozenand fine-tuning strategy, and illustrate that fine-tuning strategy produces much better perfor-mances.
We hypothesis that BERT model may be reinforced with elements of structural informa-tion through sentence segmentation and by using local information latent in its representationvectors. In this direction, we hope that our empirical approach gives some support to this hy-pothesis. In addition, our results illustrate the fact that the variant of BERT that one may useimpacts final performances. In addition, this impact varies depending on the target corpus andthe associated specific task. We observe also, as expected, that fine-tuning strategies, even ifcomputationally demanding, produces generally better performances.To conclude, we experimented several BERT-based architectures and transfer strategiesfor the task of RE, with two biomedical corpora. We motivated our choice of architecturesby the aim of leveraging several types of features: local features extracted by MCNN, con-textual features by BiLSTM, and structural features coming from a sentence segmentationapproach. We proposed different architectures adapted either to the frozen and fine-tuningstrategies (*BERT+BiLSTM-MCNN L. and *BERT+segMCNN, respectively) leading to animprovement of the state-of-the-art performance for our specific tasks of biomedical RE. Evenif our empirical contribution is limited to a subset of variants of BERT and to the specific taskof biomedical RE, we think that our architectures (in particular *BERT+BiLSTM-MCNN L.and *BERT+segMCNN) may be suitable for other BERT variants, tasks and domains.
References
Beltagy, I., K. Lo, and A. Cohan (2019). Scibert: A pretrained language model for scientifictext.
EMNLP .Chen, Y., K. Wang, W. Yang, Y. Qin, R. Huang, and P. Chen (2020). A multi-channel deepneural network for relation extraction.
IEEE Access 8 , 13195–13203.Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa (2011).Natural language processing (almost) from scratch.
J. Mach. Learn. Res. 12 , 2493–2537.Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova (2019). BERT: Pre-training of DeepBidirectional Transformers for Language Understanding.
NAACL-HLT (Mlm).Hochreiter, S. and J. Schmidhuber (1997). Long short-term memory.
Neural Comput. 9 (8),1735–1780.Kringelum, J., S. K. Kjærulff, S. Brunak, O. Lund, T. I. Oprea, and O. Taboureau (2016).Chemprot-3.0: a global chemical biology diseases mapping.
Database J. Biol. DatabasesCuration 2016 .xperiments on transfer learning architectures for biomedical relation extractionKumar, S. (2017). A survey of deep learning methods for relation extraction.
CoRR abs/1705.03645 .Lee, J., W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2020). Biobert: a pre-trainedbiomedical language representation model for biomedical text mining.
Bioinform. 36 (4),1234–1240.Legrand, J., R. Gogdemir, C. Bousquet, K. Dalleau, M. D. Devignes, W. Digan, C. J. Lee, N. C.Ndiaye, N. Petitpain, P. Ringot, M. Smaïl-Tabbone, Y. Toussaint, and A. Coulet (2020).PGxCorpus, a manually annotated corpus for pharmacogenomics.
Scientific Data 7 (1), 1–13.Li, Z., J. Yang, X. Gou, and X. Qi (2019). Recurrent neural networks with segment attentionand entity description for relation extraction from clinical texts.
Artif. Intell. Medicine 97 ,9–18.Monnin, P., J. Legrand, G. Husson, P. Ringot, A. Tchechmedjiev, C. Jonquet, A. Napoli, andA. Coulet (2019). PGxO and PGxLOD: a reconciliation of pharmacogenomic knowledge ofvarious provenances, enabling further comparison.
BMC Bioinformatics 20 (S4).Pan, S. J. and Q. Yang (2010). A survey on transfer learning.
IEEE Trans. Knowl. DataEng. 22 (10), 1345–1359.Pawar, S., G. K. Palshikar, and P. Bhattacharyya (2017). Relation extraction : A survey.
CoRR abs/1712.05191 .Peters, M. E., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer(2018). Deep contextualized word representations. pp. 2227–2237.Quan, C., L. Hua, X. Sun, and W. Bai (2016). Multichannel convolutional neural network forbiological relation extraction.
BioMed Research International .Radford, A., K. Narasimhan, T. Salimans, and I. Sutskever (2018). Improving Language Un-derstanding by Generative Pre-Training.
OpenAI , 1–10.Shi, P. and J. Lin (2019). Simple bert models for relation extraction and semantic role labeling.
CoRR abs/1904.05255 . Résumé
L’extraction de relations (ER) consiste à identifier et à structurer automatiquement des rela-tions à partir de textes. Récemment, BERT a permis d’améliorer les performances de plusieurstâches de NLP, dont l’ER. Cependant, la meilleure façon d’utiliser BERT, dans une architectured’apprentissage automatique avec une stratégie d’apprentissage par transfert reste une questionouverte, car elle dépend de la tâche et du domaine d’application. Dans ce travail, nous explo-rons diverses architectures d’ER qui s’appuient sur BERT et plusieurs stratégies de transfert(i.e., frozen ou fine-tuningfine-tuning