[PDF] Memorization vs. Generalization: Quantifying Data Leakage in NLP Performance Evaluation

Abstract

Public datasets are often used to evaluate the efficacy and generalizability of state-of-the-art methods for many tasks in natural language processing (NLP). However, the presence of overlap between the train and test datasets can lead to inflated results, inadvertently evaluating the model's ability to memorize and interpreting it as the ability to generalize. In addition, such data sets may not provide an effective indicator of the performance of these methods in real world scenarios. We identify leakage of training data into test data on several publicly available datasets used to evaluate NLP tasks, including named entity recognition and relation extraction, and study them to assess the impact of that leakage on the model's ability to memorize versus generalize.

Full PDF

aa r X i v : . [ c s . C L ] F e b Memorization vs. Generalization:Quantifying Data Leakage in NLP Performance Evaluation

Aparna Elangovan , Jiayuan He , , and Karin Verspoor The University of Melbourne, Australia RMIT University, Australia [email protected] { estrid.he, karin.verspoor } @unimelb.edu.au Abstract

Public datasets are often used to evaluate theefﬁcacy and generalizability of state-of-the-artmethods for many tasks in natural languageprocessing (NLP). However, the presence ofoverlap between the train and test datasets canlead to inﬂated results, inadvertently evaluat-ing the model’s ability to memorize and inter-preting it as the ability to generalize. In ad-dition, such data sets may not provide an ef-fective indicator of the performance of thesemethods in real world scenarios. We identifyleakage of training data into test data on sev-eral publicly available datasets used to evalu-ate NLP tasks, including named entity recog-nition and relation extraction, and study themto assess the impact of that leakage on themodel’s ability to memorize versus generalize.

Shared tasks that provide publicly availabledatasets in order to evaluate and compare the per-formance of different methods on the same taskand data are common in NLP. Held-out test setsare typically provided, enabling assessment of thegeneralizability of different methods to previouslyunseen data. These datasets have played a keyrole in driving progress in NLP, by deﬁning focustasks and by making annotated data available tothe broader community, in particular in specializeddomains such as biomedicine where data can bedifﬁcult to obtain, and quality data annotations re-quire the detailed work of domain experts. Exam-ples of tasks where benchmark data sets exist in-clude open domain question answering (QA) (Be-rant et al., 2013; Joshi et al., 2017) and biomedicalnamed entity recognition (Smith et al., 2008) .In the context of machine learning models, ef-fectiveness is typically determined by the model’sability to both memorize and generalize (Chatter-jee, 2018). A model that has huge capacity to memorize will often work well in real world appli-cations, particularly where large amounts of train-ing data are available (Daelemans et al., 2005).The ability of a model to generalize relates to howwell the model performs when it is applied on datathat may be different from the data used to trainthe model, in terms of e.g. the distribution of vo-cabulary or other relevant vocabulary. The abil-ity to memorize, taken to the extreme, can be con-sidered equivalent to an exact match lookup table(Chatterjee, 2018) and the ability to generalize cap-tures how well it can deal with degrees of vari-ations from the lookup table. An effective com-bination of memorization and generalization canbe achieved where a model selectively memorizesonly those aspects or features that matter in solv-ing a target objective given an input, allowing itto generalize better and to be less susceptible tonoise.When there is considerable overlap in the train-ing and test data for a task, models that memorizemore effectively than they generalize may bene-ﬁt from the structure of the evaluation data, withtheir performance inﬂated relative to models thatare more robust in generalization. However, suchmodels may make poor quality predictions outsideof the shared task setting. The external validity of these evaluations can therefore be questioned(Ferro et al., 2018).In this paper, we assess the overlap between thetrain and test data in publicly available datasetsfor Named Entity Recognition (NER), RelationExtraction (REL) and Text Classiﬁcation (CLS)tasks, including SST2 (Socher et al., 2013),BioCreative (Smith et al., 2008; Arighi et al.,2011) and AIMed (Bunescu et al., 2005) datasets,and examine the signiﬁcant impact of not takinginto account this overlap on performance evalua-tion.We argue that robustness in generalization tonseen data is a key consideration of the perfor-mance of a model, and propose a framework toexamine inadvertent leakage of data between dataset splits, in order to enable more controlled as-sessment of the memorization vs. generalizationcharacteristics of different methods.

The issue of memorization vs. generalization hasbeen previously discussed in the context of ques-tion answering datasets, where, given only a ques-tion, a system must output the best answer it canﬁnd in available texts.Lewis et al. (2020) identify 3 distinct issues foropen domain QA evaluation: a) question memo-rization – recall the answer to a question that themodel has seen at training time; b) answer mem-orization – answer novel questions at test time,where the model has seen the answer during train-ing; and c) generalization – question and answernot seen during training time. They ﬁnd that 58-71% of test answers occur in the training data in3 examined data sets, concluding that the majorityof the test data does not assess answer generaliza-tion. They also ﬁnd that 28-34% have paraphrasedquestions in training data, and a majority of ques-tions are duplicates differing only by a few words.Similarly, Min (2020) identiﬁed repeatingforms in QA test sets as a problem. The workproposed a novel template-based approach to split-ting questions into paraphrase groups referred toas “Templates” and then controlling train/test datasplits to ensure that all questions conforming toa given template appear in one segment of thedata only. This was tested on the EMR Clini-cal Question Answering dataset emrQA (Pampariet al., 2018) and the Overnight dataset (Wang et al.,2015); it was demonstrated that models performsigniﬁcantly worse on test sets where strict divi-sion is enforced. This paraphrase-based splittingmethodology was also employed in their recentwork on emrQA (Rawat et al., 2020).

A common practice to create a train and test set isto shufﬂe data instances in a dataset and generaterandom splits, without taking into account broadercontext. However, this can inadvertently lead todata leakage from the train set to test set due to theoverlaps between similar train and test instances.The type of overlap between train and test

Algorithm 1

Compute overlap procedure C OMPARE ( testset,trainset )2: totalscore ← n ← | testset | for test i in testset do s ← BESTMATCH ( test i , trainset ) totalscore ← totalscore + s end for return totalscore/n ⊲ Average score9: end procedure procedure

BESTMATCH ( test i , trainset )11: bestscore ← for train j in trainset do s ← SIMILARITY ( test i , train j ) if score > bestscore then bestscore ← s end if end for return bestscore end procedure dataset depends on the type of the NLP task. Gen-erally speaking, the leakage can occur either in theinput texts or the annotated outputs. We deﬁne thetypes of overlaps which may occur in several NLPtasks as follows.• In text classiﬁcation (CLS) tasks such as sen-timent analysis, overall (document-level) sim-ilarity in input texts can result in train/testleakage.• In named entity recognition (NER) tasks,leakage from train to test data may occurwhena) input sentences or passages are similarb) target entities are similar• In relation extraction (REL) tasks, leakagemay occur whena) input sentences or passages are similarb) participating entities are similarWe propose a framework for quantifying train-test overlaps, and conduct experiments to showthe impact of train-test overlap on model perfor-mances. Next, we discuss the proposed frame-work in Sec. 4.2 and the experimental settings inSec. 4.3. We present our ﬁndings including thetrain-test overlaps in several benchmark datasetsin Sec. 5.1 and the impact of data leakage inSec. 5.2. We examine overlap in the following datasets:•

AIMed - AIMed dataset (Bunescu et al.,2005) for protein relation extraction (REL)•

BC2GM - BioCreative II gene mention ask Dataset Score Split Example

REL AIMed (R) 100.0 Train Thus, during PROTEIN1 -mediated suppression of cell proliferation, PRO-TEIN and PROTEIN2 may be important for coordinating cell-cycle progres-sion, DNA replication and repair of damaged DNA.Test Thus, during PROTEIN -mediated suppression of cell proliferation, PROTEIN1and PROTEIN2 may be important for coordinating cell-cycle progression,DNA replication and repair of damaged DNA.NER BC2GM 100.0 Train E2F family membersTest E2F family members (1-5)CLS SST2 100.0 Train good movie .Test it ’s still not a good movie.CLS SST2 21.8 Train herzog is obviously looking for a moral to his fable , but the notion that astrong , uniﬁed showing among germany and eastern european jews mighthave changed 20th-century history is undermined by ahola ’s inadequate per-formance .Test of the unsung heroes of 20th century

Table 1: Examples of train-test matches and the corresponding unigram similarity score. dataset (Smith et al., 2008) for NER task•

ChEMU - Chemical Reactions from Patents(He et al., 2020) for recognising names ofchemicals, an NER task•

BC3ACT - Biocreative III protein interactionclassiﬁcation (CLS) (Arighi et al., 2011)•

SST2 - Stanford Sentiment Analysis Tree-bank (Socher et al., 2013) used to classifysentiments (CLS) in Glue (Wang et al., 2018)The AIMed dataset does not explicitly providea test set and 10-fold cross validation is used forevaluation in previous works (Hsieh et al., 2017;Zhang et al., 2019). In this paper, we use two typesof splits of AIMed to evaluate the impact of dataleakage: AIMed (R) which R andomly splits thedataset into 10 folds; and AIMed (U) which splitsthe dataset into 10 folds such that the documentswithin each resultant split are U nique (accordingto the document ID ) to other splits across eachsplit. The document ID refers to the source doc-ument of a data instance, and data instances fromthe same source document have the same docu-ment ID, see example in Appendix A The pseudo code for measuring similarity is shownin Algorithm 1. Given a test instance test i , wecompute its similarity with the training set usingthe training instance that is most similar with test i .We then use the average similarity over all the testinstances as an indicator to measure the extent oftrain/test overlap. The function similarity ( · ) canbe any function for text similarity. In this paper,we use a simple bag-of-words approach to com-pute text similarity. We represent each train/testinstance with a count vector of unigrams/bigram-s/trigrams, ignoring stopwords, and compute thesimilarity using the cosine similarity. We assess the impact of data leakage on a ma-chine learning model’s performance. We splitthe test sets of BC2GM, ChEMU, BC2ACT andSST2 into four intervals considering four similar-ity threshold ranges (in terms of unigrams): [0-0.25),[0.25-0.50), [0.50-0.75), and [0.75-1.0]. Forexample, the test instances in the ﬁrst interval aremost different from the training set with a similar-ity less than 0.25. This method allows full controlof the similarity of instances within each interval,but results in a different number of instances ineach interval. Thus, we consider another scenariowhere we split the test set into 4 quartiles based onsimilarity ranking, so that the number of samplesremain the same in each quartile but the thresholdvaries as a result.We ﬁnetune a BERT (base and cased) model(Devlin et al., 2019) for each dataset using theirown training set and compare the performance ofthe ﬁnetuned BERT model on the four differenttest intervals and test quartiles.We compare the performances of AIMed (R)with AIMed (U) using 3 different models—Zhanget al. (2019) convolutional residual network, Hsiehet al. (2017) Bi-LSTM, and BioBERT (Lee et al.,2019). Following previous works, we preprocessthe dataset and replace all non-participating pro-teins with neutral name

PROTEIN , the participat-ing entity pairs with

PROTEIN1 and

PROTEIN2 ,so the model only ever sees the pseudo proteinnames.

Examples of similar train and test instances areshown in Table 1. The overall results of train-test ataset Task uni bi tri

AIMED (R) REL

AIMED (U) REL

Table 2: Train-test similarity using unigrams (uni), bi-grams (bi), trigrams (tri). BC2GM and ChEMU are inBRAT standoff format and their similarities are shownfor their text ﬁles (“text”) and annotation ﬁles (“ann”).Similarities beyond 60.0 are highlighed in bold . Split type Method P R F1

O BiLSTM 78.8 75.2 76.9O ConvRes 79.0 76.8 77.6Replicated experimentsR BiLSTM 74.5 69.7 71.7U BiLSTM

R ConvRes 71.1 69.2 69.9U ConvRes

R BioBERT 79.8 76.7 77.9U BioBERT

Table 3: Performances on AIMed (R) and AIMed (U).The split type (O) indicates the original results from theauthors. similarities of all datasets are shown in Table 2.In the BC2GM dataset, we ﬁnd that there is70% overlap between gene names in the train andtest set. On further analysis, we ﬁnd that 2,308 outof 6,331 genes in the test set have exact matchesin the train set. In the AIMed (R) dataset, we cansee that there is over 73% overlap, even measuredin the trigrams, between train and test sets.

We observe drops in F-scores of more than 10points between AIMed (R) and AIMed (U) acrossall three models as shown in Table 3. This is inline with the similarity measurement in Table 2:the train-test similarity drops signiﬁcantly fromAIMed (R) to AIMed (U) since in AIMed (U) weonly allow unique document IDs in different folds.On the ChEMU NER dataset we observe nearly10-point drop in F-score (96.7 → → (a) the interval 4I only has 0.9% (47/5000)of test instances; (b) a signiﬁcant drop in recall(90.6 → → → → → → → → The bag-of-words based approach to compute co-sine similarity has been able to detect simpleforms of overlap effectively as shown in Table 2.A trend that can be seen is that overlap is morecommon in tasks that are manual labour intensive,

SR % P R F1 A

BC2 F 100.0 77.5 86.4 81.7BC2 1I 19.8 68.8 81.1

BC2 2I 74.1 78.3 86.9 82.4BC2 3I 5.1 83.8 90.6

BC2 4I 0.9 79.5 77.5 78.5ChE F 100.0 93.8 94.4 94.1ChE 1I 0.0 - - -ChE 2I 10.0 84.6 86.6

ChE 3I

BC3 F 100.0 45.1 84.1 58.7 82.1BC3 1I 47.0 43.0 82.0

SST 2I 66.8 91.6 96.0 93.8 93.4SST 3I 28.7 87.1 98.7 92.5 92.7SST 4I 3.5 96.9 96.9

Table 4: Performances on various similarity thresh-olds and the corresponding percentage of test instanceswithin the intervals. Datasets (D): BC2 → BC2GM,ChE → ChEMU, BC3 → BC3ACT, SST → SST2. Thesimilarity threshold range (SR) [0 , .

25) = 1 I, [0 . , .

5) = 2 I, [0 . , .

75) = 3 I, [0 . ,

1] = 4 I, [0 ,

1] =

F. Accuracy (A) is the ofﬁcial metric for theSST2 dataset according to GLUE benchmark, all oth-ers use F1-score (F1) as primary metric.

D Q Min Max P R F1 A

BC2 1 0.0 26.3 69.8 82.0

BC2 2 26.3 31.6 74.5 85.9 79.8BC2 3 31.6 38.3 78.3 86.4 82.1BC2 4 38.3 100.0 83.0 88.9

ChE 1 37.9 56.7 90.8 91.8

ChE 2 56.8 68.2 93.3 94.4 93.8ChE 3 68.2 78.5 95.1 96.1 95.6ChE 4 78.6 99.8 97.1 97.4

BC3 1 6.3 20.1 44.5 81.4

SST 2 36.5 43.6 91.3 96.2 93.7 93.2SST 3 43.6 53.5 91.2 97.3

SST 4 53.5 100.0 88.0 98.1 92.8 92.9

Table 5: Performances on four different test quartiles,where the number of samples in each quartile (Q) iskept same. The minimum (Min) and the maximum(Max) similarity within each quartile are also reported. such as named entity recognition and relation ex-traction compared to text classiﬁcation.However, this approach may detect similarityeven when the meanings are different, especiallyin the case of classiﬁcation tasks as shown forSST2 in Table 1. Semantic Text Similarity (STS)measurement is a challenging task in its own right,with a large body of literature and a number ofshared tasks organized to address it (Cer et al.,2017; Wang et al., 2020; Karimi et al., 2015).More sophisticated methods for similarity mea-surement developed in these contexts could beincorporated into the framework for measuringsimilarity of data set splits; for simple leakagedetection it is arguably adequate. However, so-phisticated methods can also potentially lead toa chicken and egg problem, if we use a machinelearning model to compute semantic similarity.The question of what level of similarity is ac-ceptable is highly data and task-dependent. If thetraining data has good volume and variety, thetraining-test similarity will naturally be higher andso will the acceptable similarity.

We ﬁnd that the F-scores tend to be higher whenthe test set input text is similar to the training setas shown in Table 3 and 4. While this might be ap-parent, quantifying similarity in the test set helpsunderstand that high scores in the test set could bea result of similarity to the train set, and thereforemeasuring memorization and not a model’s abilityto generalize. If a model is trained on sufﬁcientvolume and variety of data then it may now matterif it memorizes or generalizes in a real world con-text, and a model’s ability to memorize is not nec-essarily a disadvantage. However, in the settingof a shared task, we often do not have access tosufﬁciently large training data sets and hence it isimportant to consider the test/train similarity whenevaluating the models. This implies that in realworld scenarios the model may perform poorlywhen it encounters data not seen during training.

We conclude that quantifying train/test overlap iscrucial to assessing real world applicability of ma-chine learning in NLP tasks, given our relianceon annotated data for training and testing in theNLP community. A single metric over a held-outtest set is not sufﬁcient to infer generalizablity of model. Stratiﬁcation of test sets by similarityenables more robust assessment of memorizationvs. generalization capabilities of models. Furtherdevelopment of approaches to structured consid-eration of model performance under different as-sumptions will improve our understanding of thesetradeoffs.

References

Cecilia N Arighi, Phoebe M Roberts, Shashank Agar-wal, Sanmitra Bhattacharya, Gianni Cesareni, An-drew Chatr-Aryamontri, Simon Clematide, PascaleGaudet, Michelle Gwinn Giglio, Ian Harrow, et al.2011. BioCreative III interactive task: an overview.

BMC Bioinformatics , 12(S8):S4.Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on Freebase fromquestion-answer pairs. In

Proceedings of the 2013Conference on Empirical Methods in Natural Lan-guage Processing , pages 1533–1544. Associationfor Computational Linguistics.Razvan Bunescu, Ruifang Ge, Rohit J Kate, Edward MMarcotte, Raymond J Mooney, Arun K Ramani, andYuk Wah Wong. 2005. Comparative experimentson learning information extractors for proteins andtheir interactions.

Artiﬁcial Intelligence in Medicine ,33(2):139–155.Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017Task 1: Semantic textual similarity multilingualand crosslingual focused evaluation. In

Proceed-ings of the 11th International Workshop on SemanticEvaluation (SemEval-2017) , pages 1–14, Vancouver,Canada. Association for Computational Linguistics.Satrajit Chatterjee. 2018. Learning and memorization.In

Proceedings of the 35th International Conferenceon Machine Learning , volume 80 of

Proceedings ofMachine Learning Research , pages 755–763, Stock-holmsm¨assan, Stockholm Sweden. PMLR.Walter Daelemans, Antal Van den Bosch, et al. 2005.

Memory-based language processing . CambridgeUniversity Press.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Nicola Ferro, Norbert Fuhr, Gregory Grefenstette,Joseph A Konstan, Pablo Castells, Elizabeth M Daly,Thierry Declerck, Michael D Ekstrand, Werner Geyer, Julio Gonzalo, et al. 2018. The dagstuhl per-spectives workshop on performance modeling andprediction. In

ACM SIGIR Forum , volume 52, pages91–101. ACM New York, NY, USA.Jiayuan He, Dat Quoc Nguyen, Saber A. Akhondi,Christian Druckenbrodt, Camilo Thorne, RalphHoessel, Zubair Afzal, Zenan Zhai, Biaoyan Fang,Hiyori Yoshikawa, Ameer Albahem, LawrenceCavedon, Trevor Cohn, Timothy Baldwin, and KarinVerspoor. 2020. Overview of ChEMU 2020: Namedentity recognition and event extraction of chemicalreactions from patents. In

Experimental IR MeetsMultilinguality, Multimodality, and Interaction. Pro-ceedings of the Eleventh International Conference ofthe CLEF Association (CLEF 2020) , volume 12260.Lecture Notes in Computer Science.Yu-Lun Hsieh, Yung-Chun Chang, Nai-Wen Chang,and Wen-Lian Hsu. 2017. Identifying protein-pro-tein interactions in biomedical literature using re-current neural networks with long short-term mem-ory. In

Proceedings of the Eighth International JointConference on Natural Language Processing (Vol-ume 2: Short Papers) , pages 240–245, Taipei, Tai-wan. Asian Federation of Natural Language Process-ing.Mandar Joshi, Eunsol Choi, Daniel Weld, and LukeZettlemoyer. 2017. TriviaQA: A large scale dis-tantly supervised challenge dataset for reading com-prehension. In

Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 1601–1611, Van-couver, Canada. Association for Computational Lin-guistics.Sarvnaz Karimi, Jie Yin, and Jiri Baum. 2015. Evalu-ation methods for statistically dependent text.

Com-putational Linguistics , 41(3):539–548.Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So,and Jaewoo Kang. 2019. BioBERT: a pre–trained biomedical language representation modelfor biomedical text mining.

Bioinformatics ,36(4):1234–1240.Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel.2020. Question and answer test-train overlap inopen-domain question answering datasets.So Yeon Min. 2020. Towards knowledge-based, robustquestion answering.Anusri Pampari, Preethi Raghavan, Jennifer Liang,and Jian Peng. 2018. emrQA: A large corpus forquestion answering on electronic medical records.In

Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing ,pages 2357–2368, Brussels, Belgium. Associationfor Computational Linguistics.Bhanu Pratap Singh Rawat, Wei-Hung Weng, So YeonMin, Preethi Raghavan, and Peter Szolovits. 2020.ntity-enriched neural models for clinical ques-tion answering. In

Proceedings of the 19th SIG-BioMed Workshop on Biomedical Language Pro-cessing , pages 112–122, Online. Association forComputational Linguistics.L Smith, LK Tanabe, RJ Ando, CJ Kuo, IF Chung,CN Hsu, YS Lin, R Klinger, CM Friedrich,K Ganchev, et al. 2008. Overview of BioCreative IIgene mention recognition.

Genome Biology , 9:S2–S2.Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In

Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing ,pages 1631–1642, Seattle, Washington, USA. Asso-ciation for Computational Linguistics.Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In

Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 353–355, Brussels, Belgium.Association for Computational Linguistics.Yanshan Wang, Naveed Afzal, Sunyang Fu, LiweiWang, Feichen Shen, Majid Rastegar-Mojarad, andHongfang Liu. 2020. MedSTS: a resource for clini-cal semantic textual similarity.

Language Resourcesand Evaluation , 54(1):57–72.Yushi Wang, Jonathan Berant, and Percy Liang. 2015.Building a semantic parser overnight. In

Proceed-ings of the 53rd Annual Meeting of the Associationfor Computational Linguistics and the 7th Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 1: Long Papers) , pages 1332–1342,Beijing, China. Association for Computational Lin-guistics.H. Zhang, R. Guan, F. Zhou, Y. Liang, Z. Zhan,L. Huang, and X. Feng. 2019. Deep residual con-volutional neural network for protein-protein inter-action extraction.

IEEE Access , 7:89354–89365.

AIMed document examples

The following example shows how multiple data instances are extracted from a single document inAIMed dataset. The document with ID “

AIMed.d0 ” has several instances including “

AIMed.d0.s0 ”and “

AIMed.d0.s1 ”. These instances thus have the same document id.

B Classwise similarity for BC3AST

The test set has 5090 negative samples compared to 910 positive samples, with 2.96 points higher meansimilarity in positive samples.

Test label Unigram Bigram Trigram0 count 5090.00 5090.00 5090.00mean 26.31 6.70 1.73std 9.25 5.35 1.72min 6.28 0.00 0.0025% 19.70 3.29 0.7950% 25.16 5.07 1.3975% 31.53 8.29 2.27max 75.01 41.75 18.711 count 910.00 910.00 910.00mean 29.27 8.09 2.26std 9.36 6.00 1.73min 11.14 1.52 0.0025% 22.69 4.51 1.1750% 28.31 6.25 1.8875% 34.32 9.38 2.84max 74.01 51.20 18.97

Table 6: Class-wise similarity for BC3ACT dataset

C BERT and similarity thresholds

Table 7 shows the impact on precision, recall and F-score using different similarity thresholds on theBC2GM test set, which has approximately 6,300 annotations.We also compare the recall when the target annotations are similar as shown in Table 8. We onlycompare unigrams, as the number of tokens in a gene name tends to be small (on average less than 3).Table 9 shows BERT’s performance using bi-grams and trigrams on SST2 and BC3AST datasets. ataset N SR % P R F

BC2GM - - 100 77.5 86.4 81.7BC2GM 1 1I 19.8 68.8 81.1 74.5BC2GM 1 2I 74.1 78.2 86.9 82.3BC2GM l 3I 5.1 83.8 90.6 87.1BC2GM l 4I 1.0 79.5 77.5 78.5BC2GM 2 1I 91.7 76.9 86.3 81.4BC2GM 2 2I 7.5 82.5 88.1 85.2BC2GM 2 3I 0.3 1.0 1.0 1.0BC2GM 2 4I 0.5 78.9 76.9 77.9BC2GM 3 1I 98.5 77.4 86.4 81.7BC2GM 3 2I 0.9 85.2 88.5 86.8BC2GM 3 3I 0.1 50.0 100.0 66.7BC2GM 3 4I 0.5 80.6 76.3 78.4

Table 7: NER performances of BERT on various similarity threshold range (SR) and the corresponding percentageof instances when the similarity is computed using N-grams (N = 1, 2 and 3) in the input text. The range [0 , .

25) =1 I, [0 . , .

5) = 2 I, [0 . , .

75) = 3 I, [0 . ,

1] = 4 I, [0 ,

1] = F. Dataset N SR % RecallBC2GM (anno) - F 100.0 86.4BC2GM (anno) 1 1I 16.7 84.5BC2GM (anno) 1 2I 5.6 81.8BC2GM (anno) 1 3I 24.7 85.6BC2GM (anno) 1 4I 53.0 87.8

Table 8: NER score on BERT at various similarity threshold range (SR) and the corresponding % of samples usingngram N = 1 in the output annotated gene mentions.

Dataset N SR % P R F1 ABC3ACT - F 100.0 45.1 84.1 58.7 82.1BC3ACT 1 1I 47.0 43.0 82.0 56.4 85.8BC3ACT 1 2I 51.0 46.0 85.6 59.9 78.8BC3ACT 1 3I 2.0 53.5 76.7 63.0 77.5BC3ACT 1 4I 0.0 0.0 0.0 0.0 0.0BC3ACT 2 1I 98.2 45.0 84.1 58.6 82.2BC3ACT 2 2I 1.8 48.8 83.3 61.5 76.6BC3ACT 2 3I 0.0 100.0 100.0 100.0 100.0BC3ACT 2 4I 0.0 0.0 0.0 0.0 -BC3ACT 3 1I 100.0 45.1 84.1 58.7 82.1BC3ACT 3 2I 0.0 0.0 0.0 0.0 -BC3ACT 3 3I 0.0 0.0 0.0 0.0 -BC3ACT 3 4I 0.0 0.0 0.0 0.0 -SST2 - F 100.0 90.4 96.7 93.4 93.2SST2 1 1I 1.1 60.0 75.0 66.7 85.0SST2 1 2I 66.8 91.6 96.0 93.8 93.4SST2 1 3I 28.7 87.1 98.7 92.5 92.7SST2 1 4I 3.5 96.9 96.9 96.9 96.8SST2 2 1I 64.0 88.6 96.0 92.2 92.3SST2 2 2I 30.8 93.1 97.3 95.1 94.8SST2 2 3I 4.8 93.1 100.0 96.4 95.4SST2 2 4I 0.4 100.0 100.0 100.0 100.0SST2 3 1I 97.6 90.5 96.6 93.4 93.3SST2 3 2I 1.9 82.6 100.0 90.5 88.2SST2 3 3I 0.5 100.0 100.0 100.0 100.0SST2 3 4I 0.0 0.0 0.0 0.0 -

Table 9: SST2 and BC3ACT similarity thresholds using ngram N = 1,2 and 3. The range [0 , .

25) = 1 I , [0 . , .

5) = 2 I , [0 . , .

75) = 3 I , [0 . ,

1] = 4 I , [0 ,

1] = F High similarity BC2GM samples

Table 10 shows the 75% similarity samples in the BC2GM dataset. The samples that caused the drop inrecall are shown in Table 11.

Score Test Train76.45 Histological and immunophenotypic studies revealed 12 large cell lymphomas (11 Bcell and one T cell), two small noncleaved cell lymphomas (B-cell phenotype), andﬁve low grade B-cell lymphomas (two small lymphocytic and three follicular mixedlymphomas). The cases included 35 de novo diffuse aggressive lymphomas (DAL; 19 large-cell, 4mixed-cell, and 12 large-cell immunoblastic), 52 transformed aggressive lymphomasderived from follicular lymphomas (TFL), 42 indolent follicular lymphomas (FL), 14mantle cell lymphomas (MCL), and 27 small noncleaved cell lymphomas (SNCL).77.46 98, 93-98). 356, 93-98].81.65 Free protein S deﬁciency in acute ischemic stroke. Ischemic stroke due to protein C deﬁciency.83.41 In stage I, histochemistry for copper was positive in 11 out of 21 cases: 6 cases wereT+; 1 case R+ and 2 cases O+; 2 cases were T+, R+, O+. 3 cases86.60 STUDY DESIGN: Retrospective review. DESIGN: Retrospective study.86.60 Non-dialyzable transfer factor Dialyzable transfer factor.100.00 Recently we have performed a detailed analysis of speciﬁc neuronal populations af-fected by the mutation which shed new light on the roleof Krox-20 in the segmentationand on the physiological consequences of its inactivation. Recently we have performed a detailed analysis of speciﬁc neuronal populations af-fected by the mutationwhich shed new light on the role of Krox-20in the segmentationand on the physiological consequences of its inactivation.100.00 Slowly adapting type I mechanoreceptor discharge as a function of dynamic forceversus dynamic displacement of glabrous skin of raccoon and squirrel monkey hand. Slowly adapting type I mechanoreceptor discharge as a function of dynamic forceversus dynamic displacement of glabrous skin of raccoon and squirrel monkey hand.100.00 The recruitment of constitutively phosphorylated p185(neu) and the activated mito-genic pathway proteins to this membrane-microﬁlament interaction site provides aphysical model for integrating the assembly of the mitogenic pathway with the trans-mission of growth factor signal to the cytoskeleton. The recruitment of constitutively phosphorylated p185(neu) and the activated mito-genic pathway proteins to this membrane-microﬁlament interaction site provides aphysical model for integrating the assembly of the mitogenic pathway with the trans-mission of growth factor signal to the cytoskeleton.100.00 A heterologous promoter construct containing three repeats of a consensus Sp1 site,cloned upstream of a single copy of the ZII (CREB/ AP1) element from the BZLF1promoter linked to the beta-globin TATA box, exhibited phorbol ester inducibility. A heterologous promoter construct containing three repeats of a consensus Sp1 site,cloned upstream of a single copy of the ZII (CREB/ AP1) element from the BZLF1promoter linked to the beta-globin TATA box, exhibited phorbol ester inducibility.100.00 The reconstituted RNA polymerases containing the mutant alpha subunits were exam-ined for their response to transcription activation by cAMP-CRP and the rrnBP1 UPelement. The reconstituted RNA polymerases containing the mutant alpha subunits were exam-ined for their response to transcription activation by cAMP-CRP and the rrnBP1 UPelement.100.00 Analysis of 1 Mb of published sequence from the region of conserved synteny onhuman chromosome 5q31-q33 identiﬁed 45 gene candidates, including 35 expressedgenes in the human IL-4 cytokine gene cluster. Analysis of 1 Mb of published sequence from the region of conserved synteny onhuman chromosome 5q31-q33 identiﬁed 45 gene candidates, including 35 expressedgenes in the human IL-4 cytokine gene cluster.100.00 Although RAD17, RAD24 and MEC3 are not required for cell cycle arrest when Sphase is inhibited by hydroxyurea (HU), they do contribute to the viability of yeastcells grown in the presence of HU, possibly because they are required for the repair ofHU-induced DNA damage. Although RAD17, RAD24 and MEC3 are not required for cell cycle arrest when Sphase is inhibited by hydroxyurea (HU), they do contribute to the viability of yeastcells grown in the presence of HU, possibly because they are required for the repair ofHU-induced DNA damage.100.00 The promoter for HMG-CoA synthase contains two binding sites for the sterol regula-tory element-binding proteins (SREBPs). The promoter for HMG-CoA synthase contains two binding sites for the sterol regula-tory element-binding proteins (SREBPs).100.00 Coronary vasoconstriction caused by endothelin-1 is enhanced by ischemia-reperfusion and by norepinephrine present in concentrations typically observed afterneonatal cardiopulmonary bypass. Coronary vasoconstriction caused by endothelin-1 is enhanced by ischemia-reperfusion and by norepinephrine present in concentrations typically observed afterneonatal cardiopulmonary bypass.100.00 (LH P ¡ 0.05, LH/FSH P ¡ 0.01). (LH P ¡ 0.05, LH/FSH P ¡ 0.01).100.00 Determinants of recurrent ischaemia and revascularisation procedures after thrombol-ysis with recombinant tissue plasminogen activator in primary coronary occlusion. Determinants of recurrent ischaemia and revascularisation procedures after thrombol-ysis with recombinant tissue plasminogen activator in primary coronary occlusion.100.00 The human SHBG proximal promoter was analyzed by DNase I footprinting, andthe functional signiﬁcance of 6 footprinted regions (FP1-FP6) within the proximalpromoter was studied in human HepG2 hepatoblastoma cells. The human SHBG proximal promoter was analyzed by DNase I footprinting, andthe functional signiﬁcance of 6 footprinted regions (FP1-FP6) within the proximalpromoter was studied in human HepG2 hepatoblastoma cells.100.00 Biol. Biol.100.00 Copyright 1999 Academic Press. Copyright 1999 Academic Press.100.00 These results demonstrate a speciﬁc association of SIV and HIV-2 nef, but not HIV-1nef, with TCRzeta. These results demonstrate a speciﬁc association of SIV and HIV-2 nef, but not HIV-1nef, with TCRzeta.100.00 Urease activity, judged as the amount of ammonia production from urea, could bemeasured at 25 ng per tube (S/N = 1.5) with Jack bean meal urease. Urease activity, judged as the amount of ammonia production from urea, could bemeasured at 25 ng per tube (S/N = 1.5) with Jack bean meal urease.100.00 Copyright 1999 Academic Press. Copyright 1999 Academic Press.100.00 IV. IV.100.00 Copyright 1998 Academic Press. Copyright 1998 Academic Press.100.00 IV. IV.100.00 Biol. Biol.100.00 Copyright 1999 Academic Press. Copyright 1999 Academic Press.100.00 Copyright 1998 Academic Press. Copyright 1998 Academic Press.100.00 Copyright 2000 Academic Press. Copyright 2000 Academic Press.100.00 1988). (1988) J.100.00 Biol. Biol.100.00 Acad. Acad.100.00 Virol. Virol.100.00 1995. (1995) J.100.00 Natl. Natl.100.00 Copyright 1999 Academic Press. Copyright 1999 Academic Press.100.00 The activated glucocorticoid receptor forms a complex with Stat5 and enhances Stat5-mediated transcriptional induction. The activated glucocorticoid receptor forms a complex with Stat5 and enhances Stat5-mediated transcriptional induction.100.00 Copyright 1999 Academic Press. Copyright 1999 Academic Press.100.00 Chem. Chem.100.00 Appl. Appl.100.00 Copyright 1998 Academic Press. Copyright 1998 Academic Press.100.00 Sci. Sci.100.00 (1992) J. (1992) J.100.00 Acad. Acad.100.00 Mutational analysis of yeast CEG1 demonstrated that four of the ﬁve conserved motifsare essential for capping enzyme function in vivo. Mutational analysis of yeast CEG1 demonstrated that four of the ﬁve conserved motifsare essential for capping enzyme function in vivo.100.00 We also show that in fusions with the DNA binding domain of GAL4, full activityrequires the entire BHV-alpha TIF, although both amino and carboxyl termini displaysome activity on their own. We also show that in fusions with the DNA binding domain of GAL4, full activityrequires the entire BHV-alpha TIF, although both amino and carboxyl termini displaysome activity on their own.

Table 10: Samples with over 75% similarity in the BC2GM dataset ene Position Inputcapping enzyme 88 100 Mutational analysis of yeast CEG1 demonstrated that four of the ﬁve conserved motifs are essential for cappingenzyme function in vivo.human IL-4 cytokine gene 145 165 Analysis of 1 Mb of published sequence from the region of conserved synteny on human chromosome 5q31-q33identiﬁed 45 gene candidates, including 35 expressed genes in the human IL-4 cytokine gene cluster.LH 1 2 (LH P ¡ 0.05, LH/FSH P ¡ 0.01).LH 10 11 (LH P ¡ 0.05, LH/FSH P ¡ 0.01).FSH 13 15 (LH P ¡ 0.05, LH/FSH P ¡ 0.01).Urease 0 5 Urease activity, judged as the amount of ammonia production from urea, could be measured at 25 ng per tube (S/N= 1.5) with Jack bean meal urease.Jack bean meal urease 101 118 Urease activity, judged as the amount of ammonia production from urea, could be measured at 25 ng per tube (S/N= 1.5) with Jack bean meal urease.cAMP-CRP 117 124 The reconstituted RNA polymerases containing the mutant alpha subunits were examined for their response totranscription activation by cAMP-CRP and the rrnBP1 UP element.HIV-2 nef 51 58 These results demonstrate a speciﬁc association of SIV and HIV-2 nef, but not HIV-1 nef, with TCRzeta.HIV-1 nef 66 73 These results demonstrate a speciﬁc association of SIV and HIV-2 nef, but not HIV-1 nef, with TCRzeta.ene Position Inputcapping enzyme 88 100 Mutational analysis of yeast CEG1 demonstrated that four of the ﬁve conserved motifs are essential for cappingenzyme function in vivo.human IL-4 cytokine gene 145 165 Analysis of 1 Mb of published sequence from the region of conserved synteny on human chromosome 5q31-q33identiﬁed 45 gene candidates, including 35 expressed genes in the human IL-4 cytokine gene cluster.LH 1 2 (LH P ¡ 0.05, LH/FSH P ¡ 0.01).LH 10 11 (LH P ¡ 0.05, LH/FSH P ¡ 0.01).FSH 13 15 (LH P ¡ 0.05, LH/FSH P ¡ 0.01).Urease 0 5 Urease activity, judged as the amount of ammonia production from urea, could be measured at 25 ng per tube (S/N= 1.5) with Jack bean meal urease.Jack bean meal urease 101 118 Urease activity, judged as the amount of ammonia production from urea, could be measured at 25 ng per tube (S/N= 1.5) with Jack bean meal urease.cAMP-CRP 117 124 The reconstituted RNA polymerases containing the mutant alpha subunits were examined for their response totranscription activation by cAMP-CRP and the rrnBP1 UP element.HIV-2 nef 51 58 These results demonstrate a speciﬁc association of SIV and HIV-2 nef, but not HIV-1 nef, with TCRzeta.HIV-1 nef 66 73 These results demonstrate a speciﬁc association of SIV and HIV-2 nef, but not HIV-1 nef, with TCRzeta.