[PDF] MT-Clinical BERT: Scaling Clinical Information Extraction with Multitask Learning

Abstract

Clinical notes contain an abundance of important but not-readily accessible information about patients. Systems to automatically extract this information rely on large amounts of training data for which their exists limited resources to create. Furthermore, they are developed dis-jointly; meaning that no information can be shared amongst task-specific systems. This bottle-neck unnecessarily complicates practical application, reduces the performance capabilities of each individual solution and associates the engineering debt of managing multiple information extraction systems. We address these challenges by developing Multitask-Clinical BERT: a single deep learning model that simultaneously performs eight clinical tasks spanning entity extraction, PHI identification, language entailment and similarity by sharing representations amongst tasks. We find our single system performs competitively with all state-the-art task-specific systems while also benefiting from massive computational benefits at inference.

Full PDF

MMT-Clinical BERT: Scaling Clinical Information Extraction with MultitaskLearning

Andriy Mulyar and Bridget T. McInnes, Ph.DVirginia Commonwealth University, Richmond, Virginia, United States

Abstract

Clinical notes contain an abundance of important but not-readily accessible information about patients. Systems toautomatically extract this information rely on large amounts of training data for which their exists limited resources tocreate. Furthermore, they are developed dis-jointly; meaning that no information can be shared amongst task-speciﬁcsystems. This bottle-neck unnecessarily complicates practical application, reduces the performance capabilities ofeach individual solution and associates the engineering debt of managing multiple information extraction systems. Weaddress these challenges by developing Multitask-Clinical BERT : a single deep learning model that simultaneouslyperforms eight clinical tasks spanning entity extraction, PHI identiﬁcation, language entailment and similarity bysharing representations amongst tasks. We ﬁnd our single system performs competitively with all state-the-art task-speciﬁc systems while also beneﬁting from massive computational beneﬁts at inference. Introduction

Electronic Health Records (EHR) contain a wealth of actionable patient information in the form of structured ﬁeldsand unstructured narratives within a patient’s clinical note. While structured data such as billing codes provide coarsegrained signal pertaining to common conditions or treatments a patient may have experienced, a large quantity ofvital information is not directly accessible due to being stored in unstructured, free-text notes. The task of automat-ically extracting structured information from this free-form text is known as information extraction and has been anintensely studied line of research over the past two decades. While the primary objective of information extraction is togather ﬁne-grained information about patients such as problems experienced, treatments underwent, tests conductedand drugs received, auxiliary tasks such as the automatic identiﬁcation and subsequent removal of Personal HealthInformation (PHI) are also of pragmatic interest to the functioning of the health system controlling the EHR.To support this diverse set of information extraction challenges, several community lead shared tasks have anno-tated datasets for the construction and evaluation of automated information extraction systems. These include theidentiﬁcation of problems, treatments and tests ; the identiﬁcation of drugs, adverse drug events and drug relatedinformation ; and the de-identiﬁcation of PHI . While these shared tasks have produced well performing solutions,the resulting systems are disjoint meaning that no information is shared between systems addressing each individualinformation extraction task. Notably, this means that each task requires a separate engineering effort to solve, narrowtechnical expertise to construct and disjoint computational resources to apply in clinical practice. Recently, this gaphas been narrowed by advances in large scale self-supervised text pre-training . This paradigm has resulted in wellknow language representation systems such as BERT which can easily be adapted to any single domain speciﬁc taskand achieve state-of-the-art performance. In the clinical space, researchers have similarly leveraged large clinical noterepositories such as MIMIC-III to pre-train Clinical BERT instances achieving large performance gains on severalclinical NLP related tasks. While well-performing, a single ﬁne-tuned Clinical BERT instance requires signiﬁcantresources to deploy into a clinical informatics workﬂow thus limiting it’s practical applicability. This fact is ampliﬁedby the observation that an isolated 110 million parameter model is required for each clinical task; scaling linearly therequired hardware resources.This work introduces Multitask-Clinical-BERT: a single , uniﬁed deep learning-based clinical information extractionsystem that concurrently addresses eight information extraction tasks via multitask learning. MT-Clinical BERT aug-ments the well known BERT deep learning architecture with a novel task ﬁne-tuning scheme that allows the learningof features for multiple clinical tasks simultaneously. As a result, our system massively decreases the hardware andcomputational requirements of deploying BERT into clinical practice by successfully condensing eight 110 millionparameter BERT instances into a single model while retaining nearly all BERT-associated task performance gains. https://github.com/AndriyMulyar/multitasking_transformers a r X i v : . [ c s . C L ] A p r ok [CLS] Tok Tok n-1

Tok n [SEP] ... BERT ℎ ⃗ [ ] ℎ ⃗ ℎ ⃗ ℎ ⃗ −1 ℎ ⃗ ℎ ⃗ [ ] ... FC Layer (0, 1, . . . , 0)

NER Head quaero2014

EntailmentHead

FC Layer (0, 0, 1)

MedNLI

EntailmentHead

FC Layer (0, 0, 1)

MedRQE

STS Head

FC Layer [0, 5] n2c22019

FC Layer (0, 1, . . . , 0)

NER Head n2c22018

FC Layer (0, 1, . . . , 0)

NER Head i2b22014

FC Layer (0, 1, . . . , 0)

NER Head i2b22012

FC Layer (0, 1, . . . , 0)

NER Head i2b22010

Figure 1:

Eight headed MT Clinical BERT with a round robin training schedule. Each Entailment head predictsa one-hot class indicator. The Semantic Text Similarity (STS) head predicts a similarity score in [0 , representingthe semantic similarity of the two input sentences. Each Named Entity Recognition head predicts a one-hot entityindicator for each input sub-word token.Our main contributions are summarized as follows:1. We develop a single deep learning model that concurrently achieves competitive performance over eight clinicaltasks spanning named entity recognition, entailment and semantic similarity. As a result, we achieve an eightfold computational speed-up at inference compared to traditional per-task sequentially ﬁne-tuned models.2. We demonstrate the feasibility of multitask learning towards developing a universal clinical information extrac-tion system that shares information amongst disjointly annotated datasets.3. We release and benchmark against a new and more competitive BERT ﬁne-tuning baseline for eight clinicaltasks by performing extensive hyper parameter tuning for each task’s dataset. Methods

This section begins with a description of our clinical multitask learning system and then discusses the clinical textbenchmarks evaluating it’s performance. In this work, we refer to a task as a tuple consisting of a text dataset andcorresponding task objective (e.g. token classiﬁcation).

Multitasking Clinical BERT

The standard practice in transfer learning from BERT is a method known as sequential ﬁne-tuning. During sequentialﬁne-tuning, a BERT architecture is initialized with the parameters of a pre-trained, self-supervised model and thenﬁne-tuned with loss signal from a task-speciﬁc head. This procedure adapts the weights of the base pre-trained model

C Layer (0, 1, . . . , 0) ℎ ⃗ (a) Named Entity Recog-nition head on a sub-wordtoken hidden state.

FC Layer [0, 5] ℎ ⃗ [ ] (b) Semantic Text Simi-larity head on classiﬁca-tion token hidden state.

FC Layer (0, 0, 1) ℎ ⃗ [ ] (c) Entailment head onclassiﬁcation token hid-den state.

Figure 2:

Task-speciﬁc heads with corresponding input representations from the BERT hidden state sequence.into a task-speciﬁc feature encoder capable of representing the input text such that the task objective is easily dis-cernible by the task-speciﬁc head (e.g. linearly separable in the case of classiﬁcation). In contrast, hard-parametermultitask learning aims to adapt the weights of the base pre-trained model into a feature encoder capable of generatingtext representations suitable for multiple tasks simultaneously. In the case of BERT, this is achieved by treating theBERT Transformer stack as a feature encoder that feeds into multiple light-weight task-speciﬁc architectures eachimplementing a different task objective.Our multitasking model (Figure 1) comprises of a BERT feature encoder with weights initialized from Bio + ClinicalBERT and eight per-dataset task-speciﬁc heads. The head architectures are as follows: • Named Entity Recognition (Figure 2a): token classiﬁcation via a per-entity linear classiﬁer on sub-word tokensproviding loss signal with cross entropy loss. • Semantic Text Similarity (Figure 2b): sentence pair semantic similarity scoring via a linear regression on thesequence representation [CLS] token providing loss signal via the mean squared error. • Natural Language Inference (Figure 2c): sentence pair logical entailment via a linear classiﬁer on the sequencerepresentation [CLS] token providing loss signal with cross entropy loss.

Algorithm 1

MT-Clinical BERT Training Schedule

Require: θ E : pre-trained Transformer encoder. Require: θ H = { θ h , . . . , θ h n } : n task-speciﬁc heads. Randomly initialize θ h i ∀ i ∈ { , . . . , n } while all batches from largest task dataset are not sampled do Sample a batch D i for each θ h i ∈ θ H for each ( θ h i , D i ) do (cid:46) One round robin iteration Let θ = θ E ◦ θ h i (cid:46) Outputs of encoder into head θ h θ (cid:48) = θ − α ∇ θ L ( θ, D i ) Update θ with θ (cid:48) end for end while To train our multitasking model, the feature encoder must be adapted to support all tasks simultaneously. Their areseveral established methods of adapting the feature encoder parameters by combining loss signal from each headduring training (e.g. averaging/adding losses); however, most assume that the loss function is constant across all ofthe heads. In general this is not necessarily true. When different loss functions are present, the standard solution isto sub-sample instances from each dataset proportional to the dataset size and then proceed with batch stochasticradient descent with respect to each individual loss function. We propose an alternative training scheme equivalentto proportional sub-sampling, but requiring no additional proportionality calculations. To do this, we cycle the headsand batched gradient updates in a round robin fashion over the BERT feature encoder. This training schedule issummarized in Algorithm 1. Data

In this section, we describe the eight clinical tasks used to evaluate our multitasking system. Table 1 showcasesthe tasks considered, the pre-deﬁned train and evaluation splits used in our experiments and the corresponding taskevaluation metric.

Table 1:

Clinical information extraction benchmarks with reported performance metric.Task Dataset Metric Description Pearson Rho Sentence Pair Semantic Similarity 1,641 410Entailment MedNLI Accuracy Sentence Pair Entailment 12,627 1,422MedRQE Accuracy Sentence Pair Entailment 8,588 302NER n2c2-2018 Micro-F1 Drug and Adverse Drug Event 36,384 23,462i2b2-2014 Micro-F1 PHI de-identiﬁcation 17,310 11,462i2b2-2012 Micro-F1 Events 16,468 13,594i2b2-2010 Micro-F1 Problems, Treatments and Tests 27,837 45,009quaero-2014 Micro-F1 UMLS Semantic Groups (French) 2,695 2,260The Semantic Textual Similarity (STS) task is to assign a numerical score to sentence pairs indicating their degree ofsemantic similarity. Our system includes one STS dataset:1. The n2c2-2019 dataset consists of de-identiﬁed pairs of clinical text snippets from the Mayo Clinic that wereordinally rated from 0 to 5 with respect to their semantic equivalence where 0 indicates no semantic overlap and5 indicates complete semantic overlap. The training dataset contains 1,642 sentence pairs; while the test datasetcontains 412 sentence pairs.Textual Entailment is the task of determining if one text fragment is logically entailed by the previous text fragment.We utilize two Entailment datasets:2. The MedNLI dataset consists of the sentence pairs developed by Physicians from the Past Medical Historysection of MIMIC-III clinical notes annotated for Deﬁnitely True , Maybe True and

Deﬁnitely False . The datasetcontains 11,232 training, 1,395 development and 1,422 test instances. We combined the training and develop-ment instances for our work.3. The MedRQE dataset consists of question-answer pairs from the National Institutes of Health (NIH) NationalLibrary of Medicine (NLM) clinical question collection consisting of Frequently Asked Questions (FAQs). Thepositive examples were drawn explicitly from the dataset while the negative pairs were collected by associatinga randomly combined question-answer pair as having at least one common keyword and at least one differ-ent keyword from the original question. The dataset contains of 8,588 training pairs and 302 test pairs withapproximately 54.2% as positive instances.Named Entity Recognition (NER) is the task of automatically identifying mentions of speciﬁc entity types withinunstructured text. In this work, we utilize ﬁve NER datasets:4. The n2c2 2018 dataset consists of 505 de-identiﬁed discharge summaries drawn from the MIMIC-III clinicalcare database and annotated for Adverse Drug Events (ADEs) and the drug that caused them; reason for takingthe drug and the associated dosage, route, and frequency information. The training and test sets contain 303 and202 instances respectively.. The i2b2-2014 dataset consists of 28,772 de-identiﬁed discharge summaries provided from Partners HealthCareannotated for personal health information (PHI) including, patient names, physician names, hospital names,identiﬁcation numbers, dates, locations and phone numbers. The training and test sets contain 17,310 and11,462 instances respectively.6. The i2b2-2012 dataset consists of de-identiﬁed discharge summaries provided by Partners HealthCare andMIMIC-II. The dataset was annotated for two entity types: 1) clinically events, including both clinical con-cepts, departments, evidentials and occurrences; and (2) temporal expressions, referring to the dates, times,durations, or frequencies. In this work, we evaluated over only the event annotations. The training and test setscontain 16,468 and 13,594 instances respectively.7. The i2b2-2010 dataset consists of de-identiﬁed discharge summaries provided by Partners HealthCare andMIMIC-II; and de-identiﬁed discharge and progress notes from the University of Pittsburg Medical Center. Thedataset was annotated for three entity types: 1) clinical concepts, clinical tests and clinical problems. Theseentities overlap with the i2b2-2010 event annotations. The training and test sets contain 27,837 and 45,009instances respectively.8. The quaero-2014 dataset consists of a french medical corpus containing three document types: 1) the EuropeanMedicines Agency (EMEA) drug information; 2) MEDLINE research article titles; and 3) European PatentOfﬁce (EPO) patents. The dataset was annotated for ten types of clinical entities from the Uniﬁed MedicalLanguage System (UMLS) Semantic Groups : Anatomy, Chemical and Drugs, Devices, Disorders, GeographicAreas, Living Beings, Objects, Phenomena, Physiology, Procedures.The training and test sets contain 2,695 and2,260 instances respectively. Evaluation

To insure a competitive and fair comparison with existing state-of-the-art solutions, we perform a hyper parametersearch for each individual Clinical BERT task ﬁne-tuning run and report the best performing model on each task.Recent work has found negligible performance differences between random seed re-initialization and more complexmethods of hyper parameter search during BERT ﬁne-tuning so we opt for the former. Speciﬁcally, for each task aClinical BERT instance is initialized and ﬁne-tuned for twenty training data epochs over ﬁve unique random seedsresulting in 100 unique task-speciﬁc models. We report the top performing model at evaluation. We do not utilizea development set for training MT-Clinical BERT as the multitasking paradigm itself largely removes the abilityfor a model to overﬁt any speciﬁc task. Additionally, we do not perform hypothesis testing due to the signiﬁcantcomputational resources required. Results and DiscussionTable 2:

Clinical information extraction performance of MT Clinical BERT versus hyper parameter searched ClinicalBERT ﬁne-tuning runs. All span level metrics are exact match. Task performances showcased in the column

MT-Clinical BERT represent a single multitask trained feature encoder with individual task-speciﬁc heads. All otherreported results are generated from task-speciﬁc BERT models. Higher is better.MT-Clinical BERT Optimized Clinical BERT Clinical Bert n2c2-2019 86.7 ( − . ( − . ( − . ( − . ( − . (+0 . ( − . ( − . (Optimized) Clinical BERT represent individuallyﬁne-tuned, per-task BERT models. Evaluations reported in the column MT-Clinical BERT represent light-weight task-speciﬁc heads over a single multitask trained BERT feature encoder. We ﬁnd that the performances reported in theClinical BERT paper can be substantially improved via hyper parameter search. While this is not surprising (theauthors specify that performance was not their goal), it is important to compare improvements or degradations againsta competitive baseline. All further discussion compares the multitasking model to the hyperparameter OptimizedClinical BERT baseline.We observe a slight but consistent performance degradation in MT-Clinical BERT relative to sequential ﬁnetuning.Intuitively, this suggests that learning a general clinical text representation capable of supporting multiple tasks hasthe downside of losing the ability to exploit dataset or clinical note speciﬁc properties when compared to a single,task-speciﬁc model. This phenomena can best be illustrated amongst the English token classiﬁcation tasks, wherethe de-identiﬁcation task, i2b2-2014, suffered the greatest performance degradation. Clinical BERT is pretrained overMIMIC-III. As MIMIC-III is de-identiﬁed, all PHI markers in the original notes are replaced with special PHI tokensthat do not linguistically align with the surrounding text (e.g. an instance of a hospital name would be replaced withthe token [HOSPITAL]). Due to this, no PHI tokens are present in MIMIC-III and thus the pre-training procedure ofClinical BERT over the MIMIC-III corpus provides little signal pertaining to PHI tokens. Alsentzer et al. observesand discusses this property at depth. These results suggest that a lack of PHI related information during pre-training canbe overcome by the encoder during sequential ﬁne-tuning but not as successfully when regularized by the requirementof supporting multiple tasks.Surprisingly, MT-Clinical BERT confers a slight performance increase in the problem, treatment and test extractiontask i2b2-2012 relative to the hyper parameter tuned Clinical BERT baseline. This suggests that multitask regulariza-tion with the related problem, treatment and test extraction task in i2b2-2010 may be inducing features more suited togeneralizability for these entity types. These are the only NER tasks with overlapping entity deﬁnitions.Our ﬁnal observation re-enforces the commonly laid out claim in the multitasking community related to task orthogo-nality / overlap. In the supervised multitask set-up, two tasks are said to have overlap when some characteristics of agiven task (e.g. data domain, task objective, target label space, etc) should intuitively help with performance on a dif-ferent but related task. Otherwise, tasks are said to be orthogonal along that characteristic . The majority of the tasks (5 / in this study are token classiﬁcation objectives. Unlike the three segment level tasks, these require the BERTfeature encoder to learn task-robust contextual token representations which, due to their prevalence during training,may negatively harm the formation of segment level representations. This objective orthogonality is suggested byconsistent and large performance decreases in the entailment tasks (MedNLI and MedRQE). We speculate that thiscould be aided by including additional clinical-related segment level objectives during training or by incorporating theoriginal next sentence prediction pre-training objective into the multitasking mix. Similarly, the quaero 2014 corpus isentirely in French. This naturally induces a lingual orthogonality relative to the other seven English corpora. This or-thogonality manifests by inducing the largest loss in competitiveness (-6.4%) to ﬁne-tuning baselines across all tasks.Again, we suspect that the inclusion of additional non-English token level tasks could close this performance gap.To summarize, the main insights from our analysis are: • A general trend of degradation in MT-Clinical BERT task-speciﬁc performance over individual task-speciﬁcmodels. This is a direct trade-off to the eight-fold reduction in parameters and computational speed up atinference provided by MT-Clinical BERT. • The observation of the task-speciﬁc performance increase on i2b2-2012 by MT-Clinical BERT. This is poten-tially due to the the regularization provided during multitask learning. • The observation that the greatest relative reduction in multitasking performance occurs on datasets (MedNLI,MedRQE and quaero-2014) with orthogonal characteristics to the predominately English token classiﬁcation(NER) tasks considered. xperimental Details and Reproducibility

We base our implementation on the well-known HuggingFace “Transformers” implementation of BERT. During hyperparameter tuning, we re-initialize with random seeds in the set { , . . . , } . All ﬁne-tuning is performed with constantlearning rate e − . The NER heads train with 512 sub-word sequences of batch size 25, while the STS and Entailmenttraining is performed with a batch size of 40. All training and evaluation is conducted on a single Tesla V100 GPU. Inaddition to our pre-trained models, we support reproducibility by including all pre-processing necessary to replicateour results in the code release. Avenues for Practical Impact

It is important that the recent NLP advances our work builds on can reach clinical practice. This section providesinsight to the clinical NLP practitioner regarding the feasibility and advantages of transitioning MT-Clinical BERTinto their clinical EHR analysis systems. The main contribution of this work is condensing eight 110M parameter deeplearning models into one single enormous computationalbeneﬁts (8x) at inference while performing less implementation work by implementing a clinical information systembased on our contribution. Importantly, our contribution is expandable via the integration of additional tasks duringtraining. This means our system is capable of integrating and concurrently supporting future information extractiontasks such as the inclusion of novel, currently undeﬁned named entity types.

Limitations

We foresee the following limitations for both the implementation and scaling of our proposed system. First, thedatasets considered are annotated over patient discharge summaries. Naturally, different types of notes may havediffering underlying data distribution which can lead to performance degradation. Second, we have observed fromexperiments in other domains that scaling the number of tasks ( > during training inversely correlates with per-taskperformance. This means that multitask training with a large number of tasks may require careful ablation experimentsto gauge the net beneﬁt of adding any given task. Related Work

Multitask learning has been an integral sub-ﬁeld of the machine learning community for many decades. In the contextof deep learning, programs in several domains spanning drug discovery, computer vision and natural language process-ing have continued achieving successes by sharing supervised signal and data between machine learning tasks. Halfa decade ago, Ramsdunar et al. introduced a multitask learning system based on hard-parameter sharing for drugdiscovery. This contribution achieved signiﬁcant performance improvement in drug target identiﬁcation by leveraging259 unique drug target tasks during multitask training. Similarly in computer vision Yan et al. developed MULAN,a multitasking system that concurrently detects, tags and segments lesions in radiological report images. In 2019,Liu et al. introduced a multitasking model to improve performance on the GLUE natural language understandingbenchmark and Google introduced T5 - a system capable of multitasking by framing common NLP tasks as sequencegeneration objectives sharing a single encoder/decoder. Future work

There are several directions for future work. We describe them and provide insight below. • Adding more tasks and datasets.

Is adding more tasks feasible and beneﬁcial? Their is strong evidence sug-gesting that including a greater number of overlapping tasks may increase task-speciﬁc predictive performance.This comes with the additional beneﬁt of increasing computational performance at inference as described in thiswork. • Learning from limited data.

Do the representations obtained via multitask learning serve as a better initializa-tion for learning from limited data resources? Work in this direction would beneﬁt from the inclusion of instanceablation studies.

Unifying NLP pipelines into end-to-end systems.

Many common NLP tasks build upon the output of previoustasks. This naturally results in phenomena such as error propagation. Can the shared representations producedby a multitask encoder construct an effective joint NER and relation identiﬁcation system? Recent work suggests this is possible but can it be accomplished in the multitasking framework? • Incorporating pre-training objectives during multitasking.

In low annotated data domains such as clinicaltext, we suspect it may be useful to incorporate the self-supervised masked language modeling and next sentenceprediction objectives during multitask training. During preliminary experiments, we ﬁnd that this does not harmsystem performance.

Conclusion

We ﬁnd that multitask learning is an effective mechanism to distill information from multiple clinical tasks into asingle system. This has the main beneﬁt of signiﬁcant hardware and computational reductions at inference with thetrade-off of a small performance degradation. Our system directly increases the potential for the use of recent state-of-the-art NLP methods in clinical application. In addition, we contribute new state-of-the-art baselines for severalclinical information extraction tasks. The data repositories and resources of the clinical NLP community have grownsteadily over the past two decades - the doors have been opened to consolidate, cross-leverage and jointly build onthese expensive annotation efforts. We make our implementation and pre-trained models publicly accessible . Acknowledgements

The authors would like to thank Nick Rodriguez for his valuable commentary and suggestions on the ﬁnal draft of thisarticle.

References

1. ¨Ozlem Uzuner, Brett R South, Shuying Shen, and Scott L DuVall. 2010 i2b2/va challenge on concepts, assertions,and relations in clinical text.

Journal of the American Medical Informatics Association , 18(5):552–556, 2011.2. Weiyi Sun, Anna Rumshisky, and Ozlem Uzuner. Evaluating temporal relations in clinical text: 2012 i2b2 chal-lenge.

Journal of the American Medical Informatics Association , 20(5):806–813, 2013.3. Sam Henry, Kevin Buchan, Michele Filannino, Amber Stubbs, and Ozlem Uzuner. 2018 n2c2 shared task onadverse drug events and medication extraction in electronic health records.

Journal of the American MedicalInformatics Association , 27(1):3–12, 2020.4. Amber Stubbs, Christopher Kotﬁla, and ¨Ozlem Uzuner. Automated systems for the de-identiﬁcation of longitu-dinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track 1.

Journal of biomedical informatics ,58:S11–S19, 2015.5. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectionaltransformers for language understanding. In

Proceedings of the 2019 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies , pages 4171–4186, Minneapo-lis, Minnesota, June 2019. Association for Computational Linguistics.6. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Gener-alized autoregressive pretraining for language understanding, 2019.7. Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Ben-jamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical caredatabase.

Scientiﬁc data , 3:160035, 2016. https://github.com/AndriyMulyar/multitasking_transformers . Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew Mc-Dermott. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural LanguageProcessing Workshop , pages 72–78, Minneapolis, Minnesota, USA, June 2019. Association for ComputationalLinguistics.9. Sebastian Ruder.

Neural Transfer Learning for Natural Language Processing . PhD thesis, National University ofIreland, Galway, 2019.10. 2019 n2c2 shared-task and workshop. https://n2c2.dbmi.hms.harvard.edu/track1 . Accessed:2012-03-09.11. Alexey Romanov and Chaitanya Shivade. Lessons from natural language inference in the clinical domain. arXivpreprint arXiv:1808.06752 , 2018.12. Asma Ben Abacha and Dina Demner-Fushman. Recognizing question entailment for medical question answering.In

AMIA 2016, American Medical Informatics Association Annual Symposium, Chicago, IL, USA, November 12-16, 2016 , 2016.13. Aurlie Nvol, Cyril Grouin, Jeremy Leixa, Sophie Rosset, and Pierre Zweigenbaum. The QUAERO French medicalcorpus: A ressource for medical entity recognition and normalization. In

Proc of BioTextMining Work , pages 24–30, 2014.14. Alexa T McCray, Anita Burgun, and Olivier Bodenreider. Aggregating umls semantic types for reducing concep-tual complexity.

Studies in health technology and informatics , 84(0 1):216, 2001.15. Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine-tuningpretrained language models: Weight initializations, data orders, and early stopping, 2020.16. Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding, and Vijay Pande. Massivelymultitask networks for drug discovery. arXiv preprint arXiv:1502.02072 , 2015.17. Ke Yan, Youbao Tang, Yifan Peng, Veit Sandfort, Mohammadhadi Bagheri, Zhiyong Lu, and Ronald M. Summers.Mulan: Multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation. In

MICCAI , 2019.18. Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for naturallanguage understanding. In