Pre-training A Neural Language Model Improves The Sample Efficiency of an Emergency Room Classification Model
Binbin Xu, Cédric Gil-Jardiné, Frantz Thiessard, Eric Tellier, Marta Avalos, Emmanuel Lagarde
NNeural Language Model for Automated Classification ofElectronic Medical Records at the Emergency Room. TheSignificant Benefit of Unsupervised Generative Pre-training
Binbin XuINSERM U1219-IETO teamISPED, Bordeaux, France [email protected]
Cédric Gil-JardinéUniversity Hospital of BordeauxPole of Emergency MedicineINSERM U1219-IETO team, ISPED [email protected]
Frantz ThiessardINSERM U1219-ERIAS teamISPED, Bordeaux, France [email protected]
Eric TellierUniversity Hospital of BordeauxPole of Emergency MedicineINSERM U1219-IETO team, ISPED [email protected]
Marta AvalosINSERM U1219-SISTM teamISPED, Bordeaux, France [email protected]
Emmanuel Lagarde ∗ INSERM U1219-IETO teamISPED, Bordeaux, France [email protected]
October 25, 2019
Abstract
The French TARPON project aims to build a national injury surveillance system based onemergency room (ER) visit reports. To this end, it is necessary to develop a coding systemcapable of classifying the causes of these visits based on clinical notes in French written byemergency room clinicians. While supervised learning techniques have shown good resultsin this area, they require manual annotation of large number of texts in order to build asufficiently large labeled training dataset. Over the past two years, new levels of performancehave been achieved in neural language models (NLM) with Transformer architecture basedmodels by incorporating an unsupervised generative pre-training step. Our hypothesis is thatmethods involving a generative self-supervised pre-training step can significantly reduce thenumber of annotated samples required for the supervised fine-tuning phase. We aimed tomeasure the gain in terms of manual annotation load obtained by adopting this pre-trainingstep.To test our hypothesis, we exploited the fact that we could derive the traumatic/non-traumatic nature of the cause of the ER visit from available diagnostic codes. We thendesigned a case study to predict from free-text clinical notes whether a given ER visit was theconsequence of a traumatic or a non-traumatic event. We compared two scenarios: Scenario ∗ corresponding author, [email protected] a r X i v : . [ c s . C L ] O c t consisted in training the GPT-2 NLM on a trauma/non-trauma labeled dataset (with amaximum of
161 930 notes) in a single fully-supervised phase. In Scenario B, we split thetraining dataset in two parts, a large unlabeled one of
151 930 for the self-supervised pre-training phase and a much smaller labeled dataset (up to
10 000 notes) for the supervisedfine-tuning. In both scenarios, the GPT-2 model is trained from scratch.In Scenario A,
AUC and
F1 score reach the values of . and . respectively afterthe processing of the
161 930 labeled notes. The use of generative pre-training (Scenario B)achieved an
AUC of . and an F1 score of . after the processing of only 600 labeledclinical notes. To achieve the same performance, labeled clinical notes had to be processedin Scenario A.To conclude, it is possible to easily adapt a multi-purpose NLM model such as the GPT-2to create a powerful tool for classification of free-text notes with only a very small number oflabeled samples. Keywords
Neural Language Model · pre-training · Transformer · GPT-2
Over the past 10 years, neural language models (NLMs) have progressively taken the largest sharein the field of natural language processing with techniques based on long short-term memory andgated recurrent networks [1] or convolutional networks [2]. NLMs have then become indispensablein this field with applications like machine translation, document classification, text summarizationand speech recognition.The benefit of unsupervised pre-training have been quickly identified [3], but in the domain ofNLMs, new levels of performance have only been recently achieved with the use of models based onthe concept of attention that consists in learning dependencies between words in a sentence with-out regard to their distances. This mechanism has been implemented in a sequence to sequenceneural network model, the Transformer architecture, proposed in 2017 [4]. This model can betrained with an unsupervised generative step that learns from a large set of text to predict the newtoken in a sentence [5]. One of the latest examples is the GPT-2, published in February 2019 byOpenAI. GPT-2 is a large transformer-based language model with 1.5 billion parameters, trainedon a dataset of 8 million web pages to predict the next word after a given prompt sentence [6]. Thiswork quickly drew attention from the community as it demonstrated the model’s ability to gen-erate artificial texts which are difficult to be distinguished from humans written texts. Moreover,the meaning of these artificial sentences was surprisingly consistent with the original context text(prompt). Although only reduced versions of the full model were released in public, its applicationsare already potentially numerous. Indeed, beyond the capability to generate coherent texts, theGPT-2 has the potential to perform other tasks such as question answering and document classi-fication. Following the same idea as the BERT model [7], transferring many self-attention blocksfrom a pre-trained model proved sufficient to transfer contextual representations in the dataset.The training of the model is then performed in two distinct phases [8]: the first generativepre-training unsupervised (or more accurately self-supervised ) phase, consists in exploitation ofa text corpus. This leads to the ability of automatic text generation. The relevance of thesesynthetic sentences suggests that the networks learned contextual semantic representations. Thesecond supervised fine-tuning phase consists in resuming learning from annotated text corpus withthe objective of creating a system able to perform specific tasks.We intended to leverage the document classification potential of GPT-2 model to classify free-text clinical notes in the context of the project TARPON. This French project proposes to build2 national surveillance system based on the exhaustive collection of emergency room (ER) visitsreports in France. Its main feature is the application of automatic language analysis to extractinjury mechanism and cause from the digital medical record of each ER visit. The creation of thisdatabase and its matching with the French national health data system will be used to create anation-wide comprehensive and automated trauma/injury monitoring, research and alert system.More than 21 million unlabeled ER clinical notes are produced every year in France. The causefor the visit is not available as a standardized database although fully described with free-textnarratives stored in digital clinical records. The overall objective of the project is thus to developa tool that would derive standardized trauma/injury information and their causes from these ERnotes. To that purpose, substantial amounts of experts-annotated data would be necessary to traina conventional text classification model with acceptable accuracy.Our hypothesis is that methods involving a generative self-supervised pre-training step suchas the GPT-2 can significantly reduce the number of expert annotated samples required for thesupervised fine-tuning phase. This is of paramount significance for all projects wishing to useNLMs models for free-text classification tasks because the manual annotation phase is by far themost expensive one. The objective of our study is therefore to measure the gain in terms of manualannotation load obtained by adopting this pre-training step.
To test our hypothesis, we exploited the current digital medical record data of our ER department.We could derive the traumatic/non-traumatic characteristic of the cause of the ER visit fromavailable diagnostic codes assigned by clinicians or technical staff at the time of the patient’shospitalization. We then designed a case study to assess whether we can predict from free-textclinical notes whether an ER visit is due to trauma or not.
TARPON 1TARPON 0TARPON 1 … Training on labeled clinical notes(from a dataset of 161,930 notes)Building promptsended with TARPON key-wordfor prediction
TARPONTARPONPrediction 0
Figure 1:
Scenario A : supervised trainingIn order to measure the gain obtained with the self-supervised training phase, we comparedthe performance of two scenarios (Figures 1 and 2). Scenario A consisted in retraining the GPT-2NLM from scratch on the labeled dataset in a single fully-supervised phase. In Scenario B, wefurther split the training dataset in two parts: a large unlabeled dataset for the self-supervisedpre-training phase and a smaller labeled dataset for the supervised training. The main questionwas therefore to assess how many clinical notes are required in this training part of Scenario B toachieve the same acceptable performance as in Scenario A. This should give us a measure of howmuch annotation load can be gained as a result of Scenario B.3
Pre-training on 151,930 unlabeled clinical notes e model is able to generate arti fi cial text Prediction
TARPON TARPON TARPON … Training on labeled clinical notes(from a dataset of 10,000 notes)Building promptsended with TARPON key-wordfor prediction
TARPONTARPON
Figure 2:
Scenario B : self-supervised training + supervised training
We retrieved clinical notes and International Classification of Diseases diagnostic codes, version10 (ICD-10) from the digital medical record system of the adult ER of the University hospital ofBordeaux, France from 2011 to 2018. The ICD-10 [9] is the most used standard way to indicatediagnoses and medical procedures, and is the terminology mandatorily used in France for all staysin any private or public hospitals. This data set contains
288 404 medical records of which
209 341 contain both diagnosis code and clinical note.The labels (trauma / non-trauma event) were derived from the ICD-10 codes: a total of
56 410 visits with ICD-10 codes beginning with letters S, T1 to T35 and V were coded as trauma and
115 520 visits with ICD-10 codes beginning with letters A, C, D, E, G, H, I, J, L, N were coded asnon-trauma. A total of
37 411 visits with codes beginning with other letters (F, M, O, P, Q, T36to T98, X40 to X57, Y10 to Y98, U, Z) were excluded because they correspond to pathologies forwhich the traumatic nature is either uncertain or discussed from a semantic point of view. Thetotal number of available clinical notes was therefore
171 930 . The sampling strategy is illustrated on Figure 3. For test purpose,
10 000 clinical notes wererandomly selected and then frozen for both scenarios. The clinical notes from the remaining
161 930 notes were used with labels in Scenario A in order to estimate the number of notes neededto achieve maximum prediction performance on the
10 000 clinical notes test set. For Scenario B,we further split the
161 930 notes into a set of
151 930 unlabeled notes for unsupervised pre-trainingand a second set with
10 000 labeled notes for the supervised fine-tuning step.To better determine the optimal required number of clinical notes, models were independentlytrained and evaluated for many cases with different arbitrarily chosen numbers of notes. In total,for Scenario A, 26 cases (number of notes from , , . . . up to all the
161 930 notes) were evaluated.In Scenario B, 19 cases were studied, from , , . . . up to
10 000 notes..4 elected samplesfor supervised training
Scenario A Scenario B
Test data
Labels … Test data
Supervisedtraining161,930 notes Unsupervisedpre-training151,930 notes
20, 40, ..., notes200, 400 ... notes10,000 notes … Figure 3: Strategy of sampling: cases evaluated in Scenario A and cases in Scenario B . Like other Neural Language models based on convolutional neural network and recurrent networks,the GPT-2 proposed by Radford and colleagues is a sequence to sequence transduction model[10]. The main feature of the Transformer architecture is to use attention weight on text inputs[4]. During the training process, the network learns a context vector which gives global levelinformation on inputs telling where attention should be focused. The novel approach consists inreplacing recurrence with attention to handle the dependencies in input and output.The GPT-2 is built to predict the next token from the input of a text sequence. By loopingthis process, it works then as a text generator. The text can be generated de novo or by feedingany arbitrary text prompts. The model was originally trained on millions of webpages withoutany explicit supervision. Four models of GPT-2 with respectively 117, 345, 762 and 1542 millionparameters were trained, and only the first three had been released at the time of writing. Onlythe first two models are trainable on standard workstations.Note that the GPT-2 models are trained with web text mostly written in English while ourclinical notes data are all in French. Consequently in the present work, we did not use thosepre-trained models and retrained the models from a random set of weights.The 117M models were trained mainly with a single Nvidia R (cid:13) GeForce GTX 1080 Ti with 11GBof VRAM (4 parallel sessions can be run on our workstation with 4 GTX 1080 Ti). The 345Mmodels were trained on another workstation with a single Nvidia R (cid:13) TITAN RTX with 24GB ofVRAM.
In Scenario B, the pre-training step is referred as unsupervised learning because it is derived fromsimply reading the unlabeled clinical notes (Figure 2). It actually uses a sliding learning windowon the text. The first part of this window corresponds to the input and the last token is then thetoken to be predicted. This first step leads to models that can generate texts resembling clinical5otes in French, including the use of medical jargon and specialized abbreviations.For the supervised learning phases (Scenario A and second training process in Scenario B),we added a task identifier (e. g. TARPON) at the end of each clinical note followed by classi-fication codes, say 1 for clinical notes corresponding to traumatic events and 0 for clinical notescorresponding to non-traumatic events. The codes are preferably chosen from the vocabulary sothat the prediction (classification) probability can be directly extracted from the model for furtherquantification. As described above, this code was derived from the diagnosis classification manuallycoded by clinicians.For both scenarios (Figures 1 and 2), the test phase consists in feeding the models with promptsby adding the task identifier at the end of each test clinical note and ask the model to predictthe next token right after the task identifier. Ideally, this newly generated token should be oneof the classification codes (tokens). On the first iterations, due to the random initialization andinsufficient learning, the predicted token could be any tokens from the vocabulary other thanexpected classification tokens. But, this turns quickly to be mainly the classification tokens. Ouroperating principle can therefore be compared to a Question Answering task.
The prediction performance of the model was measured by
F1 score and area under the ROCcurve statistics (
AUC ) [11]. Evaluations on the same
10 000 clinical notes were performed for bothScenario A and Scenario B.
No nominative data were necessary for this work. The dataset was however not checked and notspecifically de-identified. Data processing and computing were conducted within the facilities ofthe Emergency Department of the University Hospital of Bordeaux which have received regulatoryclearance to host and exploit databases with personal and medical data.
For both scenarios, we compared
AUC (Figures 4 to 6) and
F1 score (Figures 7 to 9) by iterationswith a batch size of 1. The number of iterations needed to achieve a maximal
AUC / F1 score valuevaried depending on the number of notes (Figures 4 and 5 for
AUC and Figures 7 and 8 for
F1score ). For each set of clinical notes, the maximum
AUC / F1 score value was retained (Figure 6for
AUC and Figure 9 for
F1 score ) to represent how model performance varied with respect tothe number of labeled notes.In Scenario A,
AUC and
F1 score reach the values of . and . respectively after theprocessing of all the
161 930 labeled notes. The use of generative pre-training (Scenario B) achievedan
AUC of . and an F1 score of . after the processing of only labeled clinical notes.To achieve the same performance, labeled clinical notes had to be processed in Scenario A(Figures 6 and 9). At the end of Scenario B, with a training of all
10 000 notes,
AUC and
F1 score are respectively . and . , corresponding the cases of more than
100 000 notes in Scenario A.For 16 times more data, the gain from Scenario A compared to Scenario B is only an improvementof . in AUC and . in F1 score .Though the
AUC converged to the same ending point in both scenarios, the learning patternswere quite different. In Scenario A (Figure 4), the
AUC started with a value ∼ . . Because6 . . . . . . . Iteration A U C
20 notes 40 notes 60 notes 80 notes 100 notes 120 notes140 notes 160 notes 180 notes 200 notes 400 notes 600 notes800 notes 1000 notes 2000 notes 4000 notes 6000 notes 8000 notes10000 notes 20000 notes 40000 notes 60000 notes 80000 notes 100000 notes120000 notes 161930 notes
Figure 4: Scenario A:
AUC by number of iterations. 26 cases. . . . . . . . Iteration A U C
20 notes 40 notes 60 notes 80 notes 100 notes 120 notes140 notes 160 notes 180 notes 200 notes 400 notes 600 notes800 notes 1000 notes 2000 notes 4000 notes 6000 notes 8000 notes10000 notes
Figure 5: Scenario B:
AUC by number of iterations. 19 cases.of insufficient learning, almost all clinical notes are classified as non-trauma at this stage. The
AUC dropped during the first iterations due to clinical notes wrongly classified as trauma, thenincreased as expected. The main reason is that, in this Question Answering study style, the modelhas to perform two tasks at the same time: how to learn the semantic representation in clinicalnotes and how to perform the classification task. But for Scenario B (Figure 5), the clinical notesgeneration task is learned during the pre-training phase, leading to an increasing monotonous
AUC curve monotonically in step 2, corresponding to the learning of the classification task.The same is observed for the
F1 score . In Scenario A (Figure 7), the
F1 score cannot be7 . . . . . . . . . . . . . . . . · . . . . . . . . . . . number of notes A U C Scenario AScenario B10 . . . . . . Figure 6: Comparison of
AUC by cases (number of notes) for both scenariosmeasured for the first 500 iterations since recall and precision are both null. While for ScenarioB,
F1 score can be measured after only 20 iterations (up to . ) and reached . with 600iterations. For comparison, in Scenario A, the F1 score was only around . after 600 iterations. . . . . . . . . . . . Iteration F sc o r e
20 notes 40 notes 60 notes 80 notes 100 notes 120 notes140 notes 160 notes 180 notes 200 notes 400 notes 600 notes800 notes 1000 notes 2000 notes 4000 notes 6000 notes 8000 notes10000 notes 20000 notes 40000 notes 60000 notes 80000 notes 100000 notes120000 notes 161930 notes
Figure 7: Scenario A:
F1 score by number of iterations. 26 cases.8 . . . . . . . . . . . Iteration F sc o r e
20 notes 40 notes 60 notes 80 notes 100 notes 120 notes140 notes 160 notes 180 notes 200 notes 400 notes 600 notes800 notes 1000 notes 2000 notes 4000 notes 6000 notes 8000 notes10000 notes
Figure 8: Scenario B:
F1 score by number of iterations. 19 cases. . . . . . . . . . . . . . . . . · . . . . . . . . . number of notes F sc o r e Scenario AScenario B10 . . . . . Figure 9: Comparison of
F1 score by cases (number of notes) for both scenariosAs regard to training time, one iteration (batch size was 1) took about . second; the requirednumber of iterations depended on the data length and varied from
15 000 to
330 000 which resultedin training time ranging from 1 to 23 hours. The prediction task on the
10 000 -notes dataset lastedaround 4 minutes for each iterations. As a result, the time for each case run took from 4 hours upto 100 hours.Comparing 117M and 345M GPT-2 models showed no significant improvement using a morecomplex model (Figure 10). However, the 345M model takes around . second for each itera-tion ( × longer). The classification task of
10 000 notes with 345M model required about seconds which is × longer than with 117M model ( seconds). Considering the time cost andperformance, all the above-mentioned results (Figures 4 to 9) are trained with the GPT-2 117Mmodel. 9 . . . . . iteration A U C Scenario A: AUC by number of iterations
GPT-2 117MGPT-2 345M . . . . . iteration A U C Scenario B: AUC by number of iterations
GPT-2 117MGPT-2 345M
Figure 10: Comparison of GPT-2 117M models and 345M model on the cases of
161 930 notes inScenario A and
10 0000 notes in Scenario B.
As suggested by Radford and colleagues [8], large gains could be realized by generative pre-trainingwith unlabeled text corpus, saving a large amount of annotation load. In our example of clinicalnotes classification task, the order of magnitude is a factor of 10. In their 2019 paper, Radfordand colleagues reported an improvement of . on commonsense reasoning (Stories Cloze Test), . on question answering (RACE), and . on textual entailment (MultiNLI) [8].These results are in line with recent work that showed that self-supervised pre-training methods,such as ELMo [12] and BERT [7], and have established a qualitatively new level of performancein most widely used Natural Language Understanding benchmarks. Howard and Ruder [13] inparticular reported very similar results in a comparable text classification task, with a modeltrained with only 100 labeled samples that matches the performance of training from scratch on
20 000 samples. While the extensive use of pre-trained word embeddings could be considered as ofthe same nature of generative pre-training, the gain provided by generative pre-training is a majorstep for those who seek to classify free-text document with minimal manual coding efforts for sameacceptable accuracy.We have benefited from the work of the researchers who published the GPT-2 model, whichstill seems to be one of the most efficient today. The NLM field progresses fast with extensiveresearch efforts from the community. Other models have been and will be proposed, so the textclassification strategies will need to be updated. Recent and promising work includes the workof Yang and colleagues and their XLNet model [14] which currently ranks first at the StandfordQuestion Answering Dataset (SQuAD2.0).Probably because the GPT-2 model was only recently made public, few applications have beenpublished today. However, this type of tool will with no doubt be extensively used in the nearfuture for a wide range of tasks. In the area of document classification alone, they will likelyprovide faster and more relevant access to expected information. Certainly, these applicationswill go beyond simple classification tasks. Of note, it is unusual to generate the next token (in aQuestion Answering fashion) in an NLP model to perform classification tasks. A more classicalapproach would certainly be to add a layer after a hidden state of the model and apply a soft-maxlayer to output prediction probabilities. While this will be done in future work, adding a layerhowever requires much more skill in Python/TensorFlow programming. That is why we decidedto present a method that can be used by a much broader scientific community.10hile the 345M GPT-2 model did not generate better results than the 117M model in currentstudy, the use of larger models could bring further improvement. Unfortunately, the requiredcomputing power of larger models is far beyond our resources for this pilot study, we will have tobe satisfied with the results presented here.In this study, the trauma/non-trauma labeling procedure of the clinical notes was indirectlybased on the ICD-10 codes. We tried to maximize the consistency of the ground-truth labelingby selecting a subset of ICD-10 codes for which the traumatic/non-traumatic characteristic isindisputable. This method has had the advantage of providing a large amount of labeled data butdoes not allow us to compare the model’s performance with human annotation.
Our work shows that it is possible to easily adapt a multi-purpose NLM model such as the GPT-2to create a powerful classification tool of free-text notes even in languages other than English.The self-supervised training phase appeared to be a very powerful tool to dramatically decreasethe number of labeled samples required for supervised learning. These results will be used in thecoming months to implement the exhaustive coding of all events leading to trauma with emergencyroom visits, making it possible to build a national trauma observatory within the TARPON projectframework. More generally, this also opens broad perspectives for those interested in automaticfree-text annotation. In the field of health, this will be particularly useful for diagnosis coding,clinical report classification and patient reports analysis and mining.
Acknowledge
We thank sincerely the anonymous reviewers / readers whose comments/suggestions helped im-prove and clarify this manuscript.
References [1] Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy. An empirical evaluation of deep learningfor icd-9 code assignment using mimic-iii clinical notes.
Computer Methods and Programs inBiomedicine , 177:141 – 153, 2019.[2] M. Li, Z. Fei, M. Zeng, F. Wu, Y. Li, Y. Pan, and J. Wang. Automated icd-9 coding via a deeplearning approach.
IEEE/ACM Transactions on Computational Biology and Bioinformatics ,16(4):1193–1202, July 2019.[3] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent,and Samy Bengio. Why does unsupervised pre-training help deep learning?
Journal ofMachine Learning Research , 11(Feb):625–660, 2010.[4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances inNeural Information Processing Systems 30 , pages 5998–6008. Curran Associates, Inc., 2017.[5] Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. Leveraging pre-trained checkpoints forsequence generation tasks.
CoRR , abs/1907.12461, 2019.116] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.Language models are unsupervised multitask learners.
OpenAI Blog , 1(8), 2019.[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training ofdeep bidirectional transformers for language understanding.
CoRR , abs/1810.04805, 2018.[8] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving languageunderstanding by generative pre-training. 2018.[9] World Health Organization.
International statistical classification of diseases and related healthproblems : 10th revision (ICD-10), Fifth edition, 2016 . World Health Organization, 2015.[10] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statisticalmachine translation.
CoRR , abs/1406.1078, 2014.[11] David Martin Powers. Evaluation: from precision, recall and f-measure to roc, informedness,markedness and correlation. 2011.[12] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, KentonLee, and Luke Zettlemoyer. Deep contextualized word representations.
CoRR , abs/1802.05365,2018.[13] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text clas-sification. In
Proceedings of the 56th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 328–339, Melbourne, Australia, July 2018. Asso-ciation for Computational Linguistics.[14] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, andQuoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding.