[PDF] Pre-training A Neural Language Model Improves The Sample Efficiency of an Emergency Room Classification Model

Abstract

To build a French national electronic injury surveillance system based on emergency room visits, we aim to develop a coding system to classify their causes from clinical notes in free-text. Supervised learning techniques have shown good results in this area but require a large amount of expert annotated dataset which is time consuming and costly to obtain. We hypothesize that the Natural Language Processing Transformer model incorporating a generative self-supervised pre-training step can significantly reduce the required number of annotated samples for supervised fine-tuning. In this preliminary study, we test our hypothesis in the simplified problem of predicting whether a visit is the consequence of a traumatic event or not from free-text clinical notes. Using fully re-trained GPT-2 models (without OpenAI pre-trained weights), we assess the gain of applying a self-supervised pre-training phase with unlabeled notes prior to the supervised learning task. Results show that the number of data required to achieve a ginve level of performance (AUC>0.95) was reduced by a factor of 10 when applying pre-training. Namely, for 16 times more data, the fully-supervised model achieved an improvement <1% in AUC. To conclude, it is possible to adapt a multi-purpose neural language model such as the GPT-2 to create a powerful tool for classification of free-text notes with only a small number of labeled samples.

Full PDF

NNeural Language Model for Automated Classiﬁcation ofElectronic Medical Records at the Emergency Room. TheSigniﬁcant Beneﬁt of Unsupervised Generative Pre-training

Binbin XuINSERM U1219-IETO teamISPED, Bordeaux, France [email protected]

Cédric Gil-JardinéUniversity Hospital of BordeauxPole of Emergency MedicineINSERM U1219-IETO team, ISPED [email protected]

Frantz ThiessardINSERM U1219-ERIAS teamISPED, Bordeaux, France [email protected]

Eric TellierUniversity Hospital of BordeauxPole of Emergency MedicineINSERM U1219-IETO team, ISPED [email protected]

Marta AvalosINSERM U1219-SISTM teamISPED, Bordeaux, France [email protected]

Emmanuel Lagarde ∗ INSERM U1219-IETO teamISPED, Bordeaux, France [email protected]

October 25, 2019

Abstract

The French TARPON project aims to build a national injury surveillance system based onemergency room (ER) visit reports. To this end, it is necessary to develop a coding systemcapable of classifying the causes of these visits based on clinical notes in French written byemergency room clinicians. While supervised learning techniques have shown good resultsin this area, they require manual annotation of large number of texts in order to build asuﬃciently large labeled training dataset. Over the past two years, new levels of performancehave been achieved in neural language models (NLM) with Transformer architecture basedmodels by incorporating an unsupervised generative pre-training step. Our hypothesis is thatmethods involving a generative self-supervised pre-training step can signiﬁcantly reduce thenumber of annotated samples required for the supervised ﬁne-tuning phase. We aimed tomeasure the gain in terms of manual annotation load obtained by adopting this pre-trainingstep.To test our hypothesis, we exploited the fact that we could derive the traumatic/non-traumatic nature of the cause of the ER visit from available diagnostic codes. We thendesigned a case study to predict from free-text clinical notes whether a given ER visit was theconsequence of a traumatic or a non-traumatic event. We compared two scenarios: Scenario ∗ corresponding author, [email protected] a r X i v : . [ c s . C L ] O c t consisted in training the GPT-2 NLM on a trauma/non-trauma labeled dataset (with amaximum of

161 930 notes) in a single fully-supervised phase. In Scenario B, we split thetraining dataset in two parts, a large unlabeled one of

151 930 for the self-supervised pre-training phase and a much smaller labeled dataset (up to

10 000 notes) for the supervisedﬁne-tuning. In both scenarios, the GPT-2 model is trained from scratch.In Scenario A,

AUC and

F1 score reach the values of . and . respectively afterthe processing of the

161 930 labeled notes. The use of generative pre-training (Scenario B)achieved an

AUC of . and an F1 score of . after the processing of only 600 labeledclinical notes. To achieve the same performance, labeled clinical notes had to be processedin Scenario A.To conclude, it is possible to easily adapt a multi-purpose NLM model such as the GPT-2to create a powerful tool for classiﬁcation of free-text notes with only a very small number oflabeled samples. Keywords

Neural Language Model · pre-training · Transformer · GPT-2

Over the past 10 years, neural language models (NLMs) have progressively taken the largest sharein the ﬁeld of natural language processing with techniques based on long short-term memory andgated recurrent networks [1] or convolutional networks [2]. NLMs have then become indispensablein this ﬁeld with applications like machine translation, document classiﬁcation, text summarizationand speech recognition.The beneﬁt of unsupervised pre-training have been quickly identiﬁed [3], but in the domain ofNLMs, new levels of performance have only been recently achieved with the use of models based onthe concept of attention that consists in learning dependencies between words in a sentence with-out regard to their distances. This mechanism has been implemented in a sequence to sequenceneural network model, the Transformer architecture, proposed in 2017 [4]. This model can betrained with an unsupervised generative step that learns from a large set of text to predict the newtoken in a sentence [5]. One of the latest examples is the GPT-2, published in February 2019 byOpenAI. GPT-2 is a large transformer-based language model with 1.5 billion parameters, trainedon a dataset of 8 million web pages to predict the next word after a given prompt sentence [6]. Thiswork quickly drew attention from the community as it demonstrated the model’s ability to gen-erate artiﬁcial texts which are diﬃcult to be distinguished from humans written texts. Moreover,the meaning of these artiﬁcial sentences was surprisingly consistent with the original context text(prompt). Although only reduced versions of the full model were released in public, its applicationsare already potentially numerous. Indeed, beyond the capability to generate coherent texts, theGPT-2 has the potential to perform other tasks such as question answering and document classi-ﬁcation. Following the same idea as the BERT model [7], transferring many self-attention blocksfrom a pre-trained model proved suﬃcient to transfer contextual representations in the dataset.The training of the model is then performed in two distinct phases [8]: the ﬁrst generativepre-training unsupervised (or more accurately self-supervised ) phase, consists in exploitation ofa text corpus. This leads to the ability of automatic text generation. The relevance of thesesynthetic sentences suggests that the networks learned contextual semantic representations. Thesecond supervised ﬁne-tuning phase consists in resuming learning from annotated text corpus withthe objective of creating a system able to perform speciﬁc tasks.We intended to leverage the document classiﬁcation potential of GPT-2 model to classify free-text clinical notes in the context of the project TARPON. This French project proposes to build2 national surveillance system based on the exhaustive collection of emergency room (ER) visitsreports in France. Its main feature is the application of automatic language analysis to extractinjury mechanism and cause from the digital medical record of each ER visit. The creation of thisdatabase and its matching with the French national health data system will be used to create anation-wide comprehensive and automated trauma/injury monitoring, research and alert system.More than 21 million unlabeled ER clinical notes are produced every year in France. The causefor the visit is not available as a standardized database although fully described with free-textnarratives stored in digital clinical records. The overall objective of the project is thus to developa tool that would derive standardized trauma/injury information and their causes from these ERnotes. To that purpose, substantial amounts of experts-annotated data would be necessary to traina conventional text classiﬁcation model with acceptable accuracy.Our hypothesis is that methods involving a generative self-supervised pre-training step suchas the GPT-2 can signiﬁcantly reduce the number of expert annotated samples required for thesupervised ﬁne-tuning phase. This is of paramount signiﬁcance for all projects wishing to useNLMs models for free-text classiﬁcation tasks because the manual annotation phase is by far themost expensive one. The objective of our study is therefore to measure the gain in terms of manualannotation load obtained by adopting this pre-training step.

To test our hypothesis, we exploited the current digital medical record data of our ER department.We could derive the traumatic/non-traumatic characteristic of the cause of the ER visit fromavailable diagnostic codes assigned by clinicians or technical staﬀ at the time of the patient’shospitalization. We then designed a case study to assess whether we can predict from free-textclinical notes whether an ER visit is due to trauma or not.

TARPON 1TARPON 0TARPON 1 … Training on labeled clinical notes(from a dataset of 161,930 notes)Building promptsended with TARPON key-wordfor prediction

TARPONTARPONPrediction 0

Figure 1:

Scenario A : supervised trainingIn order to measure the gain obtained with the self-supervised training phase, we comparedthe performance of two scenarios (Figures 1 and 2). Scenario A consisted in retraining the GPT-2NLM from scratch on the labeled dataset in a single fully-supervised phase. In Scenario B, wefurther split the training dataset in two parts: a large unlabeled dataset for the self-supervisedpre-training phase and a smaller labeled dataset for the supervised training. The main questionwas therefore to assess how many clinical notes are required in this training part of Scenario B toachieve the same acceptable performance as in Scenario A. This should give us a measure of howmuch annotation load can be gained as a result of Scenario B.3

Pre-training on 151,930 unlabeled clinical notes  e model is able to generate arti ﬁ cial text Prediction

TARPON TARPON TARPON … Training on labeled clinical notes(from a dataset of 10,000 notes)Building promptsended with TARPON key-wordfor prediction

TARPONTARPON

Figure 2:

Scenario B : self-supervised training + supervised training

We retrieved clinical notes and International Classiﬁcation of Diseases diagnostic codes, version10 (ICD-10) from the digital medical record system of the adult ER of the University hospital ofBordeaux, France from 2011 to 2018. The ICD-10 [9] is the most used standard way to indicatediagnoses and medical procedures, and is the terminology mandatorily used in France for all staysin any private or public hospitals. This data set contains

288 404 medical records of which

209 341 contain both diagnosis code and clinical note.The labels (trauma / non-trauma event) were derived from the ICD-10 codes: a total of

56 410 visits with ICD-10 codes beginning with letters S, T1 to T35 and V were coded as trauma and

115 520 visits with ICD-10 codes beginning with letters A, C, D, E, G, H, I, J, L, N were coded asnon-trauma. A total of

37 411 visits with codes beginning with other letters (F, M, O, P, Q, T36to T98, X40 to X57, Y10 to Y98, U, Z) were excluded because they correspond to pathologies forwhich the traumatic nature is either uncertain or discussed from a semantic point of view. Thetotal number of available clinical notes was therefore

171 930 . The sampling strategy is illustrated on Figure 3. For test purpose,

10 000 clinical notes wererandomly selected and then frozen for both scenarios. The clinical notes from the remaining

161 930 notes were used with labels in Scenario A in order to estimate the number of notes neededto achieve maximum prediction performance on the

10 000 clinical notes test set. For Scenario B,we further split the

161 930 notes into a set of

151 930 unlabeled notes for unsupervised pre-trainingand a second set with

10 000 labeled notes for the supervised ﬁne-tuning step.To better determine the optimal required number of clinical notes, models were independentlytrained and evaluated for many cases with diﬀerent arbitrarily chosen numbers of notes. In total,for Scenario A, 26 cases (number of notes from , , . . . up to all the

161 930 notes) were evaluated.In Scenario B, 19 cases were studied, from , , . . . up to

10 000 notes..4 elected samplesfor supervised training

Scenario A Scenario B

Test data

Labels … Test data

Supervisedtraining161,930 notes Unsupervisedpre-training151,930 notes

20, 40, ..., notes200, 400 ... notes10,000 notes … Figure 3: Strategy of sampling: cases evaluated in Scenario A and cases in Scenario B . Like other Neural Language models based on convolutional neural network and recurrent networks,the GPT-2 proposed by Radford and colleagues is a sequence to sequence transduction model[10]. The main feature of the Transformer architecture is to use attention weight on text inputs[4]. During the training process, the network learns a context vector which gives global levelinformation on inputs telling where attention should be focused. The novel approach consists inreplacing recurrence with attention to handle the dependencies in input and output.The GPT-2 is built to predict the next token from the input of a text sequence. By loopingthis process, it works then as a text generator. The text can be generated de novo or by feedingany arbitrary text prompts. The model was originally trained on millions of webpages withoutany explicit supervision. Four models of GPT-2 with respectively 117, 345, 762 and 1542 millionparameters were trained, and only the ﬁrst three had been released at the time of writing. Onlythe ﬁrst two models are trainable on standard workstations.Note that the GPT-2 models are trained with web text mostly written in English while ourclinical notes data are all in French. Consequently in the present work, we did not use thosepre-trained models and retrained the models from a random set of weights.The 117M models were trained mainly with a single Nvidia R (cid:13) GeForce GTX 1080 Ti with 11GBof VRAM (4 parallel sessions can be run on our workstation with 4 GTX 1080 Ti). The 345Mmodels were trained on another workstation with a single Nvidia R (cid:13) TITAN RTX with 24GB ofVRAM.

In Scenario B, the pre-training step is referred as unsupervised learning because it is derived fromsimply reading the unlabeled clinical notes (Figure 2). It actually uses a sliding learning windowon the text. The ﬁrst part of this window corresponds to the input and the last token is then thetoken to be predicted. This ﬁrst step leads to models that can generate texts resembling clinical5otes in French, including the use of medical jargon and specialized abbreviations.For the supervised learning phases (Scenario A and second training process in Scenario B),we added a task identiﬁer (e. g. TARPON) at the end of each clinical note followed by classi-ﬁcation codes, say 1 for clinical notes corresponding to traumatic events and 0 for clinical notescorresponding to non-traumatic events. The codes are preferably chosen from the vocabulary sothat the prediction (classiﬁcation) probability can be directly extracted from the model for furtherquantiﬁcation. As described above, this code was derived from the diagnosis classiﬁcation manuallycoded by clinicians.For both scenarios (Figures 1 and 2), the test phase consists in feeding the models with promptsby adding the task identiﬁer at the end of each test clinical note and ask the model to predictthe next token right after the task identiﬁer. Ideally, this newly generated token should be oneof the classiﬁcation codes (tokens). On the ﬁrst iterations, due to the random initialization andinsuﬃcient learning, the predicted token could be any tokens from the vocabulary other thanexpected classiﬁcation tokens. But, this turns quickly to be mainly the classiﬁcation tokens. Ouroperating principle can therefore be compared to a Question Answering task.

The prediction performance of the model was measured by

F1 score and area under the ROCcurve statistics (

AUC ) [11]. Evaluations on the same

10 000 clinical notes were performed for bothScenario A and Scenario B.

No nominative data were necessary for this work. The dataset was however not checked and notspeciﬁcally de-identiﬁed. Data processing and computing were conducted within the facilities ofthe Emergency Department of the University Hospital of Bordeaux which have received regulatoryclearance to host and exploit databases with personal and medical data.

For both scenarios, we compared

AUC (Figures 4 to 6) and

F1 score (Figures 7 to 9) by iterationswith a batch size of 1. The number of iterations needed to achieve a maximal

AUC / F1 score valuevaried depending on the number of notes (Figures 4 and 5 for

AUC and Figures 7 and 8 for

F1score ). For each set of clinical notes, the maximum

AUC / F1 score value was retained (Figure 6for

AUC and Figure 9 for

F1 score ) to represent how model performance varied with respect tothe number of labeled notes.In Scenario A,

AUC and

F1 score reach the values of . and . respectively after theprocessing of all the

161 930 labeled notes. The use of generative pre-training (Scenario B) achievedan

AUC of . and an F1 score of . after the processing of only labeled clinical notes.To achieve the same performance, labeled clinical notes had to be processed in Scenario A(Figures 6 and 9). At the end of Scenario B, with a training of all

10 000 notes,

AUC and

F1 score are respectively . and . , corresponding the cases of more than

100 000 notes in Scenario A.For 16 times more data, the gain from Scenario A compared to Scenario B is only an improvementof . in AUC and . in F1 score .Though the

AUC converged to the same ending point in both scenarios, the learning patternswere quite diﬀerent. In Scenario A (Figure 4), the

AUC started with a value ∼ . . Because6 . . . . . . . Iteration A U C

20 notes 40 notes 60 notes 80 notes 100 notes 120 notes140 notes 160 notes 180 notes 200 notes 400 notes 600 notes800 notes 1000 notes 2000 notes 4000 notes 6000 notes 8000 notes10000 notes 20000 notes 40000 notes 60000 notes 80000 notes 100000 notes120000 notes 161930 notes

Figure 4: Scenario A:

AUC by number of iterations. 26 cases. . . . . . . . Iteration A U C

20 notes 40 notes 60 notes 80 notes 100 notes 120 notes140 notes 160 notes 180 notes 200 notes 400 notes 600 notes800 notes 1000 notes 2000 notes 4000 notes 6000 notes 8000 notes10000 notes

Figure 5: Scenario B:

AUC by number of iterations. 19 cases.of insuﬃcient learning, almost all clinical notes are classiﬁed as non-trauma at this stage. The

AUC dropped during the ﬁrst iterations due to clinical notes wrongly classiﬁed as trauma, thenincreased as expected. The main reason is that, in this Question Answering study style, the modelhas to perform two tasks at the same time: how to learn the semantic representation in clinicalnotes and how to perform the classiﬁcation task. But for Scenario B (Figure 5), the clinical notesgeneration task is learned during the pre-training phase, leading to an increasing monotonous

AUC curve monotonically in step 2, corresponding to the learning of the classiﬁcation task.The same is observed for the

F1 score . In Scenario A (Figure 7), the

F1 score cannot be7 . . . . . . . . . . . . . . . . · . . . . . . . . . . . number of notes A U C Scenario AScenario B10 . . . . . . Figure 6: Comparison of

AUC by cases (number of notes) for both scenariosmeasured for the ﬁrst 500 iterations since recall and precision are both null. While for ScenarioB,

F1 score can be measured after only 20 iterations (up to . ) and reached . with 600iterations. For comparison, in Scenario A, the F1 score was only around . after 600 iterations. . . . . . . . . . . . Iteration F sc o r e

Figure 7: Scenario A:

F1 score by number of iterations. 26 cases.8 . . . . . . . . . . . Iteration F sc o r e

20 notes 40 notes 60 notes 80 notes 100 notes 120 notes140 notes 160 notes 180 notes 200 notes 400 notes 600 notes800 notes 1000 notes 2000 notes 4000 notes 6000 notes 8000 notes10000 notes

Figure 8: Scenario B:

F1 score by number of iterations. 19 cases. . . . . . . . . . . . . . . . . · . . . . . . . . . number of notes F sc o r e Scenario AScenario B10 . . . . . Figure 9: Comparison of

F1 score by cases (number of notes) for both scenariosAs regard to training time, one iteration (batch size was 1) took about . second; the requirednumber of iterations depended on the data length and varied from

15 000 to

330 000 which resultedin training time ranging from 1 to 23 hours. The prediction task on the

10 000 -notes dataset lastedaround 4 minutes for each iterations. As a result, the time for each case run took from 4 hours upto 100 hours.Comparing 117M and 345M GPT-2 models showed no signiﬁcant improvement using a morecomplex model (Figure 10). However, the 345M model takes around . second for each itera-tion ( × longer). The classiﬁcation task of

10 000 notes with 345M model required about seconds which is × longer than with 117M model ( seconds). Considering the time cost andperformance, all the above-mentioned results (Figures 4 to 9) are trained with the GPT-2 117Mmodel. 9 . . . . . iteration A U C Scenario A: AUC by number of iterations

GPT-2 117MGPT-2 345M . . . . . iteration A U C Scenario B: AUC by number of iterations

GPT-2 117MGPT-2 345M

Figure 10: Comparison of GPT-2 117M models and 345M model on the cases of

161 930 notes inScenario A and

10 0000 notes in Scenario B.

As suggested by Radford and colleagues [8], large gains could be realized by generative pre-trainingwith unlabeled text corpus, saving a large amount of annotation load. In our example of clinicalnotes classiﬁcation task, the order of magnitude is a factor of 10. In their 2019 paper, Radfordand colleagues reported an improvement of . on commonsense reasoning (Stories Cloze Test), . on question answering (RACE), and . on textual entailment (MultiNLI) [8].These results are in line with recent work that showed that self-supervised pre-training methods,such as ELMo [12] and BERT [7], and have established a qualitatively new level of performancein most widely used Natural Language Understanding benchmarks. Howard and Ruder [13] inparticular reported very similar results in a comparable text classiﬁcation task, with a modeltrained with only 100 labeled samples that matches the performance of training from scratch on

20 000 samples. While the extensive use of pre-trained word embeddings could be considered as ofthe same nature of generative pre-training, the gain provided by generative pre-training is a majorstep for those who seek to classify free-text document with minimal manual coding eﬀorts for sameacceptable accuracy.We have beneﬁted from the work of the researchers who published the GPT-2 model, whichstill seems to be one of the most eﬃcient today. The NLM ﬁeld progresses fast with extensiveresearch eﬀorts from the community. Other models have been and will be proposed, so the textclassiﬁcation strategies will need to be updated. Recent and promising work includes the workof Yang and colleagues and their XLNet model [14] which currently ranks ﬁrst at the StandfordQuestion Answering Dataset (SQuAD2.0).Probably because the GPT-2 model was only recently made public, few applications have beenpublished today. However, this type of tool will with no doubt be extensively used in the nearfuture for a wide range of tasks. In the area of document classiﬁcation alone, they will likelyprovide faster and more relevant access to expected information. Certainly, these applicationswill go beyond simple classiﬁcation tasks. Of note, it is unusual to generate the next token (in aQuestion Answering fashion) in an NLP model to perform classiﬁcation tasks. A more classicalapproach would certainly be to add a layer after a hidden state of the model and apply a soft-maxlayer to output prediction probabilities. While this will be done in future work, adding a layerhowever requires much more skill in Python/TensorFlow programming. That is why we decidedto present a method that can be used by a much broader scientiﬁc community.10hile the 345M GPT-2 model did not generate better results than the 117M model in currentstudy, the use of larger models could bring further improvement. Unfortunately, the requiredcomputing power of larger models is far beyond our resources for this pilot study, we will have tobe satisﬁed with the results presented here.In this study, the trauma/non-trauma labeling procedure of the clinical notes was indirectlybased on the ICD-10 codes. We tried to maximize the consistency of the ground-truth labelingby selecting a subset of ICD-10 codes for which the traumatic/non-traumatic characteristic isindisputable. This method has had the advantage of providing a large amount of labeled data butdoes not allow us to compare the model’s performance with human annotation.

Our work shows that it is possible to easily adapt a multi-purpose NLM model such as the GPT-2to create a powerful classiﬁcation tool of free-text notes even in languages other than English.The self-supervised training phase appeared to be a very powerful tool to dramatically decreasethe number of labeled samples required for supervised learning. These results will be used in thecoming months to implement the exhaustive coding of all events leading to trauma with emergencyroom visits, making it possible to build a national trauma observatory within the TARPON projectframework. More generally, this also opens broad perspectives for those interested in automaticfree-text annotation. In the ﬁeld of health, this will be particularly useful for diagnosis coding,clinical report classiﬁcation and patient reports analysis and mining.

Acknowledge

We thank sincerely the anonymous reviewers / readers whose comments/suggestions helped im-prove and clarify this manuscript.

References [1] Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy. An empirical evaluation of deep learningfor icd-9 code assignment using mimic-iii clinical notes.

Computer Methods and Programs inBiomedicine , 177:141 – 153, 2019.[2] M. Li, Z. Fei, M. Zeng, F. Wu, Y. Li, Y. Pan, and J. Wang. Automated icd-9 coding via a deeplearning approach.

IEEE/ACM Transactions on Computational Biology and Bioinformatics ,16(4):1193–1202, July 2019.[3] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent,and Samy Bengio. Why does unsupervised pre-training help deep learning?

Journal ofMachine Learning Research , 11(Feb):625–660, 2010.[4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,

Advances inNeural Information Processing Systems 30 , pages 5998–6008. Curran Associates, Inc., 2017.[5] Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. Leveraging pre-trained checkpoints forsequence generation tasks.

CoRR , abs/1907.12461, 2019.116] Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.Language models are unsupervised multitask learners.

OpenAI Blog , 1(8), 2019.[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training ofdeep bidirectional transformers for language understanding.

CoRR , abs/1810.04805, 2018.[8] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving languageunderstanding by generative pre-training. 2018.[9] World Health Organization.

International statistical classiﬁcation of diseases and related healthproblems : 10th revision (ICD-10), Fifth edition, 2016 . World Health Organization, 2015.[10] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statisticalmachine translation.

CoRR , abs/1406.1078, 2014.[11] David Martin Powers. Evaluation: from precision, recall and f-measure to roc, informedness,markedness and correlation. 2011.[12] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, KentonLee, and Luke Zettlemoyer. Deep contextualized word representations.

CoRR , abs/1802.05365,2018.[13] Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text clas-siﬁcation. In

Proceedings of the 56th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 328–339, Melbourne, Australia, July 2018. Asso-ciation for Computational Linguistics.[14] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, andQuoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding.