[PDF] Efficiently Reusing Natural Language Processing Models for Phenotype-Mention Identification in Free-text Electronic Medical Records: Methodology Study

Abstract

Background: Many efforts have been put into the use of automated approaches, such as natural language processing (NLP), to mine or extract data from free-text medical records to construct comprehensive patient profiles for delivering better health-care. Reusing NLP models in new settings, however, remains cumbersome - requiring validation and/or retraining on new data iteratively to achieve convergent results. Objective: The aim of this work is to minimize the effort involved in reusing NLP models on free-text medical records. Methods: We formally define and analyse the model adaptation problem in phenotype-mention identification tasks. We identify "duplicate waste" and "imbalance waste", which collectively impede efficient model reuse. We propose a phenotype embedding based approach to minimize these sources of waste without the need for labelled data from new settings. Results: We conduct experiments on data from a large mental health registry to reuse NLP models in four phenotype-mention identification tasks. The proposed approach can choose the best model for a new task, identifying up to 76% (duplicate waste), i.e. phenotype mentions without the need for validation and model retraining, and with very good performance (93-97% accuracy). It can also provide guidance for validating and retraining the selected model for novel language patterns in new tasks, saving around 80% (imbalance waste), i.e. the effort required in "blind" model-adaptation approaches. Conclusions: Adapting pre-trained NLP models for new tasks can be more efficient and effective if the language pattern landscapes of old settings and new settings can be made explicit and comparable. Our experiments show that the phenotype-mention embedding approach is an effective way to model language patterns for phenotype-mention identification tasks and that its use can guide efficient NLP model reuse.

Full PDF

OOriginal Paper - accepted by JMIR Medical Informatics on 22 nd Oct 2019

Honghan Wu , Karen Hodgson , Sue Dyson , Katherine I Morley , Zina M Ibrahim , Ehtesham Iqbal , Robert Stewart , Richard JB Dobson , and Cathie Sudlow Centre for Medical Informatics, Usher Institute, University of Edinburgh, United Kingdom Health Data Research, University of Edinburgh, United Kingdom School of Computer and Software, Nanjing University of Information Science & Technology, China Inst itute of Psychiatry, Psychology & Neuroscience, King’s College London, United

Kingdom South London and Maudsley NHS Foundation Trust, London, UK Centre for Epidemiology and Biostatistics, Melbourne School of Global and Population Health, The University of Melbourne, Melbourne, Australia Health Data Research UK, University College London, United Kingdom * Corresponding: 9 Little France Rd, Edinburgh EH16 4UX, United Kingdom; +44 (0)131 651 7882; [email protected]

Efficiently Reusing Natural Language Processing Models for Phenotype-Mention Identification in Free-text Electronic Medical Records: Methodology Study

Abstract

Background:

Many efforts have been put into the use of automated approaches, such as natural language processing (NLP), to mine or extract data from free-text medical records to construct comprehensive patient profiles for delivering better health-care. Reusing NLP models in new settings, however, remains cumbersome - requiring validation and/or retraining on new data iteratively to achieve convergent results.

Objective:

The aim of this work is to minimize the effort involved in reusing NLP models on free-text medical records.

Methods:

We formally define and analyse the model adaptation problem in phenotype-mention identification tas ks. We identify “ duplicate waste ” and “ imbalance waste ”, which collectively impede efficient model reuse. We propose a phenotype embedding based approach to minimize these sources of waste without the need for labelled data from new settings.

Results:

We conduct experiments on data from a large mental health registry to reuse NLP models in four phenotype-mention identification tasks. The proposed approach can choose the best model for a new task, identifying up to 76% (duplicate waste), i.e. phenotype mentions without the need for validation and model retraining, and with very good performance (93-97% accuracy). It can also provide uidance for validating and retraining the selected model for novel language patterns in new tasks, saving around 80% (imbalance waste), i.e. the effort required in “ blind ” model -adaptation approaches.

Conclusions:

Adapting pre-trained NLP models for new tasks can be more efficient and effective if the language pattern landscapes of old settings and new settings can be made explicit and comparable. Our experiments show that the phenotype-mention embedding approach is an effective way to model language patterns for phenotype-mention identification tasks and that its use can guide efficient NLP model reuse.

Keywords:

Natural Language Processing; Text Mining; Phenotype; Word Embedding; Phenotype Embedding; Model Adaptation; Electronic Health Records; Machine Learning; Clustering

Introduction

Compared to structured components of electronic health records (EHRs), free-text comprises a much deeper and larger volume of health data. For example, in a recent geriatric syndrome study[1], unstructured EHR data contributed a significant proportion of identified cases: 67.9% falls, 86.6% cases of visual impairment, and 99.8% cases of lack of social support. Similarly, in a study of co-morbidities using a database of anonymised EHRs of a psychiatric hospital in London (the South London and Maudsley NHS Foundation Trust (SLaM))[2], 1,899 cases of co-morbid depression and type 2 diabetes were identified from unstructured EHRs, while only 19 cases could be found using structured diagnosis tables. The value of unstructured records for selecting cohorts has also been widely reported[3,4]. Extracting clinical variables or identifying phenotypes from unstructured EHR data is, therefore, essential for addressing many clinical questions and research hypotheses [5 – –

10] and web services or cloud based solutions[11 – uch “ blind ” adaptation is costly in the clinical domain because of barriers to data access and expensive clinical expertise needed for data labelling. The “blindness ” to the similarities and differences of language pattern landscapes between the source (where the model was trained) and target (the new task) settings causes (at least) two types of potentially unnecessary, wasted effort, which may be avoidable. First, for those data in the target setting with the same patterns as in the source setting, any validation or retraining efforts are unnecessary because the model has already been trained and validated on these language patterns. We call this type of wasted effort the “ duplicate waste ”. The second type of waste occurs if the distribution of new language patterns in the target setting is unbalanced, i.e., some - but not all - data instances belong to different language patterns. The model adaptation involves validating the model on these new data and further adjusting it when performance is not good enough. Without the knowledge of which data instances belong to which language patterns, data instances have to be randomly sampled for validation and adaptation. In most cases, a minimal number of instances of every pattern need to be processed so that convergent results can be obtained. This will usually be achieved via iterative validation and adaptation process, which will inevitably cause commonly used language patterns to be over represented, resulting in the model being over validated/retrained on such data. Such unnecessary effort on commonly used language patterns result from the pattern imbalance in the target setting, which unfortunately is the norm in almost all real world EHR datasets. We call this “ imbalance waste ”. The ability to make language patterns visible and comparable will address whether an NLP model can be adapted to a new task and, importantly, provide guidance on how to solve new problems effectively and efficiently through the smart adaptation of existing models. In this paper, we introduce a contextualised embedding model to visualise such patterns and provide guidance for reusing NLP models in phenotype-mention identification tasks. Here, a phenotype mention denotes an appearance of a word or phrase (representing a medical concept) in a document, which indicates a phenotype related to a person. We note two aspects of this definition: Phenotype mention ≠ Medical concept mention. When a medical concept mentioned in a document does not indicate a phenotype relating to a person (e.g., cases in the last two rows of Table 1), it is not a phenotype mention. Phenotype mention ≠ Phenotype. Phenotype (e.g. diseases and associated traits) is a specific patient characteristic[15], which is a patient level feature, e.g., a binary value indicating a patient is a smoker or not. However, for the same phenotype, a patient might have multiple phenotype mentions. For example, xxx is a smoker could be mentioned in different documents or even multiple times in one document, each of these appearances is a phenotype mention. Table 1. The task of recognising contextualised phenotype mentions is to identify mentions of phenotypes from free-text records and also classify the context of each mention into 5 categories (listed in the 3rd column of Table 1). The last two rows ive examples of non-phenotype mentions - the two sentences are not describing incidents of a condition.

Examples Types of phenotype mentions

49 year old man with hepatitis c

Contextualised mentions positive mention with no evidence of cancer recurrence negated mention is concerning for local lung cancer recurrence hypothetical mention PAST MEDICAL HISTORY: 1)

Atrial Fibrillation , 2)... history mention Mother was A positive, hepatitis C carrier , and ... mention of phenotype in another person She visited the

HIV clinic last week. not a phenotype mention The patient asked for information about stroke . not a phenotype mention

The focus of this work is to minimise the effort in reusing existing NLP model(s) in solving new tasks rather than proposing a novel NLP model for phenotype-mention identification. We aim to address the problem of NLP model transferability in the task of extracting mentions of phenotypes from free-text medical records. Specifically, the task is to identify above-defined phenotype mentions and the contexts in which they were mentioned[10] . (Table 1) explains and gives examples of contextualised phenotype mentions. The research question to be investigated is formally defined as follows. Figure 1. Assess the transferability of a pre-trained model in solving a new task: discriminate between differently inaccurate mentions identified by the model in the new setting.

Definition 1.

Given an NLP model (denoted as 𝑚 ) previously trained for some phenotype-mention identification task(s), and a new task (denoted as 𝑇 , where either phenotypes to be identified are new or the dataset is new, or both are new), m is used in 𝑇 to identify a set of phenotype mentions - denoted as 𝑆 . The research question (as illustrated in Figure 1) is how to partition 𝑆 to meet the following criteria: 1. a maximum p-known subset 𝑆 𝑘𝑛𝑜𝑤𝑛 where 𝑚 ’s performance ca n be properly predicted using prior knowledge of 𝑚 ; 2. p-unknown subsets: {𝑆 𝑢 , 𝑆 𝑢 , . . . , 𝑆 𝑢 𝑘 } , which meet the following criteria: a. 𝑆 𝑢 ∪ 𝑆 𝑢 ∪. . .∪ 𝑆 𝑢 𝑘 = 𝑆 − 𝑆 𝑘𝑛𝑜𝑤𝑛 ; b. ∀𝑖, 𝑗 ∈ [1. . 𝑘], 𝑖 ≠ 𝑗, 𝑆 𝑢 𝑖 ∩ 𝑆 𝑢 𝑗 = ∅ ; c. ∀𝑖 ∈ [1. . 𝑘], 𝑆 𝑢 𝑖 can be represented by a small number of instances 𝑅 𝑢 𝑖 so that 𝑚 ’s overall performance on 𝑆 𝑢 𝑖 can be predicted by its result on 𝑅 𝑢 𝑖 ; d. 𝑘 ≪ |𝑆| − |𝑆 𝑘𝑛𝑜𝑤𝑛 | . The identification of ‘ p-known ’ subset ( criterion 1 ) will help eliminate “ duplicate waste ” by avoiding unnecessary validation and adaptation on those phenotype mentions. On the other hand, separating the rest of the annotations into ‘ p-unknown ’ subsets allows processing mentions based on their performance-relevant characteristics separate ly, which in turn helps avoid “ imbalance waste ”. The above criterion 2.a ensures completeness of coverage of all performance-unknown mentions, ensures no overlaps between mention subsets so that no duplicated effort will be put on the same mentions. Criterion 2.c requires that the partitioning of the mentions is performance-relevant , meaning model performance on a small number of samples can be generalised to the whole subset that they are drawn from. Lastly, a small 𝑘 ( criterion 2.d ) enables efficient adaptation of a model. ethods Dataset & adaptable phenotype-mention identification models

Recently, we developed SemEHR[10] - a semantic search toolkit aiming to use interactive information retrieval functionalities to replace NLP building so that clinical researchers can use a browser based interface to access text mining results from a generic NLP model and (optionally) keep getting better results by iteratively feeding back to the system. A SLaM instance of this system has been trained for supporting 6 comorbidity studies (62,719 patients; 17,479,669 clinical notes in total), where different combinations of physical conditions and mental disorders are extracted and analysed. Multimedia Appendix 1 give details about the user interface and model performance. These studies effectively generated 23 phenotype-mention identification models and relevant labelled data (>7,000 annotated documents), which we use to study model transferability.

Foundation of Proposed Approach

Our approach is based on the following assumption about a language pattern representation model.

Assumption 1.

There exists a pattern representation model, A, for identifying language patterns of phenotype mentions with the following characteristics. 1.

Each phenotype mention can be characterised by one and only one language pattern; 2.

Patterns are largely shared by different mentions; 3.

There is a deterministic association between NLP models’ performances with such language patterns.

Theorem 1.

Given 𝐴 - a pattern model meeting Assumption 1, 𝑚 - an NLP model, 𝑇 - a new task, let 𝑃 𝑚 be the pattern set 𝐴 identifies from dataset(s) that 𝑚 was trained or validated on; let 𝑃 𝑇 be the pattern set 𝐴 identifies from 𝑆 - the set of all mentions identified by 𝑚 in 𝑇 . Then, the problem defined in Definition 1 can be solved by a solution, where 𝑃 𝑚 ∩ 𝑃 𝑇 is the ‘ p-known ’ subset and 𝑃 𝑇 − (𝑃 𝑚 ∩ 𝑃 𝑇 ) is ‘ p-unknown ’ subsets. Proof of Theorem 1 can be found in the Multimedia Appendix 2. The rest of this section gives details of a realisation of 𝐴 using distributed representation models. Distributed representation for contextualised phenotype mentions

In computational linguistics, statistical language models are perhaps the most common approach to quantify word sequences, where a distribution is used to represent the probability of a sequence of words -

𝑃(𝑤 , . . . , 𝑤 𝑛 ) . Among such models, the bag-of-words (BOW) model[15] is perhaps the earliest and simplest, yet still widely-used and efficient in certain tasks[16] . To overcome BOW’s limitations (e.g., ignoring semantic similarities between words), more complex models were introduced to represent word semantics [17 – – – he worries about contracting HIV - HIV here is a hypothetical phenotype mention). The key idea of our approach is to use explicit mark-ups to represent phenotype semantics in the text so that they can be learnt through an approach similar to word embedding learning framework. (Figure 2) illustrates our framework for extending the continuous BOW word embedding architecture to capture the semantics of contextualised phenotype mentions. Explicit mark-ups of phenotype mentions are added to the architecture as placeholders for phenotype semantics. A mark-up (e.g., C0038454_POS) is composed of two parts: phenotype identification (e.g., C0038454) and contextual description (e.g., POS). The first part identifies a phenotype using a standardised vocabulary. In our implementation the Unified Medical Language System (UMLS)[30] was chosen for its broad concept coverage and the provision of comprehensive synonyms for concepts. The first benefit of using a standardised phenotype definition is that it helps in grouping together mentions of the same phenotype using different names. For example, using UMLS concept identification of C0038454 for STROKE helps combining together mentions using

Stroke, Cerebrovascular Accident, Brain Attack and other 43 synonyms. The second benefit omes from the concept relations represented in the vocabulary hierarchy, which helps the transferability computation that we will elaborate later. The second part of a phenotype mention mark-up is to identify the mention context. Six types of contexts are supported: POS for positive mention , NEG for negated mention , HYP for hypothetical mention , HIS for history mention , OTH for mention of the phenotype in another person and NOT for not a phenotype mention . There The phenotype mention mark-ups can be populated using labelled data that NLP models were trained or validated on. In our implementation, the mark-ups were generated from the labelled subset of SLaM EHRs.

Using phenotype embedding and their semantics for assessing model transferability

Figure 3. Architecture of phenotype embedding based approach for transferring pre-trained natural language processing models for identifying new phenotypes or application to new corpora. The word & phenotype embedding model is learnt from the training data of the re-usable models in its source domain (the task that m was trained for). No labelled data in the target domain (new setting) is required for the adaptation guidance. The embeddings learnt (including both word and contextualised phenotype vectors) are the building blocks underlying the language pattern representation model - 𝐴, as introduced at the beginning of this section, which is to compute 𝑃 𝑚 (the landscape of language patterns that 𝑚 is familiar with) and 𝑃 𝑇 (the landscape of language patterns in the new task 𝑇 ) for assessing and guiding NLP model adaptation for new tasks. (Figure 3) illustrates the architecture of our approach. The double-circle shape denotes the embeddings learnt from the 𝑚 ’s labelled data. Essentially, the process is composed of two phases: (1) the documents from a new task (on the left of the figure) are annotated with phenotype mentions using a pre-trained model - 𝑚 ; (2) a classification task uses the above embeddings to assess each mention: whether it is n instance of p-known (something similar enough to what 𝑚 is familiar with) or any subset of p-unknown (something that is new to 𝑚 ). Specifically, the process is composed of the following steps. 1. Vectorise phenotype mentions in a new task

Each mention in the new task will be represented as a vector of real numbers using the learnt embedding model to combine its surrounding words as context semantics. Formally, let 𝑠 be a mention identified by 𝑚 in the new task, where 𝑠 can be represented by a function defined as follows: 𝑣(𝑠) = 𝑓(𝑤 → (𝑡 𝑖−𝑘 , . . . , 𝑡 𝑖+𝑘+𝑙 )) (1) where 𝑤 → is the embedding model to convert a word token into a vector, 𝑡 𝑗 is the 𝑗 𝑡ℎ word in a document, 𝑖 is the offset of the first word of 𝑠 in the document, 𝑙 is the number of words in 𝑠 and 𝑓 is a function to combine a set of vectors into a result vector (we use average in our implementation). With such representations, all mentions are effectively put in a vector space (depicted as a two dimensional space on the right of the figure for illustration purposes). 2. Identify clusters (language patterns) of mention vectors

In the vector space, clusters are naturally formed based on geometric distances between mention vectors. After trying different clustering algorithms and parameters, DBScan[31] was chosen on Euclidean distance in our implementation for vector clustering. Essentially, each cluster is a set of mentions considered to share the same (or similar enough) underlying language pattern, meaning language patterns in the new task are technically the vector clusters. We chose the cluster centroid (arithmetic mean) to represent a cluster (i.e., its underlying language pattern). 3.

Choose a reference vector for classifying language patterns

After clusters (language patterns) are identified, the next step is to classify them as p-known or subsets of p-unknown . We choose a reference vector-based approach - classifying using the distance to a selected vector. Such a reference vector is picked up (when the phenotype to be identified has been trained in 𝑚 ) or generated (when the phenotype is new to 𝑚 ) from the learnt phenotype embeddings the model 𝑚 has seen previously. Apparently, when the phenotype to be identified in the new task is new to 𝑚 (not in the set of phenotypes it was developed for), the reference phenotype needs to be carefully selected so that it can help to produce a sensible separation between p-known and p-unknown clusters. We use the semantic similarity (distance between two concepts in the UMLS tree structure) to choose the most similar phenotype from the phenotype list 𝑚 was trained for. Formally, the reference is chosen as follows. et 𝑐 𝑝 be the UMLS concept for a phenotype to be identified in the new task and 𝐶 𝑚 be the set of phenotype concepts that m was trained for, the reference phenotype choosing function is: 𝑅(𝑐 𝑝 , 𝐶 𝑚 ) = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑐∈𝐶 𝑚 𝐷(𝑐, 𝑐 𝑝 ) (2) where 𝐷 is a distance function to calculate the steps between two nodes in the UMLS concept tree. Once the reference phenotype has been chosen, the reference vector can be selected or generated (e.g., use the average) from this phenotype’s contextual embeddings. 4. Classify language patterns to guide model adaptation

Once the reference vector has been selected, clusters can be classified based on the distances between their centroids (representative vectors of clusters) and the reference vector. Once a distance threshold is chosen, this distance-based classification partitions the vector space into two subspaces using the reference vector as the centre: the sub-space whose distance to the centre is less than the threshold is called p-known sub-space and the remainder is p-unknown sub-space. The union of clusters whose centroids are within p-known sub-space is p-known meaning 𝑚 ’s performances on them can be predicted without further validation ( removing duplicate waste ). Other clusters are p-unknown clusters. 𝑚 can be validated and/or further trained on each p-unknown cluster separately instead of blindly across all clusters. This will remove imbalance waste . Results

Associations between embedding based language patterns and model performances

As stated in the beginning of Method section, our approach is based on 3 assumptions about language patterns. Therefore, it is essential to quantify to what extent the language patterns identified by our embedding based approach meet these assumptions. The first assumption - a phenotype mention can be assigned to one and only to one language pattern - is met in our approach, since (a) (Equation 1) is a One-to-One function; (b) DBScan algorithm (the vector clustering function chosen in our implementation) is also a One-to-One function. Assumption 2 can be quantified by the percentage of mentions that can be assigned to a cluster. This percentage can be increased by increasing the Epsilon(EPS) parameter (the maximum distance between two data items for them to be considered as in the same neighbourhood) in DBScan. However, the degree to which mentions are clustered together needs to be balanced against the consequence of reduced ability to identify performance-related language patterns, which is the third assumption - associations between language patterns and model performance. To quantify such an association, we propose a metric called Bad Guy Separate Power (SP for short), as defined in (Equation 3) below. The aim is to measure to what extent a clustering can assemble incorrect data items (false positive mentions of phenotypes) together. et C be a set of binary data items - ∀𝑐 𝑖 ∈ 𝐶, 𝑇(𝑐 𝑖 ) ∈ {𝑡, 𝑓} ( 𝑡 – stands for true; 𝑓 – stands for false), given a clustering result {𝐶 , . . . , 𝐶 𝑘 |𝐶 ∪ 𝐶 . . .∪ 𝐶 𝑘 = 𝐶} , its separate power for f typed data items is defined as follows. 𝑆𝑃({𝐶 , . . . , 𝐶 𝑘 }, 𝑓) = ∑ 𝑘𝑖=1 |{𝑐𝑖|𝑐𝑖∈𝐶,𝑇(𝑐𝑖)=𝑓}|2|𝐶𝑖| |{𝑐 𝑖 |𝑐 𝑖 ∈𝐶,𝑇(𝑐 𝑖 )=𝑓}| (3) In our scenario, we would like to see clustering being able to separate easy cases (where good performance is achieved) from difficult cases (where performance is poor) for a model 𝑚 . Figure 4. Clustered Percentage vs Separate Power on difficult cases. The X axis is the EPS parameter of the DBScan clustering algorithm - the longest distance between any two items within a cluster; the Y axis is the percentage. Two types of changing information (as functions of EPS) are plotted on each sub-figure: clustered percentage (solid line) and Separate Power (SP) on incorrect cases (false positive mentions of phenotypes). The latter has two series: (1) SP by chance (dash dotted line) - when clustering by randomly selecting mentions; (2) SP by clustering using phenotype embedding (dashed line). 𝑁 - number of all mentions; 𝑁 𝑓 – number of false positive mentions. To quantify the clustering percentage, the ability to separate mentions based on model performances and the interplay between the two, we conducted experiments on selected phenotypes by continuously increasing the clustering parameters - EPS from a low level. (Figure 4) shows the results. In this experiment, we label mentions into two types: correct and incorrect using SemEHR labelled data on the SLaM corpus. Specifically, for the mention types in (Table 1), incorrect mentions are those denoted ‘not -a-phenotype- mention’ and the remainder are labelled as correct. e chose incorrect as the 𝑓 in (equation 3) meaning that we evaluate the separate power on incorrect mentions. Four phenotypes were selected for this evaluation: Diabetes and

Hypertensive disease were selected because they were most validated phenotypes;

Abscess (with 13% incorrect mentions) and

Blindness (with 47% incorrect mentions) were chosen to represent NLP models with different levels of performance. The figure shows a clear trend in all cases that as EPS increases the clustered percentage increases but with decreasing separate power. This confirms a trade-off between the coverage of identified language patterns and how good they are. Regarding separate power, the performance on two selected common phenotypes (Figure 4a and 4b) is generally worse than for the other phenotypes - starting with lower power, which decreases faster as EPS increases. The main reason is that the difficult cases (mentions with poor performance) in the two commonly encountered phenotypes are relatively rare (Diabetes: 8.5%; Hypertensive disease: 5.5%). In such situations, difficult cases are harder to separate because their patterns are under-represented. However, in general, compared to random clustering, the embedding based clustering approach brings in much better separate power in all cases. This confirms a high level association between identified clusters and model performance. In particular, when the proportion of difficult cases reaches near 50% (Figure 4d), the approach can keep SP values almost constantly near 1.0 when EPS increases. This means it can almost always group difficult cases in their own clusters. Model adaptation guidance evaluation

Technically, the guidance to model adaptation is composed of two parts: avoid duplicate waste (skip validation/training efforts on cases the model is already familiar with); and avoid imbalance waste (group new language patterns together so that validation/continuous training on each group separately can be more efficient than doing it over the whole corpus). To quantify the guidance effectiveness, the following metrics are introduced. ● Duplicate Waste.

This is the number of mentions whose patterns fall into what the model m is familiar with. The quantity |{𝑠|𝑝𝑎𝑡𝑡𝑒𝑟𝑛(𝑠}∈𝑃 𝑚 ∩𝑃 𝑇 }||𝑆| is the proportion of mentions which needs no validation or retraining before reusing 𝑚 . ● Imbalance Waste.

To achieve convergence performance, an NLP model needs to be trained on a minimal number (denoted as 𝑒 ) of samples from each language pattern. Calling the language pattern set in a new task as 𝐶 ={𝐶 , . . . , 𝐶 𝑘 } , the following equation counts the minimum number of samples needed to achieve convergent results in ‘blind’ adaptations. 𝐶𝑜𝑛𝑣_𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔(𝐶, 𝑒) = 𝑚𝑎𝑥 𝑖=1𝑘 |𝑆||𝐶 𝑖 | × 𝑚𝑖𝑛(|𝐶 𝑖 |, 𝑒) (4) When the language patterns are identifiable, the Imbalance Waste that can be avoided is quantified as

𝐶𝑜𝑛𝑣_𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔(𝐶, 𝑒) − ∑ 𝑘𝑖=1 𝑚𝑖𝑛(|𝐶 𝑖 |, 𝑒) . ● Accuracy.

To evaluate whether our approach can really identify familiar patterns, we quantify the accuracy of those within-threshold clusters and also those within-threshold single mentions that are not clustered. Both acro-accuracy (average of all cluster accuracies) and micro-accuracy (overall accuracy) are used - detailed explanations at [32]. (Figure 5) shows the results of our NLP model adaptation guidance on 4 phenotype identification tasks. For each new phenotype identification task, the NLP model (pre-)trained for the semantically most similar (defined in Equation 2) phenotype was chosen as the reuse model. Models and labelled data for the four pairs of phenotypes were selected from six physical comorbidity studies on SLaM data. Figure 5 shows that identified mentions have a high proportion of avoidable duplicate waste in all 4 cases: diabetes and heart attack start with 50%; stroke and multiple sclerosis > 70%. Such avoidable duplicate waste decreases when the threshold increases. The threshold is on similarity instead of distance, meaning that new patterns need to be more similar to the reuse model’s embeddings to be counted as familiar patterns. Therefore, it is understandable that duplicate waste decreases in such scenarios. In terms of accuracy, one would expect this to increase as only more similar patterns are left when threshold increases. However, interestingly, in all cases, both macro and micro-accuracies decrease slightly before increasing to reach near 1.0. This is a phenomenon worth future investigation. In general, the changes in accuracy are small (.03 to .08), while accuracy remains high (>.92). Given these observations, the threshold is normally set at .01, to optimise the avoidance of duplicate waste with minimal effect on accuracy. Specifically, in all cases, more than half of the identified mentions (50%+ for subfigure 5a and 5b; 70%+ for 5c and 5d) do not need any validation/training to obtain accuracy of >0.95. In terms of effective adaptation on new patterns, the percentages of avoidable imbalance waste in all cases are around 80% confirming that a much more efficient retraining on data can be achieved through language pattern-based guidance. Figure 5. Identifying new phenotypes by reusing NLP models pre-trained for semantically-close phenotypes: the four pairs of phenotype-mention identification models are chosen from SemEHR models trained on SLaM data; DBScan EPS value: 3.8; Imbalance Waste is calculated on e = 3 meaning at least 3 samples are needed for training from each language pattern. The X axis is the similarity threshold, ranging from 0.0 to 0.8; the Y axes, from top to bottom, are: the proportion of duplicate waste saved over total number of mentions; macro-accuracy; micro-accuracy.

Effectiveness of phenotype semantics in model reuse

When considering NLP model reuse for a new task, if there is no existing model that has been developed for the same phenotype-mention identification task, our approach will choose a model trained for a phenotype that is most semantically similar to it (based on Equation 2). To evaluate the effectiveness of such semantic relationships in reusing NLP models, we conducted experiments on the previous four phenotypes by using phenotype models with different levels of semantic similarities. (Table 2) shows the results. In all cases, reusing models trained for more similar phenotypes can identify more duplicate waste using the same parameter settings. The first three cases in the table can also achieve better accuracies, while multiple sclerosis had slightly better accuracy by reusing the diabetes model than the more semantically-similar myasthenia gravis . However, the latter identified 46% more duplicate waste . Table 2. Comparisons of the performance of reusing models with different semantic similarity levels. More similar ones are marked with *. Similarity threshold: 0.01; DBScan EPS: 0.38. Reusing models trained for more (semantically) similar phenotypes achieved adaptation results that with less effort (more duplicate waste identified) in all cases and were also more accurate in 3 out of 4 cases. Model reuse cases duplicate waste macro-accuracy micro-accuracy iabetes by Type 2 Diabetes * Diabetes by Hypercholesterolaemia

Stroke by Heart attack * Stroke by Fatigue

Heart attack by Infarct * Heart attack by Bruise

Multiple Sclerosis by Myasthenia Gravis * Multiple Sclerosis by Diabetes

Discussion

Principal Results

Automated extraction methods (as surveyed recently by Ford and et. al.[33]), many of which are made freely available and open source, have been intensively investigated in mining free-text medical records[10,34 – Limitations

In this study, we did not evaluate the recall of adapted NLP models in new tasks. Although the models we chose can generally achieve very good recall for identifying physical conditions (96-98%) within the SLaM records, investigating the transferability on recalls is an important aspect of NLP model adaptation. he model reuse experiments were conducted on identifying new phenotypes on document sets that had not previously been seen by the NLP model. However, these documents were still part of the same (SLaM) EHR system. To fully test the generalisability of our approach will require evaluation of model reuse in a different EHR system, which will require a new set of access approvals as well as information governance approval for the sharing of embedding models between different hospitals. We chose a phenotype embedding model to represent language patterns. One reason is that we have limited number of manually annotated data items. The word embedding approach is unsupervised and the word- level “semantics” learnt from the whole corpus can help group similar words together in the vector space so that can help improve the phenotype level clustering performances. However, thorough comparisons between different language pattern models are needed to reveal whether other approaches, in particular simpler or less-computing-intensive approaches can achieve similar or different performances. In addition, implementation-wise, vector clustering is an important aspect of this approach. We have compared DBScan with k-nearest neighbors algorithm in our model, which revealed that DBScan could achieve better SP powers in most scenarios. Using a 64-bit Windows 10 server with 16G memory and 8 core CPUs (3.6 GHz), DBScan uses 200M memory and takes 0.038 seconds on near 300 data points on average of 100 executions. However, it is worth in-depth comparisons between more clustering algorithms. In particular, a larger datasets might be needed to compare the clustering performances on both computational aspect and SP powers.

Comparison with Prior Work

NLP model adaptation aims to adapt NLP models from a source domain (with abundant labelled data) to a target domain (with limited labelled data). This challenge has been extensively studied in the NLP community[37 – for identifying unique “signatures” of micro -message uthors. This paper models language patterns for characterising “landscape” of phenotype mentions. One main difference is that we do not know how many clusters (or “signatures”) of language patterns exist in our scenario. Technically, we use phenotype embeddings to model such patterns and, particularly, utilise phenotype semantic similarities (based on ontology hierarchies) for reusing learnt embeddings when necessary. Conclusions

Making fine-grained language patterns visible and comparable (in computable form) is the key to support ‘smart’ NLP model adaptation. We have shown that the phenotype embedding based approach proposed in this paper is an effective way to achieve this. However, our approach is just one way to model such fine-grained patterns. Investigating novel pattern representation models is an exciting research direction to enable automated NLP model adaptation and composition (i.e. combining various models together) for efficiently mining free-text electronic medical records in new settings with maximum efficiency and minimal effort.

Acknowledgements

This research was funded by Medical Research Council / Health Data Research UK Grant (MR/S004149/1); Industrial Strategy Challenge Grant (MC_PC_18029); the National Institute for Health Research (NIHR) Biomedical Research Centre at South

London and Maudsley NHS Foundation Trust and King’s College London. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.

Ethical approval and informed consent

De-identified patient records were accessed through the Clinical Record Interactive Search (CRIS) at the Maudsley NIHR Biomedical Research Centre, South London and Maudsley (SLaM) NHS Foundation Trust. This is a widely used clinical database with a robust data governance structure which has received ethical approval for secondary analysis (Oxford REC 18/SC/0372).

Conflicts of Interest none declared

Abbreviations

EHR: Electronic Health Record LSTM: Long Short Term Memory NLP: Natural Language Processing SemEHR: a semantic search toolkit that can be trained for different studies SLaM: South London and Maudsley NHS Foundation Trust

Data availability statement

Multimedia Appendix 1 User interface and model performances of phenotype NLP models Multimedia Appendix 2 Proof of Theorem 1 References

1. Kharrazi H, Anzaldi LJ, Hernandez L, Davison A, Boyd CM, Leff B, et al. The Value of Unstructured Electronic Health Record Data in Geriatric Syndrome Case Identification [Internet]. Journal of the American Geriatrics Society. 2018. pp. 1499 – – – e42. doi:10.1097/ede.0000000000000856 7. Bell J, Kilic C, Prabakaran R, Wang YY, Wilson R, Broadbent M, et al. Use of electronic health records in identifying drug and alcohol misuse among psychiatric in-patients [Internet]. The Psychiatrist. 2013. pp. 15 –

20. doi:10.1192/pb.bp.111.038240 8. Jackson MSc RG, Ball M, Patel R, Hayes RD, Dobson RJB, Stewart R. TextHunter--A User Friendly Tool for Extracting Generic Concepts from Free Text in Clinical Research. AMIA Annu Symp Proc. 2014;2014: 729 – – – – – – –

9. 15. Harris ZS. Distributional Structure [Internet]. WORD. 1954. pp. 146 – – – – – Bengio Y, ’e, Vincent P, Jauvin

C. A neural probabilistic language model. J Mach Learn Res. 2003;3: 1137 – – – – – – – – e86. 35. Savova GK, Ogren PV, Duffy PH, Buntrock JD, Chute CG. Mayo Clinic NLP System for Patient Smoking Status Identification [Internet]. Journal of the American Medical nformatics Association. 2008. pp. 25 –

28. doi:10.1197/jamia.m2437 36. Albright D, Lanfranchi A, Fredriksen A, Styler WF 4th, Warner C, Hwang JD, et al. Towards comprehensive syntactic and semantic annotations of the clinical narrative. J Am Med Inform Assoc. 2013;20: 922 – – – –

88. doi:10.1109/mis.2018.012001555 41. Jiang J, Zhai C. Instance weighting for domain adaptation in NLP. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics; 2007. pp. 264 – –1891.