[PDF] Patient ADE Risk Prediction through Hierarchical Time-Aware Neural Network Using Claim Codes

Abstract

Adverse drug events (ADEs) are a serious health problem that can be life-threatening. While a lot of studies have been performed on detect correlation between a drug and an AE, limited studies have been conducted on personalized ADE risk prediction. Among treatment alternatives, avoiding the drug that has high likelihood of causing severe AE can help physicians to provide safer treatment to patients. Existing work on personalized ADE risk prediction uses the information obtained in the current medical visit. However, on the other hand, medical history reveals each patient's unique characteristics and comprehensive medical information. The goal of this study is to assess personalized ADE risks that a target drug may induce on a target patient, based on patient medical history recorded in claims codes, which provide information about diagnosis, drugs taken, related medical supplies besides billing information. We developed a HTNNR model (Hierarchical Time-aware Neural Network for ADE Risk) that capture characteristics of claim codes and their relationship. The empirical evaluation show that the proposed HTNNR model substantially outperforms the comparison methods, especially for rare drugs.

Full PDF

PPatient ADE Risk Prediction through HierarchicalTime-Aware Neural Network Using Claim Codes

Jinhe Shi ∗ , Xiangyu Gao ∗ , Chenyu Ha † , Yage Wang † , Guodong Gao ‡ and Yi Chen ∗∗ New Jersey Institute of Technology, Newark, NJ, USAEmail: { js675, xg77, yi.chen } @njit.edu † Inovalon, Bowie, MD, USAEmail: { cha, ywang2 } @inovalon.com ‡ University of Maryland, College Park, MD, USAEmail: [email protected]

Abstract —Adverse drug events (ADEs) are a serious healthproblem that can be life-threatening. While a lot of studieshave been performed on detect correlation between a drug andan AE, limited studies have been conducted on personalizedADE risk prediction. Among treatment alternatives, avoidingthe drug that has high likelihood of causing severe AE canhelp physicians to provide safer treatment to patients. Existingwork on personalized ADE risk prediction uses the informationobtained in the current medical visit. However, on the other hand,medical history reveals each patients unique characteristics andcomprehensive medical information. The goal of this study is toassess personalized ADE risks that a target drug may induce ona target patient, based on patient medical history recorded inclaims codes, which provide information about diagnosis, drugstaken, related medical supplies besides billing information. Wedeveloped a HTNNR model (Hierarchical Time-aware NeuralNetwork for ADE Risk) that capture characteristics of claimcodes and their relationship. The empirical evaluation showthat the proposed HTNNR model substantially outperforms thecomparison methods, especially for rare drugs.

Index Terms —Adverse Drug Event, Neural Network, ClaimCode

I. I

NTRODUCTION

Adverse Drug Events (ADEs) , deﬁned as “an appreciablyharmful or unpleasant event resulting from the use or misuseof a drug” [1], are a serious health problem that can belife-threatening. According to FDA, the number of ADEsreported to FAERS (FDA Adverse Event Reporting System)resulting in death and serious outcomes increase consistently[2], [3]. Statistics [4] show each year ADEs account for over3.5 million physician ofﬁce visits, an estimated 1 millionemergency department visits, and approximately 125,000 hos-pital admissions. For inpatient setting, ADEs account for anestimated 1 in 3 of all hospital adverse events (AE) and affectabout 2 million hospital stays each year.While pre-marketing review is conducted before any drugsare approved for marketing, it is insufﬁcient for identifying allthe potential ADEs due to the limited sample size and durationof clinical trials. Post-marketing surveillance is critical foridentifying ADRs. Although patient can report ADE throughvoluntary and spontaneous report systems, such as FDAFAERS, the median under-reporting rate across 37 studies using a wide variety of post-marketing surveillance methodsfrom 12 countries is 94% according to an earlier study [5].There are increasing interests of using large-scale longitu-dinal clinical data, EHRs, associated clinical notes, as wellas claims data, for studying ADEs. Such data contain richand accurate information about patients health status, theirtreatment plan and clinical outcomes. Since such data isgenerated as part of medical practices, without relying onpatient self-reporting, it is available in large-scale with highquality.The studies can be categorized into two types: ADE de-tection and personalized ADE risk prediction. The goal ofADE detection is to identify the correlation or causal rela-tionship between a target drug and an observed AE. Someuse statistic methods such as the disproportionality analysis[6]–[8], others use machine learning methods such as supportvector machines, random forests and neural networks [9], [10].Besides detecting drug-AE correlation on the whole patientpopulation, there are also studies on ADE risk stratiﬁcationwhich assesses the correlation on patient populations deﬁnedby their demographics [11].In contrast of ADE detection for population, the studieson personalized ADE risk prediction assess the likelihood ofindividuals to experience an AE based on individual character-istics and clinical history. Indeed different patient may havedifferent AE outcomes even taking the same drug. Amongalternative drugs for treatment, avoiding the one that has highlikelihood of causing severe AE can help physicians to providesafer treatment to patients, as a form of personalized treatment.There are only a few works addressing the problem [12],[13]. They take as input patient demographic information andclinical information of the current hospital visit.What is lacking in the literature is to consider patientmedical history in addition to the current visit information tomake personalized prediction for ADE risks. Medical historybetter reveals each patient’s unique characteristics, as well asthe drugs and treatments taken in the past, which may interactwith the current treatment to induce AE [14], [15].However, patient medical history data is often not readilyavailable and is difﬁcult to process. First, patient may be seenat multiple healthcare centers that do not share patient data a r X i v : . [ c s . L G ] A ug n their EHR system. Second, patient self-reporting medicalhistory may not be accurate or comprehensive. Furthermore,processing large-scale longitudinal medical history data, whichcontains diverse type of clinic information, poses technicalcomplexityThe goal of this study is to assess personalized ADE risksthat a target drug may induce on a target patient, based onpatient medical history recorded in claims, which we acquiredaccess via collaboration with Inovalon, a healthcare analyticscompany. Our ﬁndings can be used by Medicare/Medicaid andhealth insurance company to provide assistance to healthcareprofessionals to identify safe treatment plan.Claims data provide valuable information about patients.It contains the information about diagnoses, drugs taken,related medical supplies, treatment procedures, besides billinginformation, for each patient encounter. While a patient mayreceive healthcare from multiple providers, and have theirmedical information scattered in multiple EHR systems, claimsdata effectively records a patients interactions across differenthealthcare systems and thus provides longitudinal and accuratedata in the continuum of a patient’s health care history [16].However, there are several technical challenges that must beaddressed. The ﬁrst challenge is how to capture the “mean-ings” of claim codes. There are over 64K unique claims codein the data, belong to nine different types. We make an analogybetween claim codes and words, and between claim historyand documents. Then we propose to use word embeddingmethods in Natural Language Processing (NLP) to generateembedding for claim codes, so that claim codes that are usedin similar ways are represented with similar vectors, naturallycapturing their meanings.The second challenge is how to model patient medical claimhistory. A patient’s claim history consists of encounters andeach encounter consists of claim codes. The relationship ofclaim codes within an encounter is different from that of claimcodes in different encounters. This present a unique challenge,as exiting work does not consider patient’s medical historybut only the current medical visit. To model patient’s claimhistory, we propose a HTNNR model stands for HierarchicalTime-aware Neural Network with drug-code Representation.The ﬁrst layer neural network encodes claim codes within anencounter into vectors, and the second layer neural networkrepresents the claim history with a sequence of encountersinto vectors. Then we propose to use a bi-directional neuralnetwork model to capture the un-ordered relationship amongclaim codes within an encounter. We further propose to usetime-aware deep learning model to capture not only the se-quential but also the temporal relationship among encounters.The contributions of our work include the following. First,to the best of our knowledge, this the ﬁrst study that usespatient claim history to make personalized prediction on drug-induced ADE risks. Second, we have made several technicalcontributions. We proposed claims code embedding, a hierar-chical neural network model to capture patient claim history,and drug-claim code representations. We also used differentneural network models for encounter representation and for claim history representations. Finally, extensive evaluation onabout 500k patients demonstrates effective prediction perfor-mance and high efﬁciency of our proposed approach.The rest of the paper is organized as follows. Section IIdiscusses the related work. Section III presents the problemstatement and data overview. Section IV and Section Vpresents the two methods for patient ADE risk prediction.Experimental results are presented in Section VI. Section VIIconcludes the paper.II. R ELATED W ORK

Studies on ADEs can be categorized into ADE detection onpopulation, personalized AE risk prediction, and prediction ofADE outcome intensity (e.g. hospitalization and mortality).The goal of ADE detection is to identify the correlation orcausal relationship between a target drug and an observed AE,using statistical methods or machine learning methods. Somestudies applied association rule mining methods for ADEdetection [17], [18]. Disproportionality analysis are widelyused for ADE detection from various data sources, such asEHR data [19], [20], clinic notes [8], and clinical trials [21].Disproportionality analysis is based on the contrast betweenobserved and expected numbers of co-occurrences, for anygiven combination of drug and AE, to detect possible causalrelations between drugs and AEs. It, however, does not con-sider context features, which are rich in unstructured clinicalnotes. Various Natural Language Processing (NLP) and ma-chine learning techniques have been applied on clinic notesto detect drug-AE association, using expert-labeled groundtruth. [9], [10] extract multiple features like drug and AEfrequency and co-mention frequency from clinical notes anduse machine learning methods like support vector machine andrandom forest to detect drug-AE correlation. [22], [23] startwith a named entity recognition module based on ConditionalRandom Fields to extract medical entities relevant to ADEsfrom clinical notes, and then use random forest and neuralnetworks, respectively, as the relation classiﬁcation model.Little has been studied on using claims data for ADE detection.[24] use ICD codes and GPI drug code in claims data (seeTable I for description of the code) as input and design agraph neural network model to construct a drug-disease graphfor ADE detection. They ﬁrst embedded disease codes anddrug codes into a graph, respectively, then the merged drugand disease graph is fed into a graph neural network forADE detection. They used the SIDER database as the groundtruth for ADEs. Besides detecting drug-AE correlation on thewhole patient population, there are also studies on ADE riskstratiﬁcation which assesses the drug-AE correlation on patientpopulations deﬁned by their demographics [11].There are only a few studies in the category of personalizedAE risk prediction. Since AE risks of different patients aredifferent, even for the same drug, these studies make riskpredictions based on the individual patient’s characteristicsfrom clinical data. [13] develops a logistic regression modelto predict the risks of AEs of in-patients based on the patientfeatures and the medical conditions during this hospital stay.

ABLE ID

ESCRIPTION OF DIFFERENT CLAIM CODES

Code Type DescriptionICD International Statistical Classiﬁcation of Diseases (ICD) codes capturing dis-eases, symptoms, abnormal ﬁndings, complaints, etc. It includes diagnosis codes(ICD10DX and ICD9DX) and procedure codes (ICD9PX and ICD10PX).CPT report medical, surgical, and diagnostic procedures and services to entities such asphysicians, health insurance companies and accreditation organizationsPOS Place of Service (POS) Codes are two-digit codes placed on health care professionalclaims to indicate the setting in which a service was provided.GPI The Generic Product Identiﬁer (GPI) is a 14-character hierarchical classiﬁcationsystem that identiﬁes drugs from their primary therapeutic use down to the uniqueinterchangeable product regardless of manufacturer or package size.TOB Type of bill codes (TOB) identiﬁes the type of bill being submitted to a payer. TOBcodes are four-digit alphanumeric codes that specify different pieces of informationon claim formREVENUE Revenue Codes are descriptions and dollar amounts charged for hospital servicesprovided to a patient.HCPCS The Healthcare Common Procedure Coding System (HCPCS) is a collection of codesthat represent procedures, supplies, products and services which may be providedto Medicare beneﬁciaries and to individuals enrolled in private health insuranceprograms.DISCHARGE Identify where the patient is at the conclusion of a health care facility encounter (avisit or an inpatient stay)LOINC Logical Observation Identiﬁers Names and Codes (LOINC) is a database anduniversal standard for identifying medical laboratory observationsThey used multiple patient characteristics like gender andage as features, also extracted some features from currentmedical conditions like the number of medications and thelist of drugs taken. [12] takes clinical features as input, suchas ADE indication codes, primary diagnosis code and lengthof the hospital stay to predict in-patient ADE risks. Theyused multiple machine learning models like random forestand support vector machines. Both make ADE risk predictionbased on the information of the current hospital stay. Beingmost related to this category of studies, our work takes as inputa patient’s longitudinal medical history, not just the currentmedical encounter. Also, we consider AE risks induced bytarget drugs (perhaps due to interaction with other drugs ormedical conditions), whereas existing studies consider AE ingeneral. The dataset used in our studies is claims data.Unlike studies on personalized AE risk prediction, whichpredict the likelihood of a speciﬁc AE to occur, there arealso studies on predicting the likelihood of hospitalizationand mortality of a patient, due to outcomes of unspeciﬁedAEs. Both of them are using the patient medical data fromFAERS. [25] proposed a hybrid model to predict the outcomesof ADEs, based on patients demographic data, such as ageand gender, and drug-taken information, such as the route ofthe drug intake and whether the adverse reaction subsidedwhen drug in-take was terminated. [26] developed a systemthat takes patient demographics, drugs, relevant diseases inpathology as input, and outputs ADE risk outcome assessment. III. P

ROBLEM S TATEMENT

In this section we present the data description and theproblem deﬁnition.

A. Data Description

The input data is medical claim history for a set of patients.Each claim history is composed of a sequence of encounters,and each encounter has a sequence of claim codes, as illus-trated in Figure 3. At an encounter, a medical treatment and/orevaluation and management services are provided. There arenine different types of claim codes, which provide informationof medical diagnoses (ICD), procedures and services (CPT,LOINC), setting where services are provided (POS), druginformation (GPI), billing (TOB, REVENUE, DISCHARGE),and codes for Medicare and private health insurance programusers (HCPCS). Table I shows a description about these codetypes.The data used in empirical evaluation was provided byInolvaon, a technology company providing cloud-based plat-forms empowering data-driven healthcare. It contains theclaims data of 500k patients for a duration of 2015-2019.There are 64,070 unique claim codes. Figure 2 shows thedistribution of the number of encounters a patient has. Wecan see that most patient has less than 500 encounters, andthe average number of encounters per patient is . . Theaverage number of claim codes per patient is and theverage number of claim code per encounter is . . Figure 1shows the number of claim code occurrences of each category. Fig. 1. Distribution of Claim Code OccurrencesFig. 2. Distribution of Number of Encounters Per Patient

B. Problem Deﬁnition

Now we formally deﬁne the problem.We model the problem as a classiﬁcation task. A patient’sclaim history is composed of a sequence of encounters,denoted as P = { e , e , . . . } . Each encounter e i is composedof a sequence of claim codes, e i = { x , x , . . . } . Considera list of target ADEs, and a target drug d . y ∈ {− , } is the classiﬁcation label, where y = 1 indicates that drug d induced at least one ADE in the target ADE list on thispatient, and otherwise y = − . For a set of patients who tookdrug d , their claim histories before taking d along with theircorresponding labels are used to train the classiﬁcation model.For a target patient who has not taken drug d , the model takeshis claim history so far to predict the label, i.e. whether he willexperience an ADE in the target ADE list if taking d now. Identifying ADEs from Claim History.

First, we identifyADEs from claims code. Based on literature, an ADE canbe identiﬁed from the claim codes by concurrent presence ofselected diagnosis codes and selected indication codes [27].Diagnosis codes are part of the ICD codes as shown in Table I.An indication code is a special type of diagnosis code indicatesthat a patient experienced an ADE [27], [28]. Followingexisting work [28], we use four categories of indication codesas shown in Table II and their corresponding ICD codes. For example, ICD code “T46.9” represents “Other and unspeciﬁedagents primarily affecting the cardiovascular system” is anindication code, indicating an ADE related to cardiovascularsystem. If a diagnosis code and an indication code co-occur inan encounter, we consider an ADE occurs and the diagnosiscode gives the information of the AE. For example: if adiagnosis code “I42.7” (Cardiomyopathy due to drugs andother external agents) and “T46.9” both occurs in an encounter,then “I42.7” represents an ADE.The diagnosis codes (ICD codes) of the target ADEs andthe GPI code of the target drugs are input of the problem. Weconsider that a target drug induces a target ADE experiencedby a patient if the ICD code corresponding to the target ADEand one of the indication codes in Table II are found in thesame encounter within time period N after taking the drug,but not found in the claim history before taking the drug,Speciﬁcally, as illustrated in Figure 3, suppose a patient startsto take a target drug from encounter e M +1 . If there is notarget ADE found before encounter e M +1 but is recorded inencounter e M (cid:48) along with an indication code, and the timeduration between e M +1 and e M (cid:48) is less than N , then weconsider the target drug induces this ADE. We can also useother approaches to generate ground truth, such as humanlabeling.Note that it is possible that an ADE is a result of drug-druginteraction [29]. In other words, some time multiple drugstogether induce to an ADE. For any of these drug is a targetdrug, for this drug the corresponding claim sequence is labeledpositively.Also, N is considered the effective time of a drug to causeAE. Currently N is set to be 3 months for all the drugs.Different values of N can be used for different drugs basedon the drug characteristics when the information becomesavailable. TABLE III

NDICATION C ODES FOR A DVERSE D RUG E VENTS

Indication Category Description

A1 The ICD-10 code description in-cludes the phrase ’induced by med-ication/drug’A2 The ICD-10 code description in-cludes the phrase ’induced by med-ication or other causes’B1 The ICD-10 code description in-cludes the phrase ’poisoning bymedication’.B2 The ICD-10 code description in-cludes the phrase ’poisoning by orharmful use of medication or othercauses’IV. F

IRST A TTEMPT

Since patient’s medical claim history consists of a sequenceof claim codes which encode medical diagnoses, procedures ncounter 1 Encounter M+1Encounter3Encounter 2 …… Encounter M ’ …… Observation Period: N Claim History PeriodA Patient With a Sequence of Encounter Records …… Date: 'POS-11', 'ICD10DX-I10', 'ICD10DX-E109', 'ICD10DX-E782', 'CPT-99213'

Fig. 3. Claim History Illustration and services conducted, drugs taken and so on. Intuitively, theproblem can be modeled as a sequence classiﬁcation problem.Figure 4 shows a system architecture. The input is the patient’smedical claim history represented as a sequence of claimcodes.cThen each claim code is represented as an embeddingvector. A deep learning model, Long Short-Term Memory(LSTM), is then used to learn the dependency between theclaim codes in order to make the prediction whether thispatient will experience a target ADE if taking a target drug. ...'POS-11', 'ICD10DX-I10', 'CPT-99213', … , 'CPT-92083', 'ICD10DX-H02831' Sequence of claim codesLSTM layerOutput layerClaim code embedding layer

LSTM LSTM LSTMLSTM...

Fig. 4. The architecture of First Attempt Method

Claim Code Embedding.

The claim code embedding layergenerates a vector for each claim code that captures thecharacteristics of codes and the relationship among codes.Word embedding is widely used in deep learning based NLPtechniques. Using dense and low-dimensional vectors to en-code words bring computational beneﬁts to downstream neuralnetwork model processing. Learned based on word usage,word embedding represents words that are used in similarways using similar vectors, naturally capturing their meaning.We make the analogy that each claim code corresponds to aword, an encounter corresponds to a sentence, and a patientclaims history corresponds to a document. The usage of claimcode indicate their correlation, just like the usage of wordsin text. We use a popular word embedding method in NLP,the skip-gram model [30]. It takes as input the collection of all patient’s claim code sequences and generates a lowdimensional, continuous and real-value vector for each claimcode as its embedding.

Sequence Classiﬁcation with LSTM.

After using an em-bedding to represent each claim code, the sequence of claimcode embedding is fed into a deep learning model to learnthe claim code dependencies. We identify all patients in thetraining data who took a target drug. These patients’ sequencesof claim code embedding before taking the target drug, andthe corresponding labels of whether a target ADE is observedwithin the L time period after taking the drug in the claimhistory are used to train the model. The trained model thenpredicts the label of each patient in the test data, based onhis/her claim code embedding sequence so far.In contrast to Convolutional Neural Network (CNN), Re-current Neural Networks (RNN) are designed for sequenceprediction problems . However, it suffers the problem ofgradient vanishing or exploding [31], where gradients maygrow or decay exponentially over long sequences. This makesit difﬁcult to model long-distance correlations in claim codesequences. Recall that the average number of claim codes perpatient is 1052.We proposed to use LSTM networks instead, which aredesigned to overcome the vanishing gradient problem and toefﬁciently learn long term dependencies. LSTMs accomplishthis by keeping an internal state that represents the memorycell of the LSTM neuron. This internal state controls theinformation ﬂow through the cell state.The new cell state c j and the output h j can be calculatedas: c j = f j (cid:12) c j − + I j (cid:12) tanh ( W c [ F j , h j − ] + b c ) (1) h j = o j (cid:12) tanh ( c j ) (2)where I j , f j and o j denote input, forget and output gate,respectively. Finally the output layer uses a softmax functionon the vector generated from the LSTM layer to make a The performance evaluation of CNN and and several feature-based ma-chine learning methods are presented in Section VI. rediction. This approach is referred as

LSTM in the rest ofpaper. V. HTNNR M

ODEL

After presenting the LSTM method in Section IV, now wediscuss several characteristics of patient claim code historyand propose a novel model named as

HTNNR Model standsfor Hierarchical Time-Aware Neural Network for ADE Risk.

A. A Hierarchical Neural Network

The LSTM method models patient claim history as asequence of claims code. However, this approach may notaccurately capture the relationship between the claims codes.Recall that the claim history actually consists of a sequence ofencounters, each of which contains a sequence claim codes.There are two observations. First, the number of claim codesin different encounters can have big variation. For instance,consider three encounters illustrated in Figure 3. The ﬁrstencounter represents a hospital stay, with 30 claim codes. Thenext encounter represents a follow-up with a specialist, withonly four claim codes. The third encounter represents a visit toa primary care doctor for a ﬂu with another four claim codes.The LSTM model ignores the encounter information, but justconsiders the claim code sequence where code relationshipsare reﬂected by their distances. In this example, the 1st codeand the 30-th code are considered less related since theirdistance is 29, despite that they actually belong to the sameencounter. On the other hand, the 30-th code and the 35-thone are considered as closely related since their distance isonly 5. However, they actually are two encounters apart, andare not semantically closely related. The second observationis that the claims code within an encounter are actually notordered, collectively describing an encounter event.Based on this observation, we propose a hierarchical frame-work to model the input data, as shown in Figure 5. The ﬁrstlayer in framework generates a vector for each encounter,called

Encounter Representation . The second layer in theframework takes the sequence of encounter vectors as inputand outputs an embedded vector for each patient’s claimhistory, referred to as

Claim History Representation . Thisframework better captures the claim code relationships. Nowwe discuss these two layers in term.

B. Encounter Representation

The Encounter Representation takes the patient claim his-tory as input. It has two components: a Bi-LSTM layer and aclaim code attention layer. We discuss each in turn.

Bi-LSTM Representation for Encounters.

Recall that theLSTM method discussed in Section IV consider claim historyas a sequence of claim codes. However claim codes in anencounter do not have sequential order, but are a set of codesthat collectively record an encounter event. Based on thisobservation, we propose to use Bi-directional Long Short TermMemory (Bi-LSTM) [32] to generate a representation of claimcodes in an encounter, which are unordered. Both previouscodes and following codes within an encounter are considered by Bi-LSTM to model code dependencies. The output of the j th claim code in an encounter is calculated as: h j = → h j ⊕ ← h j , (3)where ⊕ is an concatenation operation. Claim Code Attention.

Not all claim codes contribute equallyto the semantic representation of an encounter. Attention neu-ral networks have recently demonstrated success in documentclassiﬁcation by learning the weights of words [33]. Hence,we apply the attention mechanism to set weights of claimcodes, so that the model can focus on claim codes thatare important to capture the semantics of an encounter. Theencounter representation v e is formed by a weighted sum ofthe vectors generated by Bi-LSTM. E = tanh ( H ) (4) α = sof tmax ( w T E ) (5) v e = H α T (6)Here H is a matrix consisting of vectors [ h , h , ..., h T ] thatthe Bi-LSTM layer produces, where T is the input length. w is a trained weight vector and w T is a transpose. C. Claim History Representation

Given the encounter vectors v e i output by the EncounterRepresentation layer for every encounter e i in a claim histor,now we discuss how to generate vector for each patient’s claimhistory.One intuitive way is to use a LSTM model on the sequenceof encounter vectors to generate a claim history vector. Indeedthe sequential order of encounter indicate the temporal orderof the encounter events. However, LSTM does not capture thetime differences among the encounters. Referring to Figure3. The ﬁrst two encounters are 7 days apart, with the secondencounter being a follow-up visit of a surgery preformed inthe ﬁrst encounter. The time between the second and the thirdencounter is 9 months, with the third encounter being a visitto a primary care doctor for a ﬂu. As we can see from thisexample, two adjacent encounters that has a small time lapoften refer to closely related medical issues. On the otherhand, two adjacent encounters that are a long time apart likelyrefer to unrelated medical issues. In this case, the previousencounter has less importance to the semantics of the currentencounter. Thus, sequential order itself is inadequate to capturethe relationship between encounters, we should also considerthe actual time differences.We propose to use a Time-aware LSTM (TLSTM) [34] togenerate a claim history vector from the sequence of encountervectors for each patient. For each encounter, we considernot only its claims code, but also its timestamp. The majorcomponent of the TLSTM layer is the subspace decompositionapplied on the memory of the previous time step. The short-term memory is adjusted proportionally to the amount of timespan between two patient encounters. g (∆ i ) = 1 / ∆ i (7) POS-11', 'ICD10DX-I10', 'ICD10DX-E109', 'ICD10DX-E782', 'CPT-99213' 'CPT-92083', 'ICD10DX-H02831', 'POS-11', 'ICD10DX-H02834'

Encounter 1 Encounter i 'POS-11', 'ICD10DX-I10', 'ICD10DX-

E109', 'ICD10DX-E782', 'CPT-99213' 'CPT-92083', 'ICD10DX-H02831', 'POS-11', 'ICD10DX-H02834'

Code Attention …… Code Attention …… TLSTM ... ...

Encounter Attention Patient claim historySequence of claim codesBi-LSTM layer

Claim code attention

Encounter vector TLSTM layerEncounter attentionOutput layerEncounter 2

TLSTM TLSTM TLSTM

TLSTM

111 222 3333

LSTM LSTM

LSTMLSTM ...

LSTM LSTM

LSTMLSTM...LSTM LSTM

LSTMLSTM ...

LSTM LSTM

LSTMLSTM... LSTM LSTM

LSTMLSTM ...

LSTM LSTM

LSTMLSTM...LSTM LSTM

LSTMLSTM ...

LSTM LSTM

LSTMLSTM...

Fig. 5. Architecture Overview of HTNNR ˆ c i − = c i − ∗ g (∆ i ) (8) c ∗ i − = c Li − + ˆ c i − (9)Here ∆ i is the time span between encounter e i and encounter e i − , c i − is short memory in LSTM, ˆ c i − is the adjusted shortmemory by considering time span. c ∗ i − is the ﬁnal adjustedprevious memory that combines the normal long term memory c Li − and the adjusted short term memory. As we can see, ifthe gap between encounter e i and e i − is large, which meansthere is no new information recorded for the patient for a longtime, the dependence on the short-term memory does not playa signiﬁcant role in the prediction of the current output.In this way, the ﬁnal cell state in Equation 1 is changed to: c i = f i (cid:12) c ∗ i − + I i (cid:12) tanh ( W c [ F i , h i − ] + b c ) (10)The patient claim history vectors are calculated from en-counter vectors as the following: h i = T LST M ( v e i , ∆ i ) , i ∈ [1 , M ] (11)Here v e i is the encounter representation for encounter e i , M is the number of encounters before the target drug taken. ∆ i is the elapsed time between encounter e i and e i − . Finally the patient claim history vector is fed into an atten-tion layer to learn the importance of different encounters tomake the prediction whether the target patient will experiencea target ADE. VI. E XPERIMENT

We implemented the proposed method, referred as

HTNNR .We have conducted extensive experiments to empirically eval-uate the HTNNR model using real-life claims data. We startwith discussing the model implementation, evaluation settingand comparison methods. Then we present the empiricalevaluation results.

A. System Implementation

HTNNR is implemented using Python and the HierarchicalAttention model is implemented using Keras with Tensorﬂowbackend. The experiments are run on a 20-core computerserver. Existing work indicates that a large batch size mayalleviate the impact of noisy data, while a small size sometimescan accelerate of convergence [35]. We varied the batch size inexperiments, and set the training batch size to 256 consideringthe trade-off of performance and the consumption of trainingtime and memory. To train ADE classiﬁcation, we use binaryross-entropy as the loss function. The optimizer we adopted isAdaptive Moment Estimation (Adam) which can achieve fastgradient descent [36]. We use validation-based early stoppingto obtain the models that work the best with the validationdata. The model with the minimum validation error are savedand used to make prediction the testing data.

B. Evaluation Setting

The data we used is provided by Inovalon. Inovalon’s

M ORE Registry dataset contains 500K patients. Each pa-tient contains a sequence of encounters and each encountercontains a sequence of claim codes, with statistics presentedin Section III-A. target Drugs.

We evaluated our proposed methods on 10randomly selected drugs among all drugs, each of which hasbeen taken by more than 20K patients in the dataset. Table IIIshows the GPI, description and the number of patients takingthe drug.

TABLE IIIT

ARGET D RUGS

Drug GPI code Description Patient Population

GPI-5818002510 Duloxetine HCl 22616GPI-3610003000 Lisinopril 124716GPI-4927006000 Omeprazole 138152GPI-3400000310 Amlodipine Besylate 127326GPI-4220003230 Fluticasone Propionate 106106GPI-3320003010 Metoprolol Tartrate 75561GPI-3615004020 Losartan Potassium 75570GPI-5710001000 Alprazolam 44214GPI-5816007010 Sertraline HCl 39258GPI-6420001000 Acetaminophen 20618 target ADEs.

ADEs are prevalent, and are not totally avoid-able. The evaluation is performed on target ADEs that aresevere. Table IV shows the target ADE list used in evaluation,selected based on its severity according to existing studies[37] and their occurrence in our data set. Here the occurrencemeans the number of the patients experienced this ADE in ourdata set. Other ADEs can also be used in evaluation.

Training and Testing Data.

For each target drug, we extractall the patients whose claim history contains the GPI code ofthe drug. For each patient, we extract the claim history beforetaking the target drug. Then we identify the occurrence ofa target ADEs within 3 months after the drug taking usingthe method discussed in Section III to generate the labelfor this instance. We split all the patients in each drug intotraining/testing/validation dataset with ratio 0.7/0.2/0.1. Theﬁnal result is the averaged result of these 10 drugs

C. Comparison Methods

Since we are the only study that uses claims history forpersonalized ADE risk prediction, there is no existing work tocompare. We use several baseline approaches for comparison. • Long Short Term Memory (LSTM):

This is the methoddiscussed in Section IV.

TABLE IVT

ARGET A DVERSE D RUG E VENTS (ADE S ) ADE code (ICD 10) Description

L29.9 PruritisK27.9 Stomach or intestinal ulcersL50.9 UrticariaT78.40 Allergic ReactionF329 DepressionR06.00 DyspneaD649 AnemiaD696 ThrombocytopeniaM25.50 ArthralgiaR00.2 PalpitationR20.2 ParesthesiaF419 AnxietyM79.1 MyalgiaI47.2 Ventricular tachycardiaI63.0 Anorexia

TABLE VE

VALUATION OF O VERALL E FFECTIVENESS

Systems Accuracy Precision Recall AUC

Random Forest 0.78 0.65 0.21 0.75XGBoost 0.80 0.67 0.25 0.76LSTM 0.84 0.69 0.34 0.81CNN 0.83 0.60 0.37 0.80

HTNNR 0.88 0.84 0.51 0.89 • Convolutional Neural Network (CNN):

This replacesthe LSTM model in the method discussed in Section IVwith a CNN model. CNN has proven effectiveness incomputer vision [38], natural language processing [39] • Random Forest:

Random forest is a classiﬁcation algo-rithm consisting of many decisions trees [40]. • XGBoost:

XGBoost is an implementation of gradientboosted decision tree algorithm which has been widelyused in many classiﬁcation tasks like emotion analysis[41] and image classiﬁcation [42]Note that every method is trained on the patients for eachtarget drug independently. For Random Forest and XGBoost,we use Term Frequency (TF)-Inverse Document Frequency(IDF) vectors extracted from claim code sequence as features.TF-IDF has been commonly used as features in text classiﬁ-cation tasks [43].

D. Evaluation of Overall Effectiveness

Table V shows the performance of different methods ontarget drugs. For each system, each number is the averageperformance on ten drugs in each drug group.Several obser-vations can be made.The proposed HTNNR method consistently achieves thebest performance among these methods on all metrics. Onereason is that the hierarchical attention model to differentiatethe relationship of claim codes in an encounter, and theelationship of encounters in an claim history. It further usesdifferent neural networks, Bi-LSTM, and TLSTM, respec-tively, to capture their different characteristics. On the otherhand, comparison systems model the input as a sequence ofclaims code for each patient. Furthermore, the attention layerin HTNNR gives higher weights on important claim codes andimportant encounters.We also observe that the performance differences on pre-cision and recall are much bigger than those on AUC andAccuracy. It is relatively easy for a model to perform wellon AUC and Accuracy on imbalanced data. AUC representsthe model overall classiﬁcation ability on various thresholds.It does not reﬂect well the effect of minority class. Evenif a method mis-classiﬁes most or all of the minority class,its AUC value can still be high. Similarly, for imbalanceddata, if a model always predicts the majority label, it willobtain a good accuracy. In our case, the target drug listhas about 80% negative labels. Thus most methods performsimilarly on AUC and Accuracy. High AUC and Accuracy canbe misleading in some imbalanced data. On the other hand,achieve high precision and recall are much more challenging.In the following, we focus the analysis on precision and recall.

E. Evaluation on single drug

Table V shows the average results on the 10 drugs. Nowwe zoom in to a single drug. We randomly select a drugfrom target durg list, GPI-3320003010, and evaluate the per-formance of comparison systems, and HTNNR on its ADErisk prediction, as shown in Figure 6. Here we only showthe precision and recall, as the performance differences ofAccuracy and AUC are similar as the result represented inTable V. There are several things worth mention. First, theHTNNR model performs better than the comparison systems,consistent with the evaluation shown in Table V.we also observe the improvement on recall is higher thanthat on precision. Hierarchical framework helps to ﬁnd moreshared ADE characteristics among the drugs. At the same time,more noisy information is introduced. Thus, the recall beneﬁtsmore from training data from multiple drugs than precision.

Fig. 6. GPI-3320003010 (

F. Evaluation of Patient Claim History Length

All the results shown so far takes as input each patient’sentire medical claim history before taking a target drug to traineach model. To evaluate how the patient claim history impactthe personalized ADE prediction, we evaluated the perfor-mance on different time length of medical history considered. Figure 7 shows the performance vary with varying lengthof each patient’s medical history used to train HTNNR. Themedical history always ending at the time when a target drugis recorded in the claim history, with duration count backward.The results show that using 3 month of claim history generatesbetter performance than using 1 month of claim history, sincethe model can beneﬁt from a larger dataset. After 3 months,the longer history considered, the better recall, and the worseprecision. The reason is that longer history data can help themodel to ﬁnd more characteristics of patients and potentialdrug interactions, but at the same time, introduce more noisyinformation. In a real application, we can adjust the historylength to be considered depends on which metrics is moreimportant in the application.

Fig. 7. Evaluation on Different Length of Claim History

To summarize, HTNNR achieves the best effectiveness inall evaluation metrics among all methods tested.VII. C

ONCLUSIONS AND F UTURE W ORK

In this paper, we studied how to use patient claims historyfor personalized ADE risk prediction. We propose the HTNNRmodel that captures the characteristics of claim codes and theirrelationship. It has a hierarchical framework. The ﬁrst layerﬁrst generates embedding for claim codes, and then generatea vector for each encounter using a Bi-LSTM model with anattention layer. The second layer takes the sequences of en-counter vectors as input and uses a time-aware neural networkmodel to generate claim history representation that capture thetemporal order of encounters. The empirical evaluation showthat the proposed HTNNR model is effective and efﬁcient,especially for rare drugs.Since claim history is updated on daily basis, as future workwe will investigate how to incrementally train the model basedon the new information available without re-training the modelfrom scratch every time.R

EFERENCES[1] J. R. Nebeker, P. Barach, and M. H. Samore, “Clarifying adverse drugevents: a clinician’s guide to terminology, documentation, and reporting,”

Annals of internal medicine , vol. 140, no. 10, pp. 795–801, 2004.2] K. B. Sonawane, N. Cheng, and R. A. Hansen, “Serious adverse drugevents reported to the fda: analysis of the fda adverse event reportingsystem 2006-2014 database,”

Journal of managed care & specialtypharmacy

Drug safety , vol. 29, no. 5, pp. 385–396, 2006.[6] J.-L. Montastruc, A. Sommet, H. Bagheri, and M. Lapeyre-Mestre,“Beneﬁts and strengths of the disproportionality analysis for identiﬁca-tion of adverse drug reactions in a pharmacovigilance database,”

Britishjournal of clinical pharmacology , vol. 72, no. 6, p. 905, 2011.[7] S. J. Evans, P. C. Waller, and S. Davis, “Use of proportional reportingratios (prrs) for signal generation from spontaneous adverse drug reac-tion reports,”

Pharmacoepidemiology and drug safety , vol. 10, no. 6,pp. 483–486, 2001.[8] P. LePendu, S. V. Iyer, A. Bauer-Mehren, R. Harpaz, J. M. Mortensen,T. Podchiyska, T. A. Ferris, and N. H. Shah, “Pharmacovigilance usingclinical notes,”

Clinical pharmacology & therapeutics , vol. 93, no. 6,pp. 547–555, 2013.[9] A. Henriksson, M. Kvist, H. Dalianis, and M. Duneld, “Identifyingadverse drug event information in clinical notes with distributionalsemantic representations of context,”

Journal of biomedical informatics ,vol. 57, pp. 333–349, 2015.[10] G. Wang, K. Jung, R. Winnenburg, and N. H. Shah, “A method forsystematic discovery of adverse drug events from clinical notes,”

Journalof the American Medical Informatics Association , vol. 22, no. 6, pp.1196–1204, 2015.[11] K. Haerian, D. Varn, S. Vaidya, L. Ena, H. Chase, and C. Friedman,“Detection of pharmacovigilance-related adverse events using electronichealth records and automated methods,”

Clinical Pharmacology &Therapeutics , vol. 92, no. 2, pp. 228–234, 2012.[12] C. McMaster, D. Liew, C. Keith, P. Aminian, and A. Frauman, “Amachine-learning algorithm to optimise automated adverse drug reactiondetection from clinical coding,”

Drug safety , vol. 42, no. 6, pp. 721–725,2019.[13] J. M. Bos, G. A. Kalkman, H. Groenewoud, P. M. van den Bemt,P. A. De Smet, J. E. Nagtegaal, A. Wieringa, G. J. van der Wilt, andC. Kramers, “Prediction of clinically relevant adverse drug events insurgical patients,”

PloS one , vol. 13, no. 8, p. e0201645, 2018.[14] R. Liu, M. D. M. AbdulHameed, K. Kumar, X. Yu, A. Wallqvist, andJ. Reifman, “Data-driven prediction of adverse drug reactions inducedby drug-drug interactions,”

BMC Pharmacology and Toxicology , vol. 18,no. 1, p. 44, 2017.[15] G. Jiang, H. Liu, H. R. Solbrig, and C. G. Chute, “Mining severe drug-drug interaction adverse events using semantic web technologies: a casestudy,”

BioData mining , vol. 8, no. 1, p. 12, 2015.[16] J. D. Stein, F. Lum, P. P. Lee, W. L. Rich III, and A. L. Coleman,“Use of health care claims data to study patients with ophthalmologicconditions,”

Ophthalmology , vol. 121, no. 5, pp. 1134–1141, 2014.[17] C. Wang, X.-J. Guo, J.-F. Xu, C. Wu, Y.-L. Sun, X.-F. Ye, W. Qian, X.-Q.Ma, W.-M. Du, and J. He, “Exploration of the association rules miningtechnique for the signal detection of adverse drug events in spontaneousreporting systems,”

PloS one , vol. 7, no. 7, p. e40561, 2012.[18] J. M. Reps, U. Aickelin, J. Ma, and Y. Zhang, “Reﬁning adverse drugreactions using association rule mining for electronic healthcare data,” in . IEEE,2014, pp. 763–770.[19] H. Z. Lo, W. Ding, and Z. Nazeri, “Mining adverse drug reactions fromelectronic health records,” in . IEEE, 2013, pp. 1137–1140.[20] R. Harpaz, S. Vilar, W. DuMouchel, H. Salmasian, K. Haerian, N. H.Shah, H. S. Chase, and C. Friedman, “Combing signals from sponta-neous reports and electronic health records for detection of adverse drugreactions,”

Journal of the American Medical Informatics Association ,vol. 20, no. 3, pp. 413–419, 2013.[21] P. Dias, A. Penedones, C. Alves, C. F Ribeiro, and F. B Marques, “Therole of disproportionality analysis of pharmacovigilance databases insafety regulatory actions: a systematic review,”

Current drug safety ,vol. 10, no. 3, pp. 234–250, 2015. [22] A. B. Chapman, K. S. Peterson, P. R. Alba, S. L. DuVall, and O. V. Pat-terson, “Detecting adverse drug events with rapidly trained classiﬁcationmodels,”

Drug safety , vol. 42, no. 1, pp. 147–156, 2019.[23] B. Dandala, V. Joopudi, and M. Devarakonda, “Adverse drug eventsdetection in clinical notes by jointly modeling entities and relations usingneural networks,”

Drug safety , vol. 42, no. 1, pp. 135–146, 2019.[24] H. Kwak, M. Lee, S. Yoon, J. Chang, S. Park, and K. Jung, “Drug-disease graph: Predicting adverse drug reaction signals via graph neuralnetwork with clinical data,” in

Paciﬁc-Asia Conference on KnowledgeDiscovery and Data Mining . Springer, 2020, pp. 633–644.[25] T. Islam, N. Hussain, S. Islam, and A. Chakrabarty, “Detecting adversedrug reaction with data mining and predicting its severity with machinelearning,” in . IEEE, 2018, pp. 1–5.[26] A. Valeanu, C. Damian, C. D. Marineci, and S. Negres, “The devel-opment of a scoring and ranking strategy for a patient-tailored adversedrug reaction prediction in polypharmacy,”

Scientiﬁc Reports , vol. 10,no. 1, pp. 1–11, 2020.[27] S. R. Walter, R. O. Day, B. Gallego, and J. I. Westbrook, “The impactof serious adverse drug reactions: a population-based study of a decadeof hospital admissions in new south wales, australia,”

British journal ofclinical pharmacology , vol. 83, no. 2, pp. 416–426, 2017.[28] C. M. Hohl, A. Karpov, L. Reddekopp, and J. Stausberg, “Icd-10 codesused to identify adverse drug events in administrative data: a systematicreview,”

Journal of the American Medical Informatics Association ,vol. 21, no. 3, pp. 547–557, 2014.[29] M. M. Alvim, L. A. da Silva, I. C. G. Leite, and M. S. Silv´erio, “Adverseevents caused by potential drug-drug interactions in an intensive careunit of a teaching hospital,”

Revista Brasileira de terapia intensiva ,vol. 27, no. 4, p. 353, 2015.[30] D. Guthrie, B. Allison, W. Liu, L. Guthrie, and Y. Wilks, “A closer lookat skip-gram modelling.” in

LREC , vol. 6, 2006, pp. 1222–1225.[31] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[32] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition withdeep recurrent neural networks,” in . IEEE, 2013, pp. 6645–6649.[33] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchicalattention networks for document classiﬁcation,” in

Proceedings of the2016 conference of the North American chapter of the associationfor computational linguistics: human language technologies , 2016, pp.1480–1489.[34] I. M. Baytas, C. Xiao, X. Zhang, F. Wang, A. K. Jain, and J. Zhou,“Patient subtyping via time-aware lstm networks,” in

Proceedings of the23rd ACM SIGKDD international conference on knowledge discoveryand data mining , 2017, pp. 65–74.[35] E. Hoffer, T. Ben-Nun, I. Hubara, N. Giladi, T. Hoeﬂer, and D. Soudry,“Augment your batch: better training with larger batches,” arXiv preprintarXiv:1901.09335 , 2019.[36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[37] A. Gottlieb, R. Hoehndorf, M. Dumontier, and R. B. Altman, “Rankingadverse drug reactions with crowdsourcing,”

Journal of medical Internetresearch , vol. 17, no. 3, p. e80, 2015.[38] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun, “A guide toconvolutional neural networks for computer vision,”

Synthesis Lectureson Computer Vision , vol. 8, no. 1, pp. 1–207, 2018.[39] W. Yin, K. Kann, M. Yu, and H. Sch¨utze, “Comparative study of cnn andrnn for natural language processing,” arXiv preprint arXiv:1702.01923 ,2017.[40] A. Liaw, M. Wiener et al. , “Classiﬁcation and regression by randomfor-est,”

R news , vol. 2, no. 3, pp. 18–22, 2002.[41] M. Jabreel and A. Moreno, “Eitaka at semeval-2018 task 1: An ensembleof n-channels convnet and xgboost regressors for emotion analysis oftweets,” arXiv preprint arXiv:1802.09233 , 2018.[42] X. Ren, H. Guo, S. Li, S. Wang, and J. Li, “A novel image classiﬁcationmethod with cnn-xgboost model,” in

International Workshop on DigitalWatermarking . Springer, 2017, pp. 378–390.[43] W. Zhang, T. Yoshida, and X. Tang, “A comparative study of tf*idf, lsi and multi-words for text classiﬁcation,”