Explicit-Blurred Memory Network for Analyzing Patient Electronic Health Records
EExplicit-Blurred Memory Network for Analyzing PatientElectronic Health Records
Prithwish Chakraborty [email protected] for Computational Health, IBM ResearchYorktown Heights, NY, USA
Fei Wang [email protected] of Healthcare Policy and Research,Weill Cornell Medicine, Cornell UniversityNew York, NY, USA
Jianying Hu [email protected] for Computational Health, IBM ResearchYorktown Heights, NY, USA
Daby Sow [email protected] for Computational Health, IBM ResearchYorktown Heights, NY, USA
ABSTRACT
In recent years, we have witnessed an increased interest in temporalmodeling of patient records from large scale Electronic HealthRecords (EHR). While simpler RNN models have been used for suchproblems, memory networks, which in other domains were foundto generalize well, are underutilized. Traditional memory networksinvolve diffused and non-linear operations where influence of pastevents on outputs are not readily quantifiable. We posit that thislack of interpretability makes such networks not applicable for EHRanalysis. While networks with explicit memory have been proposedrecently, the discontinuities imposed by the discrete operationsmake such networks harder to train and require more supervision.The problem is further exacerbated in the limited data setting ofEHR studies. In this paper, we propose a novel memory architecturethat is more interpretable than traditional memory networks whilebeing easier to train than explicit memory banks. Inspired by well-known models of human cognition, we propose partitioning theexternal memory space into (a) a primary explicit memory blockto store exact replicas of recent events to support interpretations,followed by (b) a secondary blurred memory block that accumulatessalient aspects of past events dropped from the explicit block ashigher level abstractions and allow training with less supervision bystabilize the gradients. We apply the model for 3 learning problemson ICU records from the MIMIC III database spanning millions ofdata points. Our model performs comparably to the state-of the artwhile also, crucially, enabling ready interpretation of the results.
CCS CONCEPTS • Applied computing → Health informatics ; •
Computingmethodologies → Artificial intelligence ; Machine learningalgorithms . KEYWORDS electronic health records, neural networks, memory networks
ACM Reference Format:
Prithwish Chakraborty, Fei Wang, Jianying Hu, and Daby Sow. none. Explicit-Blurred Memory Network for Analyzing Patient Electronic Health Records.In . ACM, New York, NY, USA, 6 pages.
DSHealth ’20, August 24, 2020, San Diego, CA none. ACM ISBN none.
In this new era of Big Data, large volumes of patient medical dataare continuously being collected and are becoming increasinglyavailable for research. Intelligent analysis of such large scale medi-cal data can uncover valuable insights complementary to existingmedical knowledge and improve the quality of care delivery. Amongthe various kinds of medical data available, longitudinal ElectronicHealth Records (EHR), that comprehensively capture the patienthealth information over time, have been proven to be one of themost important data sources for such studies. EHRs are routinelycollected from clinical practice and the richness of the informa-tion they contain provides significant opportunities to apply AItechniques to extract nuggets of insight. Over the years, manyresearchers have postulated various temporal models of EHR fortasks such as early identification of heart failure [13], readmissionprediction [15], and acute kidney injury prediction [14]. For suchanalysis to be of practical use, the models should provide supportfor generating interpretations or post-hoc explanations. While thenecessary properties of interpretations / explanations are still beingdebated [7], it is generally desirable to ascertain the importance ofpast events on model predictions at a particular time point. Further-more, despite their initial success, RNN model applications for EHRalso suffer from the inherent difficult to identify and control thetemporal contents that should be memorized by these RNN models.Contemporaneously, we have also witnessed tremendous ar-chitectural advances for temporal models that are aimed at bettergeneralization capabilities. In particular, memory networks [2, 3, 12]are an exciting class of architecture that aim to separate the processof learning the operations and the operands by using an externalblock of memory to memorize past events from the data. Suchnetworks have been extensively applied to different problems andwere found to generalize well [3]. However, there have been onlya limited number of applications of memory networks for clinicaldata modeling [9, 10]. One of the primary obstacle is the inher-ently difficult problem of identifying important past events dueto the diffused manner in which such networks store past eventsin memories. While [4, 5] have explored the possibilities of usingexplicit memories that can store past events exactly and have foundvarying degrees of success, such models are difficult to train. Thediscontinuities arising from the discrete operations either necessi-tate learning with high levels of supervision such as REINFORCE a r X i v : . [ c s . L G ] J u l SHealth ’20, August 24, 2020, San Diego, CA Prithwish Chakraborty, Fei Wang, Jianying Hu, and Daby Sow with appropriate reward shaping or are learned using stochasticreparameterization under annealing routines and deal with highvariance in gradients.In this paper, we propose
EBmRNN : a novel explicit-blurredmemory architecture for longitudinal EHR analysis. Our modelis inspired by the well-known Atkinson-Shiffrin model of humanmemory [1]. Our key contributions are as follows: • We propose a partitioning of external memory of generic mem-ory networks into a blurred-explicit memory architecture thatsupports better interpretability and can be trained with limitedsupervision. • We evaluate the model over 3 classification problems on longitu-dinal EHR data. Our results show
EBmRNN achieves accuraciescomparable to state-of-the-art architectures. • We discuss the support for interpretations inherent in
EBmRNN and analyze the same over the different tasks.
Model:
Memory networks are a special class of Recurrent NeuralNetworks that employ external memory banks to store computed re-sults. The separation between operands and operators provided bysuch architectures have been shown to increase network capacityand/or help generalize over datasets. However, the involved opera-tions are in general highly complex and renders such networks verydifficult to interpret. Our proposed architecture is shown in Fig-ure 1. The architecture is inspired by the Atkinson-Shiffrin modelof cognition and is composed of three parts:
Controller ℎ " % " & " & " ℎ " B l u rr e d M e m . E x p l i c i t M e m . & "' & "( ) "( ) "' updateexplicit & " * " update blurred readgate writegate d e l e t e d e x p l i c i t candidate explicitcandidate blurred + " , " and shallow network Figure 1:
EBmRNN architecture: Memory controller pro-cesses observations and can choose to store them discretelyin explicit memory block or as diffused higher level abstrac-tions in blurred memory. • a controller (e.g. a LSTM network) that processes inputs sequen-tially and produces candidate memory representation at eachtime point t along with control vectors to manage the externalmemory. Mathematically, it can be expressed as follows: (cid:104) k Et , k Bt , m t , e t (cid:105) , h t = RN N ( h t − , x t , r t − ) (1) • an ‘explicit’ memory bank, where the generated candidate mem-ory representation is stored. Depending on the outputs of acontrolling read gate, the candidate memory can be stored ex-plicitly or passed on to the blurred memory. When it is storedexplicitly and the bank was already full, an older memory isremoved based on the information content and passed on tothe blurred memory bank. To update the memory explicitly, wediscretely select the index by make use of the Gumbel-Softmaxtrick as shown below: u t = α E u t − + ( − α E ) w r , E t γ t = σ (cid:16) a Tγ h t + b γ (cid:17) w w , E t = Gumbel-Softmax (cid:16) − ( ˜ w w , E t + γ t u t ) (cid:17) (2)where, u t ∈ R D is a network learnt usage estimated. α E is ahyper-parameter capturing the effect of current reads on theslots and w t is a one-hot encoded weight vector over memoryslots. • The memory passed on to the blurred memory bank is dif-fused according to the control vectors and stored as high levelconcepts.To generate outputs at time t , the architecture makes use ofa read gate to select the memories stored in explicit and blurredmemory that are useful at that time point. д t = σ (cid:16) a B Tд r B t + a E Tд r E t + b д (cid:17) r t = ReLU (cid:16) ( − д t ) × W B r r B t + д t × W E r r E t + b r (cid:17) (3)where, B and E are the blurred and explicit memories. д t ∈ R isthe read gate output and r t ∈ R D is the final output. The full modeldescription is presented in the Appendix. Experimental setup:
We evaluated the performance of the pro-posed
EBmRNN on the publicly available MIMIC III (Medical In-formation Mart for Intensive Care) data set [8]. The data includesvital signs, medications, laboratory measurements, observationsand clinical notes. For this paper, we focused on the structured datafields and followed the MIMIC III benchmark proposed in [6] toconstruct cohorts for 3 specific learning tasks of great interest to thecritical care community namely, ‘In-hospital mortality’, ‘decompen-sation’, and ’phenotype’ classification. To estimate the effectivenessof the
EBmRNN scheme, we compared it with the following base-line algorithms: Logistic Regression using the features used in [6],Long Short Term Memory Networks, and Gated Recurrent Unit(GRU) Networks. We also looked at a variant of
EBmRNN thatdoesn’t have access to blurred memory, hereby referred to as
Em-RNN . Comparison with
EmRNN allows the training to proceedvia a direct path to explicit memories and hence estimate its effectmore accurately.
EmRNN is completely interpretable while
EBm-RNN is interpretable to the limit allowed by the complexities of theproblem. Details on the exact cohort definitions and constructionsare provided in [6]. More details on the tasks are also presented inthe Appendix.
Data description:
The dataset for each of the tasks is describedbelow:
In Hospital Mortality Prediction:
This task is a classificationproblem where the learning algorithm is asked to predict mortalityusing the first 48 hours of data collected on the patient for each ICU xplicit-Blurred Memory Network for Analyzing Patient Electronic Health Records DSHealth ’20, August 24, 2020, San Diego, CA
Table 1: Performance comparison for classification tasks on test dataset. In-Hospital mortality Decompensation Phenotypemodel AUC-ROC AUC-ROC AUC-ROC(macro) AUC-ROC(micro)LR 0.8485 0.8678 0.7385 0.7995LSTM 0.8542 0.8927
Time M e m o r y S l o t s (a) In Hospital Mortality M e m o r y S l o t s (b) Decompensation M e m o r y S l o t s (c) Phenotyping Figure 2: Case Study: Explicit memory slot utilization to store events for separate patients for tasks using slots for thememory. Each slot is annotated by the time index of the event stored in memory. Memory utilization patterns exhibit long-term dependency modeling. Time E x p li c i t m e m o r y I n f l u e n c e (a) In Hospital Mortality Time E x p li c i t m e m o r y I n f l u e n c e (b) Decompensation Time E x p li c i t m e m o r y I n f l u e n c e (c) Phenotyping Figure 3: Case Study: Influence of explicit memory for tasks, patient per task, and hops of memory for read. The legendsindicate the explicit memory influence for each of the hops. Influence patterns vary across tasks indicating the task complexityas well as the modeling flexibility of EBmRNN . stay. All ICU stays for which the length of stay is unknown or lessthan 48 hours have been discarded from the study. Following ex-actly the benchmark cohort constructions proposed in [6], we wereleft with 17903 ICU stays for training and 3236 ICU stays for testing. Decompensation Prediction:
This task is a binary classificationproblem. Decompensation is synonymous to a rapid deteriorationof health typically linked to very serious complications and prompt-ing “track and trigger” initiatives by the medical staff. There aremany ways to define decompensation. We adopt the approach usedin [6] to represent the decompensation outcome as a binary vari-able indicating whether the patient will die in the next 24 hours.Consequently, data for each patient is labeled every hour with this binary outcome variable. The resulting data set for this task con-sists of 2,908,414 training instances and 523,208 testing instancesas reported in [6] with a decompensation rate of 2.06
Phenotyping:
This task is a multi label classification problemwhere the learning algorithm attempts to classify 25 common ICUconditions, including 12 critical ones such as respiratory failure andsepsis and 8 chronic comorbidities such as diabetes and metabolicdisorders. This classification is performed at the ICU stay level,resulting in 35,621 instances in the training set and 6,281 instancesin the testing set.For each patient, the input data consists of an hourly vector offeatures containing average vital signs (e.g., heart rate, diastolic
SHealth ’20, August 24, 2020, San Diego, CA Prithwish Chakraborty, Fei Wang, Jianying Hu, and Daby Sow blood pressure), health assessment scores (e.g., Glasgow ComeScale) and various patient related demographics.
All the models were trained for 100 epochs. We used the recom-mended setting for the baseline methods from [6]. In this paper, wewanted to understand the relative importance of the memory banksand as such chose to study how the network uses the two differentmemory banks under similar capacity conditions. For
EmRNN and
EBmRNN , the hyperparameters such as the memory size (4 − − − α E can be learned during the training process, followingpast work, we used a fixed value of 0 .
7. We chose a 2-layered GRUwith dropout as our controller and the models were trained usingSGD with momentum (0 .
9) along with gradient clipping. Table 1shows the AUC-ROC for the different tasks. Overall, we note that
EBmRNN is on par or able to outperform each of the baselinesfor each of the tasks. Song et al. [11] found success with a multi-layered large transformer based model and can be considered thestate-of-the art including all architectures. It is interesting to notethat our results, using a single layer of memory, are comparable tothe many-layered transformer approach - thus indicating the effi-ciency of the proposed architecture. In the subsequent paragraphs,we discuss the key insights derived from the experiments.
How to interpret
EBmRNN ? To analyze the interpretability in-herent in the model, we picked a patient for each of the tasks underconsideration. We used a trained model with 8 slots and allowing 4reads to generate the predictions. As mentioned before, the explicitmemory allows complete traceability of inputs by storing each in-put in a distinct memory slot. Figure 2 depicts the contents of theexplicit memory over time discretized by 1 hour. Such slot utiliza-tion pattern provides an insight into the contents recognized bythe network as being important for the task at hand. Furthermore,the plots also exhibit that the model is able to remember, explicitly,far-off time points for an extended period, before caching it intothe blurred memory space.
How to interpret the influence of explicit memory?
In addi-tion to exact memory contents, we can also analyze the importanceof the explicit memory for specific tasks by analyzing the control forthe read gate д t over time. Figure 3 shows the temporal progressionof the read gate for the 3 patients from previous analysis for threedistinct tasks. Interestingly, we can see the model using differentpatterns of usage for different tasks. While the network is assigningalmost equal importance to both banks for in hospital mortality, itis placing high importance on explicit memory for phenotyping.This can also correlate with the improved performance for EmRNN for the phenotyping task.
Why do we need the blurred memory?
Given the interpretabil-ity provided by the explicit memory, it may be tempting to avoid theuse of blurred memory in favor of
EmRNN . As our results indicate,such a model can perform well for certain tasks. However, for taskssuch as “in-hosptial mortality”, the blurred memory provides thenetwork with additional capacity. Also, from a practical point ofview, we found the
EmRNN difficult to train where inspite of the Gumbel-Softmax reparameterization trick, the gradients frequentlyexploded and required higher supervision. On the other hand, thepresence of the blurred bank helped the training by providing amore tractable path. If the use case demands a higher value forintepretability, we recommend to either use a smaller sized blurredmemory bank or perform relative regularization of the read gatesfor the blurred component.
In this work, we have introduced
EBmRNN , a memory networkarchitecture able to mimic the human memory models by com-bining sensory, explicit and long term memories for classificationtasks. The proposed scheme achieves state-of-the-art levels of per-formances while being more interpretable, especially when explicitmemories are utilized more. Our future work will aim at present-ing such interpretations via end-to-end system following a usercentered design approach.
REFERENCES [1] R. C. Atkinson and R. M. Shiffrin. 1968. Human memory: A proposed system andits control processes.
In K. W. Spence and J. T. Spence (Eds.), The Psychology oflearning and motivation: Advances in research and theory (vol. 2). (1968), 89 – 105.[2] Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing Machines.
CoRR abs/1410.5401 (2014). http://dblp.uni-trier.de/db/journals/corr/corr1410.html
Nature
CoRR abs/1701.08718 (2017).arXiv:1701.08718 http://arxiv.org/abs/1701.08718[5] Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho, and Y Bengio. 2018. DynamicNeural Turing Machine with Continuous and Discrete Addressing Schemes.
Neural Computation
30 (01 2018), 1–28. https://doi.org/10.1162/neco_a_01060[6] Hrayr Harutyunyan, Hrant Khachatrian, David C. Kale, and Aram Galstyan. 2017.Multitask Learning and Benchmarking with Clinical Time Series Data.
CoRR abs/1703.07771 (2017). http://dblp.uni-trier.de/db/journals/corr/corr1703.html
Pro-ceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers) . 3543–3556.[8] Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, MenglingFeng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo AnthonyCeli, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database.
Scientific Data
Proceedings of the 24thACM SIGKDD International Conference on Knowledge Discovery & Data Mining .ACM, 1637–1645.[10] Aaditya Prakash, Siyuan Zhao, Sadid A Hasan, Vivek V Datla, Kathy Lee, AshequlQadir, Joey Liu, and Oladimeji Farri. 2017. Condensed Memory Networks forClinical Diagnostic Inferencing.. In
AAAI . 3274–3280.[11] Huan Song, Deepta Rajan, Jayaraman J Thiagarajan, and Andreas Spanias. 2018.Attend and diagnose: Clinical time series analysis using attention models. In
Thirty-Second AAAI Conference on Artificial Intelligence .[12] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-To-End Memory Networks. In
Proceedings of the International Conference on NeuralInformation Processing Systems (NIPS) . https://doi.org/v5 arXiv:1503.08895[13] Jimeng Sun et. al. 2012. Combining knowledge and data driven insights for iden-tifying risk factors using electronic health records. In
AMIA Annual SymposiumProceedings , Vol. 2012. American Medical Informatics Association, 901.[14] Nenad Tomašev, Xavier Glorot, Jack W Rae, Michal Zielinski, Harry Askham,Andre Saraiva, Anne Mottram, Clemens Meyer, Suman Ravuri, Ivan Protsyuk,et al. 2019. A clinically applicable approach to continuous prediction of futureacute kidney injury.
Nature
PloS one xplicit-Blurred Memory Network for Analyzing Patient Electronic Health Records DSHealth ’20, August 24, 2020, San Diego, CA
Appendices
A MODEL DESCRIPTIONA.1 Explicit-Blurred Memory Augmented RNN
Let us denote the sequence of observations as x = x , x , · · · , x T ,where T is the length of the sequence and x t ∈ R U . Similarly, letus denote the set of desired outputs as y = y , y , · · · , y T , y t ∈ R V .To model y from x , x is fed sequentially to the proposed EBmRNN with parameters and hyper-parameters that will be defined below.In
EBmRNN , we split the conventional memory network archi-tecture into two banks: (a) an explicit memory bank ( E ) and (b) ablurred or diffused memory bank( B ). Figure 1 shows a high leveloverview of the EBmRNN cell at time t . This cell has access to anexplicit memory bank E ∈ R N E × D to persist past events discretely. N E denotes the capacity of the memory and D is the dimension-ality of each memory slot. This cell also has access to a blurredor diffused memory B ∈ R N B × D where abstractions of importantsalient features from past observations are stored.Observations at time t are fed to this recurrent cell to producean output r t ∈ R D based on the current input x t , the externalexplicit and blurred memories E and B . r t summarizes informationextracted from both E and B that is deemed relevant for the gener-ation of the output y t . r t is designed to contain enough abstractionof past observations seen by EBmRNN , including the current input x t so that specific tasks can generate a desired y t using only ashallow network outside of the cell. This design choice helps theinterpretability of the model as it facilitates linking y t to memo-ries in E pointing explicitly to inputs x t , while still retaining theexpressiveness of a blurred memory. Analyzing how EBmRNN isusing E provides a natural way to track how attentive EBmRNN is to input data stored in E while analyzing EBmRNN ’s focus on B enables us to track the importance of long term dependencies.Details on how r t is computed are presented in the next subsection.In addition to E and B , there are three primary componentscontrolling the functioning of the cell:(1) The controller ( C ), that senses inputs to EBmRNN and mapsthese inputs into control signals for the management of allread and write operations to the memory banks.(2) The read gate controlling read accesses to the memory banksfrom control signals emitted by the controller.(3) The write gate controlling writes into the memory banksfrom control signals emitted by the controller.In the remainder of this section, we describe these three compo-nents in details.
A.1.1 The Controller.
At each time point t , the controller receivesthe current input x t and generates several outputs to manage E and B with appropriate read and write instructions sent to the read andwrite gates. As it receives x t , the controller updates its hidden state h t ∈ R C based on the past output of the cell r t − , its past hiddenstate h t − and current input x t . In addition to updating its hiddenstate h t , the controller emits two keys k E t ∈ R D and k B t ∈ R D tobe used by the read gate to control access to memory contents from E and B . To control write operations, the controller also produces m t ∈ R D a representation of the x t that will be consumed by the write gate. m t represents information from x t that is candidate for awrite into E and B . The controller also produces e t ∈ R D , an erasedweight vector that will be consumed by the write gate to forgetcontent from B . In this work, we model the controller with standardrecurrent neural network architectures such as Gated RecurrentUnits (GRU) or Long Short Term Memory networks (LSTM). Theoperations of the controller are summarized below: (cid:104) k Et , k Bt , m t , e t (cid:105) , h t = RN N ( h t − , x t , r t − ) (4) A.1.2 The Read Gate and Read Operations.
The read gate enforcesread accesses from E and B by consuming k E t and k B t and com-paring these keys against the content of the two memory banks E and B . Using this addressing scheme, the following weight vectorsover the memories are computed as follows: w r , B t = Softmax ( S ( k B t , B t − )) w r , E t = Gumbel-Softmax ( S ( k E t , E t − )) (5)where S denotes an appropriate distance function between the keyvectors and the memory locations. For our purpose, we use thecosine similarity measure as a distance function. w r , B t ∈ R N B and w r , E t ∈ R N E . To ensure discrete access, w r , E t weights are requiredto be one-hot encoded vectors. While Softmax is a natural choicefor soft selection of indices for w r , B t , its use is not applicable forthe hard selection required for w r , E t . Gumbel Softmax is a newerparadigm that is applicable in this context compared to alterna-tives like top-K Softmax that can introduce discontinuities. GumbelSoftmax uses a stochastic re-parameterization scheme to avoid non-differentiablities that arise from making discrete choices duringnormal model training. We use the straight-through optimizationprocedure that allows the network to make discrete decisions onthe forward pass while estimating the gradients on the backwardpass using Gumbel Softmax. More details on this scheme can befound from [4].The read vectors r E t and r B t from each of the banks are computedas follows: r B t = w r , B t B t − r E t = w r , E t E t − (6) r B t and r E t belong both to R D . We combine the two content readsfrom the two banks using a gate as follows: д t = σ (cid:16) a B Tд r B t + a E Tд r E t + b д (cid:17) r t = ReLU (cid:16) ( − д t ) × W B r r B t + д t × W E r r E t + b r (cid:17) (7) д t ∈ R while r t ∈ R D . The final output from EBmRNN can thenbe produced from a shallow layer that combines the contributionfrom the two memory banks represented by r t : y t = Softmax (cid:0) W y r t + b y (cid:1) (8)Equation 7 ensures that the network can learn to produce itsdesired output y t using information from either memory banks. Thegated value д t controls the relative effect of the blurred and explicitmemories on the output. On one hand, higher average values of д t would ensure that the network relies more on explicit memoriesand be as such easier to interpret. On the other hand, lower valuesof д t causes the network to rely more on blurred memories and be SHealth ’20, August 24, 2020, San Diego, CA Prithwish Chakraborty, Fei Wang, Jianying Hu, and Daby Sow harder to interpret. Depending on the learning task at hand, therecould be an interesting trade-off between learning performanceand interpretability that can be controlled by this gating scheme.In fact, one could introduce a hyper-parameter in 7 to control thistrade-off between W B r and W E r .The read operations are repeated K times to generate K hopsfrom the memory. A.1.3 The Write Gate and Write Operations.
Once memories areread, the controller updates the memory banks for the next state.At each time point, the controller generates the memory represen-tations, m t , for the input x t . The update strategy for the two banksare slightly different, and we start by describing the explicit bankupdate first. Explicit memory update:
As long as the explicit bank is not full,newer memories m t are simply appended to it and the updateequation can be given as: E t = [E t − ; m t ] (9)Once the entire memory is filled up, the network needs to learnto forget less important memory slots to generate a filtered ex-plicit memory ˜ E t − and update the memory following equation 9.From an information theoretic intuition, more information can beretained by the network by sustaining a higher entropy withinthe memory banks. The network learns the importance of the oldmemories with respect to new memory candidate content m t asfollows: ˜ w w , E t = Softmax ( − S ( m t , E t − )) (10)˜ w w , E t ∈ R N E . Equation 10 only uses the content to generatethe importance of the memory locations. Specifically, interpretingthese values of ˜ w w , E t in terms of retention probabilities, locationswith dissimilar contents will have higher retention probability -thereby forcing the network to store discriminative content in theexplicit memory.Past research has also shown that usage-based addressing cansignificantly improve the expressiveness of the network. We followthe scheme proposed by [4] and make use of an auxiliary variable u t that tracks a moving average of past read values for each memorylocations of E . The final write vector along with all the usage updateis given as: u t = α E u t − + ( − α E ) w r , E t γ t = σ (cid:16) a Tγ h t + b γ (cid:17) w w , E t = Gumbel-Softmax (cid:16) − ( ˜ w w , E t + γ t u t ) (cid:17) (11) u t ∈ R D , γ t ∈ R , w w , E t ∈ R N E . α E is a hyper-parameter capturingthe effect of current reads on the slots.Although, other addressing mechanisms have been proposed inliterature, we chose this setting for model simplicity and also tobetter capture the desirable properties of EHR applications.The explicit bank E is then updated by removing the slot withthe highest value of w w , E t ( ˆ m t from slot j ) and replacing its contentwith m t . At that time, we also reset the usage value for the slot (i.e. u t [ j ] = w w , Et is a one-hot encoded vector,the equations for the popped memory, and subsequently update ofthe explicit memory are given as below:ˆ m t = E t − w w , E t E t = E t − ◦ ( N W × D − w w , E t D ) + w w , E t m E t (12)where 1 N W × D represents a N W × D matrix of all 1 and 1 D repre-sents the same for a D dimensional vector. Blurred memory update : The Blurred memories are used to rep-resent past events with more abstract concepts that can capturelong term dependencies. The memory bank B provides a place formemories forgotten from the explicit bank to be stored in moreabstract sense. B also allows EBmRNN to track and access a higherdimensional construct of current memory representation.We generate a candidate blurred memory using the followingequation: f t = σ ( W if m t + W E f ˆ m t + b f ) m B t = ReLU (( − f t ) W i m t + f t W E ˆ m t + b m ) (13)We generate write-vectors w w , B t using a formulation similar toequation 10 by replacing the Gumbel-Softmax with a Softmax. Thefinal update equation for the blurred memory can then be given asfollows:˜ w w , B t = Softmax ( S ( m B t , B t − ))B t = B t − ◦ ( N × W − w w , B t e t ) + w w , B t m B t (14)where e tt