Clinical Information Extraction via Convolutional Neural Network
aa r X i v : . [ c s . L G ] M a r Clinical Information Extraction via Convolutional Neural Network
Peng Li
Computer Science and EngineeringUniversity of Texas at Arlington [email protected]
Heng Huang
Computer Science and EngineeringUniversity of Texas at Arlington [email protected]
Abstract
We report an implementation of a clini-cal information extraction tool that lever-ages deep neural network to annotate eventspans and their attributes from raw clini-cal notes and pathology reports. Our ap-proach uses context words and their part-of-speech tags and shape information asfeatures. Then we hire temporal (1D) con-volutional neural network to learn hid-den feature representations. Finally, weuse Multilayer Perceptron (MLP) to pre-dict event spans. The empirical evalua-tion demonstrates that our approach sig-nificantly outperforms baselines.
In the past few years, there has been muchinterest in applying neural network based deeplearning techniques to solve all kinds of nat-ural language processing (NLP) tasks. Fromlow level tasks such as language modeling,POS tagging, named entity recognition, andsemantic role labeling (Collobert et al., 2011;Mikolov et al., 2013), to high level tasks such asmachine translation, information retrieval, seman-tic analysis (Kalchbrenner and Blunsom, 2013;Socher et al., 2011a; Tai et al., 2015) and sen-tence relation modeling tasks such as para-phrase identification and question answer-ing (Socher et al., 2011b; Iyyer et al., 2014;Yin and Schutze, 2015). Deep representationlearning has demonstrated its importance for thesetasks. All the tasks get performance improvementvia learning either word level representations orsentence level representations.In this work, we brought deep representa-tion learning technologies to the clinical domain.Specifically, we focus on clinical information ex-traction, using clinical notes and pathology reports from the Mayo Clinic. Our system will identifyevent expressions consisting of the following com-ponents: • The spans (character offsets) of the expres-sion in the raw text • Contextual Modality: ACTUAL, HYPO-THETICAL, HEDGED or GENERIC • Degree: MOST, LITTLE or N/A • Polarity: POS or NEG • Type: ASPECTUAL, EVIDENTIAL or N/AThe input of our system consists of raw clinicalnotes or pathology reports like below:
April 23, 2014: The patient did not haveany postoperative bleeding so we will resumechemotherapy with a larger bolus on Fridayeven if there is slight nausea.
And output annotations over the text that cap-ture the key information such as event mentionsand attributes. Table 1 illustrates the output of clin-ical information extraction in details.To solve this task, the major challenge is howto precisely identify the spans (character offsets)of the event expressions from raw clinical notes.Traditional machine learning approaches usuallybuild a supervised classifier with features gener-ated by the Apache clinical Text Analysis andKnowledge Extraction System (cTAKES) . Forexample, BluLab system (Velupillai et al., 2015)extracted morphological(lemma), lexical(token),and syntactic(part-of-speech) features encodedfrom cTAKES. Although using the domain spe-cific information extraction tools can improvethe performance, learning how to use it well for Apache cTAKES is a natural language processing systemfor extraction of information from electronic medical recordclinical free-text linical Note Event Mention Event AttributeApril 23, 2014:The patient did nothave anypostoperativebleeding so we willresumechemotherapy witha larger bolus onFriday even if thereis slight nausea. bleeding type=N/A polarity=NEGdegree=N/A modality=ACTUALresume type=ASPECTUAL polarity=POSdegree: N/A modality=ACTUALchemotherapy type=ASPECTUAL polarity=POSdegree=N/A modality=ACTUALbolus type=ASPECTUAL polarity=POSdegree=N/A modality=ACTUALnausea type=ASPECTUAL polarity=POSdegree=N/A modality=HYPOTHETICALTable 1: An example of information extraction from clinical note.clinical domain feature engineering is still verytime-consuming. In short, a simple and effectivemethod that only leverage basic NLP modules andachieves high extraction performance is desired tosave costs.To address this challenge, we propose a deepneural networks based method, especially convo-lution neural network (Collobert et al., 2011), tolearn hidden feature representations directly fromraw clinical notes. More specifically, one methodfirst extract a window of surrounding words for thecandidate word. Then, we attach each word withtheir part-of-speech tag and shape information asextra features. Then our system deploys a tempo-ral convolution neural network to learn hidden fea-ture representations. Finally, our system uses Mul-tilayer Perceptron (MLP) to predict event spans.Note that we use the same model to predict eventattributes.
The major advantage of our system is that we onlyleverage NLTK tokenization and a POS tagger topreprocess our training dataset. When implement-ing our neural network based clinical informationextraction system, we found it is not easy to con-struct high quality training data due to the noisyformat of clinical notes. Choosing the proper to-kenizer is quite important for span identification.After several experiments, we found ”RegexpTok-enizer” can match our needs. This tokenizer cangenerate spans for each token via sophisticatedregular expression like below,n l t k . t o k e n i z e . R e g e x p T o k e n i z e r ( ” \ w+ | \ $ [ \ d \ . ] + | \ S+ ” )We then use ”PerceptronTagger” as our part-of-speech tagger due to its fast tagging speed. Notethat when extracting context words, please makesure you deploy the same tokenization module in-stead of just splitting strings by space.
Event span identification is the task of extractingcharacter offsets of the expression in raw clinicalnotes. This subtask is quite important due to thefact that the event span identification accuracy willaffect the accuracy of attribute identification. Wefirst run our neural network classifier to identifyevent spans. Then, given each span, our systemtries to identify attribute values.
The way we use temporal convlution neuralnetwork for event span and attribute classifi-cation is similar with the approach proposedby (Collobert et al., 2011). Generally speaking,we can consider a word as represented by K dis-crete features w ∈ D × · · · × D K , where D K isthe dictionary for the k th feature. In our scenario,we just use three features such as token mention,pos tag and word shape. Note that word shape fea-tures are used to represent the abstract letter pat-tern of the word by mapping lower-case letters to“x”, upper-case to “X”, numbers to “d”, and re-taining punctuation. We associate to each feature alookup table. Given a word, a feature vector is thenobtained by concatenating all lookup table out-puts. Then a clinical snippet is transformed into aword embedding matrix. The matrix can be fed tofurther 1-dimension convolutional neural networknd max pooling layers. Below we will briefly in-troduce core concepts of Convoluational NeuralNetwork (CNN). Temporal Convolution
Temporal Convolution applies one-dimensionalconvolution over the input sequence. The one-dimensional convolution is an operation between avector of weights m ∈ R m and a vector of inputsviewed as a sequence x ∈ R n . The vector m isthe filter of the convolution. Concretely, we thinkof x as the input sentence and x i ∈ R as a sin-gle feature value associated with the i -th word inthe sentence. The idea behind the one-dimensionalconvolution is to take the dot product of the vector m with each m -gram in the sentence x to obtainanother sequence c : c j = m T x j − m +1: j . (1)Usually, x i is not a single value, but a d -dimensional word vector so that x ∈ R d × n .There exist two types of 1d convolution opera-tions. One was introduced by (Waibel et al., 1989)and also known as Time Delay Neural Net-works (TDNNs). The other one was introducedby (Collobert et al., 2011). In TDNN, weights m ∈ R d × m form a matrix. Each row of m is convolved with the corresponding row of x .In (Collobert et al., 2011) architecture, a sequenceof length n is represented as: x n = x ⊕ x · · · ⊕ x n , (2)where ⊕ is the concatenation operation. In gen-eral, let x i : i + j refer to the concatenation of words x i , x i +1 , . . . , x i + j . A convolution operation in-volves a filter w ∈ R hk , which is applied to a win-dow of h words to produce the new feature. Forexample, a feature c i is generated from a windowof words x i : i + h − by: c i = f ( w · x i : i + h − + b ) , (3)where b ∈ R is a bias term and f is a non-linearfunction such as the hyperbolic tangent. This fil-ter is applied to each possible window of wordsin the sequence { x h , x h +1 , . . . , x n − h +1: n } toproduce the feature map: c = [ c , c , . . . , c n − h +1 ] , (4)where c ∈ R n − h +1 . We also employ dropout on the penultimatelayer with a constraint on ℓ -norms of the weightvector. Dropout prevents co-adaptation of hid-den units by randomly dropping out a pro-portion p of the hidden units during forward-backpropagation. That is, given the penultimatelayer z = [ ˆ c , . . . , ˆ c m ] , instead of using: y = w · z + b (5)for output unit y in forward propagation, dropoutuses: y = w · ( z ◦ r ) + b , (6)where ◦ is the element-wise multiplication opera-tor and r ∈ R m is a masking vector of Bernoullirandom variables with probability p of being 1.Gradients are backpropagated only through theunmasked units. At test step, the learned weightvectors are scaled by p such that ˆ w = p w , and ˆ w is used to score unseen sentences. We additionallyconstrain l -norms of the weight vectors by re-scaling w to have || w || = s whenever || w || > s after a gradient descent step. We use the Clinical TempEval corpus as the eval-uation dataset. This corpus was based on a set of600 clinical notes and pathology reports from can-cer patients at the Mayo Clinic. These notes weremanually de-identified by the Mayo Clinic to re-place names, locations, etc. with generic place-holders, but time expression were not altered. Thenotes were then manually annotated with times,events and temporal relations in clinical notes.These annotations include time expression types,event attributes and an increased focus on tempo-ral relations. The event, time and temporal rela-tion annotations were distributed separately fromthe text using the Anafora standoff format. Ta-ble 2 shows the number of documents, event ex-pressions in the training, development and testingportions of the 2016 THYME data. All of the tasks were evaluated using the standardmetrics of precision(P), recall(R) and F : http://alt.qcri.org/semeval2016/task12/index.php?id=data ategory Train Dev TestDocuments 293 147 151Events 38872 20973 18989 Table 2: Number of documents, event expressionsin the training, development and testing portionsof the THYME data P = | S ∩ H || S | R = | S ∩ H || H | F = 2 · P · RP + R (7)where S is the set of items predicted by the sys-tem and H is the set of items manually annotatedby the humans. Applying these metrics of the tasksonly requires a definition of what is considered an”item” for each task. For evaluating the spans ofevent expressions, items were tuples of characteroffsets. Thus, system only received credit for iden-tifying events with exactly the same character off-sets as the manually annotated ones. For evaluat-ing the attributes of event expression types, itemswere tuples of (begin, end, value) where begin andend are character offsets and value is the value thatwas given to the relevant attribute. Thus, systemsonly received credit for an event attribute if theyboth found an event with correct character off-sets and then assigned the correct value for thatattribute (Bethard et al., 2015). We want to maximize the likelihood of the correctclass. This is equivalent to minimizing the nega-tive log-likelihood (NLL). More specifically, thelabel ˆ y given the inputs x h is predicted by a soft-max classifier that takes the hidden state h j as in-put: ˆ p θ ( y | x h ) = sof tmax ( W · x h + b )ˆ y = argmax y ˆ p θ ( y | x h ) (8)After that, the objective function is the negativelog-likelihood of the true class labels y k : J ( θ ) = − m m X k =1 log ˆ p θ ( y k | x kh ) + λ || θ || , (9) where m is the number of training examples andthe superscript k indicates the k th example. Hyperparameters
We use Lasagne deep learning framework. Wefirst initialize our word representations using pub-licly available 300-dimensional Glove word vec-tors . We deploy CNN model with kernel widthof 2, a filter size of 300, sequence length is ∗ windows size +1 , number filters is seqlen − kw +1 , stride is 1, pool size is seqlen − f ilter size +1 ,cnn activation function is tangent, MLP activationfunction is sigmoid. MLP hidden dimension is 50.We initialize CNN weights using a uniform dis-tribution. Finally, by stacking a softmax functionon top, we can get normalized log-probabilities.Training is done through stochastic gradient de-scent over shuffled mini-batches with the AdaGradupdate rule (Duchi et al., 2011). The learning rateis set to 0.05. The mini-batch size is 100. Themodel parameters were regularized with a per-minibatch L2 regularization strength of − . Table 3 shows results on the event expressiontasks. Our initial submits RUN 4 and 5 outper-formed the memorization baseline on every metricon every task. The precision of event span identi-fication is close to the max report. However, oursystem got lower recall. One of the main reasonis that our training objective function is accuracy-oriented. Table 4 shows results on the phase 2 sub-task.
DocTimeRelMethods P R F1Memorize 0.675Ours RUN5 0.788 0.788 0.788Ours RUN6 0.786 0.786 0.786Median report 0.724Max report 0.843
Table 4: Phase 2: DocTimeRel
In this paper, we introduced a new clinical infor-mation extraction system that only leverage deepneural networks to identify event spans and theirattributes from raw clinical notes. We trained deepneural networks based classifiers to extract clini-cal event spans. Our method attached each word https://github.com/Lasagne/Lasagne http://nlp.stanford.edu/projects/glove/ pan modality degree polarity typeMethods P R F1 P R F1 P R F1 P R F1 P R F1Memorize 0.878 0.834 0.855 0.810 0.770 0.789 0.874 0.831 0.852 0.812 0.772 0.792 0.855 0.813 0.833Ours RUN4 0.908 0.842 0.874 0.842 0.780 0.810 0.904 0.838 0.869 0.876 0.812 0.842 0.877 0.813 0.844Ours RUN5 0.900 0.850 0.874 0.837 0.790 0.813 0.896 0.845 0.870 0.861 0.813 0.836 0.869 0.820 0.844Median report 0.887 0.846 0.874 0.830 0.780 0.810 0.882 0.838 0.869 0.868 0.813 0.839 0.854 0.813 0.844Max report 0.915 0.891 0.903 0.866 0.843 0.855 0.911 0.887 0.899 0.900 0.875 0.887 0.894 0.870 0.882 Table 3: System performance comparison. Note that Run4 means the window size is 4, Run5 means thewindow size is 5to their part-of-speech tag and shape informationas extra features. We then hire temporal convolu-tion neural network to learn hidden feature repre-sentations. The entire experimental results demon-strate that our approach consistently outperformsthe existing baseline methods on standard evalua-tion datasets.Our research proved that we can get compet-itive results without the help of a domain spe-cific feature extraction toolkit, such as cTAKES.Also we only leverage basic natural language pro-cessing modules such as tokenization and part-of-speech tagging. With the help of deep representa-tion learning, we can dramatically reduce the costof clinical information extraction system develop-ment.
References [Bethard et al.2015] Steven Bethard, Leon Derczynski,Guergana Savova, Guergana Savova, James Puste-jovsky, and Marc Verhagen. 2015. Semeval-2015task 6: Clinical tempeval. In
International Work-shop on Semantic Evaluation (SemEval-2015) .[Collobert et al.2011] Ronan Collobert, Jason Weston,L´eon Bottou, Michael Karlen, Koray Kavukcuoglu,and Pavel Kuksa. 2011. Natural language process-ing (almost) from scratch.
The Journal of MachineLearning Research , 12:2493–2537.[Duchi et al.2011] John Duchi, Elad Hazan, and YoramSinger. 2011. Adaptive subgradient methods for on-line learning and stochastic optimization.
The Jour-nal of Machine Learning Research , 12:2121–2159.[Iyyer et al.2014] Mohit Iyyer, Jordan Boyd-Graber,Leonardo Claudino, Richard Socher, and HalDaum´e III. 2014. A neural network for factoidquestion answering over paragraphs. In
EmpiricalMethods in Natural Language Processing .[Kalchbrenner and Blunsom2013] Nal Kalchbrennerand Phil Blunsom. 2013. Recurrent continuoustranslation models. In
Proceedings of the 2013Conference on Empirical Methods in NaturalLanguage Processing , pages 1700–1709, Seattle,Washington, USA. Association for ComputationalLinguistics. [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever,Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed representations of words and phrasesand their compositionality. In
Advances in NeuralInformation Processing Systems , pages 3111–3119.[Socher et al.2011a] Richard Socher, Eric H Huang,Jeffrey Pennin, Christopher D Manning, and An-drew Y Ng. 2011a. Dynamic pooling and unfold-ing recursive autoencoders for paraphrase detection.In
Advances in Neural Information Processing Sys-tems , pages 801–809.[Socher et al.2011b] Richard Socher, Jeffrey Penning-ton, Eric H Huang, Andrew Y Ng, and Christo-pher D Manning. 2011b. Semi-supervised recur-sive autoencoders for predicting sentiment distribu-tions. In
Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing , pages151–161. Association for Computational Linguis-tics.[Tai et al.2015] Kai Sheng Tai, Richard Socher, andChristopher D Manning. 2015. Improved se-mantic representations from tree-structured longshort-term memory networks. arXiv preprintarXiv:1503.00075 .[Velupillai et al.2015] Sumithra Velupillai, Danielle LMowery, Samir Abdelrahman, Lee Christensen, andWendy W Chapman. 2015. Blulab: Temporal in-formation extraction for the 2015 clinical tempevalchallenge. In
International Workshop on SemanticEvaluation (SemEval-2015) .[Waibel et al.1989] Alexander Waibel, ToshiyukiHanazawa, Geoffrey Hinton, Kiyohiro Shikano, andKevin J Lang. 1989. Phoneme recognition usingtime-delay neural networks.
Acoustics, Speechand Signal Processing, IEEE Transactions on ,37(3):328–339.[Yin and Schutze2015] Wenpeng Yin and HinrichSchutze. 2015. Multigrancnn: An architecture forgeneral matching of text chunks on multiple levelsof granularity. In