Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning
Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, Curtis Langlotz, Jiawei Han
““main” — 2018/10/9 — page 1 —
Bioinformatics doi.10.1093/bioinformatics/xxxxxxAdvance Access Publication Date: Day Month YearManuscript Category
Subject Section
Cross-type Biomedical Named Entity Recognitionwith Deep Multi-Task Learning
Xuan Wang ∗ , Yu Zhang , Xiang Ren ∗ , Yuhao Zhang , Marinka Zitnik ,Jingbo Shang , Curtis Langlotz and Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA, School of Medicine, Stanford University, Stanford, CA 94305, USA, and Department of Computer Science, Stanford University, Stanford, CA 94305, USA ∗ To whom correspondence should be addressed.
Associate Editor: XXXXXXX
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Abstract
Motivation:
State-of-the-art biomedical named entity recognition (BioNER) systems often requirehandcrafted features specific to each entity type, such as genes, chemicals and diseases. Although recentstudies explored using neural network models for BioNER to free experts from manual feature engineering,the performance remains limited by the available training data for each entity type.
Results:
We propose a multi-task learning framework for BioNER to collectively use the training data ofdifferent types of entities and improve the performance on each of them. In experiments on 15 benchmarkBioNER datasets, our multi-task model achieves substantially better performance compared with state-of-the-art BioNER systems and baseline neural sequence labeling models. Further analysis shows thatthe large performance gains come from sharing character- and word-level information among relevantbiomedical entities across differently labeled corpora.
Availability:
Our source code is available at https://github.com/yuzhimanhua/lm-lstm-crf.
Contact: [email protected], [email protected].
Supplementary information:
Supplementary data are available at
Bioinformatics online.
Biomedical named entity recognition (BioNER) is one of the mostfundamental task in biomedical text mining that aims to automaticallyrecognize and classify biomedical entities (e.g., genes, proteins, chemicalsand diseases) from text. BioNER can be used to identify new gene namesfrom text (Smith et al. , 2008). It also serves as a primitive step of manydownstream applications, such as relation extraction (Cokol et al. , 2005)and knowledge base completion (Szklarczyk et al. , 2017; Wei et al. , 2013;Xie et al. , 2013; Szklarczyk et al. , 2015).BioNER is typically formulated as a sequence labeling problem whosegoal is to assign a label to each word in a sentence. State-of-the-artBioNER systems often require handcrafted features (e.g., capitalization,prefix and suffix) to be specifically designed for each entity type (Ando,2007; Leaman and Lu, 2016; Zhou and Su, 2004; Lu et al. , 2015). Thisfeature generation process takes the majority of time and cost in developing a BioNER system (Leser and Hakenberg, 2005), and leads to highlyspecialized systems that cannot be directly used to recognize new typesof entities. The accuracy of the resulting BioNER tools remains a limitingfactor in the performance of biomedical text mining pipelines (Huang andLu, 2015).Recent NER studies consider neural network models to automaticallygenerate quality features (Chiu and Nichols, 2016; Ma and Hovy, 2016;Lample et al. , 2016; Liu et al. , 2018). Crichton et al. took each wordtoken and its surrounding context words as input into a convolutionalneural network (CNN). Habibi et al. adopted the model from Lample et al. and used word embeddings as input into a bidirectional long short-term memory-conditional random field (BiLSTM-CRF) model. Theseneural network models free experts from manual feature engineering.However, these models have millions of parameters and require very largedatasets to reliably estimate the parameters. This poses a major challengefor biomedicine, where datasets of this scale are expensive and slowto create and thus neural network models cannot realize their potential © The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] a r X i v : . [ c s . I R ] O c t main” — 2018/10/9 — page 2 — Sample et al. performance to the fullest (Camacho et al. , 2018). Although neural networkmodels can outperform traditional sequence labeling models (e.g., CRFmodels (Lafferty et al. , 2001)), they are still outperfomed by handcraftedfeature-based systems in multiple domains (Crichton et al. , 2017).One direction to address the above challenge is to use the labeled dataof different entity types to augment the training signals for each of them, asinformation like word semantics and grammatical structure may be sharedacross different datasets. However, simply combining all datasets andtraining one single model over multiple entity types can introduce manyfalse negatives because each dataset is typically specifically annotated forone or only a few entity types. For example, combining dataset A forgene recognition and dataset B for chemical recognition will result inmissing chemical entity labels in dataset A and missing gene entity labelsin dataset B. Multi-task learning (MTL) (Collobert and Weston, 2008;Søgaard and Goldberg, 2016) offers a solution to this issue by collectivelytraining a model on several related tasks, so that each task benefits modellearning in other tasks without introducing additional errors. MTL has beensuccessfully applied in natural language processing (Collobert and Weston,2008), speech recognition (Deng et al. , 2013), computer vision (Girshick,2015) and drug discovery (Ramsundar et al. , 2015). But MTL is lesscommonly used and has seen limited success in BioNER so far. Crichton et al. explored MTL with a CNN model for BioNER. However, Crichton et al. only considers word-level features as input, ignoring character-level lexical information which are often crucial for modeling biomedicalentities (e.g. -ase could be an important subword feature for gene/proteinentity recognition). As a result, their best performing multi-task CNNmodel does not outperform state-of-the-art systems that use on handcraftedfeatures (Crichton et al. , 2017).In this paper, we propose a new multi-task learning framework usingchar-level neural models for BioNER. The proposed framework, despitebeing simple and not requiring any feature engineering, achieves excellentbenchmark performance. Our multi-task model is built upon a single-task neural network model (Liu et al. , 2018). In particular, we considera BiLSTM-CRF model with an additional context-dependent BiLSTMlayer for modeling character sequences. A prominent advantage of ourmulti-task model is that inputs from different datasets can efficiently shareboth character- and word-level representations, by reusing parameters inthe corresponding BiLSTM units. We compare the proposed multi-taskmodel with state-of-the-art BioNER systems and baseline neural networkmodels on 15 benchmark BioNER datasets and observe substantially betterperformance. We further show through detailed experimental analysis on 5datasets that the proposed approach adds marginal computational overheadand outperforms strong baseline neural models that do not consider multi-task learning, suggesting that multi-task learning plays an important role inits success. Altogether, this work introduces a new text-mining approachthat can help scientists exploit knowledge buried in biomedical literaturein a systematic and unbiased way.
Let Φ denote the set of labels indicating whether a word is part of a specificentity type or not. Given a sequence of words w = { w , w , ..., w n } , theoutput is a sequence of labels y = { y , y , ..., y n } , y i ∈ Φ . For example,given a sentence “.... including the RING1 ...", the output should be “... OO S-GENE ..." in which “O" indicates a non-entity type and “S-GENE"indicates a single-token GENE type. Long short-term memory neural network is a specific type of recurrentneural network that models dependencies between elements in a sequence ! " % " % " ! " ! & " ! ! ! !" "" ! ! ! " " ! " % " ’()*+,-.+*)-/+(01*23-45*%6 ! " & " % " ’()*+,-.+*)-%2++(9"-45*%6 ! " ! :5()(9" " :5()(9" Fig. 1.
Architecture of long short-term memory neural network. through recurrent connections (Fig. 1). The input to an LSTM networkis a sequence of vectors X = { x , x , ..., x T } , where vector x i is a representation vector of a word in the input sentence. The outputis a sequence of vectors H = { h , h , ..., h T } , where h i is ahidden state vector. At step t of the recurrent calculation, the networktakes x t , c t − , h t − as inputs and produces c t , h t via the followingintermediate calculations: i t = σ ( W i x t + U i h t − + b i ) f t = σ ( W f x t + U f h t − + b f ) o t = σ ( W o x t + U o h t − + b o ) g t = tanh ( W g x t + U g h t − + b g ) c t = f t (cid:12) c t − + i t (cid:12) g t h t = o t (cid:12) tanh ( c t ) , where σ ( · ) and tanh ( · ) denote element-wise sigmoid and hyperbolictangent functions, respectively, and (cid:12) denotes element-wise multiplication.The i t , f t and o t are referred to as input, forget, and output gates,respectively. The g t and c t are intermediate calculation steps. At t = 1 , h and c are initialized to zero vectors. The trainable parameters are W j , U j and b j for j ∈ { i, f, o, g } .The LSTM architecture described above can only process the inputin one direction. The bi-directional long short-term memory (BiLSTM)model improves the LSTM by feeding the input to the LSTM networktwice, once in the original direction and once in the reversed direction.Outputs from both directions are concatenated to represent the final output.This design allows for detection of dependencies from both previous andsubsequent words in a sequence. A naive way of applying the BiLSTM network to sequence labelingis to use the output hidden state vectors to make independent taggingdecisions. However, in many sequence labeling tasks such as BioNER,it is useful to also model the dependencies across output tags. TheBiLSTM-CRF network adds a conditional random field (CRF) layer ontop of a BiLSTM network. This BiLSTM-CRF network takes the inputsequence X = { x , x , ..., x n } to predict an output label sequence y = { y , y , ..., y n } . A score is defined as: s ( X , y ) = n (cid:88) i =0 A y i ,y i +1 + n (cid:88) i =1 P i,y i , where P is an n × k matrix of the output from the BiLSTM layer, n is thesequence length, k is the number of distinct labels, A is a ( k +2) × ( k +2) transition matrix and A i,j represents the transition probability from the i -th label to the j -th label. Note that two additional labels
Architecture of a single-task neural network. The input is a sentence from biomedical literature. Rectangles denote character and word embeddings; empty round rectangles denotethe first character-level BiLSTM; shaded round rectangles denote the second word-level BiLSTM; pentagons denote the concatenation units. The tags on the top, e.g., ’O’, ’S-GENE’, arethe output of the final CRF layer, which are the entity labels we get for each word in the sentence.
SharedGenes/Proteins (Dataset 1) Chemicals (Dataset 2) (a) MTM-C
SharedGenes/Proteins (Dataset 1) Chemicals (Dataset 2) (b) MTM-W
SharedGenes/Proteins (Dataset 1) Chemicals (Dataset 2) char-level BiLSTMword-level BiLSTMCRFchar-level embeddingword-level embedding (c) MTM-CW
SharedGenes/Proteins (Dataset 1) Chemicals (Dataset 2) char-level BiLSTMword-level BiLSTMCRFchar-level embeddingword-level embedding
Fig. 3.
Three multi-task learning neural network models. The empty circles denote the character embeddings. The empty round rectangles denote the character-level BiLSTM. The shadedcircles denote the word-level embeddings. The shaded round rectangles denote the word-level BiLSTM. The squares denote the CRF layer. (a) MTM-C: multi-task learning neural networkwith a shared character layer and a task-specific word layer, (b): MTM-W: multi-task learning neural network with a task-specific character layer and a shared word layer, (c) MTM-CW:multi-task learning neural network with shared character and word layers.
We further define Y X as all possible sequence labels given the inputsequence X . The training process maximizes the log-probability of thelabel sequence y given the input sequence X : log ( p ( y | X )) = log e s ( X , y ) (cid:80) y (cid:48) ∈ Y X e s ( X , y (cid:48) ) . (1)A three-layer BiLSTM-CRF architecture is employed by Lample et al. and Habibi et al. to jointly model the word and the charactersequences in the input sentence. In this architecture, the first BiLSTMlayer takes character embedding sequence of each word as input, andproduces a character-level representation vector for this word as output.This character-level vector is then concatenated with a word embeddingvector, and fed into a second BiLSTM layer. Lastly, a CRF layer takes theoutput vectors from the second BiLSTM layer, and outputs the best tagsequence by maximizing the log-probability in Equation 1.In practice, the character embedding vectors are randomly initializedand co-trained during the model training process. The word embeddingvectors are retrieved directly from a pre-trained word embedding lookuptable. The classical Viterbi algorithm is used to infer the final labels forthe CRF model. The three-layer BiLSTM-CRF model is a differentiableneural network architecture that can be trained by backpropagation. The vanilla BiLSTM-CRF model can learn high-quality representationsfor words that appeared in the training dataset. However, it often failsto generalize to out-of-vocabulary (OOV) words (i.e., words that did notappear in the training dataset) because they don’t have a pre-trained wordembedding. These OOV words are common in biomedical text (67.21%OOV words of the datasets in Table 1). Therefore, for the baseline single-task BioNER model, we use a neural network architecture that betterhandles OOV words. As shown in Fig. 2, our single-task model consistsof three layers. In the first layer, a BiLSTM network is used to modelthe character sequence of the input sentence. We use character embeddingvectors as input to the network. Hidden state vectors at the word boundariesof this character-level BiLSTM are then selected and concatenated withword embedding vectors to form word representations. Next, these wordrepresentation vectors are fed into a word-level BiLSTM layer (i.e., theupper BiLSTM layer in Fig. 2). Lastly, output of this word-level BiLSTMis fed into the a CRF layer for label prediction. Compared to the vanillaBiLSTM-CRF model, a major advantage of this model is that it can inferthe meaning of an out-of-vocabulary word from its character sequence andother characters around it. For example, the model is now able to infer that“RING2” likely represents a gene symbol, even though then network mayhave only seen the word “RING1” during training. main” — 2018/10/9 — page 4 — Sample et al.
An important characteristic of the BioNER task is the limited availabilityof supervised training data. We propose a multi-task learning approachto address this problem by training different BioNER models on datasetswith different entity types while sharing parameters across these models.We hypothesize that the proposed approach can make more efficient useof the data and encourage the models to learn representations of words andcharacters (which are shared between multiple corpora) in a more effectiveand generalized way.We give a formal definition of the multi-task setting as the following.Given m datasets, for i ∈ { , ..., m } , each dataset D i consists of n i training samples, i.e., D i = { w ij , y ij } n i j =1 . We denote the trainingmatrix for each dataset as X i = { x i , ..., x in i } ( X i is the featurerepresentation of the input word sequence w ij ) and the labels for eachdataset as y i = { y i , ..., y in i } . The model parameters include the word-level BiLSTM parameters ( θ wi ), the character-level BiLSTM parameters( θ ci ) and the output CRF parameters ( θ oi ). A multi-task model thereforeconsists of m different models, each trained on a separate dataset, whilesharing part of the model parameters across datasets. The loss function L of the multi-task model is: L = m (cid:88) i =1 λ i L i = m (cid:88) i =1 λ i log( P θ wi ,θ ci ,θ oi ( y i | X i )) . The log-likelihood term is shown in Equation 1 and λ i is a positive hyper-parameter that controls the contribution of each dataset. We observed thatour multi-task model is able to achieve very competitive performance with λ i = 1 on all datasets that we evaluated on and therefore use this valuein our experiments. However, we believe that the performance can beimproved with further tuned λ i values.We propose three different multi-task models, as illustrated in Fig.3. These three models differ in which part of the model parameters( θ wi , θ ci , θ oi ) are shared across multiple datasets: MTM-C
In this model, θ ci = θ c are shared among different tasks. Alldatasets are iteratively used to train the model. When a dataset is used,the parameters updated during the training are θ c and θ wi . The detailedarchitecture of this multi-task model is shown in Fig. 3(a). MTM-W
In this model, θ wi = θ w are shared among different tasks. Whena dataset is used, the parameters updated during the training are θ w and θ ci .The detailed architecture of this multi-task model is shown in Fig. 3(b). MTM-CW
In this model, θ ci = θ c and θ wi = θ w are shared amongdifferent tasks. Each dataset has its specific θ oi for label prediction. MTM-CW shared the most information across tasks compared with the othertwo multi-task models. It enables sharing both character- and word-level information between different biomedical entities, while the othertwo models only enable sharing part of the information. The detailedarchitecture of this multi-task model is shown in Fig. 3(c). We test our method on the same 15 datasets used by Crichton et al. , andfind our model achieves substantially better performance on 14 of themcompared with baseline neural network models. Due to space limit, here wereport detailed results of the multi-task model on 5 main datasets (Table 1),which altogether cover major biomedical entity types (e.g., genes, proteins,chemicals, diseases). We also include full results on all the 15 datasets in
Supplementary Material: Performance comparison on 15 datasets . Theperformance of the multi-task model is slightly different when trained on5 datasets compared with trained on 15 datasets (shown in
Supplementary
Table 1. Biomedical NER datasets used in the experiments.
Dataset Size Entity types and countsBC2GM 20,000 sentences Gene/Protein (24,583)BC4CHEMD 10,000 abstracts Chemical (84,310)BC5CDR 1,500 articles Chemical (15,935), Disease (12,852)NCBI-Disease 793 abstracts Disease (6,881)JNLPBA 2,404 abstracts Gene/Protein (35,336),Cell Line (4,330), DNA (10,589),Cell Type (8,649), RNA (1,069)
Material: Performance comparison on 15 datasets ), as the MTL modelhas access to more data. In our experiments, we follow the experimentsetup of Crichton et al. and divide each dataset into training, developmentand test sets. We use training and development sets to train the final model.All datasets are publicly available. As part of preprocessing, word labelsare encoded using an IOBES scheme. In this scheme, for example, a worddescribing a gene entity is tagged with “B-Gene” if it is at the beginningof the entity, “I-Gene” if it is in the middle of the entity, and “E-Gene”if it is at the end of the entity. Single-word gene entities are tagged with“S-Gene”. All other words not describing entities of interest are tagged as‘O’. Next, we briefly describe the 5 main datasets and their correspondingstate-of-the-art BioNER systems.
BC2GM
The state-of-the-art system reported for the BioCreative II genemention recognition task adopts semi-supervised learning method withalternating structure optimization (Ando, 2007).
BC4CHEMD
The state-of-the-art system reported for the BioCreative IVchemical entity mention recognition task is the
CHEMDNER system (Lu et al. , 2015), which is based on mixed conditional random fields withBrown clustering of words.
BC5CDR
The state-of-the-art system reported for the most recentBioCreative V chemical and disease mention recognition task is the
TaggerOne system (Leaman and Lu, 2016), which uses a semi-Markovmodel for joint entity recognition and normalization.
NCBI-Disease
The NCBI disease dataset was initially introduced fordisease name recognition and normalization. It has been widely used fora lot of applications. The state-of-the-art system on this dataset is also the
TaggerOne system (Leaman and Lu, 2016).
JNLPBA
The state-of-the-art system (Zhou and Su, 2004) for the 2004JNLPBA shared task on biomedical entity (gene/protein, DNA, RNA, cellline, cell type) recognition uses a hidden markov model (HMM). Althoughthis task and the model is a bit old compared with the others, it still remainsa competitive benchmark method for comparison.
We report the performance of all the compared methods on the test set.We deem each predicted entity as correct only if both the entity boundaryand entity types are the same as the ground-truth annotation (i.e., exactmatch ). Then we calculate the precision, recall and F1 scores on all datasetsand macro-averaged F1 scores on all entity types. For error analysis, wecompare the ratios of false positive (FP) and false negative (FN) labelsin the single-task and the multi-task models and include the results in
Supplementary Material: Error analysis .The test set of the BC2GM dataset is constructed slightly differentlycompared to the test sets of other datasets. BC2GM additionally providesa list of alternative answers for each entity in the test set. A predicted All datasets can be downloaded from:https://github.com/cambridgeltl/MTL-Bioinformatics-2016. main” — 2018/10/9 — page 5 —
Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning Table 2. Performance and average training time of the baseline neural network models and the proposed MTM-CW model. Bold: best scores, *: significantly worsethan the MTM-CW model ( p ≤ . ), **: significantly worse than the MTM-CW model ( p ≤ . ). The details of dataset benchmark systems and evaluationmethods are described in Section 4.1 and 4.2, respectively. DatasetBenchmark Crichton et al.
Lample et al.
Habibi et al.
Ma and Hovy STM MTM-CWBC2GM(Exact) Precision - - 81.57 ± ∗ ± ∗∗ ± ∗ ± Recall - - ± ± ∗∗ ± ∗∗ ± ∗∗ ± ± ∗∗ ± ∗ ± BC2GM(Alternative) Precision 88.48 - 87.27 ± ∗∗ ± ∗∗ ± ∗ ± Recall 85.97 ∗∗ - 87.84 ± ± ∗ ± ∗ ± F1 87.21 ∗∗ ∗∗ ± ∗ ± ∗∗ ± ∗ ± BC4CHEMD Precision 88.73 ∗∗ - 89.68 ± ∗ ± ± ∗ ± Recall 87.41 - 85.87 ± ∗ ± ∗∗ ± ± F1 88.06 ∗ ∗∗ ± ∗∗ ± ∗∗ ± ± BC5CDR Precision - 87.60 ± ∗∗ ± ± ± ∗∗ - 86.25 ± ∗∗ ± ∗∗ ± ∗∗ ± F1 86.76 ∗∗ ∗∗ ± ∗∗ ± ∗∗ ± ∗∗ ± NCBI-Disease Precision 85.10 - 86.11 ± ± ± ± ∗∗ - 85.49 ± ± ∗∗ ± ∗ ± F1 82.90 ∗∗ ∗∗ ± ± ∗∗ ± ∗ ± JNLPBA Precision 69.42 ∗∗ - ± ± ∗ ± ∗∗ ± ± ± ± ∗ ± F1 72.55 ∗∗ ∗∗ ± ± ∗ ± ∗∗ ± - - 1.59 0.95 0.71 0.75entity is deemed correct as long as it matches the ground truth or one ofthe alternative answers. We refer to this measurement as alternative match and report scores under both exact match and alternative match for theBC2GM dataset. We initialize the word embedding matrix with pre-trained word vectorsfrom Pyysalo et al. , 2013 in all experiments. These word embeddingsare trained using a skip-gram model, as described in Mikolov et al. , 2013.These word vectors are trained on three different datasets: (1) abstractsfrom the PubMed database, (2) abstracts from the PubMed databasetogether with full-text articles from the PubMed Central (PMC), and (3) theentire Pubmed database of abstracts and full-text articles together with theWikipedia corpus. We found the third set of word vectors lead to best resultson development set and therefore used it for the model development. Weprovide a full comparison of different word embeddings in
SupplementaryMaterial: Performance of Word Embeddings . In all experiments, wereplace rare words (i.e., words with a frequency of less than 5) with a special
All the neural network models are trained on one GeForce GTX 1080 GPU.To train our neural models, we use a learning rate of 0.01 with a decayrate of 0.05 applied to every epoch of training. The dimensions of wordand character embedding vectors are set to be 200 and 30, respectively(Liu et al. , 2018). We adopted 200 (best performance among 100, 200and 300) for both character- and word-level BiLSTM layers. Note that Liu et al. considers advanced strategies, such as highway structures, to furtherimprove performance. We did not observe any significant performanceboost with these advanced strategies, thus do not adopt these strategies inthis work. The performance of the model variations with these advancedstrategies can be found in
Supplementary Material: Performance of Model The pre-trained word vectors can be download from:http://bio.nlplab.org/.
Variations . To train the baseline neural network models, we use the defaultparameter settings as used in their paper (Lample et al. , 2016; Habibi et al. ,2017; Ma and Hovy, 2016) because we found the default parameters alsolead to almost optimal performance on the development set.
We compare the proposed single-task (Section 3.1) and multi-task models(Section 3.2) with state-of-the-art BioNER systems (reported for eachdataset) and three neural network models from Crichton et al. , Lample et al. ; Habibi et al. , and Ma and Hovy. The evaluation metrics includeprecision, recall and F1 score (Tsai et al. , 2006) (Table 2). We denoteresults of the best system priorly reported for each dataset as “DatasetBenchmark". For method proposed by Crichton et al. , we quote theirexperiment results directly. For other neural network models, we repearteach experiment three times with the mean and standard deviation reported(Table 2). To directly compare with the results in Crichton et al. , wemeasure statistical significance with the same t-test as used in their paper.We observe that the MTM-CW model achieves significantly higherF1 scores than state-of-the-art benchmark systems (column DatasetBenchmark in Table 2) on all of the five datasets. Following establishedpractice in the literature, we use exact match to compare benchmarkperformance on all the datasets except for the BC2GM, where we reportbenchmark performance based on alternative match. Furthermore, MTM-CW generally achieves significantly higher F1 scores than other neuralnetwork models. These results show that the proposed multi-task learningneural network significantly outperforms state-of-the-art systems and otherneural networks. In particular, the MTM-CW model consistently achievesa better performance than the single task model, demonstrating that multi-task learning is able to successfully leverage information across differentdatasets and mutually enhance performance on each single task. We furtherinvestigate the performance of three multi-task models (MTM-C, MTM-W,and MTM-CW, Table 3). Results show that the best performing multi-task model is MTM-CW, indicating the importance of morphological main” — 2018/10/9 — page 6 — Sample et al.
Table 3. F1 scores of three multi-task models proposed in this paper. Bold:best scores, *: significantly worse than the MTM-CW model ( p ≤ . ), **:significantly worse than the MTM-CW model ( p ≤ . ). Dataset MTM-C MTM-W MTM-CWBC2GM 77.80 ± ∗∗ ± ∗∗ ± BC4CHEMD 88.16 ± ∗ ± ∗ ± BC5CDR 86.05 ± ∗∗ ± ∗ ± NCBI-Disease 82.94 ± ∗∗ ± ± JNLPBA 71.79 ± ∗∗ ± ± Fig. 4.
Macro-averaged F1 scores of the proposed multi-task model compared withbenchmark on different entities. Benchmark refers to the performance of state-of-the-artBioNER systems. information captured by character-level BiLSTM as well as lexical andcontextual information captured by word-level BiLSTM.
We also conduct more fine-grained comparison of all models on fourmajor biomedical entity types: genes/proteins, chemicals, diseases and celllines since they are the most often annotated entity types (Fig. 4). Eachentity type comes from multiple datasets: genes/proteins from BC2GMand JNLPBA, chemicals from BC4CHEMD and BC5CDR, diseases fromBC5CDR and NCBI-Disease, and cell lines from JNLPBA.The MTM-CW model performs consistently better than the neuralnetwork model (Habibi et al. , 2017) on all four entity types. It alsooutperforms the state-of-the-art systems (Benchmark in Fig. 4) on threeentity types except for cell lines. These results further confirm thatthe multi-task neural network model achieves a significantly betterperformance compared with state-of-art systems and other neural networkmodels for BioNER.
A biomedical entity dictionary is a manually-curated list of entity namesthat belong to a specific entity type. Traditional BioNER systems makeheavy use of these dictionaries in addition to other data. To study whetherour approach can benefit from the use of entity dictionaries, we retrievebiomedical entity dictionaries for three entity types (i.e., genes/proteins,chemicals and diseases) from the Comparative Toxicogenomics Database(CTD) (Davis et al. , 2017). We use these entity dictionaries in a neuralnetwork model in two different ways: (1) dictionary post-processing tomatch the ‘O’-labeled entities with the dictionary to reduce the falsenegative rate, or (2) dictionary feature to provide additional informationabout words into the word-level BiLSTM. This dictionary feature indicateswhether a word sequence consisting of a word and its neighbors is present ina dictionary. We consider word sequences of up to six words, which adds 21additional dimensions for each entity type. We compare the performanceof MTM-CW with and without adding dictionaries (Table 4).
Table 4. F1 scores of the proposed multi-task model using the CTD entitydictionary. Bold: best scores, *: significantly worse than the MTM-CW model( p ≤ . ), **: significantly worse than the MTM-CW model ( p ≤ . ). Dataset MTM-CW +DictionaryFeature +DictionaryPost-processBC2GM ± ± ± ∗∗ BC4CHEMD ± ± ± ∗∗ BC5CDR 88.78 ± ± ± ∗∗ NCBI-Disease ± ± ± ∗ JNLPBA ± ± ± ∗∗ We observe no significant performance improvement when biomedicalentity dictionaries are included into the MTM-CW model at the pre-processing stage. Moreover, including dictionaries at the post-processingstage even hurts the performance. This is presumably due to a higher falsepositive rate introduced by the dictionaries, when some words share thesurface name with dictionary entities but do not share the same meaning orentity types. These results indicate that our multi-task model, by sharinginformation at both the character and word levels, is able to learn effectivedata representations and generalize to new data without the use of externallexicon resources.
All of the neural network models are trained on one GeForce GTX 1080GPU. We compare the average training time (seconds per sentence) of ourmethod on the 5 main datasets with the baseline neural models in Table 2.Since our multi-task model requires training on the 5 datasets together, wecalculate and compare the average training time on all datasets instead of oneach individual one. We find that our single-task neural model STM is themost efficient among the neural models and almost halves the training time(0.71 s/sent.) when compared to Lample et al. ; Habibi et al. (1.59 s/sent.).Compared to the single-task model STM, our multi-task model MTM-CWachieves 8.0% overall F1 improvements with only 5.1% additional trainingtime. The reason that MTM-CW is slightly slower compared with STM isthat it takes a few more epochs for MTM-CW to reach convergence whentrained on 5 datasets together.
To investigate the major advantages of the multi-task models comparedwith the single task models, we examine some sentences with predictedlabels (Table 5). The true labels and the predicted labels of each model areunderlined in a sentence.One major challenge of BioNER is to recognize a long entity withintegrity. In Case 1, the true gene entity is “endo-beta-1,4-glucanase-encoding genes”. The single-task model tends to break this whole entityinto two parts separated by a comma, while the multi-task model candetect this gene entity as a whole. This result could due to the co-trainingof multiple datasets containing long entity training examples. Anotherchallenge is to detect the correct boundaries of biomedical entities. InCase 2, the correct protein entity is “SMase” in the phrase “SMase -sphingomyelin complex structure”. The single-task models recognize thewhole phrase as a protein entity. Our multi-task model is able to detect thecorrect right boundary of the protein entity, probably also due to seeingmore examples from other datasets which may contain “sphingomyelin”as a non-chemical entity. In Case 3, the adjective words “human” and“complement factor” in front of “H deficiency” should be included as partof the true entity. The single-task models missed the adjective words whilethe multi-task model is able to detect the correct right boundary of thedisease entity. In summary, the multi-task model works better at dealingwith two critical challenges for BioNER: (1) recognizing long entitieswith integrity and (2) detecting the correct left and right boundaries of main” — 2018/10/9 — page 7 —
Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning Table 5. Case study of the prediction results from different models. The true labels and the predicted labels of each model are underlined in the sentence. A briefsummary of the error type is also included at the end of each example.
Genes/Proteins
Case 1 True label This fragment contains two complete endo - beta - 1, 4 - glucanase - encoding genes, designated celCCC and celCCG.Habibi This fragment contains two complete endo - beta - 1, 4 - glucanase - encoding genes, designated celCCC and celCCG.STM This fragment contains two complete endo - beta - 1, 4 - glucanase - encoding genes, designated celCCC and celCCG.MTM-CW This fragment contains two complete endo - beta - 1, 4 - glucanase - encoding genes, designated celCCC and celCCG.Error Entity integrity: break a long entity into parts and lose the entity integrity.Case 2 True label A model for the SMase - sphingomyelin complex structure was built to investigate how the SMase specifically recognizes its substrate.Habibi A model for the SMase - sphingomyelin complex structure was built to investigate how the SMase specifically recognizes its substrate.STM A model for the SMase - sphingomyelin complex structure was built to investigate how the SMase specifically recognizes its substrate.MTM-CW A model for the SMase - sphingomyelin complex structure was built to investigate how the SMase specifically recognizes its substrate.Error Right boundary error: false detection of non-entity tokens as part of the true entity.
Diseases
Case 3 True label ... human complement factor H deficiency associated with hemolytic uremic syndrome.Habibi ...human complement factor H deficiency associated with hemolytic uremic syndrome.STM ... human complement factor H deficiency associated with hemolytic uremic syndrome.MTM-CW ... human complement factor H deficiency associated with hemolytic uremic syndrome.Error Left boundary error: fail to detect the correct left boundary of the true entity due to some adjective words in front. biomedical entities. Both improvements come from collectively trainingmultiple datasets with different entity types and sharing useful informationbetween datasets.
We proposed an neural multi-task learning approach for biomedicalnamed entity recognition. The proposed approach, despite being simpleand not requiring manual feature engineering, outperformed state-of-the-art systems and several strong neural network models on benchmarkBioNER datasets. We also showed through detailed analysis that the strongperformance is achieved by the multi-task model with only marginallyadded training time, and confirmed that the large performance gains of ourapproach mainly come from sharing character- and word-level informationbetween biomedical entity types.Lastly, we highlight several future directions to improve the multi-taskBioNER model. First, combining single-task and multi-task models mightbe a fruitful direction. Second, by further resolving the entity boundaryand type conflict problem, we could build a unified system for recognizingmultiple types of biomedical entities with high performance and efficiency.
References
Ando, R. K. (2007). BioCreative II gene mention tagging system at IBM Watson. In
Proc. Second BioCreative Chall. Eval. Work. , volume 23, pages 101–103.Camacho, D. M., Collins, K. M., Powers, R. K., Costello, J. C., and Collins, J. J.(2018). Next-Generation Machine Learning for Biological Networks.
Cell , (7),1581–1592.Chiu, J. P. C. and Nichols, E. (2016). Named Entity Recognition with BidirectionalLSTM-CNNs. Trans. Assoc. Comput. Linguist. , , 357–370.Cokol, M., Iossifov, I., Weinreb, C., and Rzhetsky, A. (2005). Emergent behaviorof growing knowledge about molecular interactions. Nat. Biotechnol. , (10),1243–1247.Collobert, R. and Weston, J. (2008). A unified architecture for natural languageprocessing: Deep neural networks with multitask learning. In Proc. 25th ICML ,pages 160–167.Crichton, G., Pyysalo, S., Chiu, B., and Korhonen, A. (2017). A neural networkmulti-task learning approach to biomedical named entity recognition.
BMC Bioinf. , (1), 368.Davis, A. P., Grondin, C. J., Johnson, R. J., Sciaky, D., King, B. L., McMorran,R., Wiegers, J., Wiegers, T. C., and Mattingly, C. J. (2017). The comparativetoxicogenomics database: update 2017. Nucleic Acids Res. , (D1), D972–D978.Deng, L., Hinton, G., and Kingsbury, B. (2013). New types of deep neural networklearning for speech recognition and related applications: An overview. In Proc.IEEE ICASSP , pages 8599–8603.Girshick, R. (2015). Fast r-cnn. In
Proc. IEEE ICCV , pages 1440–1448.Habibi, M., Weber, L., Neves, M., Wiegandt, D. L., and Leser, U. (2017). Deeplearning with word embeddings improves biomedical named entity recognition.
Bioinformatics , (14), i37–i48. Huang, C.-C. and Lu, Z. (2015). Community challenges in biomedical text miningover 10 years: success, failure and the future. Briefings Bioinf. , (1), 132–144.Lafferty, J., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields:Probabilistic models for segmenting and labeling sequence data. In Proc. 17thICML , pages 282–289.Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016).Neural architectures for named entity recognition. In
Proc. Conf. NAACL-HLT ,pages 260–270.Leaman, R. and Lu, Z. (2016). TaggerOne: joint named entity recognition andnormalization with semi-Markov Models.
Bioinformatics , (18), 2839–2846.Leser, U. and Hakenberg, J. (2005). What makes a gene name? Named entityrecognition in the biomedical literature. Briefings Bioinf. , (4), 357–369.Liu, L., Shang, J., Xu, F., Ren, X., Gui, H., Peng, J., and Han, J. (2018). EmpowerSequence Labeling with Task-Aware Neural Language Model. In AAAI , pages5245–5253.Lu, Y., Ji, D., Yao, X., Wei, X., and Liang, X. (2015). CHEMDNER system withmixed conditional random fields and multi-scale word clustering.
J. Cheminf. , (S1), S4.Ma, X. and Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proc. 54th Annu. Meet. ACL , pages 1064–1074.Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). DistributedRepresentations of Words and Phrases and their Compositionality. In
Adv. NIPS ,pages 3111–3119.Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., and Ananiadou, S. (2013).Distributional semantics resources for biomedical text processing. In
Proc. Lang.Biol. Med. , pages 39–44.Ramsundar, B., Kearnes, S. M., Riley, P., Webster, D., Konerding, D. E., andPande, V. S. (2015). Massively multitask networks for drug discovery.
CoRR , abs/1502.02072 .Smith, L., Tanabe, L. K., nee Ando, R. J., Kuo, C.-J., Chung, I.-F., Hsu, C.-N., Lin,Y.-S., Klinger, R., Friedrich, C. M., Ganchev, K., and Others (2008). Overview ofBioCreative II gene mention recognition. Genome Biol. , (S2), S2.Søgaard, A. and Goldberg, Y. (2016). Deep multi-task learning with low level taskssupervised at lower layers. In Proc. 54th Annu. Meet. ACL , pages 231–235.Szklarczyk, D., Santos, A., von Mering, C., Jensen, L. J., Bork, P., and Kuhn, M.(2015). Stitch 5: augmenting protein–chemical interaction networks with tissueand affinity data.