An Analysis of Simple Data Augmentation for Named Entity Recognition
aa r X i v : . [ c s . C L ] O c t An Analysis of Simple Data Augmentation for Named Entity Recognition
Xiang Dai , , Heike Adel Bosch Center for Artificial Intelligence, Renningen, Germany University of Sydney, Sydney, Australia CSIRO Data61, Sydney, Australia [email protected] [email protected]
Abstract
Simple yet effective data augmentation techniques have been proposed for sentence-level andsentence-pair natural language processing tasks. Inspired by these efforts, we design and com-pare data augmentation for named entity recognition, which is usually modeled as a token-levelsequence labeling problem. Through experiments on two data sets from the biomedical and ma-terials science domains (i2b2-2010 and MaSciP), we show that simple augmentation can boostperformance for both recurrent and transformer-based models, especially for small training sets.
Modern deep learning techniques typically require a lot of labeled data (Bowman et al., 2015;Conneau et al., 2017). However, in real-world applications, such large labeled data sets are not al-ways available. This is especially true in some specific domains, such as the biomedical andmaterials science domain, where annotating data requires expert knowledge and is usually time-consuming (Karimi et al., 2015; Friedrich et al., 2020). Different approaches have been investigatedto solve this low-resource problem. For example, transfer learning pretrains language representationson self-supervised or rich-resource source tasks and then adapts these representations to the targettask (Ruder, 2019; Gururangan et al., 2020). Data augmentation expands the training set by applyingtransformations to training instances without changing their labels (Wang and Perez, 2017).Recently, there is an increased interest on applying data augmentation techniques onsentence-level and sentence-pair natural language processing (NLP) tasks, such as text clas-sification (Wei and Zou, 2019; Xie et al., 2019), natural language inference (Min et al., 2020) andmachine translation (Wang et al., 2018). Augmentation methods explored for these tasks ei-ther create augmented instances by manipulating a few words in the original instance, suchas word replacement (Zhang et al., 2015; Wang and Yang, 2015; Cai et al., 2020), random dele-tion (Wei and Zou, 2019), or word position swap (S¸ ahin and Steedman, 2018; Min et al., 2020); or cre-ate entirely artificial instances via generative models, such as variational auto encoders (Yoo et al., 2019;Mesbah et al., 2019) or back-translation models (Yu et al., 2018; Iyyer et al., 2018).Different from these sentence-level NLP tasks, named entity recognition (NER) does predictions onthe token level. That is, for each token in the sentence, NER models predict a label indicating whether thetoken belongs to a mention and which entity type the mention has. Therefore, applying transformationsto tokens may also change their labels. Due to this difficulty, data augmentation for NER is comparativelyless studied. In this work, we fill this research gap by exploring data augmentation techniques for NER,a token-level sequence labeling problem.Our contributions can be summarized as follows:1. We survey previously used data augmentation techniques for sentence-level and sentence-pair NLPtasks and adapt some of them for the NER task.2. We conduct empirical comparisons of different data augmentations using two English domain-specific data sets: MaSciP (Mysore et al., 2019) and i2b2-2010 (Uzuner et al., 2011). Results showthat simple augmentation can even improve over a strong baseline with large-scale pretrained trans-formers.
Related Work
In this section, we survey previously used data augmentation methods for NLP tasks, grouping them intofour categories:
Word replacement:
Various word replacement variants have been explored for text classificationtasks. Zhang et al. (2015) and Wei and Zou (2019) replace words with one of their synonyms, retrievedfrom an English thesaurus (e.g., WordNet). Kobayashi (2018) replace words with other words that arepredicted by a language model at the word positions. Xie et al. (2019) replace uninformative words withlow TF-IDF scores with other uninformative words for topic classification tasks.For machine translation, word replacement has also been used to generate additional parallel sentencepairs. Wang et al. (2018) replace words in both the source and the target sentence by other words uni-formly sampled from the source and the target vocabularies. Fadaee et al. (2017) search for contextswhere a common word can be replaced by a low-frequency word, relying on recurrent language models.Gao et al. (2019) replace a randomly chosen word by a soft word, which is a probabilistic distributionover the vocabulary, provided by a language model.In addition, there are two special word replacement cases, inspired by dropout and masked languagemodeling: replacing a word by a zero word (i.e., dropping entire word embeddings) (Iyyer et al., 2015),or by a [MASK] token (Wu et al., 2018).
Mention replacement:
Raiman and Miller (2017) augment a question answering training set using anexternal knowledge base. In particular, they extract nominal groups in the training set, perform stringmatching with entities in Wikidata, and then randomly replace them with other entities of the sametype. In order to remove gender bias from coreference resolution systems, Zhao et al. (2018) propose togenerate an auxiliary dataset where all male entities are replaced by female entities, and vice versa, usinga rule-based approach.
Swap words:
Wei and Zou (2019) randomly choose two words in the sentence and swap their po-sitions to augment text classification training sets. Min et al. (2020) explore syntactic transforma-tions (e.g., subject/object inversion) to augment the training data for natural language inference.S¸ ahin and Steedman (2018) rotate tree fragments around the root of the dependency tree to form a syn-thetic sentence and augment low-resource language part-of-speech tagging training sets.
Generative models:
Yu et al. (2018) train a question answering model with data generated by back-translation from a neural machine translation model. Kurata et al. (2016) and Hou et al. (2018) use asequence-to-sequence model to generate diversely augmented utterances to improve the dialogue lan-guage understanding module. Xia et al. (2019) convert data from a high-resource language to a low-resource language, using a bilingual dictionary and an unsupervised machine translation model in orderto expand the machine translation training set for the low-resource language.
Inspired by the efforts described in Section 2, we design several simple data augmentation methods forNER. Note that these augmentations do not rely on any externally trained models, such as machinetranslation models or syntactic parsing models, which are by themselves difficult to train in low-resourcedomain-specific scenarios.
Label-wise token replacement (LwTR):
For each token, we use a binomial distribution to randomlydecide whether it should be replaced. If yes, we then use a label-wise token distribution, built from theoriginal training set, to randomly select another token with the same label. Thus, we keep the originallabel sequence unchanged. Taking the instance in Table 1 as an example, there are five tokens replacedby other tokens which share the same label with the original tokens.
Synonym replacement (SR):
Our second approach is similar to LwTR, except that we replace thetoken with one of its synonyms retrieved from WordNet. Note that the retrieved synonym may consist of nstance
None She did not complain of headache or any other neurological symptoms .O O O O O B-problem O B-problem I-problem I-problem I-problem OLwTR L. One not complain of headache he any interatrial neurological current .O O O O O B-problem O B-problem I-problem I-problem I-problem OSR She did non complain of headache or whatsoever former neurologic symptom .O O O O O B-problem O B-problem I-problem I-problem I-problem OMR She did not complain of neuropathic pain syndrome or acute pulmonary disease .O O O O O B-problem I-problem I-problem O B-problem I-problem I-problem OSiS not complain She did of headache or neurological any symptoms other .O O O O O B-problem O B-problem I-problem I-problem I-problem O
Table 1: Original training instance and different types of augmented instances. We highlight changesusing blue color. Note that LwTR (Label-wise token replacement) and SiS (Shuffle within segments)change token sequence only, whereas SR (Synonym replacement) and MR (Mention replacement) mayalso change the label sequence.more than one token. However, its BIO-labels can be derived using a simple rule: If the replaced tokenis the first token within a mention (i.e., the corresponding label is ‘B-EntityType’), we assign the samelabel to the first token of the retrieved multi-word synonym, and ‘I-EntityType’ to the other tokens. If thereplaced token is inside a mention (i.e., the corresponding label is ‘I-EntityType’), we assign its label toall tokens of the multi-word synonym.
Mention replacement (MR):
For each mention in the instance, we use a binomial distribution torandomly decide whether it should be replaced. If yes, we randomly select another mention from theoriginal training set which has the same entity type as the replacement. The corresponding BIO-labelsequence can be changed accordingly. For example, in Table 1, the mention ‘headache [B-problem]’ isreplaced by another problem mention ‘neuropathic pain syndrome [B-problem I-problem I-problem]’.
Shuffle within segments (SiS):
We first split the token sequence into segments of the same label. Thus,each segment corresponds to either a mention or a sequence of out-of-mention tokens. For example, theoriginal sentence in Table 1 is split into five segments: [She did not complain of], [headache], [or], [anyother neurological symptoms], [.]. Then for each segment, we use a binomial distribution to randomlydecide whether it should be shuffled. If yes, the order of the tokens within the segment is shuffled, whilethe label order is kept unchanged.
All:
We also explore to augment the training set using all aforementioned augmentation methods. Thatis, for each training instance, we create multiple augmented instances, one per augmentation method.
MaSciP i2b2-2010
Train Dev Test Train Dev TestNumber of sentences 1,901 109 158 13,868 2,447 27,625Number of tokens 61,750 4,158 4,585 129,087 20,454 267,249Number of mentions 18,874 1,190 1,259 14,376 2,143 31,161Number of entity types 21 20 21 3 3 3Table 2: The descriptive statistics of the data sets.
Experiments and Results
We present an empirical analysis of the data augmentation methods described in Section 3 on two Englishdatasets from the materials science and biomedical domains: MaSciP (Mysore et al., 2019) and i2b2-2010 (Uzuner et al., 2011). MaSciP contains synthesis procedures annotated with synthesis operations and their typed arguments(e.g., Material, Synthesis-Apparatus, etc.). We use the train-dev-test split provided by the authors. i2b2-2010 focuses on the identification of Problem, Treatment and Test from patient reports. We use thetrain-test split from its corresponding shared task setting and randomly select 15% of sentences from thetraining set as the development set.To simulate a low-resource setting, we select the first 50, 150, 500 sentences which contain at least onemention from the training set to create the corresponding small, medium, and large training sets (denotedas S, M, L in Table 3, whereas the complete training set is denoted as F) for each data set. Note that weapply data augmentation only on the training set, without changing the development and test sets.
We model the NER task as a sequence-labeling task. Let x = h x , . . . , x T i be a sequence of T tokens,the model aims to predict a label sequence y = h y , . . . , y T i , where each label is composed of a positionindicator (e.g., BIO schema) and an entity type. The state-of-the-art sequence-labeling models roughlyconsist of two components: a neural-based encoder which creates contextualized embeddings r i for eachtoken, and a conditional random field output layer, which captures dependencies between neighboringlabels: ˆ P ( y T | r T ) ∝ T Y i =1 ψ i ( y i − , y i , r i ) . We consider two encoder variants in our study: one based on LSTM (Graves et al., 2013) and onebased on BERT (Devlin et al., 2019). The LSTM-based encoder consists of a context-independent tokenembedding layer (e.g., GloVe (Pennington et al., 2014)) and a bidirectional LSTM layer, whose weightsare learned from scratch. The representations r i are obtained by concatenating the hidden states ofthe forward and backward LSTMs at each token position. The BERT-based encoder consists of a sub-token embedding layer and a stack of multi-head self-attention and fully-connected feed-forward layers.The final hidden state corresponding to the first sub-token within each token is used as the represen-tation r i . Studies on domain-specific BERT models show that effectiveness on downstream tasks canbe improved when the BERT models are further pretrained on in-domain data (Gururangan et al., 2020;Dai et al., 2020). We thus choose SciBERT (Beltagy et al., 2019), which is pretrained on scholar articles,and fine-tune it on the NER task. In our preliminary experiments, we observe that SciBERT achievessignificant better results than BERT (Devlin et al., 2019).We use the Micro-average string match F score to evaluate the effectiveness of the models. Themodel which is most effective on the development set, measured using the F score, is finally evaluatedon the test set. For each augmentation method, we tune the number of generated instances per training instance from alist of numbers: {
1, 3, 6, 10 } . When all data augmentation methods are applied, we reduce this tuninglist to: {
1, 2, 3 } , so that the total number of generated instances given each original training instance isroughly the same for different experiments. We also tune the p value of the binomial distribution whichis used to decide whether a token or a mention should be replaced (cf., Section 3). It is searched overthe range from 0.1 to 0.7, with an incrementation step of 0.2. We perform grid search to find the bestcombination of these two hyperparameters on the developement set. https://github.com/olivettigroup/annotated-materials-syntheses https://portal.dbmi.hms.harvard.edu/ odel Method MaSciP i2b2-2010 ∆ S M L F S M L FRecurrent No augmentation 53.0 ± ± ± ± ± ± ± ± Label-wise token rep. 59.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Label-wise token rep. 70.0 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 3: Evaluation results in terms of span-level F1 score. S mall set contains 50 training instances; M edium contains 150 instances; L arge contains 500 instances; F ull uses the complete training set. Werepeat all experiments five times with different random seeds. Mean values and standard deviations arereported. ∆ column shows the averaged improvement due to data augmentation. underline: the result issignificantly better than the baseline model without data augmentation (paired student’s t-test, p: . ). Table 3 provides the evaluation results on the test sets. The first conclusion we can draw is that all dataaugmentation techniques can improve over the baseline where no augmentation is used, although there isno single clear winner across both recurrent and transformer models. Synonym replacement outperformsother augmentation on average when transformer models are used, whereas mention replacement appearsto be most effective for recurrent models.Second, applying all data augmentation methods together outperforms any single data augmentationon average, although, when the complete training set is used, applying single data augmentation mayachieve better results (c.f., MaSciP-Recurrent and i2b2-2010-Transformer). This scenario may reflect atrade-off between diversity and validity of augmented instances (Hou et al., 2018; Xie et al., 2019). Onthe one hand, applying all data augmentation together may prevent overfitting via producing diversetraining instances. This positive effect is especially useful when the training sets are small. On the otherhand, it may also increase the risk of altering the ground-truth label, or generating invalid instances. Thisnegative effect may dominate for larger training sets.Third, data augmentation techniques are more effective when the training sets are small. For example,all data augmentation methods achieve significant improvements when the training set contains only50 instances. In contrast, when the complete training sets are used, only three augmentation methodsachieve significant improvements and some even decrease the performance. This has also been observedin previous work on machine translation tasks (Fadaee et al., 2017).Last but not least, we notice that previous studies mainly investigate the effectiveness of data aug-mentation with recurrent models where most of the parameters are learned from scratch. Consideringthe significant improvements when using pretrained transformer models, we argue that it is important toinvestigate the effectiveness of techniques also on pretrained models, such as BERT (Devlin et al., 2019),because they are supposed to capture various knowledge via self-supervision learning.
We survey previously used data augmentation methods for sentence-level and sentence-pair NLP tasksand adapt them to NER, a token-level task. Through experiments on two domain-specific data sets, weshow that simple data augmentation can improve performance even over strong baselines. eferences
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained language model for scientific text. In
EMNLP-IJCNLP , pages 3613–3618, Hong Kong, China.Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotatedcorpus for learning natural language inference. In
EMNLP , pages 632–642, Lisbon, Portugal.Hengyi Cai, Hongshen Chen, Yonghao Song, Cheng Zhang, Xiaofang Zhao, and Dawei Yin. 2020. Data manipu-lation: Towards effective instance learning for neural dialogue generation via learning to augment and reweight.In
ACL , pages 6334–6343, Online.Alexis Conneau, Holger Schwenk, Lo¨ıc Barrault, and Yann Lecun. 2017. Very deep convolutional networks fortext classification. In
EACL , pages 1107–1116, Valencia, Spain.Xiang Dai, Sarvnaz Karimi, Ben Hachey, and Cecile Paris. 2020. Cost-effective selection of pretraining data: Acase study of pretraining BERT on social media. arXiv preprint arXiv:2010.01150 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirec-tional transformers for language understanding. In
NAACL , pages 4171–4186, Minneapolis, Minnesota.Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data augmentation for low-resource neural machinetranslation. In
ACL , pages 567–573, Vancouver, Canada.Annemarie Friedrich, Heike Adel, Federico Tomazic, Johannes Hingerl, Renou Benteau, Anika Marusczyk, andLukas Lange. 2020. The SOFC-exp corpus and neural approaches to information extraction in the materialsscience domain. In
ACL , pages 1255–1268, Online.Fei Gao, Jinhua Zhu, Lijun Wu, Yingce Xia, Tao Qin, Xueqi Cheng, Wengang Zhou, and Tie-Yan Liu. 2019. Softcontextual data augmentation for neural machine translation. In
ACL , pages 5539–5544, Florence, Italy.Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neuralnetworks. In
ICASSP , pages 6645–6649.Suchin Gururangan, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A.Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In
ACL , pages 8342–8360,Online.Yutai Hou, Yijia Liu, Wanxiang Che, and Ting Liu. 2018. Sequence-to-sequence data augmentation for dialoguelanguage understanding. In
COLING , pages 1234–1245, Santa Fe, New Mexico, USA.Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daum´e III. 2015. Deep unordered compositionrivals syntactic methods for text classification. In
ACL-IJCNLP , pages 1681–1691, Beijing, China.Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial example generation withsyntactically controlled paraphrase networks. In
NAACL , pages 1875–1885, New Orleans, Louisiana.Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna Kemp, and Chen Wang. 2015. CADEC: A corpus of adversedrug event annotations.
J Biomed Inform , 55:73–81.Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. In
NAACL , pages 452–457, New Orleans, Louisiana.Gakuto Kurata, Bing Xiang, and Bowen Zhou. 2016. Labeled data generation with encoder-decoder lstm forsemantic slot filling. In
INTERSPEECH , pages 725–729.Sepideh Mesbah, Jie Yang, Robert-Jan Sips, Manuel Valle Torre, Christoph Lofi, Alessandro Bozzon, and Geert-Jan Houben. 2019. Training data augmentation for detecting adverse drug reactions in user-generated content.In
EMNLP-IJCNLP , pages 2349–2359, Hong Kong, China.Junghyun Min, R. Thomas McCoy, Dipanjan Das, Emily Pitler, and Tal Linzen. 2020. Syntactic data augmentationincreases robustness to inference heuristics. In
ACL , pages 2339–2352, Online.Sheshera Mysore, Zachary Jensen, Edward Kim, Kevin Huang, Haw-Shiuan Chang, Emma Strubell, Jeffrey Flani-gan, Andrew McCallum, and Elsa Olivetti. 2019. The materials science procedural text corpus: Annotatingmaterials synthesis procedures with shallow semantic structures. In
ACL@LAW , pages 56–64, Florence, Italy.effrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representa-tion. In
EMNLP , pages 1532–1543, Doha, Qatar.Jonathan Raiman and John Miller. 2017. Globally normalized reader. In
EMNLP , pages 1059–1069, Copenhagen,Denmark.Sebastian Ruder. 2019.
Neural Transfer Learning for Natural Language Processing . Ph.D. thesis, NationalUniversity of Ireland, Galway.G¨ozde G¨ul S¸ ahin and Mark Steedman. 2018. Data augmentation via dependency tree morphing for low-resourcelanguages. In
EMNLP , pages 5004–5009, Brussels, Belgium.¨Ozlem Uzuner, Brett R South, Shuying Shen, and Scott L DuVall. 2011. 2010 i2b2/va challenge on concepts,assertions, and relations in clinical text.
J. Am. Med. Inform. Assoc. , 18(5):552–556.Jason Wang and Luis Perez. 2017. The effectiveness of data augmentation in image classification using deeplearning.
CoRR abs/1712.04621 .William Yang Wang and Diyi Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embeddingbased data augmentation approach to automatic categorization of annoying behaviors using
EMNLP , pages 2557–2563, Lisbon, Portugal.Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. 2018. SwitchOut: an efficient data augmentationalgorithm for neural machine translation. In
EMNLP , pages 856–861, Brussels, Belgium.Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classi-fication tasks. In
EMNLP-IJCNLP , pages 6382–6388, Hong Kong, China.Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. 2018. Conditional bert contextual aug-mentation.
CoRR abs/1812.06705 .Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, and Graham Neubig. 2019. Generalized data augmenta-tion for low-resource translation. In
ACL , pages 5786–5796, Florence, Italy.Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. 2019. Unsupervised data augmenta-tion.
CoRR abs/1904.12848 .Kang Min Yoo, Youhyun Shin, and Sang-goo Lee. 2019. Data augmentation for spoken language understandingvia joint variational generation. In
AAAI , Honolulu, Hawaii.Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le.2018. QANet: Combining local convolution with global self-attention for reading comprehension. In
ICLR .Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification.In
NIPS , pages 649–657.Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreferenceresolution: Evaluation and debiasing methods. In