Establishing Strong Baselines for the New Decade: Sequence Tagging, Syntactic and Semantic Parsing with BERT
EEstablishing Strong Baselines for the New Decade:Sequence Tagging, Syntactic and Semantic Parsing with BERT
Han He
Computer ScienceEmory UniversityAtlanta GA 30322, USA [email protected]
Jinho D. Choi
Computer ScienceEmory UniversityAtlanta GA 30322, USA [email protected]
Abstract
This paper presents new state-of-the-art mod-els for three tasks, part-of-speech tagging, syn-tactic parsing, and semantic parsing, using thecutting-edge contextualized embedding frame-work known as BERT. For each task, we firstreplicate and simplify the current state-of-the-art approach to enhance its model efficiency.We then evaluate our simplified approaches onthose three tasks using token embeddings gen-erated by BERT. 12 datasets in both Englishand Chinese are used for our experiments. TheBERT models outperform the previously best-performing models by 2.5% on average (7.5%for the most significant case). Moreover, an in-depth analysis on the impact of BERT embed-dings is provided using self-attention, whichhelps understanding in this rich yet represen-tation. All models and source codes are avail-able in public so that researchers can improveupon and utilize them to establish strong base-lines for the next decade.
It is no exaggeration to say that word embeddingstrained by vector-based language models (Mikolovet al., 2013; Pennington et al., 2014; Bojanowskiet al., 2017) have changed the game of NLP onceand for all. These pre-trained word embeddingstrained on large corpus improve downstream tasksby encoding rich word semantics into vector space.However, word senses are ignored in these earlierapproaches such that a unique vector is assigned toeach word, neglecting polysemy from the context.Recently, contextualized embedding approachesemerge with advanced techniques to dynamicallygenerate word embeddings from different contexts.To address polysemous words, Peters et al. (2018)introduce ELMo, which is a word-level Bi-LSTMlanguage model. Akbik et al. (2018) apply a similarapproach to the character-level, called Flair, while concatenating the hidden states corresponding tothe first and the last characters of each word to buildthe embedding of that word. Apart from these uni-directional recurrent language models, Devlin et al.(2018) replace the transformer decoder from Rad-ford et al. (2018) with a bidirectional transformerencoder, then train the BERT system on 3.3B wordcorpus. After scaling the model size to hundreds ofmillions parameters, BERT brings markedly hugeimprovement to a wide range of tasks without sub-stantial task-specific modifications.In this paper, we verify the effectiveness and con-ciseness of BERT by first generating token-levelembeddings from it, then integrating them to task-oriented yet efficient model structures (Section 3).With careful investigation and engineering, our sim-plified models significantly outperform many ofthe previous state-of-the-art models, achieving thehighest scores for 11 out of 12 datasets (Section 4).To reveal the essence of BERT in these tasks, weanalyze our tagging models with self-attention, andfind that BERT embeddings capture contextual in-formation better than pre-trained embeddings, butnot necessarily better than embeddings generatedby a character-level language model (Section 5.1).Furthermore, an extensive comparison between ourbaseline and BERT models shows that BERT mod-els handle long sentences robustly (Section 5.2).One of the key findings is that BERT embeddingsare much more related to semantic than syntactic(Section 5.3). Our findings are consistent with thetraining procedure of BERT, which provides guid-ing references for future research.To the best of our knowledge, it is the first workthat tightly integrates BERT embeddings to thesethree downstream tasks and present such high per-formance. All our resources including the modelsand the source codes are publicly available. https://github.com/emorynlp/bert-2019 a r X i v : . [ c s . C L ] M a y Related Work
Rich initial word encodings substantially improvethe performance of downstream NLP tasks, whichhave been studied over decades. Except for matrixfactorization methods (Pennington et al., 2014),most work train language models to predict somewords given their contexts. Among these work,CBOW and Skip-Gram (Mikolov et al., 2013) arepioneers of neural language models extracting fea-tures within a fixed length window. Then, Joulinet al. (2017) augment these models with subwordinformation to handle out-of-vocabulary words.To learn contextualized representations, Peterset al. (2018) apply bidirectional language model (bi-LM) to tokenized unlabeled corpus. Similarly, thecontextual string embeddings (Akbik et al., 2018)model language on character level, which can effi-ciently extract morphological features. However,bi-LM consists of two unidirectional LMs withoutleft or right context, leading to potential bias onone side. To address this limitation, BERT (De-vlin et al., 2018) employ masked LM to jointlycondition on both left and right contexts, showingimpressive improvement in various tasks.
Sequence tagging is one of the most well-studiedNLP tasks, which can be directly applied to part-of-speech tagging (POS) and named entity recognition(NER). As a general trend, fine grained featuresoften result in better performance. Akbik et al.(2018) feed the contextual string embeddings intoa Bi-LSTM-CRF tagger (Huang et al., 2015), im-proving tagging accuracy with rich morphologicaland contextual information. In a more meticulouslydesigned system, Bohnet et al. (2018) generate rep-resentations from both string and token based char-acter Bi-LSTM language models, then employ ameta-BiLSTM to integrate them.Besides, joint learning and semi-supervisedlearning can lead to more generalization. As ahighly end-to-end approach, the character leveltransition system proposed by Kurita et al. (2017)benefits from joint learning on Chinese word seg-mentation, POS tagging and dependency parsing.Recently, Clark et al. (2018) exploit large scaleunlabeled data with Cross-View Training (CVT),which improves the RNN feature detector sharedbetween the full model and auxiliary modules.
Dependency tree and constituency structure are twoclosely related syntactic forms. Choe and Char-niak (2016) cast constituency parsing as languagemodeling, achieving high UAS after conversion todependency tree. Kuncoro et al. (2017) investigaterecurrent neural network grammars through abla-tions and gated attention mechanism, finding thatlexical heads are crucial in phrasal representation.Recently graph-based parsers resurge due totheir ability to exploit modern GPU paralleliza-tion. Dozat and Manning (2017) successfully im-plement a graph-based dependency parser with bi-affine attention mechanism, showing impressiveperformance and decent simplicity. Clark et al.(2018) improve the feature detector of the biaffineparser through CVT and joint learning. While Maet al. (2018) introduce stack-pointer networks tomodel parsing history of a transition-based parser,with biaffine attention mechanism built-in.
Currently, parsing community are shifting fromsyntactic dependency tree parsing to semantic de-pendency graph parsing (SDP). As graph nodescan have multi-head or zero head, it allows formore flexible representations of sentence meanings.Wang et al. (2018) modify the preconditions ofList-Based Arc-Eager transition system (Choi andMcCallum, 2013), implementing it with Bi-LSTMSubtraction and Tree-LSTM for feature extraction.Among graph-based approaches, Peng et al.(2017) investigate higher-order structures acrossdifferent graph formalisms with tensor scoring strat-egy, benefiting from multitask learning. Dozat andManning (2018) replace the softmax cross-entropyin the biaffine parser with sigmoid cross-entropy,successfully turning the syntactic tree parser into asimple yet accurate semantic graph parser.
BERT splits each token into subwords using Word-Piece (Wu et al., 2016), which do not necessarilyreflect any morphology in linguistics. For example,‘Rainwater’ gets split into ‘Rain’ and ‘ running or rapidly remain un-changed although typical morphology would splitthem into run+ing and rapid+ly . To obtain token-level embeddings for tagging and parsing tasks, thefollowing two methods are experimented: ast Embedding Since the subwords from eachtoken are trained to predict one another during lan-guage modeling, their embeddings must be corre-lated. Thus, one way is to pick the embedding ofthe last subword as a representation of the token.
Average Embedding
For a compound word like‘doghouse’ that gets split into ‘dog’ and ‘
Model In-domain Out-of-domain
BERT
BASE : L
AST
BASE : A
VERAGE
BERT
LARGE : L
AST
LARGE : A
VERAGE
Table 1: Results from the PSD semantic parsing task(§4.3) using the last and average embedding methods.
Table 1 shows results from a semantic parsing task,PSD (Section 4.3), using the last and average em-bedding methods with BERT
BASE and BERT
LARGE models. The average method is chosen for all ourexperiments since it gives a marginal advantage tothe out-of-domain dataset.
While Devlin et al. (2018) report that adding justan additional output layer to the BERT encoder canbuild powerful models in a wide range of tasks, itscomputational cost is too high. Thus, we separatethe BERT architecture from downstream models,and feed pre-generated BERT embeddings, e BERT ,as input to task-specific encoders: F i = Encoder (cid:0) X ⊕ e BERT (cid:1)
Alternatively, BERT embeddings can be concate-nated with the output of a certain hidden layer: F h = Encoder [h:] (cid:0) Encoder [:h] ( X ) ⊕ e BERT (cid:1)
Table 2 shows results from the PSD semantic pars-ing task (Section 4.3) using the average methodfrom Section 3.1. F i shows a slight advantage forboth BERT BASE and BERT
LARGE over F h ; thus, it ischosen for all our experiments. BERT
BASE uses 12 layers, 768 hidden cells, 12 attention heads,and 110M parameters, while BERT
LARGE uses 24 layers, 1024hidden cells, 16 attention heads, and 340M parameters. Bothmodels are uncased, since they are reported to achieve highscores for all tasks except for NER (Devlin et al., 2018).
Model In-domain Out-of-domain
BERT
BASE : F i BERT
BASE : F h LARGE : F i BERT
LARGE : F h Table 2: Results from the PSD semantic parsing task(Section 4.3) using F i and F h . For sequence tagging, the Bi-LSTM-CRF (Huanget al., 2015) with the Flair contextual embeddings(Akbik et al., 2018), is used to establish a baselinefor English. Given a token w in a sequence where c i and c j are the starting and ending characters of w ( i and j are the character offsets; i ≤ j ), the Flairembedding of w is generated by concatenating twohidden states of c j +1 from the forward LSTM and c i − from the backward LSTM (Figure 1): e Flair i,j = h f ( c j +1 ) ⊕ h b ( c i − ) e Flair i,j is then concatenated with a pre-trained tokenembedding of w and fed into the Bi-LSTM-CRF.In our approach, we present two models, one sub-stituting the Flair and pre-trained embeddings withBERT, and the other concatenating BERT to theother embeddings. Note that variational dropout isnot used in our approach to reduce complexity. e b u y a p p l e T V Figure 1: Generating the Flair embedding for ‘apple’.
As Chinese is characterized as a morphologicallypoor language, the Flair embeddings are not usedfor tagging tasks; only pre-trained and BERT em-beddings are used for our experiments in Chinese.
A simplified variant of the biaffine parser (Dozatand Manning, 2017) is used for syntactic pars-ing (Figure 2). Compared to the original ver-sion, the trainable word embeddings are removedand lemmas are used instead of forms to re-trieve pre-trained embeddings in our version, lead-ing to less complexity yet better generalization. ✕ ✕ = MST>0.5softmax CE losssigmoid CE loss Syntactic ParserSemantic Parser1 1 1 1 = ✕ ✕ ✕ + MLP (arc-h)
MLP (arc-d)
MLP (arc-h)
MLP (arc-d)
MLP (rel-h)
MLP (rel-d)
MLP (arc-h)
MLP (arc-d)
MLP (rel-h)
MLP (rel-d)
MLP (arc-h)
MLP (arc-d)
MLP (rel-h)
MLP (rel-d)
Given the i ’th token w i , the feature vector is createdby concatenating its pre-trained lemma embedding e LEM i , POS embedding e POS i learned during train-ing and the representation e BERT i from the last layerof BERT. This feature vector is fed into Bi-LSTM,generating two recurrent states r f i and r b i : r f i = LSTM forward (cid:0) e LEM i ⊕ e POS i ⊕ e BERT i (cid:1) r b i = LSTM backward (cid:0) e LEM i ⊕ e POS i ⊕ e BERT i (cid:1) Two multi-layer perceptrons (MLP) are then usedto extract features for w i being a head h arc-h i or adependent h arc-d i , and two additional MLP are usedto extract h rel-h i and h rel-d i for labeled parsing: h (arc-h) i = MLP (arc-h) ( r f i ⊕ r b i ) ∈ R k × h (arc-d) i = MLP (arc-d) ( r f i ⊕ r b i ) ∈ R k × h (rel-h) i = MLP (rel-h) ( r f i ⊕ r b i ) ∈ R l × h (rel-d) i = MLP (rel-d) ( r f i ⊕ r b i ) ∈ R l × h arc-h ..n are stacked into a matrix H arc-h with a biasfor the prior probability of each token being a head,and h arc-d ..n are stacked into another matrix H arc-d asfollows ( n : U ( arc ) ∈ R k × ( k +1) ): H (arc-h) = ( h (arc-h) , . . . , h (arc-h) n ) ∈ R k × n H (arc-d) = ( h (arc-d) , . . . , h (arc-d) n ) ⊕ ∈ R ( k +1) × n S (arc) = H (arc-h) (cid:62) · U ( arc ) · H (arc-d) ∈ R n × n S (arc) is called a bilinear classifier that predicts headwords. Additionally, arc labels are predicted byanother biaffine classifier S (arc) , which combines m bilinear classifiers for multi-classification ( m : U (rel) ∈ R l × ( l +1) , V (rel) ∈ R (2 · l +1) × m ): H (rel-h) = ( h (rel-h) , . . . , h (rel-h) n ) ∈ R l × n H (rel-d) = ( h (rel-d) , . . . , h (rel-d) n ) ⊕ ∈ R ( l +1) × n U rel i = H ( rel-h ) (cid:62) · U (rel) i · H ( rel-d ) ∈ R n × n S (rel) = ( U rel , . . . , U rel m )+ ( H ( rel-h ) ⊕ H ( rel-d ) ) (cid:62) · V ( rel ) ∈ R m × n × n During training, softmax cross-entropy is used tooptimize S (arc) and S (rel) . Note that for the op-timization of S (rel) , gold heads are used insteadof predicted ones. During decoding, a maximumspanning tree algorithm is adopted for searchingthe optimal tree based on the scores in S (arc) . Dozat and Manning (2018) adapted their originalbiaffine parser to generate dependency graphs forsemantic parsing, where each token can have zeroto many heads. Since the tree structure is no longerguaranteed, sigmoid cross-entropy is used insteadso that independent binary predictions can be madefor every token to be considered a head of any othertoken. The label predictions are made as outputtingthe labels with the highest scores in S (rel) once arcpredictions are made, as illustrated in Figure 2.This updated implementation is further simpli-fied in our approach by removing the trainable wordembeddings, the character-level feature detector,and their corresponding linear transformers. More-over, instead of using the interpolation between thehead and label losses, equal weights are applied toboth losses, reducing hyperparameters to tune. Experiments
Three sets of experiments are conducted to evaluatethe impact of our approaches using BERT (Sec. 3).For sequence tagging (Section 4.1), part-of-speechtagging is chosen where each token gets assignedwith a fine-grained POS tag. For syntactic parsing(Section 4.2), dependency parsing is chosen whereeach token finds exactly one head, generating a treeper sentence. For semantic parsing (Section 4.3),semantic dependency parsing is chosen where eachtoken finds zero to many heads, generating a graphper sentence. Every task is tested on both Englishand Chinese to ensure robustness across languages.Standard datasets are adapted to all experimentsfor fair comparisons to many previous approaches.All our models are experimented three times and av-erage scores with standard deviations are reported.Section A describes our environmental settings anddata split in details for the replication of this work.
For part-of-speech tagging, the Wall Street Journalcorpus from the Penn Treebank 3 (Marcus et al.,1993) is used for English, and the Penn ChineseTreebank 5.1 (Xue et al., 2005) is used for Chinese.Table 3 shows tagging results on the test sets. (a) Results from the English test set. BERT BS and BERT LG areBERT’s uncased base and cased large models, respectively. ALL OOV
Ma and Hovy (2016) 97.55
Ling et al. (2015) 97.78 n/aClark et al. (2018) 97.79 n/aAkbik et al. (2018) 97.85 ( ± n/aBaseline 97.70 ( ± ± \ BERT BS ± ± \ BERT LG ± ± BS ± ± LG ± ± (b) Results from the Chinese test set. * are evaluated on thecharacter-level due to automatic segmentation, so their resultsare not directly comparable to ours but reported for reference. ALL OOV
Zhang et al. (2015) 94.47 ∗ n/aZhang et al. (2014) 94.62 ∗ n/aKurita et al. (2017) 94.84 ∗ n/aHatori et al. (2011) 94.64 n/aWang and Xue (2014) 96.0 n/aBaseline 95.65 ( ± ± \ BERT 96.38 ( ± ± ( ± ( ± Table 3: Test results for part-of-speech tagging, wheretoken-level accuracy is used as the evaluation metric.ALL: all tokens, OOV: out-of-vocabulary tokens.
For English, the baseline is our replication of theFlair model using both GloVe and Flair embeddings(Section 3.3). It shows a slightly lower accuracy,-0.15%, than the original model (Akbik et al., 2018)due to the lack of variational dropout. \ BERT sub-stitutes GloVe and Flair with BERT embeddings,and +BERT uses all three types of embeddings.The baseline outperforms all BERT models for theALL test, implying that Flair’s Bi-LSTM charac-ter language model is more effective than BERT’sword-piece approach. No significant difference isfound between BERT BS and BERT LG . However, aninteresting trend is found in the OOV test, wherethe +BERT LG model shows good improvement overthe baseline. This implies that BERT embeddingscan still contribute to the Flair model for OOV al-though the CNN character language model fromMa and Hovy (2016) is marginally more effectivethan +BERT for out-of-vocabulary tokens.For Chinese, the Bi-LSTM-CRF model withFastText embeddings is used for baseline (Sec. 3.3). \ BERT that substitutes FastText embeddings withBERT and +BERT that adds BERT embeddings tothe baseline show progressive improvement overthe prior model for both the ALL and OOV tests.+BERT gives an accuracy that is 1.25% higher thanthe previous state-of-the-art using joint-learning be-tween tagging and parsing (Wang and Xue, 2014).
The same datasets used for POS tagging, the PennTreebank and the Penn Chinese Treebank (Sec-tion 4.1), are used for dependency parsing as well.Table 4 shows parsing results on the test sets. (a) Results from the English test set.
UAS LAS
Dozat and Manning (2017) 95.74 94.08Kuncoro et al. (2017) 95.8 94.6Ma et al. (2018) 95.87 94.19Choe and Charniak (2016) 95.9 94.1Clark et al. (2018) 96.6 95.0Baseline 95.78 ( ± ± \ BERT 96.76 ( ± ± ( ± ( ± (b) Results from the Chinese test set. UAS LAS
Dozat and Manning (2017) 89.30 88.23Ma et al. (2018) 90.59 89.29Baseline 91.02 ( ± ± \ BERT 93.21 ( ± ± ( ± ( ± Table 4: Test results for dependency parsing, where un-labeled and labeled attachment scores (UAS and LAS)are used as the evaluation metrics. ur simplified version of the biaffine parser (Sec-tion 3.4) is used for baseline, where GloVe and Fast-Text embeddings are used for English and Chinese,respectively. The baseline model gives a compara-ble result to the original model (Dozat and Man-ning, 2017) for English, yet shows a notably betterresult for Chinese, which can be due to higher qual-ity embeddings from FastText. \ BERT substitutesthe pre-trained embeddings with BERT and +BERTadds BERT embeddings to the baseline. Moreover,BERT’s uncased base model is used for English.Between \ BERT and +BERT, no significant dif-ference is found, implying that those pre-trainedembeddings are not so useful when coupled withBERT. All BERT models show significant improve-ment over the baselines for both languages, andoutperform the previous state-of-the-art approachesusing cross-view training (Clark et al., 2018) andstack-pointer networks (Ma et al., 2018) by 0.29%and 3% in LAS for English and Chinese, respec-tively. Considering the simplicity of our +BERTmodels, these results are remarkable.
The English dataset from the SemEval 2015 Task18: Broad-Coverage Semantic Dependency Pars-ing (Oepen et al., 2015) and the Chinese datasetfrom the SemEval 2016 Task 9: Chinese SemanticDependency Parsing (Che et al., 2016) are used forsemantic dependency parsing. (a) Results from the in-domain (ID) test sets.
DM PAS PSD AVG
Du et al. (2015) 89.1 91.3 75.7 85.3Almeida and Martins (2015) 89.4 91.7 77.6 86.2Wang et al. (2018) 90.3 91.7 78.6 86.9Peng et al. (2017) 90.4 92.7 78.5 87.2Dozat and Manning (2018) 93.7 93.9 81.0 89.5Baseline 92.48 94.56 85.00 90.68Baseline \ BERT 94.37 96.03 86.59 92.33Baseline + BERT (b) Results from the out-of-domain (OOD) test sets.
DM PAS PSD AVG
Du et al. (2015) 81.8 87.2 73.3 80.8Almeida and Martins (2015) 83.8 87.6 76.2 82.5Wang et al. (2018) 84.9 87.6 75.9 82.8Peng et al. (2017) 85.3 89.0 76.4 83.6Dozat and Manning (2018) 88.9 90.6 79.4 86.3Baseline 86.98 91.35 77.28 85.34Baseline \ BERT 90.49 94.31 79.31 88.07Baseline + BERT
Table 5: Test results for semantic dependency parsingin English; labeled dependency F1 scores are used asthe evaluation metrics. The standard deviations are re-ported in Section A.3. DM: DELPH-IN dependencies,PAS: Enju dependencies, PSD: Prague dependencies,AVG: macro-average of (DM, PAS, PSD).
Table 5 shows the English results on the test sets.The baseline, \ BERT, and +BERT models are sim-ilar to the ones in Section 4.2, except they use thesigmoid instead of the softmax function in the out-put layer to accept multiple heads (Section 3.5).Our baseline is a simplified version of Dozat andManning (2018); its average scores are 1.2% higherand 1.0% lower than the original model for IDand OOD, due to different hyperparameter settings.+BERT shows good improvement over \ BERT forboth test sets, implying that BERT embeddings arecomplementary to those pre-trained embeddings,and surpasses the previous state-of-the-art scoresby 3% and 2% for ID and OOD, respectively.
NEWS TEXTUF LF UF LF
Artsymenia et al. (2016) 77.64 59.06 82.41 68.59Wang et al. (2018) 81.14 63.30 85.71 72.92Baseline 80.51 64.90 88.06 77.28Baseline \ BERT 82.91 67.17 90.83
Baseline + BERT
Table 6: Test results for semantic dependency parsingin Chinese, where unlabeled and labeled dependencyF1 scores (UF and LF) are used as the evaluation met-rics. The standard deviations are also reported in Sec-tion A.3. NEWS: newswire, TEXT: textbook.
Table 6 shows the Chinese results on the test sets.No significant difference is found between \ BERTand +BERT. +BERT significantly outperforms theprevious state-of-the-art by 4% and 7.5% in LFfor NEWS and TEXT, which confirms that BERTembeddings are very effective for semantic depen-dency parsing in both English and Chinese.
This section gives an in-depth analysis of the greatresults achieved by our approaches (Section 4) tobetter understand the role of BERT in these tasks.
The performance of \ BERT models is surprisinglylow for English POS tagging, compared to even alinear model achieving the accuracy of 97.64% onthe same dataset (Choi, 2016). This aligns with thefindings reported by BERT (Devlin et al., 2018) andELMo (Peters et al., 2018), another popular con-textualized embedding approach, where their POSand named entity tagging results do not surpass thestate-of-the-art. To study how tagging models aretrained with BERT embeddings, we augment thebaseline and \ BERT BS models in Table 3(a) with a) English: Flair (b) English: BERT (c) Chinese: FastText (d) Chinese: BERT Figure 3: Averaged attention matrices on sentences with 30 tokens. Each cell depicts the attention weight between w i and w j , representing i ’th and j ’th tokens. All models are based on the Bi-LSTM-CRF (Huang et al., 2015) usingthe Flair (Akbik et al., 2018), FastText (Bojanowski et al., 2017), and BERT (Devlin et al., 2018) embeddings. dot-product self-attention (Luong et al., 2015), andextract their attention weights. We then average theattention matrices decoded from sentences with anequal length, 30 tokens, to find any general trend.Comparing attention matrices across languages,it is clear that the Chinese matrices are much morecheckered, implying that it requires more contentsto make correct predictions in Chinese than English.This makes sense because Chinese words tend to bemore polysemous than English ones (Huang et al.,2007) so that they rely more on contents to disam-biguate their categories. For the Flair and BERTmodels in English, the Flair matrix is more check-ered and its diagonal is darker, implying that it usesmore contents while individual token embeddingsconvey more information for POS tagging so theirweights are higher than the ones in the BERT ma-trix. For the FastText and BERT models in Chinese,on the other hand, the BERT model is slightly morecheckered and its diagonal is darker, indicating thatBERT is better suited for this task than FastText. 3 1 5 1 1 $ ' 9 9 &