[PDF] Generalizing Natural Language Analysis through Span-relation Representations

Abstract

Natural language processing covers a wide variety of tasks predicting syntax, semantics, and information content, and usually each type of output is generated with specially designed architectures. In this paper, we provide the simple insight that a great variety of tasks can be represented in a single unified format consisting of labeling spans and relations between spans, thus a single task-independent model can be used across different tasks. We perform extensive experiments to test this insight on 10 disparate tasks spanning dependency parsing (syntax), semantic role labeling (semantics), relation extraction (information content), aspect based sentiment analysis (sentiment), and many others, achieving performance comparable to state-of-the-art specialized models. We further demonstrate benefits of multi-task learning, and also show that the proposed method makes it easy to analyze differences and similarities in how the model handles different tasks. Finally, we convert these datasets into a unified format to build a benchmark, which provides a holistic testbed for evaluating future models for generalized natural language analysis.

Full PDF

GGeneralizing Natural Language Analysis throughSpan-relation Representations

Zhengbao Jiang , Wei Xu , Jun Araki , Graham Neubig Language Technologies Institute, Carnegie Mellon University Department of Computer Science and Engineering, Ohio State University Bosch Research North America { zhengbaj,gneubig } @cs.cmu.edu [email protected] , [email protected] Abstract

Natural language processing covers a wide va-riety of tasks predicting syntax, semantics, andinformation content, and usually each type ofoutput is generated with specially designedarchitectures. In this paper, we provide thesimple insight that a great variety of taskscan be represented in a single uniﬁed formatconsisting of labeling spans and relations be-tween spans, thus a single task-independentmodel can be used across different tasks. Weperform extensive experiments to test this in-sight on 10 disparate tasks spanning depen-dency parsing (syntax), semantic role label-ing (semantics), relation extraction (informa-tion content), aspect based sentiment analysis(sentiment), and many others, achieving per-formance comparable to state-of-the-art spe-cialized models. We further demonstrate ben-eﬁts of multi-task learning, and also show thatthe proposed method makes it easy to analyzedifferences and similarities in how the modelhandles different tasks. Finally, we convertthese datasets into a uniﬁed format to build abenchmark, which provides a holistic testbedfor evaluating future models for generalizednatural language analysis.

A large number of natural language processing(NLP) tasks exist to analyze various aspects of hu-man language, including syntax (e.g., constituencyand dependency parsing), semantics (e.g., seman-tic role labeling), information content (e.g., namedentity recognition and relation extraction), or sen-timent (e.g., sentiment analysis). At ﬁrst glance,these tasks are seemingly very different in both thestructure of their output and the variety of infor-mation that they try to capture. To handle thesedifferent characteristics, researchers usually usespecially designed neural network architectures. Inthis paper we ask the simple questions: are the

Figure 1: An example from BRAT, consisting of POS,NER, and RE. task-speciﬁc architectures really necessary? Orwith the appropriate representational methodology,can we devise a single model that can perform —and achieve state-of-the-art performance on — alarge number of natural language analysis tasks ?Interestingly, in the domain of efﬁcient humanannotation interfaces , it is already standard to useuniﬁed representations for a wide variety of NLPtasks. Figure 1 shows one example of the BRAT(Stenetorp et al., 2012) annotation interface, whichhas been used for annotating data for tasks as broadas part-of-speech tagging, named entity recogni-tion, relation extraction, and many others. Notably,this interface has a single uniﬁed format that con-sists of spans (e.g., the span of an entity), labels onthe spans (e.g., the variety of entity such as “per-son” or “location”), and labeled relations betweenthe spans (e.g., “born-in”). These labeled relationscan form a tree or a graph structure, expressingthe linguistic structure of sentences (e.g., depen-dency tree). We detail this BRAT format and how itcan be used to represent a wide number of naturallanguage analysis tasks in Section 2.The simple hypothesis behind our paper is: ifhumans can perform natural language analysis ina single uniﬁed format, then perhaps machines canas well . Fortunately, there already exist NLP mod-els that perform span prediction and prediction ofrelations between pairs of spans, such as the end-to-end coreference model of Lee et al. (2017). Weextend this model with minor architectural mod-iﬁcations (which are not our core contributions)and pre-trained contextualized representations (e.g., a r X i v : . [ c s . C L ] M a y nformation Extraction POS Parsing SRL Sentiment NER RE Coref. OpenIE Dep. Consti. ABSA ORLDifferent Models for Different TasksELMo (Peters et al., 2018) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55)

BERT (Devlin et al., 2019) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)

SpanBERT (Joshi et al., 2019) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)

Single Model for Different TasksGuo et al. (2016) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55) (cid:55)

Swayamdipta et al. (2018) (cid:55) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55)

Strubell et al. (2018) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:55) (cid:55)

Clark et al. (2018) (cid:51) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55)

Luan et al. (2018, 2019) (cid:51) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)

Dixit and Al-Onaizan (2019) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)

Marasovi´c and Frank (2018) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55) (cid:51)

Hashimoto et al. (2017) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55) (cid:55)

This Work (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)

Table 1: A comparison of the tasks covered by previous work and our work.

BERT; Devlin et al. (2019) ) then demonstrate theapplicability and versatility of this single modelon 10 tasks, including named entity recognition(NER), relation extraction (RE), coreference reso-lution (Coref.), open information extraction (Ope-nIE), part-of-speech tagging (POS), dependencyparsing (Dep.), constituency parsing (Consti.), se-mantic role labeling (SRL), aspect based sentimentanalysis (ABSA), and opinion role labeling (ORL).While previous work has used similar formalismsto understand the representations learned by pre-trained embeddings (Tenney et al., 2019a,b), to thebest of our knowledge this is the ﬁrst work that usessuch a uniﬁed model to actually perform analysis .Moreover, we demonstrate that despite the model’ssimplicity, it can achieve comparable performancewith special-purpose state-of-the-art models on thetasks above (Table 1). We also demonstrate that thisframework allows us to easily perform multi-tasklearning (MTL), leading to improvements whenthere are related tasks to be learned from or datais sparse. Further analysis shows that dissimilartasks exhibit divergent attention patterns, whichexplains why MTL is harmful on certain tasks. Wehave released our code and the G eneral L anguage A nalysis D atasets (GLAD) benchmark with 8datasets covering 10 tasks in the BRAT format In contrast to work on pre-trained contextualized repre-sentations like ELMo (Peters et al., 2018) or BERT (Devlinet al., 2019) that learn uniﬁed features to represent the input indifferent tasks, we propose a uniﬁed representational method-ology that represents the output of different tasks. Analysismodels using BERT still use special-purpose output predictorsfor speciﬁc tasks or task classes. at https://github.com/neulab/cmu-multinlp ,and provide a leaderboard to facilitate future workon generalized models for NLP. In this section, we explain how the BRAT formatcan be used to represent a large number of tasks.There are two fundamental types of annotations:span annotations and relation annotations. Given asentence x = [ w , w , ..., w n ] of n tokens, a spanannotation ( s i , l i ) consists of a contiguous spanof tokens s i = [ w b i , w b i +1 , ..., w e i ] and its label l i ( l i ∈ L ), where b i / e i are the start/end indicesrespectively, and L is a set of span labels. A re-lation annotation ( s j , s k , r jk ) refers to a relation r jk ( r jk ∈ R ) between the head span s j and thetail span s k , where R is a set of relation types.This span-relation representation can easily expressmany tasks by deﬁning L and R accordingly, assummarized in Table 2a and Table 2b. These tasksfall in two categories: span-oriented tasks , wherethe goal is to predict labeled spans (e.g., named en-tities in NER) and relation-oriented tasks , wherethe goal is to predict relations between two spans(e.g., relation between two entities in RE). For ex-ample, constituency parsing (Collins, 1997) is aspan-oriented task aiming to produce a syntacticparse tree for a sentence, where each node of thetree is an individual span associated with a con-stituent label. Coreference resolution (Pradhanet al., 2012) is a relation-oriented task that linksan expression to its mentions within or beyond asingle sentence. Dependency parsing (K¨ubler et al., ask Spans annotated with labels NER Barack Obama person was born in Hawaii location .Consti. And their suspicions NP of each other NPPPNP run deep

ADVPVP . S POS What WP kind NN of IN memory NN ?ABSA Great laptop that offers many great features positive ! Table 2a: Span-oriented tasks. Spans are annotated by under-lines and their labels.

Task Spans and relations annotated with labels

RE The burst has been caused by pressure. cause-effect

Coref. I voted for Tom because he is clever. coref.

SRL We brought you the tale of two cities.

ARG0 ARG2 ARG1

OpenIE The four lawyers climbed out from under a table.

ARG0 ARG1

Dep. The entire division employs about 850 workers. detamod nsubj advmod nummoddobj

ORL We therefore as MDC do not accept this result. holder target

Table 2b: Relation-oriented tasks. Directed arcs indicate the relationsbetween spans. not covered, to properly scope this work.Notably, sentence-level tasks such as text classiﬁca-tion and natural language inference are not covered,although they can also be formulated using thisspan-relation representation by treating the entiresentence as a span. We chose to omit these tasksbecause they are already well-represented by pre-vious work on generalized architectures (Lan andXu, 2018) and multi-task learning (Devlin et al.,2019; Liu et al., 2019), and thus we mainly focuson tasks using phrase-like spans. In addition, thespan-relation representations described here are de-signed for natural language analysis , and cannothandle tasks that require generation of text, suchas machine translation (Bojar et al., 2014), dialogresponse generation (Lowe et al., 2015), and sum-marization (Nallapati et al., 2016). There are alsoa small number of analysis tasks such as semanticparsing to logical forms (Banarescu et al., 2013)where the outputs are not directly associated withspans in the input, and handling these tasks is be-yond the scope of this work.

Now that it is clear that a very large number of anal-ysis tasks can be formulated in a single format, weturn to devising a single model that can solve thesetasks. We base our model on a span-based modelﬁrst designed for end-to-end coreference resolution (Lee et al., 2017), which is then adapted for othertasks (He et al., 2018; Luan et al., 2018, 2019; Dixitand Al-Onaizan, 2019; Zhang and Zhao, 2019). Atthe core of the model is a module to represent eachspan as a ﬁxed-length vector, which is used to pre-dict labels for spans or span pairs. We ﬁrst brieﬂydescribe the span representation used and proven tobe effective in previous works, then highlight somedetails we introduce to make this model generalizeto a wide variety of tasks.

Span Representation

Given a sentence x =[ w , w , ..., w n ] of n tokens, a span s i =[ w b i , w b i +1 , ..., w e i ] is represented by concatenat-ing two components: a content representation z ci calculated as the weighted average across all tokenembeddings in the span, and a boundary represen-tation z ui that concatenates the embeddings at thestart and end positions of the span. Speciﬁcally, c , c , ..., c n = TokenRepr ( w , w , ..., w n ) , (1) u , u , ..., u n = BiLSTM ( c , c , ..., c n ) , (2) z ci = SelfAttn ( c b i , c b i +1 , ..., c e i ) , (3) z ui = [ u b i ; u e i ] , z i = [ z ci ; z ui ] , (4)where TokenRepr could be non-contextualized,such as GloVe (Pennington et al., 2014), or contex-tualized, such as BERT (Devlin et al., 2019). Werefer to Lee et al. (2017) for further details. Span and Relation Label Prediction

Since weextract spans and relations in an end-to-end fashion,we introduce two additional labels

NEG SPAN and

NEG REL in L and R respectively. NEG SPAN in-dicates invalid spans (e.g., spans that are not namedentities in NER) and

NEG REL indicates invalidspan pairs without any relation between them (i.e.,no relation exists between two arguments in SRL). ataset Domain

Wet Lab Protocols biology 14,301 NER 60,745 - F (Kulkarni et al., 2018) RE 60,745 43,773 F CoNLL-2003 (Sang and Meulder, 2003) news 20,744 NER 35,089 - F SemEval-2010 Task 8 (Hendrickx et al., 2010) misc. 10,717 RE 21,437 10,717 Macro F ◦ OntoNotes 5.0 (cid:63) (Pradhan et al., 2013) misc. 94,268 Coref. 194,477 1,166,513 Avg F SRL 745,796 543,534 F POS 1,631,995 - AccuracyDep. 1,722,571 1,628,558 LASConsti. 1,320,702 - Evalb F † Penn Treebank(Marcus et al., 1994) speech, news 49,208 POS 1,173,766 - Accuracy43,948 Dep. 1,090,777 1,046,829 LAS43,948 Consti. 871,264 - Evalb F † OIE2016 (Stanovsky and Dagan, 2016) news, Wiki 2,534 OpenIE 15,717 12,451 F MPQA 3.0 (Deng and Wiebe, 2015) news 3,585 ORL 13,841 9,286 F SemEval-2014 Task 4 (Pontiki et al., 2014) reviews 4,451 ABSA 7,674 - Accuracy ◦ Table 3: Statistics of GLAD, consisting of 10 tasks from 8 datasets. (cid:63)

Following He et al. (2018), we use a subsetof OntoNotes 5.0 dataset based on CoNLL 2012 splits (Pradhan et al., 2012). ◦ Previous works use gold standardspans in these evaluations. † We use the bracket scoring program Evalb (Collins, 1997) in constituency parsing.

We ﬁrst predict labels for all spans up to a lengthof l words using a multilayer perceptron (MLP): softmax( MLP span ( z i )) ∈ ∆ |L| , where ∆ |L| is a |L| -dimensional simplex. Then we keep the top K = τ · n spans with the lowest NEG SPAN prob-ability in relation prediction for efﬁciency, wheresmaller pruning threshold τ indicates more aggres-sive pruning. Another MLP is applied to pairsof the remaining spans to produce their relationscores: o jk = MLP rel ([ z j ; z k ; z j · z k ]) ∈ R |R| ,where j and k index two spans. Application to Disparate Tasks

For most of thetasks, we can simply maximize the probability ofthe ground truth relation for all pairs of the re-maining spans . However, some tasks might havedifferent requirements, e.g., coreference resolutionaims to cluster spans referring to the same conceptand we do not care about which antecedent a spanis linked to if there are multiple ones. Thus, weprovide two training loss functions:1.

Pairwise

Maximize the probabilities of theground truth relations for all pairs of the remain-ing spans independently: softmax( o jk ) r jk ,where r jk indexes the ground truth relation.2. Head

Maximize the probability of groundtruth head spans for a speciﬁc span s j : (cid:80) k ∈ head ( s j ) softmax([ o j , o j , ..., o jK ]) k ,where head ( · ) returns indices of one or moreheads and o j · is the corresponding scalar from o j · indicating how likely two spans are related. We use option 1 for all tasks except for coreferenceresolution which uses option 2. Note that the aboveloss functions only differ in how relation scoresare normalized and the other parts of the modelremain the same across different tasks. At test time,we follow previous inference methods to generatevalid outputs. For coreference resolution, we linka span to the antecedent with highest score (Leeet al., 2017). For constituency parsing, we usegreedy top-down decoding to generate a valid parsetree (Stern et al., 2017). For dependency parsing,each word is linked to exactly one parent with thehighest relation probability. For other tasks, wepredict relations for all span pairs and use those notpredicted as NEG REL to construct outputs.Our core insight is that the above formulationis largely task-agnostic , meaning that a task canbe modeled in this framework as long as it can beformulated as a span-relation prediction problemwith properly deﬁned span labels L and relationlabels R . As shown in Table 1, this uniﬁed Span - Rel ation (SpanRel) model makes it simple to scaleto a large number of language analysis tasks, withbreadth far beyond that of previous work.

Multi-task Learning

The SpanRel model makesit easy to perform multi-task learning (MTL) bysharing all parameters except for the MLPs used forlabel prediction. However, because different taskscapture different linguistic aspects, they are notequally beneﬁcial to each other. It is expected thatjointly training on related tasks is helpful, whileforcing the same model to solve unrelated tasks ategory Task Metric Dataset Setting SOTA Model Previous SOTA Our Model

IE NER F CoNLL03 BERT Devlin et al. (2019) 92.8 92.2WLP ELMo Luan et al. (2019) 79.5 79.2RE Macro F1 SemEval10 BERT, gold Wu and He (2019) 89.3 87.4F WLP ELMo Luan et al. (2019) 64.1 65.5Coref. Avg F OntoNotes GloVe, CharCNN Lee et al. (2017) ◦ OIE2016 ELMo Stanovsky et al. (2018) (cid:63) OntoNotes ELMo He et al. (2018) † PTB BERT Kitaev et al. (2019) 95.6 95.5Sentiment ABSA Accuracy SemEval14 BERT, gold Xu et al. (2019) (cid:47) MPQA 3.0 GloVe, gold Marasovi´c and Frank (2018) (cid:63)

Table 4: Comparison between SpanRel models and task-speciﬁc SOTA models. Following Luan et al. (2019), weperform NER and RE jointly on WLP dataset. We use gold entities in SemEval-2010 Task 8, gold aspect terms inSemEval-2014 Task 4, and gold opinion expressions in MPQA 3.0 to be consistent with existing works. might even hurt the performance (Ruder, 2017).Compared to manually choosing source tasks basedon prior knowledge, which might be sub-optimalwhen the number of tasks is large, SpanRel offersa systematic way to examine relative beneﬁts ofsource-target task pairs by either performing pair-wise MTL or attention-based analysis, as we willshow in Section 4.3.

We ﬁrst describe our G eneral L anguage A nalysis D atasets (GLAD) benchmark and evaluation met-rics, then conduct experiments to (1) verify thatSpanRel can achieve comparable performanceacross all tasks (Section 4.2), and (2) demonstrateits beneﬁts in multi-task learning (Section 4.3). As summarized in Table 3, we convert 8 widelyused datasets with annotations of 10 tasks intothe BRAT format and include them in the GLADbenchmark. It covers diverse domains, providing aholistic testbed for natural language analysis evalu-ation. The major evaluation metric is span-based F (denoted as F ), a standard metric for SRL. Preci-sion is the proportion of extracted spans (spans not ◦ The small version of Lee et al. (2017)’s method with100 antecedents and no speaker features. (cid:63)

For OpenIE andORL, we use span-based F instead of syntactic-head-basedF and binary coverage F used in the original papers becausethey are biased towards extracting long spans. † For SRL, wechoose to compare with He et al. (2018) because they alsoextract predicates and arguments in an end-to-end way. (cid:47)

Wefollow Xu et al. (2019) to report accuracy of restaurant andlaptop domain separately in ABSA. predicted as

NEG SPAN ) that are consistent withthe ground truth. Recall is the proportion of groundtruth spans that are correctly extracted. Span F is also applicable to relations, where an extractedrelation (relations not predicted as NEG REL ) iscorrect iff both head and tail spans have correctboundaries and the predicted relation is correct. Tomake fair comparisons with existing works, wealso compute standard metrics for different tasks,as listed in Table 3.

Implementation Details

We attempted four to-ken representation methods (Equation 1), namelyGloVe (Pennington et al., 2014), ELMo (Peterset al., 2018), BERT (Devlin et al., 2019), and Span-BERT (Joshi et al., 2019). We use BERT base in ourmain results and report BERT large in Appendix B.A three-layer BiLSTM with 256 hidden units isused (Equation 2). Both span and relation predic-tion MLPs have two layers with 128 hidden units.Dropout (Srivastava et al., 2014) of 0.5 is appliedto all layers. For GloVe and ELMo, we use Adam(Kingma and Ba, 2015) with learning rate of 1e-3and early stop with patience of 3. For BERT andSpanBERT, we follow standard ﬁne-tuning withlearning rate of 5e-5, β = 0 . , β = 0 . , L2weight decay of 0.01, warmup over the ﬁrst 10%steps, and number of epochs tuned on developmentset. Task-speciﬁc hyperparameters maximal spanlength and pruning ratio are tuned on developmentset and listed in Appendix C. We compare the SpanRel model with state-of-the-art task-speciﬁc models by training on data from aingle task. By doing so we attempt to answer theresearch question “can a single model with mini-mal task-speciﬁc engineering achieve competitiveor superior performance to other models that havebeen speciﬁcally engineered?” We select competi-tive SOTA models mainly based on settings, e.g.,single-task learning and end-to-end extraction ofspans and relations. To make fair comparisons, to-ken embeddings (GloVe, ELMo, BERT) and otherhyperparameters (e.g., the number of antecedentsin Coref. and the maximal span length in SRL) inour method are set to match those used by SOTAmodels, to focus on differences brought about bythe model architecture.As shown in Table 4, the SpanRel modelachieves comparable performances as task-speciﬁcSOTA methods (regardless of whether the tokenrepresentation is contextualized or not). This indi-cates that the span-relation format can genericallyrepresent a large number of natural language analy-sis tasks and it is possible to devise a single uniﬁedmodel that achieves strong performance on all ofthem. It provides a strong and generic baselinefor natural language analysis tasks and a way toexamine the usefulness of task-speciﬁc designs.

To demonstrate the beneﬁt of the SpanRel model inMTL, we perform single-task learning (STL) andMTL across all tasks using end-to-end settings. Following Liu et al. (2019), we perform MTL+ﬁne-tuning and show the results in separate columnsof Table 5. Contextualized token representationsyield signiﬁcantly better results than GloVe on alltasks, indicating that pre-training on large corporais almost universally helpful to NLP tasks. Compar-ing the results of MTL+ﬁne-tuning with STL, wefound that performance with GloVe drops on 8 outof 15 tasks, most of which are tasks with relativelysparse data. It is probably because the capacity ofthe GloVe-based model is too small to store all thepatterns required by different tasks. The resultsof contextualized representations are mixed, withsome tasks being improved and others remainingthe same or degrading. We hypothesize that thisis because different tasks capture different linguis-tic aspects, thus are not equally helpful to eachother. Reconciling these seemingly different tasks Span-based F is used as the evaluation metric inSemEval-2010 Task 8 and SemEval-2014 Task 4 as opposed tomacro F and accuracy reported in the original papers becausewe aim at end-to-end extractions. in the same model might be harmful to some tasks.Notably, as the contextualized representations be-come stronger, the performance of MTL+FT be-comes more favorable. 5 out of 15 tasks (NER,RE, OpenIE, SRL, ORL) observe statistically sig-niﬁcant improvements (p-value < . with pairedbootstrap re-sampling) with SpanBERT, a contex-tualized embedding pre-trained with span-basedtraining objectives, while only one task degrades(ABSA), indicating its superiority in reconcilingspans from different tasks. The GLAD benchmarkprovides a holistic testbed for evaluating naturallanguage analysis capability. Task Relatedness Analysis

To further investi-gate how different tasks interact with each other,we choose ﬁve source tasks (i.e., tasks used to im-prove other tasks, e.g., POS, NER, Consti., Dep.,and SRL) that have been widely used in MTL(Hashimoto et al., 2017; Strubell et al., 2018) andsix target tasks (i.e., tasks to be improved, e.g., Ope-nIE, NER, RE, ABSA, ORL, and SRL) to performpairwise multi-task learning.We hypothesize that although language model-ing pre-training is theoretically orthogonal to MTL(Swayamdipta et al., 2018), in practice their ben-eﬁts tends to overlap. To analyze these two fac-tors separately, we start with a weak representa-tion GloVe to study task relatedness, then moveto BERT to demonstrate how much we can stillimprove with MTL given strong and contextual-ized representations. As shown in Table 6 (GloVe),tasks are not equally useful to each other. Notably,(1) for OpenIE and ORL, multi-task learning withSRL improves the performance signiﬁcantly, whileother tasks lead to less or no improvements. (2) De-pendency parsing and SRL are generic source tasksthat are beneﬁcial to most of the target tasks. Thisuniﬁed SpanRel makes it easy to perform MTL anddecide beneﬁcial source tasks.Next, we demonstrate that our framework alsoprovides a platform for analysis of similarities anddifferences between different tasks. Inspired by theintuition that the attention coefﬁcients are some-what indicative of a model’s internal focus (Li et al.,2016; Vig, 2019; Clark et al., 2019), we hypothe-size that the similarity or difference between atten-tion mechanisms may be correlated with similaritybetween tasks, or even the success or failure ofMTL. To test this hypothesis, we extract the at-tention maps of two BERT-based SpanRel models(trained on a source t (cid:48) and a target task t separately) loVe ELMo BERT base SpanBERT base

Category Task Metric Dataset STL MTL +FT STL MTL +FT STL MTL +FT STL MTL +FT

IE NER F CoNLL03 88.4 86.2 ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↑ ↑ RE F SemEval10 50.7 15.2 ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↑ ↑ ↑ ↑ Coref Avg F OntoNotes 56.3 50.3 ↓ ↓ ↑ ↑ ↓ ↓ OIE2016 28.3 6.8 ↓ ↓ ↓ ↓ ↑ ↑ ↑ SRL F OntoNotes 78.0 77.9 ↑ ↑ Parsing Dep. LAS PTB 92.9 93.2 ↑ ↑ ↑ PTB 93.4 - 93.8 95.3 - 95.3 95.5 - 95.2 95.8 - 95.5OntoNotes 91.0 - ↑ ↑ SemEval14 63.5 48.5 ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ORL F MPQA 3.0 38.2 18.4 ↓ ↓ ↓ ↓ ↓ ↑ ↓ ↑ POS Accuracy PTB 96.8 96.8 96.8 97.7 97.7 97.8 97.6 97.3 97.3 97.6 97.6 97.6OntoNotes 97.0 97.0 97.1 98.2 98.2 98.3 97.7 97.8 97.8 98.3 98.3 98.3

Table 5: Comparison between STL and MTL+ﬁne-tuning across all tasks. blue ↑ indicates results better than STL,red ↓ indicates worse, and black means almost the same (i.e., a difference within 0.5). Constituency parsing requiresmore memory than other tasks so we restrict its span length to 10 in MTL, and thus do not report results. over sentences X t from the target task, and computetheir similarity using the Frobenius norm:sim k ( t, t (cid:48) ) = − |X t | (cid:88) x ∈X t (cid:13)(cid:13)(cid:13) A tk ( x ) − A t (cid:48) k ( x ) (cid:13)(cid:13)(cid:13) F , where A tk ( x ) is the attention map extracted fromthe k -th head by running the model trained fromtask t on sentence x . We select OpenIE as the targettask because it shows the largest performance vari-ation when paired with different source tasks (34.0- 38.8) in Table 6. We visualize the attention simi-larity of all heads in BERT (12 layers ×

12 heads)between two mutually harmful tasks (OpenIE/POSon the left) and between two mutually helpful tasks(OpenIE/SRL on the right) in Figure 2a. A com-mon trend is that heads in higher layers exhibitmore divergence, probably because they are closerto the prediction layer, thus easier to be affectedby the end task. Overall, it can be seen that Ope-nIE/POS has much more attention divergence thanOpenIE/SRL. A notable difference is that almost all heads in the last two layers of the OpenIE/POSmodels differ signiﬁcantly, while some heads inthe last two layers of the OpenIE/SRL models stillbehave similarly, providing evidence that failureof MTL can be attributed to the fact that dissimi-lar tasks requires different attention patterns. Wefurther compute average attention similarities forall source tasks in Figure 2b, and we can see thatthere is a strong correlation (Pearson correlation -5012 l a y e r s heads 1 heads 121 121 (a) Attention similarity betweenOpenIE/POS (left), and betweenOpenIE/SRL (right) for all heads.

34 36 38 40-2.3 -2.2 -2.1 -2 pe r f o r m an c e similarityPOSNERconsti.dep.SRL (b) Correlation betweenattention similarity andMTL performance. Figure 2: Attention-based task relatedness analysis. of 0.97) between the attentions similarity and theperformance of pairwise MTL, supporting our hy-pothesis that attention pattern similarities can beused to predict improvements of MTL.

MTL under Different Settings

We analyze howtoken representations and sizes of the target datasetaffect the performance of MTL. Comparing BERTand GloVe in Table 6, the improvements becomesmaller or vanish as the token representation be-comes stronger, e.g., improvement on OpenIE withSRL reduces from 5.8 to 1.6. This is expected be-cause both large-scale pre-training and MTL aim tolearn general representations and their beneﬁts tendto overlap in practice. Interestingly, some helpfulsource tasks become harmful when we shift fromGloVe to BERT, such as OpenIE paired with POS.We conjecture that the gains of MTL might have al-ready been achieved by BERT, but the task-speciﬁccharacteristics of POS hurt the performance of Ope-nIE. We did not observe many tasks beneﬁtting loVe BERT base

Target Source STL POS NER Consti. Dep. SRL STL POS NER Consti. Dep. SRL

OpenIE 28.3 ↑ ↓ ↑ ↑ ↑ ↓ ↓ ↓ ↑ ↑ NER (WLP) 77.6 77.8 ↑ ↑ ↑ ↑ RE (WLP) 64.9 ↑ ↑ ↑ ↑ ↑ RE (SemEval10) 50.7 ↑ ↑ ↓ ↑ ↑ ↓ ↓ ↓ ABSA 63.5 63.4 62.8 ↓ ↓ ↓ ↓ ↑ ↓ ↓ ORL 38.2 35.7 ↓ ↓ ↑ ↑ ↑ ↑ SRL (10k) 68.8 ↑ ↑ ↑ - 78.7 ↑ ↑ ↑ ↑ - Table 6: Performance of pairwise multi-task learning with GloVe and BERT base . blue ↑ indicates results better than STL, red ↓ indicates worse, and black means almost the same(i.e., a difference within 0.5). We show the performance after ﬁne-tuning. Dataset ofsource tasks POS, Consti., Dep. is PTB and dataset of NER is CoNLL-2003.

79 80 81 82 83 84 85 86 0 15 30 45 60 75 F Figure 3: MTL Perfor-mance of SRL wrt. thedata size. from MTL for the GloVe-based model in Table 5because it is trained on all tasks (instead of two ),which is beyond its limited model capacity. The im-provements of MTL shrink as the size of the SRLdatasets increases, as shown in Figure 3, indicatingthat MTL is useful when the target data is sparse.

Time Complexity Analysis

Time complexitiesof span and relation prediction are O ( l · n ) and O ( K ) = O ( τ · n ) respectively for a sentenceof n tokens (Section 3). The time complexityof BERT is O ( L · n ) , dominated by its L self-attention layers. Since the pruning threshold τ isusually less than 1, the computational overhead in-troduced by the span-relation output layer is muchless than BERT. In practice, we observe that thetraining/testing time is mainly spent by BERT. ForSRL, one of the most computation-intensive taskswith long spans and dense span/relation annota-tions, 85.5% of the time is spent by BERT. ForPOS, a less heavy task, the time spent by BERTincreases to 98.5%. Another option for span pre-diction is to formulate it as a sequence labelingtask, as in previous works (Lample et al., 2016;He et al., 2017), where time complexity is O ( n ) .Although slower than token-based labeling models,span-based models offer the advantages of beingable to model overlapping spans and use span-levelinformation for label prediction (Lee et al., 2017). General Architectures for NLP

There has beena rising interest in developing general architecturesfor different NLP tasks, with the most prominentexamples being sequence labeling framework (Col-lobert et al., 2011; Ma and Hovy, 2016) used fortagging tasks and sequence-to-sequence framework(Sutskever et al., 2014) used for generation tasks. Moreover, researchers typically pick related tasks,motivated by either linguistic insights or empiri-cal results, and create a general framework to per-form MTL, several of which are summarized inTable 1. For example, Swayamdipta et al. (2018)and Strubell et al. (2018) use constituency anddependency parsing to improve SRL. Luan et al.(2018, 2019); Wadden et al. (2019) use a span-based model to jointly solve three information-extraction-related tasks (NER, RE, and Coref.). Liet al. (2019) formulate both nested NER and ﬂatNER as a machine reading comprehension task.Compared to existing works, we aim to create anoutput representation that can solve nearly every natural language analysis task in one fell swoop,allowing us to cover a far broader range of taskswith a single model.In addition, NLP has seen a recent burgeoningof contextualized representations pre-trained onlarge corpora (e.g., ELMo (Peters et al., 2018) andBERT (Devlin et al., 2019)). These methods focuson learning generic input representations, but areagnostic to the output representation, requiring dif-ferent predictors for different tasks. In contrast, wepresent a methodology to formulate the output ofdifferent tasks in a uniﬁed format. Thus our work isorthogonal to those on contextualized embeddings.Indeed, in Section 4.3, we demonstrate that theSpanRel model can beneﬁt from stronger contex-tualized representation models, and even provide atestbed for their use in natural language analysis.

Benchmarks for Evaluating Natural LanguageUnderstanding

Due to the rapid development ofNLP models, large-scale benchmarks, such as Sen-tEval (Conneau and Kiela, 2018), GLUE (Wanget al., 2019b), and SuperGLUE (Wang et al., 2019a)have been proposed to facilitate fast and holisticvaluation of models’ understanding ability. Theymainly focus on sentence-level tasks, such as nat-ural language inference, while our GLAD bench-mark focuses on token/phrase-level analysis taskswith diverse coverage of different linguistic struc-tures. New tasks and datasets can be convenientlyadded to our benchmark as long as they are in theBRAT standoff format, which is one of the mostcommonly used data format in the NLP community,e.g., it has been used in the BioNLP shared tasks(Kim et al., 2009) and the Universal Dependencyproject (McDonald et al., 2013).

We provide the simple insight that a large numberof natural language analysis tasks can be repre-sented in a single format consisting of spans andrelations between spans. As a result, these taskscan be solved in a single modeling framework thatﬁrst extracts spans and predicts their labels, thenpredicts relations between spans. We attempted 10tasks with this SpanRel model and show that thisgeneric task-independent model can achieve com-petitive performance as state-of-the-art methodstailored for each tasks. We merge 8 datasets intoour GLAD benchmark for evaluating future modelsfor natural language analysis. Future directions in-clude (1) devising hierarchical span representationsthat can handle spans of different length and diversecontent more effectively and efﬁciently; (2) robustmultitask learning or meta-learning algorithms thatcan reconcile very different tasks.

Acknowledgments

This work was supported by gifts from Bosch Re-search. We would like to thank Hiroaki Hayashi,Bohan Li, Pengcheng Yin, Hao Zhu, Paul Michel,and Antonios Anastasopoulos for their insightfulcomments and suggestions.

References

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Grifﬁtt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In

Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse, LAW-ID@ACL 2013, August 8-9, 2013,Soﬁa, Bulgaria , pages 178–186.Michele Banko, Michael J. Cafarella, Stephen Soder-land, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In

IJCAI2007, Proceedings of the 20th International JointConference on Artiﬁcial Intelligence, Hyderabad, In-dia, January 6-12, 2007 , pages 2670–2676.Ondrej Bojar, Christian Buck, Christian Federmann,Barry Haddow, Philipp Koehn, Johannes Leveling,Christof Monz, Pavel Pecina, Matt Post, HerveSaint-Amand, Radu Soricut, Lucia Specia, and AlesTamchyna. 2014. Findings of the 2014 workshopon statistical machine translation. In

Proceedingsof the Ninth Workshop on Statistical Machine Trans-lation, WMT@ACL 2014, June 26-27, 2014, Balti-more, Maryland, USA , pages 12–58.Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D. Manning. 2019. What does BERTlook at? an analysis of BERT’s attention. In

Pro-ceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks forNLP , pages 276–286, Florence, Italy. Associationfor Computational Linguistics.Kevin Clark, Minh-Thang Luong, Christopher D. Man-ning, and Quoc Le. 2018. Semi-supervised se-quence modeling with cross-view training. In

Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing , pages 1914–1925, Brussels, Belgium. Association for Computa-tional Linguistics.Michael Collins. 1997. Three generative, lexicalisedmodels for statistical parsing. In ,pages 16–23, Madrid, Spain. Association for Com-putational Linguistics.Ronan Collobert, Jason Weston, L´eon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel P. Kuksa.2011. Natural language processing (almost) fromscratch.

J. Mach. Learn. Res. , 12:2493–2537.Alexis Conneau and Douwe Kiela. 2018. SentEval:An evaluation toolkit for universal sentence repre-sentations. In

Proceedings of the Eleventh Interna-tional Conference on Language Resources and Eval-uation (LREC-2018) , Miyazaki, Japan. EuropeanLanguages Resources Association (ELRA).Lingjia Deng and Janyce Wiebe. 2015. MPQA 3.0:An entity/event-level sentiment corpus. In

NAACLHLT 2015, The 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Den-ver, Colorado, USA, May 31 - June 5, 2015 , pages1323–1328.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human Languageechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Kalpit Dixit and Yaser Al-Onaizan. 2019. Span-levelmodel for relation extraction. In

Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics , pages 5308–5314, Florence,Italy. Association for Computational Linguistics.Daniel Gildea and Daniel Jurafsky. 2002. Auto-matic labeling of semantic roles.

Comput. Linguist. ,28(3):245–288.Jiang Guo, Wanxiang Che, Haifeng Wang, Ting Liu,and Jun Xu. 2016. A uniﬁed architecture for seman-tic role labeling and relation classiﬁcation. In

Pro-ceedings of COLING 2016, the 26th InternationalConference on Computational Linguistics: Techni-cal Papers , pages 1264–1274, Osaka, Japan. TheCOLING 2016 Organizing Committee.Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu-ruoka, and Richard Socher. 2017. A joint many-taskmodel: Growing a neural network for multiple NLPtasks. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017 , pages 1923–1933.Luheng He, Kenton Lee, Omer Levy, and Luke Zettle-moyer. 2018. Jointly predicting predicates and argu-ments in neural semantic role labeling. In

Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers) , pages 364–369, Melbourne, Australia. Asso-ciation for Computational Linguistics.Luheng He, Kenton Lee, Mike Lewis, and Luke Zettle-moyer. 2017. Deep semantic role labeling: Whatworks and what’s next. In

Proceedings of the 55thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers) , pages473–483, Vancouver, Canada. Association for Com-putational Linguistics.Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid ´O S´eaghdha, SebastianPad´o, Marco Pennacchiotti, Lorenza Romano, andStan Szpakowicz. 2010. Semeval-2010 task 8:Multi-way classiﬁcation of semantic relations be-tween pairs of nominals. In

Proceedings of the5th International Workshop on Semantic Evaluation,SemEval@ACL 2010, Uppsala University, Uppsala,Sweden, July 15-16, 2010 , pages 33–38.Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke Zettlemoyer, and Omer Levy. 2019.Spanbert: Improving pre-training by representingand predicting spans.

CoRR , abs/1907.10529.Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshi-nobu Kano, and Jun’ichi Tsujii. 2009. Overviewof bionlp’09 shared task on event extraction. In

Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: SharedTask , BioNLP ’09, pages 1–9, Stroudsburg, PA,USA. Association for Computational Linguistics.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In .Nikita Kitaev, Steven Cao, and Dan Klein. 2019. Multi-lingual constituency parsing with self-attention andpre-training. In

Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics , pages 3499–3505, Florence, Italy. Associa-tion for Computational Linguistics.Sandra K¨ubler, Ryan McDonald, and Joakim Nivre.2009. Dependency parsing.

Synthesis Lectures onHuman Language Technologies , 1(1):1–127.Chaitanya Kulkarni, Wei Xu, Alan Ritter, and RaghuMachiraju. 2018. An annotated corpus for machinereading of instructions in wet lab protocols. In

Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 2 (Short Papers) , pages 97–106, New Orleans,Louisiana. Association for Computational Linguis-tics.Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.In

NAACL HLT 2016, The 2016 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, San Diego California, USA, June 12-17, 2016 ,pages 260–270.Wuwei Lan and Wei Xu. 2018. Neural network modelsfor paraphrase identiﬁcation, semantic textual simi-larity, natural language inference, and question an-swering. In

Proceedings of the 27th InternationalConference on Computational Linguistics , pages3890–3902, Santa Fe, New Mexico, USA. Associ-ation for Computational Linguistics.Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end neural coreference reso-lution. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing,EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017 , pages 188–197.Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Un-derstanding neural networks through representationerasure.

CoRR , abs/1612.08220.Xiaoya Li, Jingrong Feng, Yuxian Meng, QinghongHan, Fei Wu, and Jiwei Li. 2019. A uniﬁed MRCframework for named entity recognition.

CoRR ,abs/1910.11476.iaodong Liu, Pengcheng He, Weizhu Chen, and Jian-feng Gao. 2019. Multi-task deep neural networksfor natural language understanding. In

Proceedingsof the 57th Conference of the Association for Compu-tational Linguistics, ACL 2019, Florence, Italy, July28- August 2, 2019, Volume 1: Long Papers , pages4487–4496.Ryan Lowe, Nissan Pow, Iulian Serban, and JoellePineau. 2015. The ubuntu dialogue corpus: A largedataset for research in unstructured multi-turn dia-logue systems. In

Proceedings of the SIGDIAL 2015Conference, The 16th Annual Meeting of the Spe-cial Interest Group on Discourse and Dialogue, 2-4 September 2015, Prague, Czech Republic , pages285–294.Yi Luan, Luheng He, Mari Ostendorf, and HannanehHajishirzi. 2018. Multi-task identiﬁcation of enti-ties, relations, and coreference for scientiﬁc knowl-edge graph construction. In

Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, Brussels, Belgium, October 31 -November 4, 2018 , pages 3219–3232.Yi Luan, Dave Wadden, Luheng He, Amy Shah, MariOstendorf, and Hannaneh Hajishirzi. 2019. A gen-eral framework for information extraction using dy-namic span graphs. In

Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers) , pages 3036–3046, Minneapolis, Minnesota.Association for Computational Linguistics.Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end se-quence labeling via bi-directional lstm-cnns-crf. In

Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2016,August 7-12, 2016, Berlin, Germany, Volume 1:Long Papers .Ana Marasovi´c and Anette Frank. 2018. SRL4ORL:Improving opinion role labeling using multi-tasklearning with semantic role labeling. In

Proceedingsof the 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Pa-pers) , pages 583–594, New Orleans, Louisiana. As-sociation for Computational Linguistics.Mitchell P. Marcus, Grace Kim, Mary AnnMarcinkiewicz, Robert MacIntyre, Ann Bies,Mark Ferguson, Karen Katz, and Britta Schasberger.1994. The penn treebank: Annotating predicate ar-gument structure. In

Human Language Technology,Proceedings of a Workshop held at Plainsboro, NewJerey, USA, March 8-11, 1994 .Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuz-man Ganchev, Keith Hall, Slav Petrov, HaoZhang, Oscar T¨ackstr¨om, Claudia Bedini, N´uriaBertomeu Castell´o, and Jungmee Lee. 2013. Uni-versal dependency annotation for multilingual pars-ing. In

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol-ume 2: Short Papers) , pages 92–97, Soﬁa, Bulgaria.Association for Computational Linguistics.Ramesh Nallapati, Bowen Zhou, C´ıcero Nogueira dosSantos, C¸ aglar G¨ulc¸ehre, and Bing Xiang. 2016.Abstractive text summarization using sequence-to-sequence rnns and beyond. In

Proceedings of the20th SIGNLL Conference on Computational NaturalLanguage Learning, CoNLL 2016, Berlin, Germany,August 11-12, 2016 , pages 280–290.Christina Niklaus, Matthias Cetto, Andr´e Freitas, andSiegfried Handschuh. 2018. A survey on open infor-mation extraction. In

Proceedings of the 27th Inter-national Conference on Computational Linguistics ,pages 3866–3878, Santa Fe, New Mexico, USA. As-sociation for Computational Linguistics.Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for wordrepresentation. In

Proceedings of the 2014 Confer-ence on Empirical Methods in Natural LanguageProcessing, EMNLP 2014, October 25-29, 2014,Doha, Qatar, A meeting of SIGDAT, a Special Inter-est Group of the ACL , pages 1532–1543.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Maria Pontiki, Dimitris Galanis, John Pavlopoulos,Harris Papageorgiou, Ion Androutsopoulos, andSuresh Manandhar. 2014. Semeval-2014 task 4: As-pect based sentiment analysis. In

Proceedings of the8th International Workshop on Semantic Evaluation,SemEval@COLING 2014, Dublin, Ireland, August23-24, 2014. , pages 27–35.Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Hwee Tou Ng, Anders Bj¨orkelund, Olga Uryupina,Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-bust linguistic analysis using ontonotes. In

Pro-ceedings of the Seventeenth Conference on Compu-tational Natural Language Learning, CoNLL 2013,Soﬁa, Bulgaria, August 8-9, 2013 , pages 143–152.Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Olga Uryupina, and Yuchen Zhang. 2012. Conll-2012 shared task: Modeling multilingual unre-stricted coreference in ontonotes. In

Joint Con-ference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning - Proceedings of the Shared Task:Modeling Multilingual Unrestricted Coreference inOntoNotes, EMNLP-CoNLL 2012, July 13, 2012,Jeju Island, Korea , pages 1–40.Adwait Ratnaparkhi. 1996. A maximum entropymodel for part-of-speech tagging. In

Conferencen Empirical Methods in Natural Language Process-ing .Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks.

CoRR ,abs/1706.05098.Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the conll-2003 shared task:Language-independent named entity recognition. In

Proceedings of the Seventh Conference on NaturalLanguage Learning, CoNLL 2003, Held in cooper-ation with HLT-NAACL 2003, Edmonton, Canada,May 31 - June 1, 2003 , pages 142–147.Nitish Srivastava, Geoffrey E. Hinton, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. 2014. Dropout: a simple way to prevent neuralnetworks from overﬁtting.

J. Mach. Learn. Res. ,pages 1929–1958.Gabriel Stanovsky and Ido Dagan. 2016. Creatinga large benchmark for open information extraction.In

Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing , pages2300–2305, Austin, Texas. Association for Compu-tational Linguistics.Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer,and Ido Dagan. 2018. Supervised open informationextraction. In

Proceedings of the 2018 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long Papers) , pages 885–895, New Orleans, Louisiana. Association for Com-putational Linguistics.Pontus Stenetorp, Sampo Pyysalo, Goran Topi´c,Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu-jii. 2012. BRAT: a web-based tool for NLP-assistedtext annotation. In

Proceedings of the Demonstra-tions at the 13th Conference of the European Chap-ter of the Association for Computational Linguistics ,pages 102–107, Avignon, France. Association forComputational Linguistics.Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. Aminimal span-based neural constituency parser. In

Proceedings of the 55th Annual Meeting of the As-sociation for Computational Linguistics, ACL 2017,Vancouver, Canada, July 30 - August 4, Volume 1:Long Papers , pages 818–827.Emma Strubell, Patrick Verga, Daniel Andor,David Weiss, and Andrew McCallum. 2018.Linguistically-informed self-attention for semanticrole labeling. In

Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing , pages 5027–5038, Brussels, Belgium.Association for Computational Linguistics.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural networks.In

Advances in Neural Information Processing Sys-tems 27: Annual Conference on Neural Informa-tion Processing Systems 2014, December 8-13 2014,Montreal, Quebec, Canada , pages 3104–3112. Swabha Swayamdipta, Sam Thomson, Kenton Lee,Luke Zettlemoyer, Chris Dyer, and Noah A. Smith.2018. Syntactic scaffolds for semantic structures.In

Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing ,pages 3772–3782, Brussels, Belgium. Associationfor Computational Linguistics.Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a.BERT rediscovers the classical NLP pipeline. In

Proceedings of the 57th Conference of the Associ-ation for Computational Linguistics, ACL 2019, Flo-rence, Italy, July 28- August 2, 2019, Volume 1:Long Papers , pages 4593–4601.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R. Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R. Bowman, Dipan-jan Das, and Ellie Pavlick. 2019b. What do youlearn from context? probing for sentence structurein contextualized word representations. In .Kristina Toutanova, Dan Klein, Christopher D. Man-ning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network.In

Proceedings of the 2003 Human Language Tech-nology Conference of the North American Chapterof the Association for Computational Linguistics ,pages 252–259.Jesse Vig. 2019. A multiscale visualization of atten-tion in the transformer model. In

Proceedings ofthe 57th Conference of the Association for Compu-tational Linguistics, ACL 2019, Florence, Italy, July28 - August 2, 2019, Volume 3: System Demonstra-tions , pages 37–42.David Wadden, Ulme Wennberg, Yi Luan, and Han-naneh Hajishirzi. 2019. Entity, relation, and eventextraction with contextualized span representations.In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 5783–5788, Hong Kong, China. Association for Computa-tional Linguistics.Alex Wang, Yada Pruksachatkun, Nikita Nangia,Amanpreet Singh, Julian Michael, Felix Hill, OmerLevy, and Samuel R. Bowman. 2019a. Superglue:A stickier benchmark for general-purpose languageunderstanding systems.

CoRR , abs/1905.00537.Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel R. Bowman. 2019b.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In .Shanchan Wu and Yifan He. 2019. Enriching pre-trained language model with entity information forrelation classiﬁcation.

CoRR , abs/1905.08284.u Xu, Bing Liu, Lei Shu, and Philip Yu. 2019. BERTpost-training for review reading comprehension andaspect-based sentiment analysis. In

Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Longand Short Papers) , pages 2324–2335, Minneapolis,Minnesota. Association for Computational Linguis-tics.Bishan Yang and Claire Cardie. 2013. Joint infer-ence for ﬁne-grained opinion extraction. In

Proceed-ings of the 51st Annual Meeting of the Associationfor Computational Linguistics, ACL 2013, 4-9 Au-gust 2013, Soﬁa, Bulgaria, Volume 1: Long Papers ,pages 1640–1649.Junlang Zhang and Hai Zhao. 2019. Span based openinformation extraction.

CoRR , abs/1901.10879.

A Detailed Explanations of 10 Tasks • Span-oriented Tasks (Table 2a)– Named Entity Recognition (Sang and Meul-der, 2003) NER is traditionally considered asa sequence labeling task. We model namedentities as spans over one or more tokens. – Constituency Parsing (Collins, 1997) Con-stituency parsing aims to produce a syntacticparse tree for each sentence. Each node inthe tree is an individual span associated with aconstituent label, and spans are nested. – Part-of-speech Tagging (Ratnaparkhi, 1996;Toutanova et al., 2003) POS tagging is anothersequence labeling task, where every single to-ken is an individual span with a POS tag. – Aspect-based Sentiment Analysis (Pontikiet al., 2014) ABSA is a task that consists ofidentifying certain spans as aspect terms andpredicting their associated sentiments. • Relation-oriented Tasks (Table 2b)– Relation Extraction (Hendrickx et al., 2010)RE concerns the relation between two entities. – Coreference (Pradhan et al., 2012) Corefer-ence resolution is to link named, nominal, andpronominal mentions that refer to the sameconcept, within or beyond a single sentence. – Semantic Role Labeling (Gildea and Juraf-sky, 2002) SRL aims to identify arguments ofa predicate (verb or noun) and classify themwith semantic roles in relation to the predicate. – Open Information Extraction (Banko et al.,2007; Niklaus et al., 2018) In contrast to theﬁxed relation types in RE, OpenIE aims to ex-tract open-domain predicates and their argu-ments (usually subjects and objects) from asentence. – Dependency Parsing (K¨ubler et al., 2009)Spans are single-word tokens and a relationlinks a word to its syntactic parent with thecorresponding dependency type. – Opinion Role Labeling (Yang and Cardie,2013) ORL detects spans that are opinion ex-pressions, as well as holders and targets relatedto these opinions.

B Results of BERT Large Model

Table 7 shows the performance of single-task learn-ing with different token representations. BERT large achieves the best performance on most of the tasks. ategory Task Metric Dataset GloVe ELMo BERT base

SpanBERT base

BERT large

IE NER F CoNLL03 88.4 91.9 91.0 91.3 90.9WLP 77.6 79.2 78.1 77.9 78.3RE F SemEval10 50.7 61.8 61.7 62.1 64.7WLP 64.9 65.5 64.7 64.1 65.1Coref Avg F OntoNotes 56.3 62.2 66.3 70.0 -OpenIE F OIE2016 28.3 35.2 36.7 36.5 36.5SRL F OntoNotes 78.0 82.4 83.3 83.1 84.4Parsing Dep. LAS PTB 92.9 94.7 94.9 95.1 95.3OntoNotes 90.4 92.3 94.1 94.2 94.5Consti. Evalb F PTB 93.4 95.3 95.5 95.8 95.8OntoNotes 91.0 93.2 93.6 94.3 93.9Sentiment ABSA F SemEval14 63.5 69.2 70.8 70.0 73.8ORL F MPQA 3.0 38.2 42.9 44.5 45.2 47.1POS Accuracy PTB 96.8 97.7 97.6 97.6 97.4OntoNotes 97.0 98.2 97.7 98.3 97.9

Table 7: Single-task learning performance of the SpanRel model with different token representations. BERT large requires a large amount of memory so we cannot feed the entire document to the model in coreference resolution.

Information Extraction POS Parsing SRL Sentiment

NER RE Coref. OpenIE Dep. Consti. ABSA ORLmax span length l

10 5 10 30 1 1 - 30 10 30pruning ratio τ - 5 0.4 0.8 - 1.0 - 1.0 - 0.3 Table 8: Task-speciﬁc hyperparameters. Span-oriented tasks do not need pruning ratio.