[PDF] ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning

Abstract

This paper presents our systems for the three Subtasks of SemEval Task4: Reading Comprehension of Abstract Meaning (ReCAM). We explain the algorithms used to learn our models and the process of tuning the algorithms and selecting the best model. Inspired by the similarity of the ReCAM task and the language pre-training, we propose a simple yet effective technology, namely, negative augmentation with language model. Evaluation results demonstrate the effectiveness of our proposed approach. Our models achieve the 4th rank on both official test sets of Subtask 1 and Subtask 2 with an accuracy of 87.9% and an accuracy of 92.8%, respectively. We further conduct comprehensive model analysis and observe interesting error cases, which may promote future researches.

Full PDF

aa r X i v : . [ c s . C L ] F e b ZJUKLAB at SemEval-2021 Task 4:Negative Augmentation with Language Model for ReadingComprehension of Abstract Meaning

Xin Xie ∗ , Xiangnan Chen ∗ , Xiang Chen ∗ , Yong Wang ,Ningyu Zhang , Shumin Deng , Huajun Chen † Zhejiang University AZFT Joint Lab for Knowledge Engine Microsoft { xx2020,xnchen2020,xiang chen } @zju.edu.cn { zhangningyu,231sm,huajunsir } @[email protected] Abstract

This paper presents our systems for the threeSubtasks of SemEval Task4: Reading Compre-hension of Abstract Meaning (ReCAM). Weexplain the algorithms used to learn our mod-els and the process of tuning the algorithmsand selecting the best model. Inspired by thesimilarity of the ReCAM task and the lan-guage pre-training, we propose a simple yet ef-fective technology, namely, negative augmen-tation with language model. Evaluation resultsdemonstrate the effectiveness of our proposedapproach. Our models achieve the rank onboth ofﬁcial test sets of Subtask 1 and Subtask2 with an accuracy of 87.9% and an accuracyof 92.8%, respectively . We further conductcomprehensive model analysis and observe in-teresting error cases, which may promote fu-ture researches. Past decades have witnessed the huge progress ofrepresentation learning in Natural Language Pro-cessing (NLP). With pre-trained language mod-els, machine reading comprehension (MRC) mod-els can extract answers from given documentsand even yield better performance than humanson benchmark datasets such as Squad (Rajpurkaret al., 2016). However, these successes sometimeslead to the hype in which these models are beingdescribed as “understanding” language or captur-ing “meaning” (Bender and Koller, 2020). Notethat the intention of MRC is letting the systemsread a text like human beings, extracting text in-formation and understanding the meaning of a textthen answering questions, which means the sys-tems can not only conclude the semantic of thetext but also comprehend the abstract concepts un-der the constraint of general knowledge regarding ∗ Equal contribution and shared co-ﬁrst authorship. Our implementation is publicly available at https://github.com/zjunlp/SemEval2021Task4 the world (Wang and Jiang, 2016). Nevertheless,little works as well as benchmarks focus on thisdirection.

SemEval 2021 Task4 (Zheng et al., 2021) is anMRC task that focuses on evaluating the model’sability to understand abstract words. ReadingComprehension of Abstract Meaning (ReCAM)task is divided into three Subtasks including

Sub-task 1 : ReCAM-Imperceptibility,

Subtask 2 :ReCAM-Nonspeciﬁcity and

Subtask 3 : ReCAM-Intersection. Unlike previous MRC datasetssuch as CNN/Daily Mail (Hermann et al., 2015),SQuAD (Rajpurkar et al., 2018), and CoQA(Reddy et al., 2019) that request computers to pre-dict concrete concepts, e.g. named entities. Thistask challenges the model’s ability to ﬁll the ab-stract words removed from human-written sum-maries based on the English context.Note that this task’s input format is similar tothe MLM pre-training task of BERT (Devlin et al.,2019), which aims to predict the mask tokens. Pre-trained language models (PLMs) such as BERT(Devlin et al., 2019), RoBERTa (Liu et al., 2019),ALBERT (Lan et al., 2020), DeBERTa (He et al.,2021) have achieved success on MRC tasks. In-spired by this, we introduce a simple yet effec-tive method, namely, N egative A ugmentation with L anguage model ( NAL ) in

SemEval 2021 Task4 .Speciﬁcally, we augment the answer distributionwith an additional negative candidate from themask language model’s prediction. Previous work(Petroni et al., 2019; Zhou et al., 2020) indicatesthat the pre-trained language model has alreadycaptured much world knowledge. Thus, we arguethat knowledge can help guild the model trainingand identify those ambiguous abstract meanings.Further, we introduce other technologies such aslabel smoothing, domain-adaptive pre-training inour system. We describe the detailed approachesused for the Subtasks in Section 3.e conduct comprehensive experiments in Sec-tion 3, and we achieve the 4th system for Subtask1: ReCAM-Imperceptibility and the 4th system forSubtask 2: ReCAM-Nonspeciﬁcity in the leader-board. In our experiments, we observe that PLMswithout ﬁne-tuning can easily get 60+% accuracyon both Subtask 1 and Subtask 2, demonstratingthat pre-trained language models already capturesome abstract meanings. We further ﬁnd that ournegative augmentation with language model canimprove the performance with in Subtask 1and in Subtask 2. Finally, we conduct erroranalysis to promote future researches.

Machine reading comprehension (MRC) has re-ceived increasing attention recently, which is achallenging task. According to the type of the an-swer, reading comprehension tasks can be dividedinto four categories (Chen, 2018): 1) Cloze-style:The question contains a ”@placeholder,” and thesystem must choose a word or entity from the setof candidate answers to ﬁll in the ”@placeholder”to make the sentence complete. 2) Multiple choice:In this type of task, Choosing a suitable answerfrom K sets of given answers. This answer canbe one word or a sentence. 3) Span prediction:This kind of task is also called (Extractive questionanswering), which requires the system to extracta suitable range of text fragments from a givenoriginal text based on the question as to the an-swer. 4) Free-form answer: This task allows theanswer to be any type of text, which is necessary tomine deep-level contextual semantic informationaccording to a given question and a collection ofcandidate documents, and even combine multiplearticles to give the best answer.In

SemEval 2021 Task4 , it requires the sys-tem to have a strong ability of reading comprehen-sion not only because the task is the cloze-styleformat as mentioned above but also the abstractwords in answers. There are two deﬁnitions of ab-stract words: imperceptibility and nonspeciﬁcity.Concrete words refer to things, events, and proper-ties that we can perceive directly with our senses(Spreen and Schulz, 1966; Turney et al., 2011).Compared to concrete words like ”trees” and ”red,”abstract words for imperceptibility are created byhumans instead of pointing the things in the nat-ural world. For example, as shown in Table 1,”want” and ”achieve” means a person’s attitude to- P: Briton Davies won F42 shot put gold with aGames record at Rio 2016, but was unableto defend his 2012 discus title as it did notfeature in Brazil. ”I don’t normally say whatI’m going for,” said the Welshman, 25. ”Butthis time

I’m deﬁnitely going for the twogolds in both disciplines and nothing will bebetter than being in front of a home crowd.”... Q: Paralympic champion Aled Sion Davies@placeholder two gold medals at the 2017World Para Athletics Championships inLondon. A: (A) suffered (B) promoted (C) remains (D)wants (E) achieved P: ... Low vitamin D levels can lead to brittlebones and rickets in children. The ﬁguresfrom the HSCNI show a dramatic rise in Vi-tamin D prescriptions over the last 10 years:The data does not include Vitamin D boughtover the counter... Q: Rickets does not have the ring of a 21stCentury problem - it sounds more like the@placeholder of a bygone era .

A: (A) horror (B) size (C) fate (D) tale (E)death

Table 1: Examples of the

SemEval 2021 Task 4 .Given a passage and a question, the model needs topick the best one from the ﬁve candidates to replace@placeholder. wards something and a person’s accomplishmentabout something. Meanwhile, the abstract wordsfor nonspeciﬁcity can be described as upper words.By determining whether one word can generalizeanother word, we can get dictionaries of differentlevels. The words with higher levels are the non-speciﬁcity words. Compared to concrete conceptslike groundhog and whale, hypernyms such as ver-tebrate are regarded as more abstract (Changizi,2008).The difference between Subtask 1 and Subtask2 is the deﬁnition of abstract words. So the in-put of both Subtask 1 and Subtask 2 are the same.The input of these tasks are shown in Table 1, itcan be represented as a triple < P, Q, A > , where P = s , s , ..., s m is the passage from CNN daily(Hermann et al., 2015), Q is a human-written sum-mary based on the passage with one abstract wordreplaced by ”@placeholder” and A is a set of can-idate abstract words for ﬁlling in the ”@place-holder” in the question. Recently, with the development of the large Pre-trained Language Models (PLMs), such as GPT(Radford et al., 2018), BERT (Devlin et al., 2019),RoBERTa (Liu et al., 2019), ALBERT (Lan et al.,2020), DeBERTa (He et al., 2021), have over-whelm the NLP community (Zhang et al., 2020c).The powerful semantic feature extraction capabil-ities of the PLMs make us only need to make bet-ter use of the BERT-like model itself for down-stream tasks instead of adding different layers tothe model.Similar to the normal multi-choice task, wehave ﬁve candidates, one passage, and one ques-tion per sample. Here we leverage PLMs as en-coders to capture the global context representationabout the passage, question, and answer. Thena decoder is used to determine the score of each < P, Q, A > pair. Since we get A , ..., A n n answers, for every passage, we construct n in-put samples as [ Q − A i ; P ] , the concatenation of Q − A i and P . Because the question is the sum-mary with an abstract word removed. We con-struct Q − A by replacing ”@placeholder” withthe option from the candidate set instead of con-catenating Q and A . After encoding all n inputsfor a single passage, we get the global represen-tations T i for different options in the candidateset. During ﬁne-tuning PLMs, the ﬁrst special to-ken [CLS] represents the global meaning of thewhole input. We use an dense decoder layer tocompute the score for all T i , the calculation ofscore is as follow: T i = P LM ( Q − A ; P ) (1) score i = exp ( f ( T i )) P i ′ exp ( f ( T i ′ )) (2)where the [ Q − A ; P ] is the input constructed ac-cording to the instruction of PLMs and MRC tasks,and the T ∗ is the ﬁnal hidden state of the ﬁrst token [CLS] . The candidate answers with higher scores will be identiﬁed as the ﬁnal prediction.Since previous research (Gao et al., 2020; Yanget al., 2019) demonstrate that there exists a gap be-tween language model pre-training and ﬁne-tuning Figure 1: The procedure of Negative Augmentationwith Language Model (NAL). the models in the downstream task and inspire bythe similar task deﬁnition as MLM, we introducethe negative augmentation with language modelmechanism (Section 3.2). Note that the additionallabel will enhance the discriminability of the ab-stract meanings in a contrastive manner. In otherwords, the model is encouraged

NOT to gener-ate those abstract tokens from the language model,but the golden candidates from the given docu-ments. We further introduce the label smooth-ing (Section 3.3), which can enhance the modelperformance. Finally, we leverage task-adaptivepre-training (Section 3.4) inspired by (Gururanganet al., 2020) to obtain better performance.

Inspired by the same format of MLM and this task,we ﬁrst conduct a toy experiment to test whethera PLM can get the right answer without any su-pervised signal. Firstly we replace the ”@place-holder” with [MASK] to reconstruct the input andask the BERT model with MLM head to predictthe word token at the [MASK] . Then we calculatethe similarity between the word model predict andthe options from the set of candidate answers. Weset the option with the highest similarity score asthe model’s choice. Then we ﬁnd that the BERTmodel without any ﬁne-tuning gets accu-racy in both Subtask 1 and Subtask 2. The resultabove shows that PLMs have the ability to predictabstract words, and those predicted words can beleveraged as negative candidates in the ﬁne-tuningperiod.Note that huge languages have quantities ofparameters; the PLMs are able to store muchknowledge through pre-training tasks. However, [MASK] is not used when ﬁne-tuning the modelfor downstream tasks; how to use the knowledgestored by the model on pre-training tasks moreexplicitly on downstream tasks has become a hot igure 2: System overview (Best viewed in color.). The top of the Figure refers to the normal ﬁne-tuning ofmulti-choice models, ignoring the form of pre-training tasks. While the bottom of the Figure refers to our systemwith Negative Augmentation with Language Model (NAL), which uses the abstract words predict by the originalPLM as negative candidates to augment ﬁne-tuning. topic of current research. Motivated by this, wetry to bridge the gap between pre-train and down-stream tasks. Inspired by the contrastive learn-ing (Chen et al., 2020; Robinson et al., 2020)as stronger negative samples will help the modellearning with better performance, we introduceour negative augmentation with language modelmethod. Speciﬁcally, we let the PLMs predict the”@placeholder” replaced with [MASK] token togenerate negative candidates. Thus, we can lever-age those negative words that may mislead themodels to help train the models. Formally, wehave: P = p ( m i | θ, [ Q − A ; P ]) , m i ∈ [1 , , ..., | V | ] (3)where P are the distribution of words model, pre-dict, m i is the token in the vocabulary, and | V | isthe total number of the vocabulary. We can use thedistribution to get the top confusing words to aug-ment our models, which is described in Figure 2.Due to the limitation of GPU, we add the possibleword to augment our models. Label smoothing is a well-known ”trick” to im-prove the model’s performance effectively. It en-courages the activations of the penultimate layerto be close to the template of the correct class andequally distant to the templates of the incorrectclasses (M¨uller et al., 2019). With more options than the original dataset by the approach men-tioned in Section 3.2, label smoothing will mag-nify our method’s effect while ﬁne-tuning the mod-els. Suppose the output of the ﬁnal layer and soft-max layer as follows: p k = e x T w k P Ll =1 e x T w l (4)where p k is the likelihood the model assigns to the k -th class, w k represents the weights and biases ofthe last layer. x is the vector containing the activa-tions of the penultimate layer of a neural networkconcatenated with ”1” to account for the bias. letus see the equitation about the cross entropy loss. L = − M X c =1 y k log ( p k ) (5)The cross-entropy formula without Labelsmoothing only focuses on whether the positiveexample is true and does not pay attention to thenegative examples’ relationship. We make the softy as follows: y i = ( (1 − ε ) , right answer εK − , wrong answer (6)We set ε as . in our models. The BERT-like model is pre-trained in the generaldomain corpus such as Wikipedia. Since passages tatistics / Dataset Subtask 1 Subtask 2

Table 2: Statistics of the

SemEval 2021 Task 4 dataset. mainly come from CNN daily, the data distribu-tion may be quite different from pre-training data.Therefore, we utilize task-adaptive pre-train BERTwith masked language model and next sentenceprediction tasks on the domain-speciﬁc data. Task-adaptive pre-training not only makes the modelbetter ﬁt the distribution in the domain but alsohelps the model to predict good negative words toenhance the original dataset, which is described inSection 3.2. We take two different approaches fortask-adaptive pre-training as follows:1) In-domain pre-training, we use the source data:CNN Daily to task-adaptive pre-training ourbase models(Sun et al., 2020).2) Within-task pre-training, practically we replacethe ”@placeholder” with the correct answerand put the same input format as the ﬁne-tuningsteps, which is [ Q − A ; P ] (Gururangan et al.,2020). In Subtask 1, the training/trail/development/testcontains , / , / / , instances. InSubtask 2, the training/trail/development/test con-tains , / , / / , instances. The over-all statistics can be found in Table 2. For data pre-processing, we use the byte-levelBPE encoding (Sennrich et al., 2016), and theofﬁcial vocabulary contains more than ﬁfty thou-sand byte-level tokens. All tokens are stored in

MERGES . TXT , while

VOCAB . JSON is a byte-to-index mapping. Generally speaking, the higherthe frequency, the smaller the byte index. Sincethe average length of the passage about Subtask 1and Subtask 2 is and , we divide those longcontext paragraphs. We limit the max number oftokens in an input sample [ Q − A ; P ] to 256 for oursystem. Statically, 60% of the paragraphs exceeds the tokens (including the special tokens like [CLS] , [SEP] and so on. For these input sam-ples, we divide them into new input samples withat most 256 tokens. To be more speciﬁc, we di-vide the passage to different inputs with the samequestion and answer. Our system is implemented with PyTorch (Paszkeet al., 2019) and we use the PyTorch version ofthe pre-trained language models . We employRoBERTa, ALBERT, and DeBERTa large mod-els as our PLM encoder. We use AdamW opti-mizer (Loshchilov and Hutter, 2018) to ﬁne-tunethe models. We set the batch size to 1, and themax length of input to for RoBERTa, forALBERT.Usually, the batch size has a signiﬁcant inﬂu-ence on the BERT-like model; due to the limit ofGPU memory, we use gradient accumulation inour training steps. We set the gradient accumu-lation step as , which means the formal num-ber of batch sizes is 32 in training. We pick thebest learning rate from the dev set, ﬁne-tuning theRoBERTa, ALBERT, DeBERTa with the learningrate of × − , × − and × − respectively.We set the number of epoch to 8 for ALBERT and12 for RoBERTa and DeBERTa. Furthermore, wesave the best model on the validation set for testingduring training. Because the formats of both Sub-task 1 and Subtask 2 are the same, we set the samebatch size and max length of the input sequencefor training. On Subtask 1 , the ReCAM-Imperceptibility task,the evaluation results are illustrated in Table 3.We set the three baseline models: RoBERTa

Large ,DeBERTa

Large , and ALBERT xxLarge . RoBERTa

Large + NAL, DeBERTa

Large + NAL, and ALBERT

Large + NAL denotes the language model with ourproposed negative augmentation with languagemodel. Ensemble refers to the ensemble modelof the three models as mentioned above with allstrategies. We ﬁnd that ALBERT achieves bet-ter performance in Subtask 1 but fails to get goodperformance in Subtask 2, while DeBERTa andRoBERTa have better performance in Subtask 2. https://github.com/huggingface/transformers (version 3.3.0) odel Dev Test Baseline

RoBERTa

Large xxLarge

Large

Ours

RoBERTa

Large + NAL 85.9 86.1ALBERT xxLarge + NAL 86.2 85.6DeBERTa

Large + NAL 86.7 86.8Ensemble

Table 3: Results (Accuracy) on Subtask 1.

Model Dev Test

Baseline

RoBERTa

Large xxLarge

Large

Ours

RoBERTa

Large + NAL 91.1 89.7ALBERT xxLarge + NAL 89.3 88.6DeBERTa

Large + NAL 91.3 90.3Ensemble

Table 4: Results (Accuracy) on Subtask 2.

Comparing with the original RoBERTa, DeBERTa,and ALBERT models, each model is hugely im-proved with NAL by about % accuracy. We fur-ther observe that DeBERTa and RoBERTa, whichhave the same architecture, obtain better perfor-mance than ALBERT in the dev and test sets. Wethink the possible reason is that ALBERT useslayer weight sharing, which reduces the model’sgeneralization ability in reading comprehension,especially the abstract words meaning. Finally,the ensemble of the best model of RoBERTa, De-BERTa, and ALBERT lead to a signiﬁcant im-provement ( % accuracy) compared with base-lines, which is also our ﬁnal submission to theleaderboard.

On Subtask 2, the ReCAM-Nonspeciﬁcity task,the experiment results are showed in Table4. Similar to the models in Subtask 1, wechoose RoBERTa, DeBERTa and ALBERT asour baseline models. All RoBERTa

Large + NAL ,ALBERT xxLarge + NAL and DeBERTa

Large + NALare the models with negative augmentation with language model. Ensemble refers to the ensem-ble model of RoBERTa, DeBERTa, and ALBERTwith all strategies. We notice that our proposedmechanism brings signiﬁcant improvement (aver-aging % of the accuracy score) compared withbaselines, demonstrating the effectiveness of ourproposed strategies such as negative augmentationwith a language model, label smoothing, and task-adaptive pre-training. We observe that ensembleapproach of three enhanced models (RoBERTa

Large + NAL, ALBERT xxLarge + NAL and DeBERTa

Large +NAL) obtain the best accuracy of . % at test set,which is also our ﬁnal submit to the leaderboard. Subtask3 focuses on the model’s transferability.During the evaluation period, we use the data onSubtask 2 to evaluate the models trained on theSubtask 1 and vice versa. We obtain the 82% ac-curacy of the model trained on Subtask 1 and eval-uated on Subtask 2 on the dev set.During experiments for all tasks, we have triedto use different decoders like MLP and other net-work architecture. Eventually, we ﬁnd that it doesnot help to improve the system’s performance. Anexplanation is that the pre-trained language mod-els (PLMs) have already captured global contex-tual sentence meaning at the [CLS] token.

During our experiments, we conduct case stud-ies to ﬁgure out how our method of NAL helpsthe model to boost performance. From Table 5,we notice that the original PLM considers usingthe ”all”, ”half” as its choice instead of ”parts”.Although ﬁne-tuned on the downstream task, thebaseline model still choose ”half”. In our NALmethod, we add some misleading negative wordsto help models correct the knowledge learned fromthe pre-training task.

In usual MRC tasks, the length of the passageis a key factor for the models to solve the prob-lems. We conduct experiments to analyze the per-formance regarding different lengths of passage.Contrary to the common assumption, from Figure3 and Figure 4, we observe that the instances withlong passage obtain better performance. We think xampleQuestion:

The Aurora Borealis, better known as the Northern Lights, was spotted across@placeholder of England on Sunday.

Answer set { (A) millions, (B) parts , (C) half, (D) isle, (E) remains } NAL set { all, half, parts } Baseline (C) half

Model with NAL (B) parts

Question:

The BBC is providing live coverage of the Scottish National Party conferencein Glasgow. This live @placeholder has ﬁnished .

Answer set { (A) results, (B) recording, (C) event , (D) action, (E) center } NAL set { blog, recording , stream } Baseline (B) recording

Model with NAL (C) event

Table 5: We can clearly see the negative options can help the model better understand the abstract meaning in thepassage and question. Answers are bold in the Table. that abstract mean understanding may need com-prehensive context information from the long sen-tence, and we will conduct further analysis in fu-ture works.

Paragraph Length (term) A cc u r a cy ( % ) RoBERTa large

ALBERT xxlarge

DeBERTa large

Figure 3: Results (Accuracy) on Subtask 1 with thelength of passage.

Paragraph Length (term) A cc u r a cy ( % ) RoBERTa large

ALBERT xxlarge

DeBERTa large

Figure 4: Results (Accuracy) on Subtask 2 with thelength of passage.

We select four kinds of different types of errorcases to promote further researches. We classifythe examples according to the main causes (pre-training, ﬁne-tuning, and so on) of the error. Wethink it will help us better understand what themodel learns from pre-training and ﬁne-tuning.

Case 1 - Inﬂuenced by the original pre-trainingtask • Passage: ”...found the United States to havethe highest number of sleep deprived stu-dents, with 73% of 9 and 10 year olds and80% of 13 and 14 year olds identiﬁed by theirteachers as being adversely affected . TheBBC’s Jane O’Brien reports.”•

Question:

Sleep deprivation is a signiﬁcanthidden factor in lowering the @placeholderof school pupils , according to researcherscarrying out international education tests .•

Answer: (A) morale (B) IQ (C) mortality (D)closure (E) achievement•

Negative augmented choice: (F) intelli-gence•

Right Option: (E) achievement•

Wrong Option: (B) IQ•

Potential causes:

After pre-training on thelarge general domain corpus, PLMs havea huge bias on predicting the [MASK] to-ken. Just like the ”IQ” model predict in the”@placeholder”. Even after ﬁne-tuning, ourmodels still cannot recognize the strong ev-idence ”being adversely affected”. In ourdaily life, we wouldn’t hold that being ad-versely affected by lack of sleep can lead toa decrease in IQ. We usually say that lack ofsleeping may lower one’s achievement in thefuture.•

How to help models?

To prevent the modelfrom relying too much on pre-training tasks,we create more negative samples to help themodel to understand what is wrong or rightbout the abstract words.

Case 2 - Adverse affected by ﬁne-tuning • Passage: ” 17 May 2017 Last updated at12:44 BST Adrien Gulfo, wearing red, whoplays for the Swiss side Pully Football, triedto clear the ball away from his goal with aspectacular bicycle kick. Unfortunately forhim it all went very wrong - watch the video...There was a happy ending to the story forGulfo though, Pully went through to the cupﬁnal on penalties after the match ﬁnished 3-3.”•

Question:

You won’t believe this own goalthat was @placeholder in the Swiss lowerleague !•

Answer: (A) scored (B) born (C) eliminated(D) closed (E) beaten•

Negative Augmented Choice: (F) scored(model predict ”scored.” Because it is theright answer, so we choose another choice”played” as an augmented choice. )•

Right Option: (A) scored•

Wrong Option: (E) beaten•

Potential Causes:

It is quite weird that theoriginal PLMs can predict the right answer,but fail to make it after ﬁne-tuning. We sup-pose that in the process of ﬁne-tuning, theinconsistency of abstract vocabulary predic-tion and the interference of other vocabularycaused the model’s effect in some cases to de-crease instead.•

How to help models?

We could use ourapproach of NAL to increase the weight ofthe knowledge learned in the pre-training taskor leverage external knowledge (Zhang et al.,2019, 2020b; Yu et al., 2020; Zhang et al.,2020a).

Case 3 - Obscure abstract word meaning • Passage: ” ...Mr Habgood said: ”We’repretty sure it will be popular because it waswhen East Street was closed for other reasonsand we want to make it a friendlier place tobe. ”It does ﬁt with our larger objectives toimprove the town and make it safer for cy-clists and pedestrians.” ...”•

Question:

Three busy town center streetsare to be pedestrianised in a bid to improve@placeholder for shoppers and cyclists .•

Answer: (A) opportunities (B) services (C)quality (D) disruption (E) safety •

Negative Augmented Choice: (F) access•

Right Option: (E) safety•

Wrong Option: (B) services•

Potential Causes:

Due the limit of GPUmemory, we cannot put the long passage intothe model once a time. So during the train-ing, the model can only see a small chunk ofthe passage, so that it cannot get the globalrepresentation of the passage.•

How to help models?

We chunk those longsentences with the approach of the slidingwindow to help the model understanding thewhole passage.

Case 4 - Hypernyms is not always right • Passage: ” North Wales Fire and Rescue Ser-vice was called to Express Linen Services onVale Road in Llandudno Junction just before19:30 GMT on Thursday. North Wales Policesaid a man was treated at the scene for smokeinhalation. Police have asked people to avoidthe area...”•

Question:

A number of @placeholder havebeen evacuated as ﬁreﬁghters tackle a blazeat a commercial laundry ﬁrm ’s premises inConwy county.•

Answer: (A) families (B) properties (C) wa-ter (D) disruption (E) vehicles•

Negative Augmented Choice: (F) homes•

Right Option: (B) properties•

Wrong Option: (A) families•

Potential Causes:

Hypernyms is the main fo-cus of Subtask 2, the model may consider the”families” as the upper level of the ”people”occur in the passage and choose the ”(A) fam-ilies” instead of the right answer ”(B) proper-ties”.•

How to help models?

We try to use theproposed NAL to add more abstract wordslearned from the pre-training to mitigate thisissue.

This paper presents our system design for the Se-mEval 2021 Task4. We propose a simple yet effec-tive method called negative augmentation with lan-guage model. Comprehensive experiments demon-strate the effectiveness of our proposed approach.We also conduct case studies and investigate whythe model fails to obtain the correct prediction.Note that language models are pre-trained fromthe huge corpus; recently, researchers have iden-iﬁed the bias in the language model, which maymislead the model prediction. Our proposed neg-ative augmentation with language model can helpthe model better discriminate candidates in ﬁne-tuning, thus boost the performance. From anotherperspective, as depicts in Section 3.2, the languagemodel without any ﬁne-tuning gets accu-racy in both Subtask 1 and Subtask 2. This indi-cates that bias exists in the datasets (Part of the ab-stract meaning can be obtained from the languagemodel). More strong benchmarks should be con-structed in the future.

References

Emily M. Bender and Alexander Koller. 2020. Climb-ing towards NLU: On meaning, form, and under-standing in the age of data. In

Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics , pages 5185–5198, Online. As-sociation for Computational Linguistics.Mark A. Changizi. 2008. Economically organized hier-archies in WordNet and the Oxford English Dictio-nary.

Cognitive Systems Research , 9(3):214–228.Danqi Chen. 2018.

Neural Reading Comprehensionand Beyond . Ph.D. thesis, Stanford University.Ting Chen, Simon Kornblith, Mohammad Norouzi,and Geoffrey E. Hinton. 2020. A simple frameworkfor contrastive learning of visual representations. In

Proceedings of the 37th International Conference onMachine Learning, ICML 2020, 13-18 July 2020,Virtual Event , volume 119 of

Proceedings of Ma-chine Learning Research , pages 1597–1607. PMLR.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Tianyu Gao, Adam Fisch, and Danqi Chen. 2020.Making pre-trained language models better few-shotlearners.

CoRR , abs/2012.15723.Suchin Gururangan, Ana Marasovi´c, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. 2020. Don’t stop pretraining:Adapt language models to domains and tasks. In

Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics , pages8342–8360, Online. Association for ComputationalLinguistics. Pengcheng He, Xiaodong Liu, Jianfeng Gao, andWeizhu Chen. 2021. DeBERTa: Decoding-en-hanced BERT with Disentangled Attention. arXiv:2006.03654 [cs] . ArXiv: 2006.03654.Karl Moritz Hermann, Tom´as Kocisk´y, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,and Phil Blunsom. 2015. Teaching machines toread and comprehend. In

Advances in Neural Infor-mation Processing Systems 28: Annual Conferenceon Neural Information Processing Systems 2015,December 7-12, 2015, Montreal, Quebec, Canada ,pages 1693–1701.Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. ALBERT: A lite BERT for self-supervisedlearning of language representations. In . OpenReview.net.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized BERT pretraining ap-proach.

CoRR , abs/1907.11692.Ilya Loshchilov and Frank Hutter. 2018. Fixing WeightDecay Regularization in Adam.Rafael M¨uller, Simon Kornblith, and Geoffrey E. Hin-ton. 2019. When does label smoothing help? In

Advances in Neural Information Processing Systems32: Annual Conference on Neural Information Pro-cessing Systems 2019, NeurIPS 2019, December8-14, 2019, Vancouver, BC, Canada , pages 4696–4705.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas K¨opf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Te-jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,Junjie Bai, and Soumith Chintala. 2019. Py-torch: An imperative style, high-performance deeplearning library. In

Advances in Neural Informa-tion Processing Systems 32: Annual Conferenceon Neural Information Processing Systems 2019,NeurIPS 2019, December 8-14, 2019, Vancouver,BC, Canada , pages 8024–8035.Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In

Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2463–2473, Hong Kong, China. As-sociation for Computational Linguistics.Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.ranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for SQuAD. In

Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers) , pages 784–789, Melbourne, Australia. Association for Compu-tational Linguistics.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 2383–2392, Austin,Texas. Association for Computational Linguistics.Siva Reddy, Danqi Chen, and Christopher D. Manning.2019. Coqa: A conversational question answer-ing challenge.

Trans. Assoc. Comput. Linguistics ,7:249–266.Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, andStefanie Jegelka. 2020. Contrastive learning withhard negative samples.

CoRR , abs/2010.04592.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Otfried Spreen and Rudolph W. Schulz. 1966. Param-eters of abstraction, meaningfulness, and pronuncia-bility for 329 nouns.

Journal of Verbal Learning andVerbal Behavior , 5(5):459–468.Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.2020. How to Fine-Tune BERT for Text Classiﬁca-tion? arXiv:1905.05583 [cs] . ArXiv: 1905.05583.Peter Turney, Yair Neuman, Dan Assaf, and Yohai Co-hen. 2011. Literal and metaphorical sense identiﬁca-tion through concrete and abstract context. In

Pro-ceedings of the 2011 Conference on Empirical Meth-ods in Natural Language Processing , pages 680–690, Edinburgh, Scotland, UK. Association for Com-putational Linguistics.Shuohang Wang and Jing Jiang. 2016. Machine com-prehension using match-lstm and answer pointer.

CoRR , abs/1608.07905.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In

Advances in NeuralInformation Processing Systems 32: Annual Con-ference on Neural Information Processing Systems2019, NeurIPS 2019, December 8-14, 2019, Vancou-ver, BC, Canada , pages 5754–5764.Haiyang Yu, Ningyu Zhang, Shumin Deng, HongbinYe, Wei Zhang, and Huajun Chen. 2020. Bridgingtext and knowledge with multi-prototype embedding for few-shot relational triple extraction. In

Proceed-ings of the 28th International Conference on Com-putational Linguistics, COLING 2020, Barcelona,Spain (Online), December 8-13, 2020 , pages 6399–6410. International Committee on ComputationalLinguistics.Ningyu Zhang, Shumin Deng, Juan Li, Xi Chen, WeiZhang, and Huajun Chen. 2020a. Summarizing chi-nese medical answer with graph convolution net-works and question-focused dual attention. In

Pro-ceedings of the 2020 Conference on Empirical Meth-ods in Natural Language Processing: Findings,EMNLP 2020, Online Event, 16-20 November 2020 ,pages 15–24. Association for Computational Lin-guistics.Ningyu Zhang, Shumin Deng, Zhanlin Sun, JiaoyanChen, Wei Zhang, and Huajun Chen. 2020b. Rela-tion adversarial network for low resource knowledgegraph completion. In

WWW ’20: The Web Confer-ence 2020, Taipei, Taiwan, April 20-24, 2020 , pages1–12. ACM / IW3C2.Ningyu Zhang, Shumin Deng, Zhanlin Sun, Guany-ing Wang, Xi Chen, Wei Zhang, and Huajun Chen.2019. Long-tail relation extraction via knowledgegraph embeddings and graph convolution networks.In

Proceedings of the 2019 Conference of the NorthAmerican Chapter of the Association for Compu-tational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers) , pages3016–3025, Minneapolis, Minnesota. Associationfor Computational Linguistics.Ningyu Zhang, Qianghuai Jia, Kangping Yin, LiangDong, Feng Gao, and Nengwei Hua. 2020c. Concep-tualized representation learning for chinese biomed-ical text mining. arXiv preprint arXiv:2008.10813 .Boyuan Zheng, Xiaoyu Yang, Yuping Ruan, QuanLiu, Zhen-Hua Ling, Si Wei, and Xiaodan Zhu.2021. SemEval-2021 task 4: Reading comprehen-sion of abstract meaning. In

Proceedings of the15th International Workshop on Semantic Evalua-tion (SemEval-2021) .Xuhui Zhou, Yue Zhang, Leyang Cui, and DandanHuang. 2020. Evaluating commonsense in pre-trained language models. In