[PDF] IIE-NLP-Eyas at SemEval-2021 Task 4: Enhancing PLM for ReCAM with Special Tokens, Re-Ranking, Siamese Encoders and Back Translation

Abstract

This paper introduces our systems for all three subtasks of SemEval-2021 Task 4: Reading Comprehension of Abstract Meaning. To help our model better represent and understand abstract concepts in natural language, we well-design many simple and effective approaches adapted to the backbone model (RoBERTa). Specifically, we formalize the subtasks into the multiple-choice question answering format and add special tokens to abstract concepts, then, the final prediction of question answering is considered as the result of subtasks. Additionally, we employ many finetuning tricks to improve the performance. Experimental results show that our approaches achieve significant performance compared with the baseline systems. Our approaches achieve eighth rank on subtask-1 and tenth rank on subtask-2.

Full PDF

aa r X i v : . [ c s . C L ] F e b IIE-NLP-Eyas at SemEval-2021 Task 4: Enhancing PLM for ReCAMwith Special Tokens, Re-Ranking, Siamese Encoders and Back Translation

Yuqiang Xie, Luxi Xing, Wei Peng, Yue Hu ∗ Institute of Information Engineering, Chinese Academy of Sciences, Beijing, ChinaSchool of Cyber Security, University of Chinese Academy of Sciences, Beijing, China { xieyuqiang,xingluxi,pengwei,huyue } @iie.ac.cn Abstract

This paper introduces our systems for all threesubtasks of SemEval-2021 Task 4: ReadingComprehension of Abstract Meaning. To helpour model better represent and understand ab-stract concepts in natural language, we well-design many simple and effective approachesadapted to the backbone model (RoBERTa).Speciﬁcally, we formalize the subtasks intothe multiple-choice question answering formatand add special tokens to abstract concepts,then, the ﬁnal prediction of question answer-ing is considered as the result of subtasks. Ad-ditionally, we employ many ﬁnetuning tricksto improve the performance. Experimental re-sults show that our approaches achieve signif-icant performance compared with the baselinesystems. Our approaches achieve eighth rankon subtask-1 and tenth rank on subtask-2.

The computer’s ability in understanding, repre-senting, and expressing abstract meaning is a fun-damental problem towards achieving true natu-ral language understanding. SemEval-2021 Task4: Reading Comprehension of Abstract Meaning(ReCAM) provides a well-formed benchmark thataims to study the machine’s ability in representingand understanding abstract concepts (Zheng et al.,2021).The Reading Comprehension of Abstract Mean-ing (ReCAM) task is divided into three sub-tasks, including Imperceptibility, Nonspeciﬁcil-ity, and Interaction. Please refer to the task de-scription paper (Zheng et al., 2021) for more de-tails. To address the above challenges in Re-CAM, we ﬁrst formalize all subtasks as a typeof multiple-choice Question Answering (QA) tasklike (Xing et al., 2020). Recently, the the largePre-trained Language Models (PLMs), such as * Corresponding author.

GPT-2 (Radford et al., 2019), BERT (Devlin et al.,2019), RoBERTa (Liu et al., 2019), demonstrateits excellent ability in various natural language un-derstanding tasks (Wang et al., 2018; Zellers et al.,2018, 2019). So, we employ the state-of-the-artPLM, RoBERTa, as our backbone model. More-over, we well-design many simple and effective ap-proaches to improve the performance of the back-bone model, such as adding special tokens, sen-tence re-ranking, label smoothing and back trans-lation.This paper describes approaches for all subtasksdeveloped by the IIE-NLP-Eyas Team (NaturalLanguage Processing group of Institute of Infor-mation Engineering of the Chinese Academy ofSciences). Our contributions are summarized asthe followings:• We design many simple and effective ap-proaches to improve the performance of thePLMs on all three subtasks, such as addingspecial tokens, sentence re-ranking and so on;• Experiments demonstrate that the proposedmethods achieve signiﬁcant improvementcompared with the PLMs baseline and weobtain the eighth-place in subtask-1 and thetenth-place in subtask-2 on the ﬁnal ofﬁcialevaluation.

Since the format of the tasks in ReCAM is thesame, we use the uniﬁed framework to address alltasks. The following is the detail of our methods.

Task Deﬁnition

We ﬁrst present the descrip-tion of symbols. Formally, suppose thereare seven key elements in all subtasks, i.e. { D, Q, A , A , A , A , A } . We suppose the D denotes the given article, the Q denotes the sum-mary of the article with a placeholder, the A ∗ de-otes the candidate abstract concepts for all sub-tasks to ﬁll in the placeholder. Multi-Choice Based Model

The pre-trainedlanguage models have made a great contributionto MRC tasks. Recently, a signiﬁcant milestoneis the BERT (Devlin et al., 2019), which gets newstate-of-the-art results on eleven natural languageprocessing tasks. In this section, we present the de-scription of the multi-choice based model whichwe use in all subtasks. Consider the BERT-stylemodel RoBERTa’s (Liu et al., 2019) stronger per-formance than BERT, we utilize it as our back-bone model, which introduces more data and big-ger models for better performance. A multiple-choice based QA model M consists of a PLM en-coder and a task-speciﬁc classiﬁcation layer whichincludes a feed-forward neural network f ( · ) anda softmax operation. For each pair of question-answer, the calculation of M is as follow: score i = exp ( f ( S i )) P i ′ exp ( f ( S i ′ )) (1) S i = PLM ([ Q ; A i ; D ]) (2)where the [ · ] is the input constructed according tothe instruction of PLMs, and the S ∗ is the ﬁnalhidden state of the ﬁrst token ( ). For moredetails, we refer to the original work of PLMs(Liu et al., 2019). The candidate answer whichowns a higher score will be identiﬁed as the ﬁnalprediction. The model M is trained end-to-endwith the cross-entropy objective function. Special Tokens

To help the model to the PLMsrepresent and understanding the abstract conceptin textual descriptions, we add special tokens toenhance the semantic representation of candidateconcepts. The idea is similar to the prompt tem-plate of (Xing et al., 2020). We use and to add on both ends of the abstract concept,i.e. abstract concept . Wealso tried many other special tokens which will bediscussed in section 4.

Sentence Ranking

As the given passage is toolong to be deal with the Pre-trained LanguageModels (PLMs), we consider reﬁning the passageinput by rearranging the order of the sentences inthe passage. With this reorder process, the sen-tence, which is more critical to the question, canappear at the beginning of the passage. Althoughthe passage’s sequential information is sacriﬁced,we keep the more question-relevant information of the passage. Supposing the Passage D contains N sentences, i.e., D = { W , W , ..., W N } , whereeach sentence W n = { t , t , ..., t M } including M tokens. We denote the given cloze-style questionas Q . To rank the sentences in D , we resort BERTto compute the similarity score between each sen-tence, i.e. W n , and Q following the algorithm inZhang et al. (2020). After ranking, the sentencesin D are sorted in descending order of similarityscores, and we can get a rearranged passage ˆ D asthe passage input to the QA model. Siamese Encoders

When exploring the dataset,we ﬁnd that the complete question statement, rep-resenting the result statement after replacing theplaceholder token with the candidate option, alsocontains the semantic information which can helpto make the judgment about options. Based on theobservation, we propose a siamese encoders basedarchitecture to inject the additional complete ques-tion statement information while not inﬂuence theinput with passage. On the other hand, it can beseen as introducing an auxiliary task to assist themain task. Speciﬁcally, the training of siamese en-coder based architecture is as following: l i = P LM ([ ˆ Q i ])[0] (3) l i = P LM ([ Q ; A i ; D ])[0] (4) P ( A i | ˆ Q ) = softmax ( f ( l i )) (5) P ( A i | D, Q ) = softmax ( f ( l i )) (6)where the P LM ( · ) stands for PLM encoder, ˆ Q i is the complete question statement, i indicates thei-th candidate answer, f ( · ) is the feed forward net-work. To coordinate the two losses, we opt foran uncertainty loss (Kendall et al., 2018) to adjustit adaptively through σ { , } as: L ( θ, σ , σ ) = σ L ( θ )+ σ L ( θ )+ log σ σ , where L { , } arethe cross-entropy loss between the model predic-tion P { , } and the ground truth label respectively. Back Translation

Generally speaking, moresuccessful neural networks require a large numberof parameters, often in the millions. In order tomake the neural network implements correctly, alot of data is needed for training, but in actual situa-tions, there is not as much data as we thought. Therole of data augment includes two aspects. Oneis to increase the amount of training data and im-prove the generalization ability of the model. Theother is to increase the noise data and improve therobustness of the model. A large number of the ubtask Train Trail Dev TestImperceptibility 3227 1000 837 2025Nonspeciﬁcility 3318 1000 851 2017

Table 1: Data scale of each subtask. works (Buslaev et al., 2018; Bloice et al., 2019;Chen et al., 2020; Cubuk et al., 2020; Sato et al.,2018; Zhu et al., 2020) consider the data augmentto make better performances. In the ﬁeld of com-puter vision, a lot of work (Buslaev et al., 2018;Bloice et al., 2019; Chen et al., 2020; Cubuk et al.,2020) uses existing data to perform operations,such as ﬂipping, translation or rotation, to cre-ate more data, so that neural networks have bet-ter generalization effects. Adding Gaussian dis-tribution to text processing (Sato et al., 2018) canalso achieve the effect of data augment. Besides,some works (Miyato et al., 2017; Zhu et al., 2020)utilize the adversarial training methods to do thedata augment. For convenience and simplicity, weadopt the back translation (Sennrich et al., 2016)to increase the amount of training data, whichused to construct pseudo parallel corpus in unsu-pervised machine translation (Lample et al., 2018).Speciﬁcally, we use the Google API † to translatethe passage into French, and then translate thetranslation into English in turn. The pseudo par-allel corpus can be obtained as: { D ′ } = bkt ( { D } ) (7)where { D ′ } means the translated English corpusthat we used as data agument, bkt is back transla-tion.As for the question, given the existence of thespecial character placeholder , forced translationmay result in grammatical errors and semanticgaps. Therefore, the questions and options willbe kept original. After getting the pseudo parallelcorpus, we train our model with the training datatogether with the cross-entropy loss function. Label Smoothing

Furthermore, for improvingthe generalization ability of the model trained onsole task and prevent the overconﬁdence of model,we consider training model with label smoothing(Miller et al., 1996; Pereyra et al., 2017). Whentraining with label smoothing, the hard one-hot la-bel distribution is replaced with a softened label † The web page is available athttps://translate.google.com

Hyper-parameter ValueLR { } Batch size {

16, 32 } Gradient norm 1.0Warm-up { } Max. input length (

Table 2: Hyper-parameters of our approach.

Models Trial Acc. Dev Acc.R

OBERTA

LARGE (Liu et al., 2019) 85.85 82.12(1) w/ special tokens (2) w/ sentence ranking 86.54 83.52(3) w/ label smoothing 86.88 85.85(4) w/ siamese encoders 86.62 83.22(5) w/ back translation 87.23 84.32Our Approach

Table 3: The results of our approach on subtask-1. Our approach is the ﬁnal, stable and best model:R

OBERTA

LARGE with special tokens. distribution through a smoothing value α , whichis a hyper-parameter. In our experiments, we setthe smoothing value α = 0 . . In all subtasks, the scale of each task is shown inTable 1. We train the model on training data andthe related pseudo data generated by back trans-lation, then select hyper-parameters based on thebest performing model on the dev set, and then re-port results on the test set.Our system is implemented with PyTorch andwe use the PyTorch version of the pre-trained lan-guage models ‡ . We employ RoBERTa (Liu et al.,2019) large model as our PLM encoder in Equa-tion 2. The Adam optimizer (Kingma and Ba,2014) is used to ﬁne-tune the model. We introducethe detailed setup of the best model on the devel-opment dataset. For subtask-1 and subtask-2, thehyper-parameters are shown in Table 2. From Table 3, we can see theresults of our approach on subtask-1 of ReCAM.Compared to the backbone model RoBERTa largemodel, our methods achieve signiﬁcant improve- ‡ https://github.com/huggingface/transformers odels Trial Acc. Dev Acc.R OBERTA

LARGE (Liu et al., 2019)

Table 4: The results of our approach on subtask-2. Our approach is the ﬁnal, stable and best model:R

OBERTA

LARGE with special tokens and label smooth-ing.

Trained on Tested on Test Acc.Subtask-1 Subtask-1 87.51Subtask-1 Subtask-2 84.13Subtask-2 Subtask-2 89.64Subtask-2 Subtask-1 81.09

Table 5: The results of our approach on subtask-3. ments. It is interesting that the special token is themost helpful part for the Imperceptibility subtask.

Nonspeciﬁcility

Table 4 shows the results ofour approach on subtask-2 of ReCAM. Similarly,the models with special tokens work well on theNonspeciﬁcility subtask. Compared to the back-bone model RoBERTa large model, our methodsachieve better improvements.

Interaction

We also perform subtask-3 of Re-CAM, Interaction, which aims to provide more in-sights into the relationship of the two views on ab-stractness. In this task, we test the performanceof our system that is trained on one deﬁnition andevaluated on the other. The results of our system’sperformance on Imperceptibility and Nonspeciﬁ-cility subtasks which is shown in Table 5. We canﬁnd that our model is relatively robust for differentabstract concepts.

In this part, we perform an ablation study of our ap-proach. As shown in Table 3 and 4, our proposedmethods help the backbone model better representand understand the abstract concepts. Note thatthe special tokens bring the PLMs with the bestimprovements in both subtask-1 and subtask-2. Itis possible that the special tokens teach the model Special Token Trial Acc. Dev Acc. < <$> $ /$

Table 6: The results of models with different specialtokens on subtask-1. to focus on the abstract concept in a stronger man-ner. Moreover, other common tricks bring withlittle improvements.

We also search for the best special tokens for Re-CAM on the dev set of subtask-1. Tabel 6 showsthat enhance the representations of ab-stract concepts well. Besides, the <> and could be helpful for PLMs to pay attention to theabstract concepts. Moreover, it is interesting thateach special token helps PLMs choose the right ab-stract concepts which submerged in long sequen-tial tokens (including article and summary). Thisresult strength that special tokens can enhance therepresentation of abstract concepts in PLM basedapproaches. In this paper, we design many simple and effec-tive approaches to improve the performance of thePLMs on all three subtasks. Experiments demon-strate that the proposed methods achieve signiﬁ-cant improvement compared with the PLMs base-line and we obtain the eighth-place in subtask-1and tenth-place in subtask-2 on the ﬁnal ofﬁcialevaluation. Moreover, we show that special tokenswork well in enhancing PLMs for representatingand understanding abstract concepts.

References

Marcus D. Bloice, Peter M. Roth, and AndreasHolzinger. 2019. Biomedical image augmentationusing augmentor.

Bioinform. , 35(21):4522–4524.Alexander V. Buslaev, Alex Parinov, Eugene Khved-chenya, Vladimir I. Iglovikov, and Alexandr A.Kalinin. 2018. Albumentations: fast and ﬂexible im-age augmentations.

CoRR , abs/1809.06839.engguang Chen, Shu Liu, Hengshuang Zhao, and Ji-aya Jia. 2020. Gridmask data augmentation.

CoRR ,abs/2001.04086.Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, andQuoc V. Le. 2020. Randaugment: Practical au-tomated data augmentation with a reduced searchspace. In , pages3008–3017.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In

NAACL-HLT , pages 4171–4186.Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018.Multi-task learning using uncertainty to weighlosses for scene geometry and semantics. In , pages 7482–7491.Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv .Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2018. Unsupervised ma-chine translation using monolingual corpora only. In .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized BERT pretraining ap-proach.

CoRR , abs/1907.11692.David J. Miller, Ajit V. Rao, Kenneth Rose, and AllenGersho. 1996. A global optimization technique forstatistical classiﬁer design.

IEEE Trans. Signal Pro-cess. , 44(12):3108–3122.Takeru Miyato, Andrew M. Dai, and Ian J. Good-fellow. 2017. Adversarial training methods forsemi-supervised text classiﬁcation. In .Gabriel Pereyra, George Tucker, Jan Chorowski,Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Regu-larizing neural networks by penalizing conﬁdent out-put distributions. In .Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. In

Ope-nAI Blog . Motoki Sato, Jun Suzuki, Hiroyuki Shindo, and YujiMatsumoto. 2018. Interpretable adversarial pertur-bation in input embedding space for text. In

Pro-ceedings of the Twenty-Seventh International JointConference on Artiﬁcial Intelligence, IJCAI 2018,July 13-19, 2018, Stockholm, Sweden , pages 4323–4330.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving neural machine translation mod-els with monolingual data. In

Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics, ACL 2016, August 7-12, 2016,Berlin, Germany, Volume 1: Long Papers .Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In

Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 353–355, Brussels, Belgium.Association for Computational Linguistics.Luxi Xing, Yuqiang Xie, Yue Hu, and Wei Peng. 2020.Iie-nlp-nut at semeval-2020 task 4: Guiding plmwith prompt template reconstruction strategy forcomve. In

SemEval@COLING .Rowan Zellers, Yonatan Bisk, Roy Schwartz, andYejin Choi. 2018. SWAG: A large-scale adversar-ial dataset for grounded commonsense inference. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages 93–104, Brussels, Belgium. Association for Computa-tional Linguistics.Rowan Zellers, Ari Holtzman, Yonatan Bisk, AliFarhadi, and Yejin Choi. 2019. HellaSwag: Cana machine really ﬁnish your sentence? In

Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4791–4800, Florence, Italy. Association for ComputationalLinguistics.Tianyi Zhang, Varsha Kishore*, Felix Wu, Kilian Q.Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-uating text generation with bert. In

InternationalConference on Learning Representations .Boyuan Zheng, Xiaoyu Yang, Yuping Ruan, QuanLiu, Zhen-Hua Ling, Si Wei, and Xiaodan Zhu.2021. SemEval-2021 task 4: Reading comprehen-sion of abstract meaning. In

Proceedings of the15th International Workshop on Semantic Evalua-tion (SemEval-2021) .Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Gold-stein, and Jingjing Liu. 2020. Freelb: Enhanced ad-versarial training for natural language understanding.In8th International Conference on Learning Repre-sentations, ICLR 2020, Addis Ababa, Ethiopia, April26-30, 2020

Related Researches

AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation

by Jonáš Kulhánek

BembaSpeech: A Speech Recognition Corpus for the Bemba Language

by Claytone Sikasote

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

by Wenmeng Yu

Transfer Learning Approach for Arabic Offensive Language Detection System -- BERT-Based Model

by Fatemah Husain

Bootstrapping Relation Extractors using Syntactic Search by Examples

by Matan Eyal

Leveraging cross-platform data to improve automated hate speech detection

by John D Gallacher

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

by Chuhan Wu

Broader terms curriculum mapping: Using natural language processing and visual-supported communication to create representative program planning experiences

by Rogério Duarte

Decontextualization: Making Sentences Stand-Alone

by Eunsol Choi

The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point

by Magnus Sahlgren

Generate and Revise: Reinforcement Learning in Neural Poetry

by Andrea Zugarini

A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining

by Boliang Zhang

SLUA: A Super Lightweight Unsupervised Word Alignment Model via Cross-Lingual Contrastive Learning

by Di Wu

Wake Word Detection with Streaming Transformers

by Yiming Wang

A study of text representations in Hate Speech Detection

by Chrysoula Themeli

OntoEnricher: A Deep Learning Approach for Ontology Enrichment from Unstructured Text

by Lalit Mohan Sanagavarapu

Effects of Layer Freezing when Transferring DeepSpeech to New Languages

by Onno Eberhard

How True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

by Hannah Kirk

In-Order Chart-Based Constituent Parsing

by Yang Wei

Quality Estimation without Human-labeled Data

by Yi-Lin Tuan

Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration

by Betty van Aken

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

by Yunyang Xiong

Spoiler Alert: Using Natural Language Processing to Detect Spoilers in Book Reviews

by Allen Bao

An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

by ElMehdi Boujou

Representation Learning for Natural Language Processing

by Zhiyuan Liu

«

1

2

3

4

»

Submitted on 25 Feb 2021 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar