IIE-NLP-Eyas at SemEval-2021 Task 4: Enhancing PLM for ReCAM with Special Tokens, Re-Ranking, Siamese Encoders and Back Translation
aa r X i v : . [ c s . C L ] F e b IIE-NLP-Eyas at SemEval-2021 Task 4: Enhancing PLM for ReCAMwith Special Tokens, Re-Ranking, Siamese Encoders and Back Translation
Yuqiang Xie, Luxi Xing, Wei Peng, Yue Hu ∗ Institute of Information Engineering, Chinese Academy of Sciences, Beijing, ChinaSchool of Cyber Security, University of Chinese Academy of Sciences, Beijing, China { xieyuqiang,xingluxi,pengwei,huyue } @iie.ac.cn Abstract
This paper introduces our systems for all threesubtasks of SemEval-2021 Task 4: ReadingComprehension of Abstract Meaning. To helpour model better represent and understand ab-stract concepts in natural language, we well-design many simple and effective approachesadapted to the backbone model (RoBERTa).Specifically, we formalize the subtasks intothe multiple-choice question answering formatand add special tokens to abstract concepts,then, the final prediction of question answer-ing is considered as the result of subtasks. Ad-ditionally, we employ many finetuning tricksto improve the performance. Experimental re-sults show that our approaches achieve signif-icant performance compared with the baselinesystems. Our approaches achieve eighth rankon subtask-1 and tenth rank on subtask-2.
The computer’s ability in understanding, repre-senting, and expressing abstract meaning is a fun-damental problem towards achieving true natu-ral language understanding. SemEval-2021 Task4: Reading Comprehension of Abstract Meaning(ReCAM) provides a well-formed benchmark thataims to study the machine’s ability in representingand understanding abstract concepts (Zheng et al.,2021).The Reading Comprehension of Abstract Mean-ing (ReCAM) task is divided into three sub-tasks, including Imperceptibility, Nonspecificil-ity, and Interaction. Please refer to the task de-scription paper (Zheng et al., 2021) for more de-tails. To address the above challenges in Re-CAM, we first formalize all subtasks as a typeof multiple-choice Question Answering (QA) tasklike (Xing et al., 2020). Recently, the the largePre-trained Language Models (PLMs), such as * Corresponding author.
GPT-2 (Radford et al., 2019), BERT (Devlin et al.,2019), RoBERTa (Liu et al., 2019), demonstrateits excellent ability in various natural language un-derstanding tasks (Wang et al., 2018; Zellers et al.,2018, 2019). So, we employ the state-of-the-artPLM, RoBERTa, as our backbone model. More-over, we well-design many simple and effective ap-proaches to improve the performance of the back-bone model, such as adding special tokens, sen-tence re-ranking, label smoothing and back trans-lation.This paper describes approaches for all subtasksdeveloped by the IIE-NLP-Eyas Team (NaturalLanguage Processing group of Institute of Infor-mation Engineering of the Chinese Academy ofSciences). Our contributions are summarized asthe followings:• We design many simple and effective ap-proaches to improve the performance of thePLMs on all three subtasks, such as addingspecial tokens, sentence re-ranking and so on;• Experiments demonstrate that the proposedmethods achieve significant improvementcompared with the PLMs baseline and weobtain the eighth-place in subtask-1 and thetenth-place in subtask-2 on the final officialevaluation.
Since the format of the tasks in ReCAM is thesame, we use the unified framework to address alltasks. The following is the detail of our methods.
Task Definition
We first present the descrip-tion of symbols. Formally, suppose thereare seven key elements in all subtasks, i.e. { D, Q, A , A , A , A , A } . We suppose the D denotes the given article, the Q denotes the sum-mary of the article with a placeholder, the A ∗ de-otes the candidate abstract concepts for all sub-tasks to fill in the placeholder. Multi-Choice Based Model
The pre-trainedlanguage models have made a great contributionto MRC tasks. Recently, a significant milestoneis the BERT (Devlin et al., 2019), which gets newstate-of-the-art results on eleven natural languageprocessing tasks. In this section, we present the de-scription of the multi-choice based model whichwe use in all subtasks. Consider the BERT-stylemodel RoBERTa’s (Liu et al., 2019) stronger per-formance than BERT, we utilize it as our back-bone model, which introduces more data and big-ger models for better performance. A multiple-choice based QA model M consists of a PLM en-coder and a task-specific classification layer whichincludes a feed-forward neural network f ( · ) anda softmax operation. For each pair of question-answer, the calculation of M is as follow: score i = exp ( f ( S i )) P i ′ exp ( f ( S i ′ )) (1) S i = PLM ([ Q ; A i ; D ]) (2)where the [ · ] is the input constructed according tothe instruction of PLMs, and the S ∗ is the finalhidden state of the first token ( ). For moredetails, we refer to the original work of PLMs(Liu et al., 2019). The candidate answer whichowns a higher score will be identified as the finalprediction. The model M is trained end-to-endwith the cross-entropy objective function. Special Tokens
To help the model to the PLMsrepresent and understanding the abstract conceptin textual descriptions, we add special tokens toenhance the semantic representation of candidateconcepts. The idea is similar to the prompt tem-plate of (Xing et al., 2020). We use
Sentence Ranking
As the given passage is toolong to be deal with the Pre-trained LanguageModels (PLMs), we consider refining the passageinput by rearranging the order of the sentences inthe passage. With this reorder process, the sen-tence, which is more critical to the question, canappear at the beginning of the passage. Althoughthe passage’s sequential information is sacrificed,we keep the more question-relevant information of the passage. Supposing the Passage D contains N sentences, i.e., D = { W , W , ..., W N } , whereeach sentence W n = { t , t , ..., t M } including M tokens. We denote the given cloze-style questionas Q . To rank the sentences in D , we resort BERTto compute the similarity score between each sen-tence, i.e. W n , and Q following the algorithm inZhang et al. (2020). After ranking, the sentencesin D are sorted in descending order of similarityscores, and we can get a rearranged passage ˆ D asthe passage input to the QA model. Siamese Encoders
When exploring the dataset,we find that the complete question statement, rep-resenting the result statement after replacing theplaceholder token with the candidate option, alsocontains the semantic information which can helpto make the judgment about options. Based on theobservation, we propose a siamese encoders basedarchitecture to inject the additional complete ques-tion statement information while not influence theinput with passage. On the other hand, it can beseen as introducing an auxiliary task to assist themain task. Specifically, the training of siamese en-coder based architecture is as following: l i = P LM ([ ˆ Q i ])[0] (3) l i = P LM ([ Q ; A i ; D ])[0] (4) P ( A i | ˆ Q ) = softmax ( f ( l i )) (5) P ( A i | D, Q ) = softmax ( f ( l i )) (6)where the P LM ( · ) stands for PLM encoder, ˆ Q i is the complete question statement, i indicates thei-th candidate answer, f ( · ) is the feed forward net-work. To coordinate the two losses, we opt foran uncertainty loss (Kendall et al., 2018) to adjustit adaptively through σ { , } as: L ( θ, σ , σ ) = σ L ( θ )+ σ L ( θ )+ log σ σ , where L { , } arethe cross-entropy loss between the model predic-tion P { , } and the ground truth label respectively. Back Translation
Generally speaking, moresuccessful neural networks require a large numberof parameters, often in the millions. In order tomake the neural network implements correctly, alot of data is needed for training, but in actual situa-tions, there is not as much data as we thought. Therole of data augment includes two aspects. Oneis to increase the amount of training data and im-prove the generalization ability of the model. Theother is to increase the noise data and improve therobustness of the model. A large number of the ubtask Train Trail Dev TestImperceptibility 3227 1000 837 2025Nonspecificility 3318 1000 851 2017
Table 1: Data scale of each subtask. works (Buslaev et al., 2018; Bloice et al., 2019;Chen et al., 2020; Cubuk et al., 2020; Sato et al.,2018; Zhu et al., 2020) consider the data augmentto make better performances. In the field of com-puter vision, a lot of work (Buslaev et al., 2018;Bloice et al., 2019; Chen et al., 2020; Cubuk et al.,2020) uses existing data to perform operations,such as flipping, translation or rotation, to cre-ate more data, so that neural networks have bet-ter generalization effects. Adding Gaussian dis-tribution to text processing (Sato et al., 2018) canalso achieve the effect of data augment. Besides,some works (Miyato et al., 2017; Zhu et al., 2020)utilize the adversarial training methods to do thedata augment. For convenience and simplicity, weadopt the back translation (Sennrich et al., 2016)to increase the amount of training data, whichused to construct pseudo parallel corpus in unsu-pervised machine translation (Lample et al., 2018).Specifically, we use the Google API † to translatethe passage into French, and then translate thetranslation into English in turn. The pseudo par-allel corpus can be obtained as: { D ′ } = bkt ( { D } ) (7)where { D ′ } means the translated English corpusthat we used as data agument, bkt is back transla-tion.As for the question, given the existence of thespecial character placeholder , forced translationmay result in grammatical errors and semanticgaps. Therefore, the questions and options willbe kept original. After getting the pseudo parallelcorpus, we train our model with the training datatogether with the cross-entropy loss function. Label Smoothing
Furthermore, for improvingthe generalization ability of the model trained onsole task and prevent the overconfidence of model,we consider training model with label smoothing(Miller et al., 1996; Pereyra et al., 2017). Whentraining with label smoothing, the hard one-hot la-bel distribution is replaced with a softened label † The web page is available athttps://translate.google.com
Hyper-parameter ValueLR { } Batch size {
16, 32 } Gradient norm 1.0Warm-up { } Max. input length (
Table 2: Hyper-parameters of our approach.
Models Trial Acc. Dev Acc.R
OBERTA
LARGE (Liu et al., 2019) 85.85 82.12(1) w/ special tokens (2) w/ sentence ranking 86.54 83.52(3) w/ label smoothing 86.88 85.85(4) w/ siamese encoders 86.62 83.22(5) w/ back translation 87.23 84.32Our Approach
Table 3: The results of our approach on subtask-1. Our approach is the final, stable and best model:R
OBERTA
LARGE with special tokens. distribution through a smoothing value α , whichis a hyper-parameter. In our experiments, we setthe smoothing value α = 0 . . In all subtasks, the scale of each task is shown inTable 1. We train the model on training data andthe related pseudo data generated by back trans-lation, then select hyper-parameters based on thebest performing model on the dev set, and then re-port results on the test set.Our system is implemented with PyTorch andwe use the PyTorch version of the pre-trained lan-guage models ‡ . We employ RoBERTa (Liu et al.,2019) large model as our PLM encoder in Equa-tion 2. The Adam optimizer (Kingma and Ba,2014) is used to fine-tune the model. We introducethe detailed setup of the best model on the devel-opment dataset. For subtask-1 and subtask-2, thehyper-parameters are shown in Table 2. From Table 3, we can see theresults of our approach on subtask-1 of ReCAM.Compared to the backbone model RoBERTa largemodel, our methods achieve significant improve- ‡ https://github.com/huggingface/transformers odels Trial Acc. Dev Acc.R OBERTA
LARGE (Liu et al., 2019)
Table 4: The results of our approach on subtask-2. Our approach is the final, stable and best model:R
OBERTA
LARGE with special tokens and label smooth-ing.
Trained on Tested on Test Acc.Subtask-1 Subtask-1 87.51Subtask-1 Subtask-2 84.13Subtask-2 Subtask-2 89.64Subtask-2 Subtask-1 81.09
Table 5: The results of our approach on subtask-3. ments. It is interesting that the special token is themost helpful part for the Imperceptibility subtask.
Nonspecificility
Table 4 shows the results ofour approach on subtask-2 of ReCAM. Similarly,the models with special tokens work well on theNonspecificility subtask. Compared to the back-bone model RoBERTa large model, our methodsachieve better improvements.
Interaction
We also perform subtask-3 of Re-CAM, Interaction, which aims to provide more in-sights into the relationship of the two views on ab-stractness. In this task, we test the performanceof our system that is trained on one definition andevaluated on the other. The results of our system’sperformance on Imperceptibility and Nonspecifi-cility subtasks which is shown in Table 5. We canfind that our model is relatively robust for differentabstract concepts.
In this part, we perform an ablation study of our ap-proach. As shown in Table 3 and 4, our proposedmethods help the backbone model better representand understand the abstract concepts. Note thatthe special tokens bring the PLMs with the bestimprovements in both subtask-1 and subtask-2. Itis possible that the special tokens teach the model Special Token Trial Acc. Dev Acc.
Table 6: The results of models with different specialtokens on subtask-1. to focus on the abstract concept in a stronger man-ner. Moreover, other common tricks bring withlittle improvements.
We also search for the best special tokens for Re-CAM on the dev set of subtask-1. Tabel 6 showsthat
References
Marcus D. Bloice, Peter M. Roth, and AndreasHolzinger. 2019. Biomedical image augmentationusing augmentor.
Bioinform. , 35(21):4522–4524.Alexander V. Buslaev, Alex Parinov, Eugene Khved-chenya, Vladimir I. Iglovikov, and Alexandr A.Kalinin. 2018. Albumentations: fast and flexible im-age augmentations.
CoRR , abs/1809.06839.engguang Chen, Shu Liu, Hengshuang Zhao, and Ji-aya Jia. 2020. Gridmask data augmentation.
CoRR ,abs/2001.04086.Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, andQuoc V. Le. 2020. Randaugment: Practical au-tomated data augmentation with a reduced searchspace. In , pages3008–3017.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In
NAACL-HLT , pages 4171–4186.Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018.Multi-task learning using uncertainty to weighlosses for scene geometry and semantics. In , pages 7482–7491.Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv .Guillaume Lample, Alexis Conneau, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2018. Unsupervised ma-chine translation using monolingual corpora only. In .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized BERT pretraining ap-proach.
CoRR , abs/1907.11692.David J. Miller, Ajit V. Rao, Kenneth Rose, and AllenGersho. 1996. A global optimization technique forstatistical classifier design.
IEEE Trans. Signal Pro-cess. , 44(12):3108–3122.Takeru Miyato, Andrew M. Dai, and Ian J. Good-fellow. 2017. Adversarial training methods forsemi-supervised text classification. In .Gabriel Pereyra, George Tucker, Jan Chorowski,Lukasz Kaiser, and Geoffrey E. Hinton. 2017. Regu-larizing neural networks by penalizing confident out-put distributions. In .Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. In
Ope-nAI Blog . Motoki Sato, Jun Suzuki, Hiroyuki Shindo, and YujiMatsumoto. 2018. Interpretable adversarial pertur-bation in input embedding space for text. In
Pro-ceedings of the Twenty-Seventh International JointConference on Artificial Intelligence, IJCAI 2018,July 13-19, 2018, Stockholm, Sweden , pages 4323–4330.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Improving neural machine translation mod-els with monolingual data. In
Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics, ACL 2016, August 7-12, 2016,Berlin, Germany, Volume 1: Long Papers .Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In
Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 353–355, Brussels, Belgium.Association for Computational Linguistics.Luxi Xing, Yuqiang Xie, Yue Hu, and Wei Peng. 2020.Iie-nlp-nut at semeval-2020 task 4: Guiding plmwith prompt template reconstruction strategy forcomve. In
SemEval@COLING .Rowan Zellers, Yonatan Bisk, Roy Schwartz, andYejin Choi. 2018. SWAG: A large-scale adversar-ial dataset for grounded commonsense inference. In
Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages 93–104, Brussels, Belgium. Association for Computa-tional Linguistics.Rowan Zellers, Ari Holtzman, Yonatan Bisk, AliFarhadi, and Yejin Choi. 2019. HellaSwag: Cana machine really finish your sentence? In
Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4791–4800, Florence, Italy. Association for ComputationalLinguistics.Tianyi Zhang, Varsha Kishore*, Felix Wu, Kilian Q.Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-uating text generation with bert. In
InternationalConference on Learning Representations .Boyuan Zheng, Xiaoyu Yang, Yuping Ruan, QuanLiu, Zhen-Hua Ling, Si Wei, and Xiaodan Zhu.2021. SemEval-2021 task 4: Reading comprehen-sion of abstract meaning. In
Proceedings of the15th International Workshop on Semantic Evalua-tion (SemEval-2021) .Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Gold-stein, and Jingjing Liu. 2020. Freelb: Enhanced ad-versarial training for natural language understanding.In8th International Conference on Learning Repre-sentations, ICLR 2020, Addis Ababa, Ethiopia, April26-30, 2020