[PDF] ISCAS at SemEval-2020 Task 5: Pre-trained Transformers for Counterfactual Statement Modeling

Abstract

ISCAS participated in two subtasks of SemEval 2020 Task 5: detecting counterfactual statements and detecting antecedent and consequence. This paper describes our system which is based on pre-trained transformers. For the first subtask, we train several transformer-based classifiers for detecting counterfactual statements. For the second subtask, we formulate antecedent and consequence extraction as a query-based question answering problem. The two subsystems both achieved third place in the evaluation. Our system is openly released at this https URL.

Full PDF

IISCAS at SemEval-2020 Task 5: Pre-trained Transformers forCounterfactual Statement Modeling

Yaojie Lu , , Annan Li , , Hongyu Lin , Xianpei Han , , Le Sun , Chinese Information Processing Laboratory State Key Laboratory of Computer ScienceInstitute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China { yaojie2017,liannan2019,hongyu2016,xianpei,sunle } @iscas.ac.cn Abstract

ISCAS participated in two subtasks of SemEval 2020 Task 5: detecting counterfactual statementsand detecting antecedent and consequence. This paper describes our system which is based on pre-trained transformers. For the ﬁrst subtask, we train several transformer-based classiﬁers for detect-ing counterfactual statements. For the second subtask, we formulate antecedent and consequenceextraction as a query-based question answering problem. The two subsystems both achievedthird place in the evaluation. Our system is openly released at https://github.com/casnlu/ISCAS-SemEval2020Task5.

Counterfactual statements describe events that did not actually happen or cannot occur, as well asthe possible consequence if the events have had happened. Counterfactual detecting aims to identifycounterfactual statements in language and understand antecedents and consequents in these statements. Forinstance, the following sentence is a counterfactual statement, and the underlined term is the antecedent,while the italic term is the consequence :Her post-traumatic stress could have been avoided if a combination of paroxetine and exposuretherapy had been prescribed two months earlier .Once understanding the statement, we can accumulate the causal knowledge for “post-traumatic stress”,i.e., “a combination of paroxetine and exposure may help cure post-traumatic stress”. To model counter-factual semantics and reason in natural language, SemEval 2020 Subtask 5 provides an English benchmarkfor two basic problems: detecting counterfactual statements and detecting antecedent and consequence(Yang et al., 2020).We build our evaluation systems that are built on pre-trained transformer-based neural network models,which have shown signiﬁcant improvements over conventional methods in many NLP ﬁelds (Devlinet al., 2019; Liu et al., 2020; Lan et al., 2020). Speciﬁcally, in subtask 1, several transformer-basedclassiﬁers are designed to detect counterfactual statements. Besides, because counterfactual antecedentexpressions are usually expressed using some obvious conditional assumption connectives, such as if and wish . We also equip transformers with additional convolutional neural network to capture the abovestrong local context information. For subtask 2, we formulate antecedent and consequence extractionas a query-based question answering problem. Speciﬁcally, to effectively model context information incounterfactual statements, we design two different kinds of input queries for antecedents/consequencesand regard counterfactual statements as given paragraphs.The rest of this paper is organized as follows. Section 2 introduces the background of pre-trainedtransformers. Section 3 describes the overview of our system for two subtasks. In Section 4-5, wedescribe the detailed experiment setup and the overall system performance on the two subtasks. Finally,we conclude this paper in Section 6. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/ . a r X i v : . [ c s . C L ] S e p re-trained Transformer Pre-trained Transformer … CNNClassifierClassifier (a) CLS Aggregation (b) CNN Aggregation [CLS] [SEP] … !𝑥 !! !𝑥 !" !𝑥 [CLS] [SEP] … !𝑥 !! !𝑥 !" !𝑥 Figure 1: The Transformer for Detecting Counterfactual Statements.

Different from the pre-trained word embedding in NLP (Pennington et al., 2014), pre-trained contextual-ized models aim to learn encoders to represent words in context for downstream tasks. BERT(Devlin etal., 2019) is a representative large-scale pre-trained transformer, which is trained using mask languagemodeling (MLM) and next sentence prediction (NSP) task.Whole Word Masking model (BERT-WWM) is a simple but effective variant of BERT. In this case,the pre-training stage always mask all of the tokens corresponding to a word instead of a single WordPiecetoken (sub-token).RoBERTa (Liu et al., 2020) further improves on BERT’s pre-training procedure and achieves substantialimprovements. The improvements include training the model longer, with bigger batches over more data;removing the next sentence prediction objective; training on longer sequences; and dynamically changingthe masking pattern applied to the training data.ALBERT (Lan et al., 2020) incorporates factorized embedding parameterization and cross-layerparameter sharing to reduce the number of parameters of BERT. These two methods can signiﬁcantlyreduce the number of parameters of BERT, thus improving the parameter efﬁciency and facilitating thelearning of larger models. Besides, ALBERT also uses sentence ordering prediction (SOP) self-supervisedlearning task to replace BERT’s NSP task, for its helpful for the model to better learn sentence coherence. Given a candidate text x = { w , w , ..., w n } , our system needs to: 1) determine whether the candidatecontains a counterfactual statement; 2) extract the antecedent and consequence from the counterfactualstatement. For the example in Section 1, we ﬁrst detect it is a counterfactual statement, then extract “ Herpost-traumatic stress could have been avoided ” as its antecedent and “ if a combination of paroxetine andexposuretherapy had been prescribed two months earlier ” as its consequent. In the following, we describeour two sub-systems in detail.

To detect counterfactual statements, we build classiﬁers based on contextualized representation to detectcounterfactual statements. We ﬁrst represent each word in the text using its contextualized representation,then obtain the overall text representation using two different aggregation methods, and ﬁnally determiningwhether the text contains counterfactual statements using a classiﬁer. The overall framework is shown inFigure 1.

Contextualized Word Representation Layer.

To capture the counterfactual semantics in naturallanguage, we learn a contextualized representation for each token. In order to alleviate the out-of-vocabulary problem in text representation, we ﬁrst convert the raw input text into word-pieces https://github.com/google-research/bert sub-tokens) { ˜ x , ..., ˜ x n , ˜ x n } in the pre-deﬁned vocabulary. Then, the two special symbols [CLS]and [SEP] will be added to the head and tail of the sentence. Finally, we feed tokenized text ˜ x = { [CLS] , ˜ x , ..., ˜ x n , ˜ x n , [SEP] } into L -layers pre-trained transformers to obtain the contextualizedrepresentation for each sub-tokens.Following (Tenney et al., 2019), we pool token i ’s representation h i ∈ R n × d across all BERT layersusing scalar mixing (Peters et al., 2018): h i = γ (cid:80) Lj =1 α j x ( j ) i where x ( j ) i ∈ R d is the embedding oftoken i from BERT layer j , α j is softmax-normalized weights, and γ is a scalar parameter. We denotethe ﬁnal representation of the special symbol [CLS] as C ∈ R d . Speciﬁcally, we obtain the token-levelrepresentation using the representation of the ﬁrst sub-token in each token. Contextualized Information Aggregation.

After obtaining the representation of each word, we pro-duce aggregated feature vector r to capture the counterfactual information of the entire statement. Weinvestigate two different aggregation strategies in this section: [CLS] aggregation and convolutional neuralnetwork (CNN) aggregation.In [CLS] aggregation, we directly use the representation C of the special symbol [CLS] as the aggregatefeature r (Devlin et al., 2019).In counterfactual statements, connectives are often used to express the relation between antecedent andconsequence, i.e., “if”, “even if”, and “would”. To capture these local patterns in counterfactual statements,we employ a CNN (Kim, 2014) to aggregate sentence information. Given the token sequence { h , ..., h n } ,the convolutional ﬁlter scans the token sequence and extract the local feature l i : l i = tanh w · h i : i + h − + b .Finally, a max-pooling layer is used to produce the feature r for further counterfactual statement detection: r = max ≤ i ≤ n l i . Counterfactual Statement Classiﬁer.

After aggregation, the feature vector r will be fed to the coun-terfactual classiﬁer, which computes a probability of whether it is a counterfactual statement: P ( y = 1 | x ) = σ ( w c · r + b c ) (1)where w c is the weight vector, b c is the bias term, and σ is simgoid function.Given the training set D = { ( x i , y i ) } , we train all parameters using a binary cross-entropy loss function: L = (cid:88) i ∈D y i log P ( y = 1 | x i ) + (1 − y i ) log(1 − P ( y = 1 | x i )) (2) We now describe how to extract antecedent and consequence via a question answering-style procedure.Given a counterfactual statement s , we ﬁrst construct an antecedent query q a and a consequence query q c separately, and then extract the corresponding antecedent a a and consequence a c in the text by answeringthese two questions. The overall framework is illustrated in Figure 2. Query Construction.

We design two kinds of queries for extraction: name query and deﬁnition query.For name query, we directly use “antecedent” and “consequence” as the query for extraction. To enrichthe semantic information of questions, we also propose deﬁnition query, which employs the dictionarydeﬁnition of each label as deﬁnition queries. For “antecedent”, the deﬁnition query is “a preceding event,condition, or cause”. For “consequent”, the deﬁnition query is “a result or effect”. Question and Context Encoding.

We represent the input question q ∗ for extraction and the counterfac-tual statement s as a single packed sequence: { [CLS], q ∗ , [SEP], s , [SEP] } . First, q ∗ and s are tokenizedas sub-token sequences after WordPiece tokenization as shown in Figure 2. After tokenization, we feedthe single packed sequence to the pre-trained transformers, and obtain the ﬁnal hidden vector for the i th sub-token in the query as h iq ∈ R d , the j th sub-token in the statement as h js ∈ R d , and C ∈ R d for thespecial token [CLS]. re-trained Transformer [CLS] [SEP] … [SEP] … CounterfactualStatementQuery forAntecedent/Consequence

Start EndNo Answer

Figure 2: The Transformer for Detecting Antecedent and Consequence.

Answer Prediction.

To extract continuous text fragments, we employ a pointer network to predict thestart position and end position of the answer text. A pointer network contains a start vector w start and anend vector w end , which are used to produce the scores of word i being the start/end of the answer. Thescore of word i being the start of the answer is computed as a dot product between the hidden state ofeach token in the statement: w start · h js , the score of the end is calculated in the same way. We deﬁne thescore of a candidate span from position j to position k as S j,k = w start · h js + w end · h ks , where k ≥ j .Since some statements do not contain consequences , we regard the questions corresponding to theseconsequence statements as unanswerable questions. For these questions, we treat [CLS] token as boththe start and the end of the answer span. In this way, the score of a statement without consequence is S null = w start · C + w end · C .For model training, we update the full model by maximizing the likelihood of the start token j ∗ and theend token k ∗ (including [CLS]): L = − (cid:88) i ∈D log P ( y start = j ∗ | x i ) + log P ( y end = k ∗ | x i ) P ( y start = j ∗ | x i ) = exp( w start · h j ∗ s )exp( w start · C ) + (cid:80) nj =1 exp( w start · h js ) P ( y end = k ∗ | x i ) = exp( w end · h k ∗ s )exp( w end · C ) + (cid:80) nk =1 exp( w end · h ks ) (3)where the parameters of the pointer network are training from scratch. This subtask contains 13,000 instances for model training and 7000 unseen instances foronline evaluation. We sampled 1,500 instances from the whole dataset as our development set. Then,we split the remaining 11,500 instances in 5-fold; each fold has 2,300 instances. We trained ﬁve modelson ﬁve groups of datasets for ensemble voting. Each group takes four folds as a training dataset and theremaining fold for early stopping.

Subtask 2.

This subtask contains a total of 3,551 instances for model training and 1950 unseen instancesfor online evaluation. We sampled 3200 instances as training sets and take the remaining 351 instances asdevelopment sets.

For each model, we selected the best ﬁne-tuning learning rate (among ‘5e-6’ ‘1e-5’ ‘3e-5’)on the development set. Because of the GPU memory limitation, we truncated the maximum total inputsequence length after WordPiece tokenization to 128. We employ three different pre-trained transformersin our submission for SemEval 2020 Task 5 ofﬁcial evaluation: BERT (Devlin et al., 2019), ALBERT These statements cover 14.64% in the training set.

Lan et al., 2020), and RoBERTa (Liu et al., 2020). We used a batch size of 8 for ALBERT-xxlarge and 24for other models. For CNN aggregation, we only used one single CNN layer whose window size is 3 andhidden size is 300.

Subtask 2.

We ﬁne-tuned all models on the training data for 5 epochs using the learning rate of × − for the BERT parameters and the task parameters, while we evaluate and save models at every 250 stepswith the batch size of each step is 16. We trained the large model on two 24G GPUs in parallel, andselected the best model on the development set for online evaluation. We report the performance of subtask 1 and subtask 2, scored by the evaluation server . In subtask 1,different models are evaluated using Precision ( P ), Recall ( R ), and F1-score ( F ) for binary classiﬁcation.While there are four metrics for subtask 2: Exact Match ( EM ), Precision, Recall, and F1-score. Exactmatch measures the percentage of predictions that match the annotated antecedents and consequencesexactly. Note that, F1-score in subtask 2 is token level metric and will be calculated based on the offsetsof predicted antecedent and consequence. F R P

BERT

Large-Cased-WWM + [CLS] 87.70 87.50 87.90BERT

Large-Cased-WWM + CNN 88.00 87.90 88.10Roberta

Large + [CLS] 89.80

Large + CNN 89.70 89.60 89.80ALBERT

XXLarge + [CLS]

ALBERT

XXLarge + CNN 89.00 87.70 90.40ALBERT

XXLarge + [CLS] + CNN

Table 1: Subtask 1 Test results.Table 1 shows the overall results of our seven runs on subtask 1. We can see that our system achievedvery competitive performance . The performance of ALBERT

XXLarge with [CLS] aggregation on precision(92.20) ranked 1 st in all teams. Our best F (90.00) score ranked 3 rd in all teams. F R P EM

BERT

Base-Cased + Name 86.30 90.30 86.00 51.60BERT

Base-Uncased + Name 86.60 90.20 86.70 51.90BERT

Base-Cased + Deﬁnition 86.30 90.30 86.00 52.40BERT

Base-Uncased + Deﬁnition 86.80 90.00 87.10 52.50BERT

Large-Uncased-WWM + Name 87.30 89.80

Large-Uncased-WWM + Deﬁnition

Table 2: Subtask 2 Test results. Name indicates using name queries and Def. indicates using deﬁnition-enriched queries.Table 2 shows the overall results of our seven runs on subtask 2. Our QA-based method ranked 1 st on R score and 3 rd on F , P scores. Besides, our system achieved 2 nd on EM score and surpassed the thirdteam by a large margin (4.90). From the results in Table 2, we can see that:1) The deﬁnition-based query achieved better performance than the name-based queries. We believethis because the deﬁnition-based query provides richer semantic information than the name-based query.2) Uncased models are better than cased models on both F and EM scores. This may be because ourmodel focuses more on capturing the structure information of counterfactual expressions, meanwhile caseinformation is more useful on capturing information about named entities, such as persons and locations. https://competitions.codalab.org/competitions/21691 Conclusion

In this paper, we propose a transformers-based system for counterfactual modeling. For counterfactualstatements detection, we investigated a variety of advanced pretraining models and two efﬁcient aggre-gation algorithms. For antecedent and consequent extraction, we framed it as a span-based questionanswering task, and then deﬁnition-enriched queries are designed to extract the required term fromcounterfactual statements. Evaluation results demonstrate the effectiveness of our system. For futurework, we plan to investigate how to inject extra-knowledge into counterfactual modeling systems, such asknowledge-enriched transformers.

References

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirec-tional transformers for language understanding. In

Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers) , pages 4171–4186, Minneapolis, Minnesota, June. Association for Computational Linguistics.Yoon Kim. 2014. Convolutional neural networks for sentence classiﬁcation. In

Proceedings of the 2014 Confer-ence on Empirical Methods in Natural Language Processing (EMNLP) , pages 1746–1751, Doha, Qatar, October.Association for Computational Linguistics.Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020.Albert: A lite bert for self-supervised learning of language representations. In

International Conference onLearning Representations .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, LukeZettlemoyer, and Veselin Stoyanov. 2020. Roberta: A robustly optimized bert pretraining approach.Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representa-tion. In

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) ,pages 1532–1543, Doha, Qatar, October. Association for Computational Linguistics.Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettle-moyer. 2018. Deep contextualized word representations. In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1(Long Papers) , pages 2227–2237, New Orleans, Louisiana, June. Association for Computational Linguistics.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin VanDurme, Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? probingfor sentence structure in contextualized word representations. In .Xiaoyu Yang, Stephen Obadinma, Huasha Zhao, Qiong Zhang, Stan Matwin, and Xiaodan Zhu. 2020. SemEval-2020 task 5: Counterfactual recognition. In