ISCAS at SemEval-2020 Task 5: Pre-trained Transformers for Counterfactual Statement Modeling
IISCAS at SemEval-2020 Task 5: Pre-trained Transformers forCounterfactual Statement Modeling
Yaojie Lu , , Annan Li , , Hongyu Lin , Xianpei Han , , Le Sun , Chinese Information Processing Laboratory State Key Laboratory of Computer ScienceInstitute of Software, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China { yaojie2017,liannan2019,hongyu2016,xianpei,sunle } @iscas.ac.cn Abstract
ISCAS participated in two subtasks of SemEval 2020 Task 5: detecting counterfactual statementsand detecting antecedent and consequence. This paper describes our system which is based on pre-trained transformers. For the first subtask, we train several transformer-based classifiers for detect-ing counterfactual statements. For the second subtask, we formulate antecedent and consequenceextraction as a query-based question answering problem. The two subsystems both achievedthird place in the evaluation. Our system is openly released at https://github.com/casnlu/ISCAS-SemEval2020Task5.
Counterfactual statements describe events that did not actually happen or cannot occur, as well asthe possible consequence if the events have had happened. Counterfactual detecting aims to identifycounterfactual statements in language and understand antecedents and consequents in these statements. Forinstance, the following sentence is a counterfactual statement, and the underlined term is the antecedent,while the italic term is the consequence :Her post-traumatic stress could have been avoided if a combination of paroxetine and exposuretherapy had been prescribed two months earlier .Once understanding the statement, we can accumulate the causal knowledge for “post-traumatic stress”,i.e., “a combination of paroxetine and exposure may help cure post-traumatic stress”. To model counter-factual semantics and reason in natural language, SemEval 2020 Subtask 5 provides an English benchmarkfor two basic problems: detecting counterfactual statements and detecting antecedent and consequence(Yang et al., 2020).We build our evaluation systems that are built on pre-trained transformer-based neural network models,which have shown significant improvements over conventional methods in many NLP fields (Devlinet al., 2019; Liu et al., 2020; Lan et al., 2020). Specifically, in subtask 1, several transformer-basedclassifiers are designed to detect counterfactual statements. Besides, because counterfactual antecedentexpressions are usually expressed using some obvious conditional assumption connectives, such as if and wish . We also equip transformers with additional convolutional neural network to capture the abovestrong local context information. For subtask 2, we formulate antecedent and consequence extractionas a query-based question answering problem. Specifically, to effectively model context information incounterfactual statements, we design two different kinds of input queries for antecedents/consequencesand regard counterfactual statements as given paragraphs.The rest of this paper is organized as follows. Section 2 introduces the background of pre-trainedtransformers. Section 3 describes the overview of our system for two subtasks. In Section 4-5, wedescribe the detailed experiment setup and the overall system performance on the two subtasks. Finally,we conclude this paper in Section 6. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/ . a r X i v : . [ c s . C L ] S e p re-trained Transformer Pre-trained Transformer … CNNClassifierClassifier (a) CLS Aggregation (b) CNN Aggregation [CLS] [SEP] … !𝑥 !! !𝑥 !" !𝑥 [CLS] [SEP] … !𝑥 !! !𝑥 !" !𝑥 Figure 1: The Transformer for Detecting Counterfactual Statements.
Different from the pre-trained word embedding in NLP (Pennington et al., 2014), pre-trained contextual-ized models aim to learn encoders to represent words in context for downstream tasks. BERT(Devlin etal., 2019) is a representative large-scale pre-trained transformer, which is trained using mask languagemodeling (MLM) and next sentence prediction (NSP) task.Whole Word Masking model (BERT-WWM) is a simple but effective variant of BERT. In this case,the pre-training stage always mask all of the tokens corresponding to a word instead of a single WordPiecetoken (sub-token).RoBERTa (Liu et al., 2020) further improves on BERT’s pre-training procedure and achieves substantialimprovements. The improvements include training the model longer, with bigger batches over more data;removing the next sentence prediction objective; training on longer sequences; and dynamically changingthe masking pattern applied to the training data.ALBERT (Lan et al., 2020) incorporates factorized embedding parameterization and cross-layerparameter sharing to reduce the number of parameters of BERT. These two methods can significantlyreduce the number of parameters of BERT, thus improving the parameter efficiency and facilitating thelearning of larger models. Besides, ALBERT also uses sentence ordering prediction (SOP) self-supervisedlearning task to replace BERT’s NSP task, for its helpful for the model to better learn sentence coherence. Given a candidate text x = { w , w , ..., w n } , our system needs to: 1) determine whether the candidatecontains a counterfactual statement; 2) extract the antecedent and consequence from the counterfactualstatement. For the example in Section 1, we first detect it is a counterfactual statement, then extract “ Herpost-traumatic stress could have been avoided ” as its antecedent and “ if a combination of paroxetine andexposuretherapy had been prescribed two months earlier ” as its consequent. In the following, we describeour two sub-systems in detail.
To detect counterfactual statements, we build classifiers based on contextualized representation to detectcounterfactual statements. We first represent each word in the text using its contextualized representation,then obtain the overall text representation using two different aggregation methods, and finally determiningwhether the text contains counterfactual statements using a classifier. The overall framework is shown inFigure 1.
Contextualized Word Representation Layer.
To capture the counterfactual semantics in naturallanguage, we learn a contextualized representation for each token. In order to alleviate the out-of-vocabulary problem in text representation, we first convert the raw input text into word-pieces https://github.com/google-research/bert sub-tokens) { ˜ x , ..., ˜ x n , ˜ x n } in the pre-defined vocabulary. Then, the two special symbols [CLS]and [SEP] will be added to the head and tail of the sentence. Finally, we feed tokenized text ˜ x = { [CLS] , ˜ x , ..., ˜ x n , ˜ x n , [SEP] } into L -layers pre-trained transformers to obtain the contextualizedrepresentation for each sub-tokens.Following (Tenney et al., 2019), we pool token i ’s representation h i ∈ R n × d across all BERT layersusing scalar mixing (Peters et al., 2018): h i = γ (cid:80) Lj =1 α j x ( j ) i where x ( j ) i ∈ R d is the embedding oftoken i from BERT layer j , α j is softmax-normalized weights, and γ is a scalar parameter. We denotethe final representation of the special symbol [CLS] as C ∈ R d . Specifically, we obtain the token-levelrepresentation using the representation of the first sub-token in each token. Contextualized Information Aggregation.
After obtaining the representation of each word, we pro-duce aggregated feature vector r to capture the counterfactual information of the entire statement. Weinvestigate two different aggregation strategies in this section: [CLS] aggregation and convolutional neuralnetwork (CNN) aggregation.In [CLS] aggregation, we directly use the representation C of the special symbol [CLS] as the aggregatefeature r (Devlin et al., 2019).In counterfactual statements, connectives are often used to express the relation between antecedent andconsequence, i.e., “if”, “even if”, and “would”. To capture these local patterns in counterfactual statements,we employ a CNN (Kim, 2014) to aggregate sentence information. Given the token sequence { h , ..., h n } ,the convolutional filter scans the token sequence and extract the local feature l i : l i = tanh w · h i : i + h − + b .Finally, a max-pooling layer is used to produce the feature r for further counterfactual statement detection: r = max ≤ i ≤ n l i . Counterfactual Statement Classifier.
After aggregation, the feature vector r will be fed to the coun-terfactual classifier, which computes a probability of whether it is a counterfactual statement: P ( y = 1 | x ) = σ ( w c · r + b c ) (1)where w c is the weight vector, b c is the bias term, and σ is simgoid function.Given the training set D = { ( x i , y i ) } , we train all parameters using a binary cross-entropy loss function: L = (cid:88) i ∈D y i log P ( y = 1 | x i ) + (1 − y i ) log(1 − P ( y = 1 | x i )) (2) We now describe how to extract antecedent and consequence via a question answering-style procedure.Given a counterfactual statement s , we first construct an antecedent query q a and a consequence query q c separately, and then extract the corresponding antecedent a a and consequence a c in the text by answeringthese two questions. The overall framework is illustrated in Figure 2. Query Construction.
We design two kinds of queries for extraction: name query and definition query.For name query, we directly use “antecedent” and “consequence” as the query for extraction. To enrichthe semantic information of questions, we also propose definition query, which employs the dictionarydefinition of each label as definition queries. For “antecedent”, the definition query is “a preceding event,condition, or cause”. For “consequent”, the definition query is “a result or effect”. Question and Context Encoding.
We represent the input question q ∗ for extraction and the counterfac-tual statement s as a single packed sequence: { [CLS], q ∗ , [SEP], s , [SEP] } . First, q ∗ and s are tokenizedas sub-token sequences after WordPiece tokenization as shown in Figure 2. After tokenization, we feedthe single packed sequence to the pre-trained transformers, and obtain the final hidden vector for the i th sub-token in the query as h iq ∈ R d , the j th sub-token in the statement as h js ∈ R d , and C ∈ R d for thespecial token [CLS]. re-trained Transformer [CLS] [SEP] … [SEP] … CounterfactualStatementQuery forAntecedent/Consequence
Start EndNo Answer
Figure 2: The Transformer for Detecting Antecedent and Consequence.
Answer Prediction.
To extract continuous text fragments, we employ a pointer network to predict thestart position and end position of the answer text. A pointer network contains a start vector w start and anend vector w end , which are used to produce the scores of word i being the start/end of the answer. Thescore of word i being the start of the answer is computed as a dot product between the hidden state ofeach token in the statement: w start · h js , the score of the end is calculated in the same way. We define thescore of a candidate span from position j to position k as S j,k = w start · h js + w end · h ks , where k ≥ j .Since some statements do not contain consequences , we regard the questions corresponding to theseconsequence statements as unanswerable questions. For these questions, we treat [CLS] token as boththe start and the end of the answer span. In this way, the score of a statement without consequence is S null = w start · C + w end · C .For model training, we update the full model by maximizing the likelihood of the start token j ∗ and theend token k ∗ (including [CLS]): L = − (cid:88) i ∈D log P ( y start = j ∗ | x i ) + log P ( y end = k ∗ | x i ) P ( y start = j ∗ | x i ) = exp( w start · h j ∗ s )exp( w start · C ) + (cid:80) nj =1 exp( w start · h js ) P ( y end = k ∗ | x i ) = exp( w end · h k ∗ s )exp( w end · C ) + (cid:80) nk =1 exp( w end · h ks ) (3)where the parameters of the pointer network are training from scratch. This subtask contains 13,000 instances for model training and 7000 unseen instances foronline evaluation. We sampled 1,500 instances from the whole dataset as our development set. Then,we split the remaining 11,500 instances in 5-fold; each fold has 2,300 instances. We trained five modelson five groups of datasets for ensemble voting. Each group takes four folds as a training dataset and theremaining fold for early stopping.
Subtask 2.
This subtask contains a total of 3,551 instances for model training and 1950 unseen instancesfor online evaluation. We sampled 3200 instances as training sets and take the remaining 351 instances asdevelopment sets.
For each model, we selected the best fine-tuning learning rate (among ‘5e-6’ ‘1e-5’ ‘3e-5’)on the development set. Because of the GPU memory limitation, we truncated the maximum total inputsequence length after WordPiece tokenization to 128. We employ three different pre-trained transformersin our submission for SemEval 2020 Task 5 official evaluation: BERT (Devlin et al., 2019), ALBERT These statements cover 14.64% in the training set.
Lan et al., 2020), and RoBERTa (Liu et al., 2020). We used a batch size of 8 for ALBERT-xxlarge and 24for other models. For CNN aggregation, we only used one single CNN layer whose window size is 3 andhidden size is 300.
Subtask 2.
We fine-tuned all models on the training data for 5 epochs using the learning rate of × − for the BERT parameters and the task parameters, while we evaluate and save models at every 250 stepswith the batch size of each step is 16. We trained the large model on two 24G GPUs in parallel, andselected the best model on the development set for online evaluation. We report the performance of subtask 1 and subtask 2, scored by the evaluation server . In subtask 1,different models are evaluated using Precision ( P ), Recall ( R ), and F1-score ( F ) for binary classification.While there are four metrics for subtask 2: Exact Match ( EM ), Precision, Recall, and F1-score. Exactmatch measures the percentage of predictions that match the annotated antecedents and consequencesexactly. Note that, F1-score in subtask 2 is token level metric and will be calculated based on the offsetsof predicted antecedent and consequence. F R P
BERT
Large-Cased-WWM + [CLS] 87.70 87.50 87.90BERT
Large-Cased-WWM + CNN 88.00 87.90 88.10Roberta
Large + [CLS] 89.80
Large + CNN 89.70 89.60 89.80ALBERT
XXLarge + [CLS]
ALBERT
XXLarge + CNN 89.00 87.70 90.40ALBERT
XXLarge + [CLS] + CNN
Table 1: Subtask 1 Test results.Table 1 shows the overall results of our seven runs on subtask 1. We can see that our system achievedvery competitive performance . The performance of ALBERT
XXLarge with [CLS] aggregation on precision(92.20) ranked 1 st in all teams. Our best F (90.00) score ranked 3 rd in all teams. F R P EM
BERT
Base-Cased + Name 86.30 90.30 86.00 51.60BERT
Base-Uncased + Name 86.60 90.20 86.70 51.90BERT
Base-Cased + Definition 86.30 90.30 86.00 52.40BERT
Base-Uncased + Definition 86.80 90.00 87.10 52.50BERT
Large-Uncased-WWM + Name 87.30 89.80
Large-Uncased-WWM + Definition
Table 2: Subtask 2 Test results. Name indicates using name queries and Def. indicates using definition-enriched queries.Table 2 shows the overall results of our seven runs on subtask 2. Our QA-based method ranked 1 st on R score and 3 rd on F , P scores. Besides, our system achieved 2 nd on EM score and surpassed the thirdteam by a large margin (4.90). From the results in Table 2, we can see that:1) The definition-based query achieved better performance than the name-based queries. We believethis because the definition-based query provides richer semantic information than the name-based query.2) Uncased models are better than cased models on both F and EM scores. This may be because ourmodel focuses more on capturing the structure information of counterfactual expressions, meanwhile caseinformation is more useful on capturing information about named entities, such as persons and locations. https://competitions.codalab.org/competitions/21691 Conclusion
In this paper, we propose a transformers-based system for counterfactual modeling. For counterfactualstatements detection, we investigated a variety of advanced pretraining models and two efficient aggre-gation algorithms. For antecedent and consequent extraction, we framed it as a span-based questionanswering task, and then definition-enriched queries are designed to extract the required term fromcounterfactual statements. Evaluation results demonstrate the effectiveness of our system. For futurework, we plan to investigate how to inject extra-knowledge into counterfactual modeling systems, such asknowledge-enriched transformers.
References
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirec-tional transformers for language understanding. In
Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers) , pages 4171–4186, Minneapolis, Minnesota, June. Association for Computational Linguistics.Yoon Kim. 2014. Convolutional neural networks for sentence classification. In
Proceedings of the 2014 Confer-ence on Empirical Methods in Natural Language Processing (EMNLP) , pages 1746–1751, Doha, Qatar, October.Association for Computational Linguistics.Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020.Albert: A lite bert for self-supervised learning of language representations. In
International Conference onLearning Representations .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, LukeZettlemoyer, and Veselin Stoyanov. 2020. Roberta: A robustly optimized bert pretraining approach.Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representa-tion. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) ,pages 1532–1543, Doha, Qatar, October. Association for Computational Linguistics.Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettle-moyer. 2018. Deep contextualized word representations. In
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1(Long Papers) , pages 2227–2237, New Orleans, Louisiana, June. Association for Computational Linguistics.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R. Thomas McCoy, Najoung Kim, Benjamin VanDurme, Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? probingfor sentence structure in contextualized word representations. In .Xiaoyu Yang, Stephen Obadinma, Huasha Zhao, Qiong Zhang, Stan Matwin, and Xiaodan Zhu. 2020. SemEval-2020 task 5: Counterfactual recognition. In