[PDF] An Empirical Study on Measuring the Similarity of Sentential Arguments with Language Model Domain Adaptation

Abstract

Measuring the similarity between two different sentential arguments is an important task in argument mining. However, one of the challenges in this field is that the dataset must be annotated using expertise in a variety of topics, making supervised learning with labeled data expensive. In this paper, we investigated whether this problem could be alleviated through transfer learning. We first adapted a pretrained language model to a domain of interest using self-supervised learning. Then, we fine-tuned the model to a task of measuring the similarity between sentences taken from different domains. Our approach improves a correlation with human-annotated similarity scores compared to competitive baseline models on the Argument Facet Similarity dataset in an unsupervised setting. Moreover, we achieve comparable performance to a fully supervised baseline model by using only about 60% of the labeled data samples. We believe that our work suggests the possibility of a generalized argument clustering model for various argumentative topics.

Full PDF

AAn Empirical Study on Measuring the Similarity of SententialArguments with Language Model Domain Adaptation

ChaeHun Park ∗ School of ComputingKAIST [email protected]

Sangwoo Seo

Scatterlab Inc. [email protected]

Abstract

Measuring the similarity between two differ-ent sentential arguments is an important taskin argument mining. However, one of the chal-lenges in this ﬁeld is that the dataset mustbe annotated using expertise in a variety oftopics, making supervised learning with la-beled data expensive. In this paper, we investi-gated whether this problem could be alleviatedthrough transfer learning. We ﬁrst adapted apretrained language model to a domain of in-terest using self-supervised learning. Then, weﬁne-tuned the model to a task of measuring thesimilarity between sentences taken from differ-ent domains. Our approach improves a corre-lation with human-annotated similarity scorescompared to competitive baseline models onthe Argument Facet Similarity dataset in anunsupervised setting. Moreover, we achievecomparable performance to a fully supervisedbaseline model by using only about 60% of thelabeled data samples. We believe that our worksuggests the possibility of a generalized argu-ment clustering model for various argumenta-tive topics.

Providing diverse opinions on a controversial issueis one of the most important functions of argumentmining. To this end, methods for grouping relevantarguments within a given topic by their similari-ties (Misra et al., 2016; Reimers et al., 2019; Chenet al., 2019) should be developed to prohibit redun-dant outcomes ( argument clustering ). This stepplays a crucial role in preventing users from beingoverwhelmed by the number of retrieved argumentsand in clarifying the inconspicuous aspects.However, obtaining a sufﬁciently large labeleddataset is usually time-consuming and expensive.A continual annotation and training process for an ∗ Work done while the author was an intern at Scatterlab. unseen topic is also required to avoid performancedegradation. To address this, various domain adap-tation methods (Li et al., 2019; Das et al., 2019;Wang et al., 2019; Cao et al., 2020) have beenrecently explored. These studies aimed to appro-priately transfer the knowledge within the domainor task. In particular, several studies found thatcontinual pretraining of a language model (e.g.,BERT (Devlin et al., 2019) and RoBERTa (Liuet al., 2019)) is effective with both unsuperviseddomain adaptation (Ma et al., 2019; Rietzler et al.,2020) and general supervised learning (Howardand Ruder, 2018; Gururangan et al., 2020).In this study, we attempted to alleviate the low-resource problem of an argument clustering taskby leveraging the recent transfer learning strategies.Speciﬁcally, we ﬁne-tuned BERT using a semantictextual similarity (STS) task to transfer the abilityto measure the similarity between two sentences.Concurrently, we adapted the model to sentencesfrom domains of interest. These two methods candrive the model to encode the proper representation,in the aspects of both domain and task.We evaluated our approach under various con-ditions including the use of the labeled targetdataset and the order of training. Experimen-tal results show that our approach improved cor-relation with human-annotated similarity scoresagainst competitive baseline models in an unsu-pervised setting for the Argument Facet Similaritydataset (AFS) (Misra et al., 2016). The sampleefﬁciency was also improved, in that comparableperformance to a fully supervised baseline modelwas obtained by using only about 60% of the la-beled dataset.Our contributions are as follows: (1) We formu-late the task that measures the similarity betweensentential arguments into an unsupervised domainadaptation problem. (2) We investigate variousstrategies to adapt the pretrained language model a r X i v : . [ c s . C L ] F e b LM !" MLM ’$()*+

STS ,"-(.*

STS ’$()*/ disadvantageThe [MASK] with [MASK] punishment, however, is that it is [MASK].My major complaint against the death penalty is that it is quite permanent.The problem with capital punishment, however, is that it is permanent.wealthyA woman is sewing on a machine.A woman is using a sewing machine.capitalI see almost no economic [MASK] to raising this taxes on the [MASK].permanentproblem

Domain datasetTarget datasetSourcedatasetTarget dataset

Figure 1: Overview of MLM domain (cid:1)

MLM tgt (cid:1)

STS src . STS tgt is only performed in a supervised setting. into the desired domain and task. (3) Our proposedapproach constantly achieves higher correlationscores than strong baseline models in unsupervised,low-resource, and fully-supervised settings.

We formulated the argument clustering task asmeasuring the similarity between two senten-tial arguments. For this, we used a sentence-BERT (Reimers and Gurevych, 2019) as our ba-sic architecture. When two sentences were given,each sentence was individually transformed into aﬁxed-size vector by a shared single BERT. We usedcosine similarity to measure the similarity scorebetween two sentences.Our approach consists of two different meth-ods (Fig. 1). The ﬁrst method adapts the pre-trained BERT to domains of interest through self-supervised learning (Section 2.1). The othermethod ﬁne-tunes the sentence-BERT architecturefor an STS task with a dataset other than our targetdataset (Section 2.2).

We used masked language modeling (MLM) toadapt BERT to our target data distribution. Thisstrategy randomly masks the tokens of an inputsequence and trains the model to correctly predictthe original token based on its unmasked context.This process was expected to shift the distributionof the model toward the desired domain and enablethe model to extract the better representations oftarget sentences. This adapted BERT is then used toget semantically meaningful sentence embeddings.For this step, we used two unlabeled corporawith different characteristics, following Gururan- gan et al. (2020). The ﬁrst corpus is composed ofsentences from the target dataset itself, to adaptthe model to the target distribution. We denotethis adapted BERT by

MLM tgt . The second is alarger corpus that contains arguments on varioustopics other than ones in the target dataset. Thisdomain-level adaptation conveyed more generalknowledge of argumentation to the model. Thismodel is denoted by

MLM domain . We performed supervised learning for a sentence-pair similarity regression task using STSbdataset (Cer et al., 2017). The underlying hypothe-sis here was that the ability to measure the similar-ity between relatively common sentences could betransferred to our narrow range of domains. Thiscan be regarded as a typical unsupervised domainadaptation training, where only the labeled datasetfrom the source domain (STSb) exists. This modelis denoted by

STS src . We considered different combinations among theabovementioned methods to ﬁnd the best curricu-lum strategy. If two or more methods were used,each method was processed sequentially. For in-stance, if STS src and MLM domain methods werechosen, two different models can be made basedon the order of training (MLM domain (cid:1)

STS src andSTS src (cid:1)

MLM domain ). These models were eitherﬁne-tuned for the target task (if labeled data ex-isted), or used directly for the target task. We didnot investigate the combinations of MLM domain fol-lowing the other two methods (STS src and MLM tgt )since the number of data samples available is muchlarger for MLM domain (2.3M) than for the oth-ers (6K and 8K, respectively). ame MLM Fine-tuning Size Model

AFS (Misra et al., 2016) (cid:88) (cid:52) tgt

Reddit (Hua and Wang, 2018) (cid:88) domain

STSb (Cer et al., 2017) (cid:88) src

Table 1: Dataset details. Fine-tuning on AFS was performed in a supervised setting only.

We used AFS dataset (Misra et al., 2016) as ourmain target dataset for the argument clustering task.This dataset contains sentential arguments on threecontroversial topics ( gun control , death penalty and gay marriage ). STSb dataset was used as a sourcedomain for STS task (Cer et al., 2017). In AFS andSTSb datasets, similarity scores are annotated on ascale from 0 to 5. For domain-level MLM, we usedthe dataset crawled from Reddit r/ChangeMyView subcommunity (Hua and Wang, 2018) . In thiscommunity, users post their replies to change theviewpoints of other users about various controver-sial topics. The details of each dataset are describedin Table 1.We used Adam optimizer (Kingma and Ba,2015) with the initial learning rate set to 2e-5 andapplied gradient clipping with a maximum norm of1 (Pascanu et al., 2013). We trained MLM on AFSfor 10 epochs, as well as on Reddit for 5 epochs.We ﬁne-tuned STS task for 5 epochs on both STSband AFS datasets. In MLM, we randomly dropped15% of the tokens in a sentence. We used dropoutwith a rate of 0.1 (Srivastava et al., 2014). We set arandom seed to 42 for every experiment.We compared our approach with the follow-ing baseline models: BERT (Devlin et al.,2019) , Glove (Pennington et al., 2014),

In-ferSent (Conneau et al., 2017),

Universal SentenceEncoder (Cer et al., 2018). The similarity scorebetween two sentence embeddings was measuredby cosine similarity. As previously mentioned, theoriginal BERT and all of our methods are used asan encoder of sentence-BERT to get a sentenceembedding of each sentential argument.

We evaluated Pearson correlation ( r ) and Spear-man’s rank correlation coefﬁcient ( ρ ) for eachmethod, following previous works (Misra et al., The pretrained BERT ( bert-based-uncased ) byHuggingface (Wolf et al., 2019) was used for our experiments.

Model r ρ

Unsupervised - Baseline

GloVe .1443 .1632InferSent-GloVe .2741 .2699InferSent-FastText .2741 .2699BERT .3464 .3413Universal Sentence Encoder .4445 .4358

Unsupervised - Ours

MLM tgt .3947 .4071STS src .4002 .3881STS src (cid:1)

MLM tgt .4195 .4203MLM domain .4654 .4564MLM tgt (cid:1)

STS src .4662 .4454MLM domain (cid:1)

MLM tgt .4707 .4648MLM domain (cid:1)

STS src .4767 .4699MLM domain (cid:1)

STS src (cid:1)

MLM tgt .4779 .4685MLM domain (cid:1)

MLM tgt (cid:1)

STS src .5209 .5085

Table 2: Evaluation results in an unsupervised setting.The highest score is highlighted in bold.

Table 2 presents the evaluation results of eachmodel in an unsupervised setting. Among the base-line models,

Universal Sentence Encoder showedthe best performance. From the result of our meth-ods, we observed that all of our proposed singlemodels achieved better performance in both met-rics than the original BERT model. A combinationof any method followed by others performed bet-ter than single methods. In particular, our bestmodel (MLM domain (cid:1)

MLM tgt (cid:1)

STS src ) improvedPearson correlation by 50.37% and Spearman’srank correlation by 48.98% compared with BERT.These results indicate that our proposed methodcan effectively measure the similarity of senten-tial arguments in the unsupervised setting. We alsofound that even if the same methods were used, per-formance differed signiﬁcantly depending on theorder of training (For instance, MLM tgt (cid:1)

STS src and STS src (cid:1)

MLM tgt ). We speculate that this is .0 0.2 0.4 0.6 0.8 1.0

The ratio of samples used in fine-tuning

MLM domain

MLM tgt

STS src

MLM domain

STS src

MLM domain

MLM tgt

STS src

BERT

Figure 2: Spearman’s rank correlation ( ρ ) for eachmodel as a function of the ratio of data samples usedin ﬁne-tuning. The dotted red line indicates BERT in asupervised setting. because ﬁne-tuning the model with a proper down-stream task is required in the ﬁnal process of train-ing, which should be further investigated in futurework.

To verify the sample efﬁciency of the proposedmethods, we further ﬁne-tuned each model usingAFS dataset by increasing the ratio of labeled datasamples by 10%. The results are depicted in Fig. 2.Our models reached the performance of the fullysupervised BERT by using only about 60% of thelabeled data. In the fully supervised case, ourbest model improved both metrics by 3-4% uponBERT (Table 3).

One natural question is whether the performanceimprovement in our approach was due to increasein the number of training samples, regardless of thetraining details. To verify this, we used the MNLIdataset (Williams et al., 2018) to train BERT byeither an MLM (MLM

MNLI ) or a supervised NLIclassiﬁcation task (NLI

MNLI ). The training epochsfor MLM and NLI ﬁne-tuning were set to 5 and 3,respectively. The results are presented in Table 4.As can be observed, supervised training using theMNLI dataset slightly dropped the performanceof BERT, regardless of whether the labeled AFS

Model r ρ

Supervised

BERT .7520 .7249MLM tgt .7637 .7407STS src .7655 .7455MLM tgt (cid:1)

STS src .7756 .7549MLM domain (cid:1)

STS src .7776 .7591

MLM domain .7786 .7581MLM domain (cid:1)

STS src (cid:1)

MLM tgt .7789 .7579MLM domain (cid:1)

MLM tgt .7801 .7570

Table 3: Evaluation results in supervised setting. Thehighest score is highlighted in bold.

Model r ρ

Unsupervised

NLI

MNLI .3325 (-.0139) .3030 (-.0383)MLM

MNLI .3772 (+.0308) .3804 (+.0391)

Supervised

NLI

MNLI .7367 (-.0153) .7024 (-.0225)MLM

MNLI .7593 (+.0073) .7375 (+.0126)

Table 4: Evaluation results for MNLI dataset. NLI

MNLI and MLM

MNLI denote the model trained by the orig-inal NLI task and MLM, respectively. The numbersin parentheses represent differences from the originalBERT. dataset was used. Masked language modeling im-proved the performance compared to the originalBERT, although not superior to any of our methods.

We investigated a way of leveraging transfer learn-ing to address the low-resource problem of the sen-tential argument clustering task. To this end, weused two simple methods to adapt the pretrainedlanguage model to the target data distribution andthe task itself. Experimental results showed thatthere was a reasonable performance gain in the un-supervised setting, and also improvement in thesample efﬁciency in the supervised setting. Empiri-cal results imply that our approach could be usedto train a more efﬁcient and accurate model forargument clustering.As future work, we intend to extend our ap-proach to a general clustering setup, not limitedby a sentence-pair similarity. We also plan to in-vestigate if such knowledge could be transferredfor other tasks as well in argument mining, for in-stance, stance classiﬁcation (Bar-Haim et al., 2017)and evidence detection (Thorne et al., 2019). eferences

Roy Bar-Haim, Indrajit Bhattacharya, Francesco Din-uzzo, Amrita Saha, and Noam Slonim. 2017. Stanceclassiﬁcation of context-dependent claims. In

Pro-ceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Volume 1, Long Papers , pages 251–261,Valencia, Spain. Association for Computational Lin-guistics.Yu Cao, Meng Fang, Baosheng Yu, and Joey TianyiZhou. 2020. Unsupervised domain adaptation onreading comprehension. In

The Thirty-Fourth AAAIConference on Artiﬁcial Intelligence, AAAI 2020,The Thirty-Second Innovative Applications of Ar-tiﬁcial Intelligence Conference, IAAI 2020, TheTenth AAAI Symposium on Educational Advancesin Artiﬁcial Intelligence, EAAI 2020, New York, NY,USA, February 7-12, 2020 , pages 7480–7487. AAAIPress.Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017task 1: Semantic textual similarity multilingual andcrosslingual focused evaluation. In

Proceedingsof the 11th International Workshop on SemanticEvaluation (SemEval-2017) , pages 1–14, Vancouver,Canada. Association for Computational Linguistics.Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,Nicole Limtiaco, Rhomni St. John, Noah Con-stant, Mario Guajardo-Cespedes, Steve Yuan, ChrisTar, Yun-Hsuan Sung, Brian Strope, and RayKurzweil. 2018. Universal sentence encoder.

CoRR ,abs/1803.11175.Sihao Chen, Daniel Khashabi, Wenpeng Yin, ChrisCallison-Burch, and Dan Roth. 2019. Seeing thingsfrom a different angle:discovering diverse perspec-tives about claims. In

Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers) , pages 542–557, Minneapolis, Minnesota. As-sociation for Computational Linguistics.Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıcBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In

Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing , pages 670–680, Copen-hagen, Denmark. Association for ComputationalLinguistics.Manirupa Das, Zhen Wang, Evan Jaffe, MadhujaChattopadhyay, Eric Fosler-Lussier, and Rajiv Ram-nath. 2019. Learning to answer subjective, spe-ciﬁc product-related queries using customer re-views by adversarial domain adaptation.

CoRR ,abs/1910.08270.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Suchin Gururangan, Ana Marasovi´c, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A Smith. 2020. Don’t stop pretraining:Adapt language models to domains and tasks.

Proceedings of ACL 2020 .Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model ﬁne-tuning for text classiﬁcation. In

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 328–339, Melbourne, Australia.Association for Computational Linguistics.Xinyu Hua and Lu Wang. 2018. Neural argumentgeneration augmented with externally retrieved evi-dence. In

Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 219–230, Melbourne,Australia. Association for Computational Linguis-tics.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In .Zheng Li, Xin Li, Ying Wei, Lidong Bing, Yu Zhang,and Qiang Yang. 2019. Transferable end-to-endaspect-based sentiment analysis with selective adver-sarial learning. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 4590–4600, Hong Kong, China. As-sociation for Computational Linguistics.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Xiaofei Ma, Peng Xu, Zhiguo Wang, Ramesh Nalla-pati, and Bing Xiang. 2019. Domain adaptationwith BERT-based domain classiﬁcation and data se-lection. In

Proceedings of the 2nd Workshop onDeep Learning Approaches for Low-Resource NLP(DeepLo 2019) , pages 76–83, Hong Kong, China.Association for Computational Linguistics.Amita Misra, Brian Ecker, and Marilyn Walker. 2016.Measuring the similarity of sentential arguments indialogue. In

Proceedings of the 17th Annual Meet-ing of the Special Interest Group on Discourse andDialogue , pages 276–287, Los Angeles. Associationfor Computational Linguistics.azvan Pascanu, Tomas Mikolov, and Yoshua Bengio.2013. On the difﬁculty of training recurrent neuralnetworks. In

International conference on machinelearning , pages 1310–1318.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In

Proceedings of the 2014 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 1532–1543, Doha,Qatar. Association for Computational Linguistics.Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages3982–3992, Hong Kong, China. Association forComputational Linguistics.Nils Reimers, Benjamin Schiller, Tilman Beck, Jo-hannes Daxenberger, Christian Stab, and IrynaGurevych. 2019. Classiﬁcation and clustering of ar-guments with contextualized word embeddings. In

Proceedings of the 57th Annual Meeting of the As-sociation for Computational Linguistics , pages 567–578, Florence, Italy. Association for ComputationalLinguistics.Alexander Rietzler, Sebastian Stabinger, Paul Opitz,and Stefan Engl. 2020. Adapt or get left behind:Domain adaptation through BERT language modelﬁnetuning for aspect-target sentiment classiﬁcation.In

Proceedings of The 12th Language Resourcesand Evaluation Conference , pages 4933–4941, Mar-seille, France. European Language Resources Asso-ciation.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overﬁtting.

The journal of machine learningresearch , 15(1):1929–1958.James Thorne, Andreas Vlachos, Oana Cocarascu,Christos Christodoulopoulos, and Arpit Mittal. 2019.The fever2. 0 shared task. In

Proceedings of the Sec-ond Workshop on Fact Extraction and VERiﬁcation(FEVER) , pages 1–6.Huazheng Wang, Zhe Gan, Xiaodong Liu, Jingjing Liu,Jianfeng Gao, and Hongning Wang. 2019. Adversar-ial domain adaptation for machine reading compre-hension. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages2510–2520, Hong Kong, China. Association forComputational Linguistics.Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In

Proceed-ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers) , pages 1112–1122. Association forComputational Linguistics.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.