[PDF] "Laughing at you or with you": The Role of Sarcasm in Shaping the Disagreement Space

Abstract

Detecting arguments in online interactions is useful to understand how conflicts arise and get resolved. Users often use figurative language, such as sarcasm, either as persuasive devices or to attack the opponent by an ad hominem argument. To further our understanding of the role of sarcasm in shaping the disagreement space, we present a thorough experimental setup using a corpus annotated with both argumentative moves (agree/disagree) and sarcasm. We exploit joint modeling in terms of (a) applying discrete features that are useful in detecting sarcasm to the task of argumentative relation classification (agree/disagree/none), and (b) multitask learning for argumentative relation classification and sarcasm detection using deep learning architectures (e.g., dual Long Short-Term Memory (LSTM) with hierarchical attention and Transformer-based architectures). We demonstrate that modeling sarcasm improves the argumentative relation classification task (agree/disagree/none) in all setups.

Full PDF

““Laughing at you or with you”: The Role of Sarcasmin Shaping the Disagreement Space

Debanjan Ghosh ∗ , Ritvik Shrivastava *2, 3 and Smaranda Muresan Educational Testing Service MindMeld, Cisco Systems Data Science Institute, Columbia University [email protected] , [email protected] , { rs3868, smara } @columbia.edu Abstract

Detecting arguments in online interactionsis useful to understand how conﬂicts ariseand get resolved. Users often use ﬁgura-tive language, such as sarcasm, either as per-suasive devices or to attack the opponentby an ad hominem argument. To furtherour understanding of the role of sarcasm inshaping the disagreement space, we presenta thorough experimental setup using a cor-pus annotated with both argumentative moves(agree/disagree) and sarcasm. We exploit jointmodeling in terms of (a) applying discretefeatures that are useful in detecting sarcasmto the task of argumentative relation classi-ﬁcation (agree/disagree/none), and (b) multi-task learning for argumentative relation clas-siﬁcation and sarcasm detection using deeplearning architectures (e.g., dual Long Short-Term Memory (LSTM) with hierarchical atten-tion and Transformer-based architectures). Wedemonstrate that modeling sarcasm improvesthe argumentative relation classiﬁcation task(agree/disagree/none) in all setups.

User-generated conversational data such as dis-cussion forums provide a wealth of naturally oc-curring arguments. The ability to automaticallydetect and classify argumentative relations (e.g.,agree/disagree) in threaded discussions is usefulto understand how collective opinions form, howconﬂict arises and is resolved (van Eemeren et al.,1993; Abbott et al., 2011; Walker et al., 2012b;Misra and Walker, 2013; Ghosh et al., 2014; Rosen-thal and McKeown, 2015; Stede and Schneider,2018). Linguistic and argumentation theories havethoroughly studied the use of sarcasm in argumen-tation, including its effectiveness as a persuasivedevice or as a means to express an ad hominem ∗ Equal Contribution. Arg. Rel. Turn Pairs

Prior Turn:

Today, no informed creationistwould deny natural selection.

Agree

Current Turn:

Seeing how this was pro-posed over a century and a half ago by Dar-win, what took the creationists so long tocatch up?

Prior Turn:

Personally I wouldn’t own agun for self defense because I am just notthat big of a sissy.

Disagree

Current Turn:

Because taking responsibil-ity for ones own safety is certainly a sissything to do?

Prior Turn:

I’m not surprised that no oneon your side of the debate would correct you,but wolves and dogs are both members of thesame species. The Canid species.

Current Turn:

Wow, you ’re even wrongwhen you get away from your precious Bibleand try to sound scientiﬁc.

Prior Turn:

The hand of God kept me fromserious harm. Maybe He has a plan for me.

None

Current Turn:

You better hurry up . Are n’tyou like 113 years old.

Table 1: Sarcastic turns that disagree, agree or have noargumentative relation with their prior turns. fallacy (attacking the opponent instead of her/hisargument) (Tindale and Gough, 1987; van Eemerenand Grootendorst, 1992; Gibbs and Izett, 2005;Averbeck, 2013). We propose an experimentalsetup to further our understanding of the role ofsarcasm in shaping up the disagreement space inonline interactions. The disagreement space, de-ﬁned in the context of the dialogical perspectiveon argumentation, is seen as the speech acts initiat-ing the difference of opinions that argumentation isintended to resolve (Jackson, 1992; van Eemerenet al., 1993). Our study is based on the InternetArgument Corpus (IAC) introduced by Abbott et al.(2011) that contains online discussions annotatedfor the presence/absence and the type of an argu-mentative move (agree/disagree/none) as well asthe presence/absence of sarcasm. Consider the dia-logue turns from IAC in Table 1, where the currentturn (henceforth, ct ) is a sarcastic response to the a r X i v : . [ c s . C L ] J a n rior turn (henceforth, pt ). These dialogue movescan be argumentative (agree/disagree) or not argu-mentative (none). The argumentative move canexpress agreement (ﬁrst example) or disagreement(the second example is an undercutter, while thethird example is an ad hominem attack). The fourthexample, although sarcastic, it is not argumenta-tive. It can be noticed that none of the currentturns contain explicit lexical terms that could sig-nal an argumentative relation with the prior turn.Instead, the argumentative move is being implicitlyexpressed using sarcasm.We study whether modeling sarcasm can im-prove the detection and classiﬁcation of argumen-tative relations in online discussions. We pro-pose a thorough experimental setup to answer thisquestion using feature-based machine learning ap-proaches and deep learning models. For the former,we show that combining features that are useful todetect sarcasm (Joshi et al., 2015; Muresan et al.,2016; Ghosh and Muresan, 2018) with state-of-the-art argument features leads to better performancefor the argumentative relation classiﬁcation task(agree/disagree/none) (Section 5). For the deeplearning approaches, we hypothesize that multitasklearning , which allows representations to be sharedbetween multiple tasks (e.g., here, the tasks of argu-mentative relation classiﬁcation and sarcasm detec-tion), lead to better generalizations. We investigatethe impact of multitask learning for a dual LongShort-Term Memory (LSTM) Network with hierar-chical attention (Ghosh et al., 2017) (Section 4.2)and BERT (Bidirectional Encoder Representationsfrom Transformers) (Devlin et al., 2019), includ-ing an optional joint multitask learning objectivewith uncertainty-based weighting of task-speciﬁclosses (Kendall et al., 2018) (Section 4.3). Wedemonstrate that multitask learning improves theperformance of the argumentative relation classiﬁ-cation task for all settings (Section 5). We provide adetailed qualitative analysis (Section 5.1) to give in-sights into when and how modeling sarcasm helps.We make the code from our experiments publiclyavailable. The Internet Argument Corpus (

IAC )(Walker et al., 2012b) can be found for public acesshere: https://github.com/ritvikshrivastava/multitask transformers https://nlds.soe.ucsc.edu/iac2 Argument mining is a growing area of researchin computational linguistics, focusing on the de-tection of argumentative structures in a text (seeStede and Schneider (2018) for an overview).This paper focuses on two subtasks: argumenta-tive relation identiﬁcation and classiﬁcation (i.e.,agree/disagree/none). Some of the earlier work onargumentative relation identiﬁcation and classiﬁ-cation has relied on feature-based machine learn-ing models, focusing on online discussions (Abbottet al., 2011; Walker et al., 2012b; Misra and Walker,2013; Ghosh et al., 2014; Wacholder et al., 2014)and monologues (Stab and Gurevych, 2014, 2017;Persing and Ng, 2016; Ghosh et al., 2016). Staband Gurevych (2014) proposed a set of lexical, syn-tactic, semantic, and discourse features to classifythem. On the same essay dataset, Nguyen and Lit-man (2016) utilized contextual information to im-prove the accuracy. Both Stab and Gurevych (2017)and Persing and Ng (2016) used Integer Linear Pro-gramming (ILP) based joint modeling to detectargument components and relations. Rosenthal andMcKeown (2015) introduced sentence similarityand accommodation features, whereas Menini andTonelli (2016) presented how entailment betweentext pairs can discover argumentative relations. Ourargumentative features in the feature-based modelare based on the above works (Section 4.1). Weshow that additional features that are useful in sar-casm detection (Joshi et al., 2015; Ghosh and Mure-san, 2018) enhance the performance on the argu-mentative relation identiﬁcation and classiﬁcationtasks.In addition to feature-based models, deeplearning models have been recently used forthese tasks. Potash et al. (2017) proposed apointer network, and Hou and Jochim (2017) of-fered LSTM+Attention network to predict argu-ment components and relations jointly, whereas(Chakrabarty et al., 2019) exploited adaptive pre-training (Gururangan et al., 2020) for BERT toidentify argument relations. We use two multitasklearning objectives (argumentative relation identiﬁ-cation/classiﬁcation and sarcasm detection), as ourgoal is to investigate whether identifying sarcasmcan help in modeling the disagreement space. Ma-jumder et al. (2019); Chauhan et al. (2020) usedmultitask learning for sarcasm & sentiment and sar-casm, sentiment, & emotion, respectively, wherea direct link between the corresponding tasks isvident.Finally, analyzing the role of sarcasm and verbalirony in argumentation has a long history in lin-guistics (Tindale and Gough, 1987; Gibbs and Izett,2005; Averbeck, 2013; van Eemeren and Grooten-dorst, 1992). We propose joint modeling of argu-mentative relation detection and sarcasm detectionto empirically validate sarcasm’s role in shapingthe disagreement space in online conversations.While the focus of our paper is not to providea state-of-the-art sarcasm detection model, ourfeature-based models, along with the deep learningmodels for sarcasm detection are based on state-of-the-art approaches. We implemented discrete fea-tures such as pragmatic features (Gonz´alez-Ib´a˜nezet al., 2011; Muresan et al., 2016), diverse sarcasmmarkers (Ghosh and Muresan, 2018), and incon-gruity detection features (Riloff et al., 2013; Joshiet al., 2015). The LSTM models are inﬂuencedby Ghosh and Veale (2017); Ghosh et al. (2018),where the function of contextual knowledge is usedto detect sarcasm. Lastly, transformer models suchas BERT and RoBERTa have been used in the win-ning entries for the recent shared task on sarcasmdetection (Ghosh et al., 2020). In our research, forboth kinds of deep-learning models, the best resultsare obtained by using the multitask setup, showingthat multitask learning indeed helps improve bothtasks.

Our training and test data are collected from theInternet Argument Corpus (

IAC ) (Walker et al.,2012a). This corpus consists of posts from conver-sations in online forums on a range of controversialpolitical and social topics such as Evolution, Abor-tion, Gun Control, and Gay Marriage (Abbott et al.,2011, 2016). Multiple versions of

IAC corporaare publicly available, and we use a particular sub-set, marked as

IAC orig , collected from Abbott et al.(2011). This consists of around 10K pairs of conver-sation turns (i.e., prior turn pt and the current turn ct ) that were annotated using Mechanical Turk forargumentative relations (agree/disagree/none) andother characteristics such as sarcasm/non-sarcasm,respect/insult, nice/nastiness. Median Cohen’s κ is0.5 across all topics.For agree/disagree/none relations the annotationwas a scalar judgment on an 11 point scale [-5,5]where “-5” indicates a high disagreement move,“0” indicates none relation, and “5” denotes a high Arg. Rel. Sarcasm A S

315 (33%)

N S

638 (67%)

D S

N S

Table 2: Dataset statistics; A (Agree), D (Disagree), N(None); S (Sarcasm), NS (Non-Sarcasm) agreement move. We converted the scalar valuesto three categories: disagree ( D ) for values be-tween [-5, -2], none ( N ) for values between [-1,1],and agree ( A ) for values between [2,5], where thescalar partitions ([]) follow prior work with IAC (Misra and Walker, 2013; Rosenthal and McKeown,2015).Each “current turn” that is part of a < pt , ct > pairis also labeled with a Sarcasm ( S ) or Non-Sarcasm( N S ) label. Table 2 shows the data statistics interms of argumentative relations ( A / D / N ) and sar-casm ( S / N S ). We split the dataset into training (80%; 7,982 turn pairs), test (10%; 999 turn pairs),and dev (10%; 999 turn pairs) sets where each setcontains a proportional number of instances (i.e.,80% of 315 (=252) sarcastic turns ( S ) with argu-ment relation label A (agree) appears in the trainingset). The dev set is used for parameter tuning. We present the computational approaches to inves-tigate whether modeling sarcasm can help detectargumentative relations. As our goal is to provide acomprehensive empirical investigation of sarcasm’srole in argument mining rather than propose newmodels, we explore three separate machine learn-ing approaches well-established for studying argu-mentation and ﬁgurative language. First, we imple-ment a Logistic Regression method that exploits acombination of state-of-the-art features to detectargumentative relations as well as sarcasm (Section4.1). Second, we present a dual LSTM architec-ture with hierarchical attention and its multitasklearning setup (Section 4.2). Third, we discussexperiments using the pre-trained BERT modelsand our multitask learning architectures based onit (Section 4.3). .1 Logistic Regression with DiscreteFeatures

We use a Logistic Regression (LR) model thatuses both argument-relevant (

ArgF ) and sarcasm-relevant (

SarcF ) features. Unless mentioned, allfeatures were extracted from the current turn ct . Argument-relevant features (

ArgF ). We ﬁrstevaluate the features that are reported as beinguseful for identifying and classifying argumen-tative relations: (a) n-grams (e.g., unigram, bi-gram, trigram) created based on the full vocabularyof the

IAC corpus; (b) argument lexicons : twolists of twenty words representing agreement (e.g.,“agree”, “accord”) and disagreement (e.g., “dif-fer”, “oppose”), respectively (Rosenthal and McK-eown, 2015) (c) sentiment lexicons such as MPQA(Wilson et al., 2005) and opinion lexicon (Hu andLiu, 2004) to identify sentiment in the turns; (d) hedge features , since they are often used to mitigatespeaker’s commitment (Tan et al., 2016); (e)

PDTBdiscourse markers because claims often start withdiscourse markers such as therefore , so . We discardmarkers from the temporal relation; (f) modal verbs because they signal the degree of certainty whenexpressing a claim (Stab and Gurevych, 2014); (g) pronouns , since they dialogically point to the pre-vious speaker’s stance; (h) textual entailment : cap-tures whether a position expressed in the prior turnis accepted in the current turn (Cabrio and Villata,2012; Menini and Tonelli, 2016) ; (i) lemma over-lap to determine topical alignment between theprior and current turn (Somasundaran and Wiebe,2010). We compute lemma overlap of noun, verbs,and adjectives between the turns, and (j) negation toextract explicit negation cues (e.g., “not”, “don’t”)that often signal disagreement. Sarcasm-relevant features (

SarcF ). Assarcasm-relevant features we use: (a) LinguisticInquiry Word Count (LIWC) (Pennebaker et al.,2001) features to capture the linguistic, social,individual, and psychological processes; (b)measuring sentiment incongruity , that is, capturingthe number of times the difference in sentimentpolarity between the prior turn pt and the currentturn ct occurs and number of positive and negativesentiment words in turns (Joshi et al., 2015); (c) sarcasm markers used by Ghosh and Muresan(2018), such as capitalization , quotation marks , We used the textual entailment toolkit (AllenNLP) (Gard-ner et al., 2017). punctuation , exclamations that emphasize a senseof surprisal, tag questions , interjections becausethey seem to undermine a literal evaluation, hyperbole because users frequently overstate themagnitude of an event in sarcasm, and emoticons & emojis , since they often emphasize the sarcasticintent.We use SKLL , an open-source Python packagethat wraps around the Scikit-learn tool (Pedregosaet al., 2011). We perform the feature-based exper-iment using the Logistic Regression model fromScikit-learn.In the experimental runs, LR

ArgF (i.e., modelthat uses just the

ArgF features) denotes the indi-vidual model and LR

ArgF + SarcF (i.e., model thatuses both

ArgF and

SarcF features) is the joint model.

LSTMs are able to learn long-term dependencies(Hochreiter and Schmidhuber, 1997) and have beenshown to be effective in Natural Language In-ference (NLI) research, where the task is to es-tablish the relationship between multiple inputs(Rockt¨aschel et al., 2015). This type of architec-ture is often denoted as the dual architecture sinceone LSTM models the premise and the other mod-els the hypothesis (in Recognizing Textual Entail-ment(RTE) tasks). Ghosh et al. (2018) used thedual LSTM architecture with hierarchical atten-tion (HAN) (Yang et al., 2016) for sarcasm detec-tion to model the conversation context, and we usetheir approach in this paper to model the currentturn ct and the prior turn pt . HAN implementsattention both at the word level and sentence level.The distinct characteristics of this attention is thatthe word/sentence-representations are weighted bymeasuring similarity with a word/sentence levelcontext vector, respectively, which are randomlyinitialized and jointly learned during training (Yanget al., 2016). We compute the vector representa-tion for the current turn ct and prior turn pt andconcatenate vectors from the two LSTMs for theﬁnal softmax decision (i.e., A , D or N for argu-mentative relation detection). Henceforth, this dualLSTM architecture is denoted as LST M attn .To measure the impact of sarcasm in argumen-tative relation detection, we use a multitask learn-ing approach. Multitask learning aims to leverageuseful information in multiple related tasks to im- https://pypi.org/project/skll/ pt v ct Dense + SoftMax

SarcasmArgumentative Relation

Figure 1: Sentence-level Multitask Attention Networkfor prior turn pt and current turn ct . Figure is inspiredby Yang et al. (2016). prove each task’s performance (Caruana, 1997; Liuet al., 2019). We use a simple hard parameter shar-ing network. The architecture is a replica of the LST M attn , with a modiﬁcation of employing twoloss functions, one for sarcasm detection (i.e., train-ing using the S and N S labels) and another forthe argumentative relation classiﬁcation task (i.e.,training using the A , D , and N labels).Figure 1 shows the high-level architecture of thedual LSTM and multitask learning ( LST M MT ).The prior turn pt (left) and the current turn ct (right)are read by two separate LSTMs (i.e., LST M pt and LST M ct ). In case of LST M MT the concatenationof v pt and v ct is passed through a dense+Softmaxlayer for the MTL as shown in Figure 1. Similarto the LR models, LST M attn now represents the individual model (i.e., predicts only the argumen-tative relation) whereas

LST M MT represents the joint model. Dynamic Multitask Loss.

In addition to simplyadding the two losses, we also employed dynamicweighting of task-speciﬁc losses during the trainingprocess, based on the homoscedastic uncertainty oftasks, as proposed in Kendall et al. (2018): L = (cid:88) t σ t L t + log σ t (1)where L t and σ t depict the task-speciﬁc loss andits variance, respectively, over training instances.We denote this variation as LSTM MT uncert . BERT (Devlin et al., 2019), a bidirectional trans-former model, has achieved state-of-the-art per-

Figure 2: Alternating mini-batch training based on thetask type (

BERT

ALT ). formance for many NLP tasks. BERT is initiallytrained on masked token prediction and next sen-tence prediction tasks over large corpora (EnglishWikipedia and Book Corpus). During its training,a special token “[CLS]” is added to the beginningof each training instance, and the “[SEP]” tokensare added to indicate the end of utterance(s) andseparate, in case of two utterances (e.g., pt and ct ). During the evaluation, the learned representa-tion for the “[CLS]” token is processed by an ad-ditional layer with nonlinear activation. In its stan-dard form, pre-trained BERT (“bert-base-uncased”)can be used for transfer learning by ﬁne-tuning ona downstream task, i.e., argument relation detec-tion where training instances are labeled as A , D ,and N . We denote the BERT baseline model as BERT orig that is ﬁne-tuned over the training par-tition of only the argumentative relation data (i.e.,individual task training). Unless mentioned other-wise, we use the BERT predictions available via the“[CLS]” token. To this end, we propose a coupleof variations in the multitask learning settings, andthey are brieﬂy described in the following sections.

Multitask Learning with BERT.

The ﬁrstmodel we use for multitask learning is denoted as

BERT MT (i.e., BERT Multitask Learning). Here,we pass the BERT output embeddings to two clas-siﬁcation heads - one for each task (i.e., detectionof argumentative relation and sarcasm), and therelevant gold labels are passed to them. Each clas-siﬁcation head is a linear layer (size=3 and 2 for Dynamic Loss : Similar to the LSTM architecture,here, too, we experiment with dynamic multitaskloss. We denote this variation as BERT MT uncert . Alternate Multitask Learning.

We employ an-other multitask learning technique where we at-tempt to enrich the learning with ﬁne-tuning oflabeled additional material from the sarcasm de-tection task. Notably, we exploit “sarcasm V2”,a sarcasm detection dataset that was also curatedfrom the original corpus of

IAC and was releasedby Oraby et al. (2016). We pre-process the “sar-casm V2” dataset by removing duplicates that ap-pear in

IAC orig and we end up selecting 3513 training v instances and 423 dev v instances bal-anced between S/NS categories for experimentsand merged them to the sarcasm dataset ( training and dev , respectively) from IAC orig . Note, unlikethe original multitask setting, this time we havemore sarcastic instances (a total of 11,495) thaninstances labeled with argumentative roles (7,982instances as before) for the training purpose, whilekeeping the test set from

IAC orig unchanged.Since the training data is now unequal betweenthe two tasks of argumentative relation and sar-casm detection, we create mini-batches so that eachbatch consists of instances with only one task label(i.e., either argumentative labels or sarcasm labels).The batches from the two tasks are interleaved uni-formly, i.e., the BERT model is only passed to oneof the two tasks’ speciﬁc classiﬁcation heads, andthe related loss is used to update the parameters inthat iteration. This way, the model trains both tasksbut alternates between the two tasks per mini-batchiteration while the extra batches of sarcasm datafrom the “sarcasm V2” dataset are managed at theend together. This model is denoted as

BERT

ALT (see Figure 2).For brevity, all models’ parameter tuning de-scription (e.g., Logistic Regression, Dual LSTM,BERT) is in the supplemental material.

Table 3 presents the classiﬁcation results on the test set. We report F1 scores for each class ( A , D and N ) and Micro-F1 overall score (F1 micro ) (usedto account for multi-class and class imbalance). Model F micro A D N LR ArgF

ArgF + SarcF α ∗ Attn MT MT uncert α ∗ orig MT MT uncert α ∗ ALT

Table 3: Results for argumentative relation detection( F micro and F1 scores/category) on the test set of IAC orig . α ∗ depict signiﬁcance on p ≤ . (mea-sured via Mcnemar’s test) against the corresponding in-dividual model (e.g., LR ArgF , LSTM

Attn , BERT orig ,respectively). Highest scores per group of models arein bold . The LR model using both the

SarcF and

ArgF features performs better than the model thatuses

ArgF features alone, improving the overallperformance by an absolute 2.9% F1 micro , andshowing a huge impact on the agreement class( A ) (8.6% absolute improvement). Table 4 showsthe top discrete features for argumentative relationidentiﬁcation. From ArgF features (ﬁrst column),we notice discourse expansion (“particularly”), con-trast (“although”) and agree/disagree lexicon get-ting high feature weights. We also notice pronouns receive large feature weights because argumenta-tive text often refers to personal stance (e.g., “youthink”, “I believe”). However, when analyzing

ArgF + SarcF features we ﬁnd various sarcasmmarkers, such as tag questions, hyperbole, multi-ple punctuation, or sarcasm characteristics such assentiment incongruity receive the highest weights.For

LSTM models , we see that multitask learn-ing helps, LSTM MT uncert showing a 2.8% im-provement over the single model LSTM Attn , whichis statistically signiﬁcant. Moreover, we notice thatthe improvement for the agree ( A ) and disagree( D ) classes is 5.1%, with just a small reduction forthe none ( N ) class (0.7%).For BERT , we notice better results when per-forming multitask learning, while the best per-forming model is obtained from BERT MT uncert where we experimented with the dynamic weight-ing of task-speciﬁc losses during the training pro-cess (Kendall et al., 2018). The performance in-crease is consistent across all three classes. Thedifference in performance among each setup is sta- R ArgF LR ArgF + SarcF pronouns : I. my (both A ),your(s) ( D ); discourse : so,because, for (all A ), inciden-tally, particularly, although(all D ); disagree lexicon :disagree, differ (both D ); agree lexicon : agreed( A ); entailment relation ; negation ( D ) pronouns : mine, my (both A ), you ( D ); discourse :then ( A ), though, however(both D ); modal : will ( A ); punctuation : multiple ques-tion marks (both A and D ); tag question : “are you”, “doyou” (both D ); hyperbole :wonderful ( A ), nonsense, bi-ased (both D ); LIWC dimen-sions : anxiety, assent, cer-tainty (all D ); sentiment in-congruity ( D ); interj : so,agreed (both A ) Table 4: Top discrete features from LR

ArgF andLR

ArgF + SarcF models, respectively. A and D depictthe argumentative relations (agree and disagree) for theparticular feature. tistically signiﬁcant, as shown in Table 3. More-over, BERT MT uncert model improves the F micro by a large margin when compared to the LR andthe LSTM models. However, adding more datafor the auxiliary task (i.e., sarcasm detection) aspresented in BERT

ALT did not provide any sig-niﬁcant improvement, only a 0.2 improvement of F micro over BERT MT (however it does showimprovement over the single task model). The rea-son could be that although “sarcasm V2”is a subsetof the original IAC corpus, it was annotated by adifferent set of Turkers than

IAC orig with differentannotation guidelines.Between the three classes - A , D , and N - weobserve the lowest performance on the A class.This is unsurprising, given the highly unbalancedsetting of the training data ( A occurs less than10% of times in the IAC orig , see Table 2).In sum, these improvements through multitasklearning over single task argumentative relation de-tection indicate that modeling sarcasm is usefulin modeling the disagreement space in online dis-cussions. This provides an empirical justiﬁcationto existing theories that study sarcasm’s impact inmodeling argumentation, persuasion, and argumentfallacies such as ad hominem attacks. Finally, wenotice that multitask learning also improves theperformance on the sarcasm detection task (resultsare presented in the Appendix).

To further investigate the effect of multitask learn-ing, we present qualitative analysis studies to:1. Understand the models’ performance by look- ing at the turns correctly classiﬁed by the mul-titask models and misclassiﬁed by the corre-sponding individual single task model. Weanalyze the turns in terms of sarcastic char-acteristics - whether they depict incongruity,humor, or sarcasm indicators (i.e., markers).2. Understand when both multitask and individ-ual model made incorrect predictions.We compare the predictions between the mul-titask and the individual models for differentsettings to address the ﬁrst issue. For exam-ple,

BERT MT uncert correctly identiﬁes 6 A , 50 D , and 60 N instances more than BERT

Orig (out of 91, 398 and 510 instances, respectively).Two of the authors independently investigateda random sample of 100 instances ( qual set)chosen from the union of the test instancesthat are correctly predicted only by the mul-titask models (LR

ArgF + SarcF , LST M MT uncert , BERT MT uncert , and BERT

ALT ) and not bythe corresponding individual models (LR

ArgF , LST M attn , and

BERT

Orig ). For both Trans-former and LSTM-based models, we explore howattention heads behave and whether common pat-terns exist (e.g., attending words with oppositemeaning when incongruity occurs). We displaythe heat maps of the attention weights for a pairof prior and current turns (LSTM-based models)(Figure 3) whereas for BERT we display word-to-word attentions (Figures 4, 5, 6, 7, and 8) usingvisualization tools (Vig, 2019; Yang and Zhang,2018). All the examples presented in this sec-tion are argumentative moves (i.e., turns with A or D ) correctly identiﬁed by our multitask learningmodels but wrongly predicted as none ( N ) by theindividual models. Moreover, the multitask learn-ing models also correctly predict that these turnsare instances of sarcasm. Incongruity between prior turn and currentturn.

Semantic incongruity, which can appear be-tween conversation context pt and the current turn ct is an inherent characteristic of sarcasm (Joshiet al., 2015). This characteristic highlights the in-consistency between expectations and reality , mak-ing sarcasm or irony highly effective in persuasivecommunication (Gibbs and Izett, 2005). Clark et al. (2019) have probed different layers and at-tention heads in BERT to ﬁnd patterns, e.g., whether a tokenconsistently attends a ﬁxed token in a speciﬁc layer. To avoidconfusion and bias, we select attention examples from onlythe middle (layer=6) layer. ’m not saying the primates changed all of a sudden , but what all of asudden made them want to change . I ’m not saying the primates changed all of a sudden , but what all of asudden made them want to change .You actually think Evolution works by what creatuers want ? You thinkthey just got up one day and said , “ ya know Bob , I wan na evolve .” ! ! Oops , there goes my tail You actually think Evolution works by what creatuers want ? You thinkthey just got up one day and said , “ ya know Bob , I wan na evolve .” ! ! Oops , there goes my tail

Figure 3: Attention heatmap of a particular turn pair from

LST M attn ( left) and LSTM MT uncert ( right) showinghigher weights on sarcasm marker such as “Oops” and “!!” for LSTM MT uncert (disagree relation)Figure 4: BERT MT uncert (right) attending contrastingwords more in word-level attention in comparison to BERT

Orig (left) (disagree relation)Figure 5:

BERT

ALT (right) attending only contrast-ing words in comparison to

BERT

Orig (left) (disagreerelation). However, the strength of the contrast in thecase of

BERT

ALT is lower than

BERT MT uncert forthe same example turns. In the case of BERT, Figure 4 presents theturns “evolution can’t prove the book of genesisfalse” ( pt ) ↔ “ignorant of science think evolu-tion has anything to do with the bible” ( ct ). Here, BERT MT uncert shows more attention between in-congruous terms (“genesis” ↔ “science”, “evo-lution”) as well as to the mocking word “igno-rance”. Likewise, Figure 6 presents two turns“you are quite anti religious it seems” ( pt ) ↔ “anti ignorance and superstition . . . this is religion”( ct ). We notice the word “religious” is attend-ing “anti” and “ignorance” with high weights incase of BERT MT uncert (from pt to ct ) whereas BERT

Orig only attends to the word “religious”

Figure 6:

BERT MT uncert (right) attending contrastingwords more than BERT

Orig (left) (disagree relation)Figure 7:

BERT

ALT (right) attending only the con-trasting words in comparison to

BERT

Orig (left) (dis-agree relation) from the pt to ct turn. By modeling sarcasm, themultitask learning models can better predict argu-mentative moves that are expressed implicitly.We also evaluate the BERT

ALT model forthe examples presented in Figure 4 and Figure6. Figure 5 shows that although

BERT

ALT isattending (from pt to ct ) incongruous terms “gen-esis” ↔ “evolution”, the strength of the relation(i.e., attention weight) is comparatively lower than BERT MT uncert (See Figure 4). On the contrary,between Figure 6 and Figure 7, BERT MT uncert model is attending multiple words in ct from theword “religion” in pt , but the BERT

ALT model at-tends only two words ‘anti” and “ignorance”, withhigh weights from “religion” ( pt to ct ). Humor by word repetition.

Often the currentturn ct sarcastically taunts the prior turn pt by wordrepetition and rhyme, imposing a humorous comiceffect, also regarded as the phonetic style of hu-mor (Yang et al., 2015). For the pair, “genetics igure 8: BERT MT uncert (right) attending co-referenced words in a humorous example missed by the BERT

Orig model (left) (disagree relation) has nothing to do with it” ( pt ) ↔ “are saying thatgenetics has nothing to do with genetics?” ( ct ), wenotice in BERT MT uncert the token “it” in pt cor-rectly attends to both occurrences of “genetics” in ct where the second occurrence is the co-referenceof “it” (Figure 8), which is missed by the individualmodel BERT

Orig . Role of sarcasm markers.

Sarcasm markers areindicators that alert if an utterance is sarcastic (At-tardo, 2000). While comparing the logistic regres-sion models between LR

ArgF + SarcF and LR

ArgF ,we observe markers such as multiple punctua-tions (“???”), tag question (“are you”), upper case(“NOT”) have received the highest features weights( Table 4). In Figure 3, while the individual model

LST M attn attends the words almost equally, wenotice in the multitask variation several sarcasmmarkers such as “ya”, “oops”, and numerous excla-mations (“!!”) receive larger attention weights.Addressing the second issue (i.e., when bothmultitask and single tasks models make the wrongpredictions), we notice that over 100 examplesof none ( N ) class were classiﬁed as argumenta-tive by both BERT MT uncert and BERT

Orig . Forthe none N class, one of the most common in-stances of wrong predictions is when the currentturn ct sarcastically takes a “different stance” ona topic from pt in a narrow context but the wholeturn is not argumentative. In the following exam-ple: “does he just say the opposite of everything < name > says?” ( pt ) ↔ “using < name > as a 180compass is just ﬁne by me” ( ct ), BERT MT uncert , BERT

Orig , LSTM MT uncert , and LST M attn mod-els make disagree D prediction (since ct is sarcas-tic on “ < name > ”) where the gold label is none N .Looking closely at this pair of turns, it seems thatthe ct presents a case of ad hominem attack (on theperson’s “ < name > ”) rather than a none relation. In the case of argumentative turns (agree anddisagree) that are wrongly classiﬁed as none by allmodels, we found two common patterns: the use ofconcessions (e.g., “it’s a consideration, but I doubtwe should be promoting this . . . ”) and argumentswith uncommitted beliefs (e.g., “it is possible that”,“that could probably be”, “ possibly , I must admit”).

Linguistic and argumentation theories have studiedthe use of sarcasm in argumentation, including itseffectiveness as a persuasive device or as a meansto express an ad hominem fallacy. We present acomprehensive experimental study for argumenta-tive relation identiﬁcation and classiﬁcation usingsarcasm detection as an additional task. First, in dis-crete feature space, we show that sarcasm-relatedfeatures, in addition to argument-related features,improve the accuracy of the argumentative rela-tion identiﬁcation/classiﬁcation task by 3%. Next,we show that multitask learning using both a dualLSTM framework and BERT helps improve per-formance compared to the corresponding singlemodel by a statistically signiﬁcant margin. In bothcases, the dynamic weighting of task speciﬁc lossesperforms best. We provide a detailed qualitativeanalysis by investigating a large sample manuallyand show what characteristics of sarcasm are at-tended to, which might have guided the correctprediction on the identiﬁcation of the argumenta-tive relation/classiﬁcation task. In the future, weaim to study this synergy further by looking atsarcasm as well as the persuasive strategies (e.g.,ethos, pathos, logos), and argument fallacies (e.g.,ad hominem attack that was also noticed by Haber-nal et al. (2018)).

Acknowledgements

The authors thank the anonymous reviewers andTuhin Chakrabarty for helpful comments.

References

Rob Abbott, Brian Ecker, Pranav Anand, and MarilynWalker. 2016. Internet argument corpus 2.0: An sqlschema for dialogic social media and the corpora togo with it. In

Proceedings of the Tenth InternationalConference on Language Resources and Evaluation(LREC’16) , pages 4445–4452.Rob Abbott, Marilyn Walker, Pranav Anand, Jean EFox Tree, Robeson Bowmani, and Joseph King.011. How can you say such things?!?: Recogniz-ing disagreement in informal political argument. In

Proceedings of the Workshop on Languages in So-cial Media , pages 2–11. Association for Computa-tional Linguistics.Salvatore Attardo. 2000. Irony markers and functions:Towards a goal-oriented theory of irony and its pro-cessing.

Rask , 12(1):3–20.Joshua M. Averbeck. 2013. Comparisons of ironic andsarcastic arguments in terms of appropriateness andeffectiveness in personal relationships.

Argumenta-tion and Advocacy , 50(1):47–57.Elena Cabrio and Serena Villata. 2012. Combining tex-tual entailment and argumentation theory for sup-porting online debates interactions. In

Proceed-ings of the 50th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers) , pages 208–212.Rich Caruana. 1997. Multitask learning.

Machinelearning , 28(1):41–75.Tuhin Chakrabarty, Christopher Hidey, SmarandaMuresan, Kathy McKeown, and Alyssa Hwang.2019. AMPERSAND: Argument mining for PER-SuAsive oNline discussions. In

Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 2933–2943, Hong Kong,China. Association for Computational Linguistics.Dushyant Singh Chauhan, SR Dhanush, Asif Ekbal,and Pushpak Bhattacharyya. 2020. Sentiment andemotion help sarcasm? a multi-task learning frame-work for multi-modal sarcasm, sentiment and emo-tion analysis. In

Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics , pages 4351–4360.Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D. Manning. 2019. What does BERTlook at? an analysis of BERT’s attention. In

Pro-ceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks forNLP , pages 276–286, Florence, Italy. Associationfor Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Frans H. van Eemeren and Rob Grootendorst. 1992.

Argumentation, communication, and fallacies: apragma-dialectical perspective . Lawrence ErlbaumAssociates, Inc. Frans Hendrik van Eemeren, Rob Grootendorst, SallyJackson, Scott Jacobs, et al. 1993.

Reconstructingargumentative discourse.

University of AlabamaPress.Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, MatthewPeters, Michael Schmitz, and Luke S. Zettlemoyer.2017. Allennlp: A deep semantic natural languageprocessing platform.Aniruddha Ghosh and Tony Veale. 2017. Magnets forsarcasm: Making sarcasm detection timely, contex-tual and very personal. In

Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing , pages 482–491.Debanjan Ghosh, Alexander R Fabbri, and SmarandaMuresan. 2018. Sarcasm analysis using conversa-tion context.

Computational Linguistics , 44(4):755–792.Debanjan Ghosh, Alexander Richard Fabbri, andSmaranda Muresan. 2017. The role of conversa-tion context for sarcasm detection in online inter-actions. In

Proceedings of the 18th Annual SIG-dial Meeting on Discourse and Dialogue , pages 186–196, Saarbr¨ucken, Germany. Association for Compu-tational Linguistics.Debanjan Ghosh, Aquila Khanam, Yubo Han, andSmaranda Muresan. 2016. Coarse-grained argumen-tation features for scoring persuasive essays. In

Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers) , pages 549–554, Berlin, Germany. Associa-tion for Computational Linguistics.Debanjan Ghosh and Smaranda Muresan. 2018. “with1 follower i must be awesome :p”. exploring the roleof irony markers in irony recognition.

Proceedingsof ICWSM .Debanjan Ghosh, Smaranda Muresan, Nina Wacholder,Mark Aakhus, and Matthew Mitsui. 2014. Analyz-ing argumentative discourse units in online interac-tions. In

Proceedings of the First Workshop on Ar-gumentation Mining , pages 39–48.Debanjan Ghosh, Avijit Vajpayee, and Smaranda Mure-san. 2020. A report on the 2020 sarcasm detectionshared task. In

Proceedings of the Second Work-shop on Figurative Language Processing , pages 1–11, Online. Association for Computational Linguis-tics.Raymond W Gibbs and Christin Izett. 2005. Ironyas persuasive communication.

Figurative languagecomprehension: Social and cultural inﬂuences ,pages 131–151.Roberto Gonz´alez-Ib´a˜nez, Smaranda Muresan, andNina Wacholder. 2011. Identifying sarcasm in twit-ter: A closer look. In

Proceedings of the 49th An-nual Meeting of the Association for ComputationalLinguistics: Human Language Technologies , pages81–586, Portland, Oregon, USA. Association forComputational Linguistics.Suchin Gururangan, Ana Marasovi´c, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. 2020. Don’t stop pretraining:Adapt language models to domains and tasks. In

Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics , pages8342–8360, Online. Association for ComputationalLinguistics.Ivan Habernal, Henning Wachsmuth, Iryna Gurevych,and Benno Stein. 2018. Before name-calling: Dy-namics and triggers of ad hominem fallacies in webargumentation. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages386–396, New Orleans, Louisiana. Association forComputational Linguistics.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural computation ,9(8):1735–1780.Yufang Hou and Charles Jochim. 2017. Argument rela-tion classiﬁcation using a joint inference model. In

Proceedings of the 4th Workshop on Argument Min-ing , pages 60–66, Copenhagen, Denmark. Associa-tion for Computational Linguistics.Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In

Proceedings of the tenthACM SIGKDD international conference on Knowl-edge discovery and data mining , pages 168–177.ACM.S Jackson. 1992. Virtual standpoints’ and the pragmat-ics of conversational argument.

Argumentation illu-minated , page 260–269.Aditya Joshi, Vinita Sharma, and Pushpak Bhat-tacharyya. 2015. Harnessing context incongruity forsarcasm detection. In

Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 2:Short Papers) , pages 757–762, Beijing, China. As-sociation for Computational Linguistics.Armand Joulin, Edouard Grave, Piotr Bojanowski,Matthijs Douze, H´erve J´egou, and Tomas Mikolov.2016. Fasttext. zip: Compressing text classiﬁcationmodels. arXiv preprint arXiv:1612.03651 .Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018.Multi-task learning using uncertainty to weighlosses for scene geometry and semantics. In

Pro-ceedings of the IEEE conference on computer visionand pattern recognition , pages 7482–7491.Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-feng Gao. 2019. Multi-task deep neural networks fornatural language understanding. In

Proceedings of the 57th Annual Meeting of the Association for Com-putational Linguistics , pages 4487–4496, Florence,Italy. Association for Computational Linguistics.Navonil Majumder, Soujanya Poria, Haiyun Peng,Niyati Chhaya, Erik Cambria, and Alexander Gel-bukh. 2019. Sentiment and sarcasm classiﬁcationwith multitask learning.

IEEE Intelligent Systems ,34(3):38–43.Stefano Menini and Sara Tonelli. 2016. Agreement anddisagreement: Comparison of points of view in thepolitical domain. In

Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers , pages 2461–2470.Amita Misra and Marilyn A Walker. 2013. Topic in-dependent identiﬁcation of agreement and disagree-ment in social media dialogue. In

Proceedings ofthe SIGDIAL 2013 Conference , pages 41–50. Asso-ciation for Computational Linguistics.Smaranda Muresan, Roberto Gonzalez-Ibanez, Deban-jan Ghosh, and Nina Wacholder. 2016. Identiﬁca-tion of nonliteral language in social media: A casestudy on sarcasm.

Journal of the Association for In-formation Science and Technology .Huy Nguyen and Diane Litman. 2016. Context-awareargumentative relation mining. In

Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 1127–1137.Shereen Oraby, Vrindavan Harrison, Lena Reed,Ernesto Hernandez, Ellen Riloff, and MarilynWalker. 2016. Creating and characterizing a diversecorpus of sarcasm in dialogue. In , page 31.Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-learn:Machine learning in python.

Journal of machinelearning research , 12(Oct):2825–2830.James W Pennebaker, Martha E Francis, and Roger JBooth. 2001. Linguistic inquiry and word count:Liwc 2001.

Mahway: Lawrence Erlbaum Asso-ciates , 71:2001.Isaac Persing and Vincent Ng. 2016. Modeling stancein student essays. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 2174–2184, Berlin, Germany. Association for Computa-tional Linguistics.Peter Potash, Alexey Romanov, and Anna Rumshisky.2017. Here’s my point: Joint pointer architecturefor argument mining. In

Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing , pages 1364–1373, Copenhagen,Denmark. Association for Computational Linguis-tics.llen Riloff, Ashequl Qadir, Prafulla Surve, LalindraDe Silva, Nathan Gilbert, and Ruihong Huang. 2013.Sarcasm as contrast between a positive sentimentand negative situation. In

Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing , pages 704–714. Association for Compu-tational Linguistics.Tim Rockt¨aschel, Edward Grefenstette, Karl MoritzHermann, Tom´aˇs Koˇcisk`y, and Phil Blunsom. 2015.Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664 .Sara Rosenthal and Kathleen McKeown. 2015. Icouldn’t agree more: The role of conversationalstructure in agreement and disagreement detectionin online discussions. In ,page 168.Swapna Somasundaran and Janyce Wiebe. 2010. Rec-ognizing stances in ideological on-line debates. In

Proceedings of the NAACL HLT 2010 Workshop onComputational Approaches to Analysis and Genera-tion of Emotion in Text , pages 116–124. Associationfor Computational Linguistics.Christian Stab and Iryna Gurevych. 2014. Identifyingargumentative discourse structures in persuasive es-says. In

EMNLP , pages 46–56.Christian Stab and Iryna Gurevych. 2017. Parsing ar-gumentation structures in persuasive essays.

Com-putational Linguistics , 43(3):619–659.Manfred Stede and Jodi Schneider. 2018. Argumen-tation mining.

Synthesis Lectures on Human Lan-guage Technologies , 11(2):1–191.Chenhao Tan, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. 2016. Winningarguments: Interaction dynamics and persuasionstrategies in good-faith online discussions. In

Pro-ceedings of WWW .Christopher W Tindale and James Gough. 1987. Theuse of irony in argumentation.

Philosophy &rhetoric , pages 1–17.Jesse Vig. 2019. A multiscale visualization of atten-tion in the transformer model. In

Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics: System Demonstrations , pages37–42, Florence, Italy. Association for Computa-tional Linguistics.Nina Wacholder, Smaranda Muresan, Debanjan Ghosh,and Mark Aakhus. 2014. Annotating multiparty dis-course: Challenges for agreement metrics.

LAWVIII , page 120.Marilyn A Walker, Pranav Anand, Robert Abbott, andRicky Grant. 2012a. Stance classiﬁcation using dia-logic properties of persuasion. In

Proceedings of the2012 Conference of the North American Chapter of the Association for Computational Linguistics: Hu-man Language Technologies , pages 592–596. Asso-ciation for Computational Linguistics.Marilyn A Walker, Jean E Fox Tree, Pranav Anand,Rob Abbott, and Joseph King. 2012b. A corpus forresearch on deliberation and debate. In

LREC , pages812–817.Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.2005. Recognizing contextual polarity in phrase-level sentiment analysis. In

Proceedings of the con-ference on human language technology and empiri-cal methods in natural language processing , pages347–354. Association for Computational Linguis-tics.Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy.2015. Humor recognition and humor anchor extrac-tion. In

Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing ,pages 2367–2376.Jie Yang and Yue Zhang. 2018. Ncrf++: An open-source neural sequence labeling toolkit. In

Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics .Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchicalattention networks for document classiﬁcation. In

Proceedings of NAACL-HLT , pages 1480–1489.

Appendix

A Lo-gistic Regression model with L penalty is em-ployed where the class weights are proportional tothe number of instances for A , D and N classes.The regularization strength C is searched over agrid using the dev data. Following values weretried for c : [.0001, .001, .01, .1, 1, 10, 100, 1000,10000]. Dual LSTM and Multi-task Learning experi-ment:

For LSTM networks based experimentswe searched the hyper parameters over the dev set.Particularly we experimented with different mini-batch size (e.g., 8, 16, 32), dropout value (e.g., 0.3,0.5, 0.7), number of epochs (e.g., 40, 50), hiddenstate of different sized-vectors (100, 300) and theAdam optimizer (learning rate of 0.01). Embed-dings were generated using FastText vectors (300dimensions) (Joulin et al., 2016). Any token occur-ring less than ﬁve times were replaced by a specialUNK token where the UNK vector is created basedon random samples from a normal (Gaussian) dis-tribution between 0.0 and 0.17. After tuning weuse the following hyper-parameters for the test set: mini-batch size of 16, hidden state of size 300,number of epochs = 50, and dropout value of 0.5.Task-speciﬁc losses for the dynamic multitask ver-sion was learned during training.

BERT based models:

We use the dev partitionfor hyperparameter tuning such as different mini-batch size (e.g., 8, 16, 32, 48), number of epochs (3,5, 6), learning rate of 3e-5) and optimized networkswith the Adam optimizer. The training partitionswere ﬁne-tuned for 5 epochs with batch size = 16.Each training epoch took between 08:46 ∼ Although improving sarcasm detection is not thefocus our paper, we observe that multi-task learn-ing improves the performance on this task as well,when compared to the single task model. Wepresent results for the deep learning models in Ta-ble 5. The multi-task models (both for LSTM andBERT) outperform the corresponding single taskmodels (by 6.9 F1 and 6.4 F1 for LSTM and BERTmodels, respectively). We note that the results onthis particular dataset are much lower than on otherdatasets used for sarcasm detection. For example, the LSTM

Attn which is the best model used byGhosh et al. (2018) obtained only 52.9 F1 score onthis dataset, while it obtained 70.34 F1 on SarcasmV2 (derived also from IAC but using different an-notation guidelines), 74.96 F1 on a Twitter datasetand 75.41 F1 on a Reddit dataset (Ghosh et al.,2018). Model Precision Recall F1LSTM

Attn MT BERT orig MT BERT MT uncert Table 5: Evaluations of sarcasm detection on the test set of

IAC origorig