Contextual Argument Component Classification for Class Discussions
CContextual Argument Component Classification for Class Discussions
Luca Lugini and
Diane Litman
Department of Computer Science andLearning Research and Development CenterUniversity of PittsburghPittsburgh, PA, USA
Abstract
Argument mining systems often consider contextual information, i.e. information outside of anargumentative discourse unit, when trained to accomplish tasks such as argument componentidentification, classification, and relation extraction. However, prior work has not carefully an-alyzed the utility of different contextual properties in context-aware models. In this work, weshow how two different types of contextual information, local discourse context and speakercontext, can be incorporated into a computational model for classifying argument componentsin multi-party classroom discussions. We find that both context types can improve performance,although the improvements are dependent on context size and position.
In a typical argument mining system, the first task is identifying spans of text consisting of argumentativediscourse units (ADU), i.e. argument component identification (ACI). The next task, argument compo-nent classification (ACC), consists of assigning a label to each ADU according to an argument model,e.g. claims, evidence, etc. For example, row 1 in Table 1 is labeled claim since speaker 7 provides theirpersonal view, while row 2 is labeled evidence because it references facts from a text.While “context” has been used in the argument mining literature to refer to several phenomena, weconsider context to be auxiliary textual information outside the span of an ADU. It is generally acceptedthat context is important in argument mining. Stab and Gurevych (2014) as well as Nguyen and Litman(2016) use context features extracted from the sentence containing an ADU to improve ACC. Persingand Ng (2016), Habernal and Gurevych (2017) and Aker et al. (2017) similarly use contextual featuresin joint ACI/ACC models. Optiz and Frank (2019) analyze a previous argument mining system andfind that, for its predictions, it relies on context more than it does on ADU content. Chakrabarty et al.(2019) indirectly model context in ACC by fine-tuning a BERT model to predict the next sentence (i.e.the context). Eger et al. (2017) analyzed several neural models for jointly performing ACI, ACC andargument relation extraction. All of these works share several limitations: (i) the context is either limitedto a single configuration (e.g. one sentence before/after the ADU) or optimized along a single dimension(typically size but not position); (ii) only a subset of the features extracted for the target ADU are alsoextracted for context; (iii) context is typically based on ADU adjacency, although other ways of buildingcontext (e.g. based on speakers in multi-party dialogues) are possible.
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/ . Row ID ADU AC
Table 1: Excerpt from a classroom discussion showing speaker ID, ADU and argument component labels. a r X i v : . [ c s . C L ] F e b n this work, we improve upon baseline ACC models for multi-party discussions by incorporating twotypes of contexts. We define local context as ADUs preceding and/or following a target ADU, regardlessof speaker. Speaker context consists of previous ADUs that a specific speaker previously voiced duringthe discussion. Our results show that both context types can individually and jointly improve ACCperformance, with performance gains dependent on context size and position.
The dataset used to build and evaluate our proposed ACC models consists of 3,135 ADUs in a corpusof 29 text-based (i.e. centered around a book, play, or other literature piece), multi-party classroomdiscussions between high school students (average 15 students per discussion) (Olshefski et al., 2020).The discussions (average length of 34 minutes) were audio-recorded, manually transcribed, and studentturns were manually segmented into multiple ADUs when needed. ADUs were then manually annotatedaccording to a simplified version of Toulmin’s argumentative model (Toulmin, 1958) consisting of threelabels: (i) claims , arguable statements that voice a specific interpretation of a text; (ii) evidence , facts ordocumentation used to support a claim; (iii) warrant , reasoning given to explain how certain evidencesupports a claim. An inter-rater reliability analysis showed a Krippendorff α U of 0.96 for segmentationand Cohen Kappa of 0.74 for argument components. The dataset contains 3,135 ADUs: 65.3% claims,24.3% evidence, and 10.4% warrants. ADUs are additionally labeled with the ID of the speaker whovoiced the utterance. Table 1 shows an excerpt from an annotated discussion.To evaluate the utility of context, we introduce two baseline ACC models and propose several con-textual extensions. The source code for all models (and all parameters) is available at https://github.com/lucalugini/coling2020_argmining . Baseline Models.
Our first model ( hybrid baseline ) is based on the model of Lugini and Litman(2018) which was developed for a similar type of dataset, where an embedding generated through a con-volutional neural network (CNN) is concatenated to a set of handcrafted features, and a softmax classifieris used to predict argument components. Given the limited size of our dataset, however, we only use asubset of the original model’s handcrafted features, namely those used by Speciteller (Li and Nenkova,2015); this reduces the number of handcrafted features from over 7,000 to 114 and avoids overfit. TheSpeciteller feature set consists of pretrained word vectors (average of the word vectors for each word inthe ADU), as well as number of connectives, number of words, number of numbers, number of symbols,number of capital letters, number of stopwords normalized by ADU length, number of subjective andpolar words (from the MPQA (Wilson et al., 2009) and the General Inquirer (Stone and Hunt, 1963)dictionaries), average word familiarity (from MRC Psycholinguistic Database (Wilson, 1988)), averagecharacters per word, and inverse document frequency statistics (minimum and maximum). The dimen-sionality of the final feature vector is 2,514 (114 for the handcrafted features and 2,400 for the CNN).Our second model (
BERT baseline ) is based on recent advances related to Transformer architectures(Vaswani et al., 2017): a BERT pretrained model (Devlin et al., 2019; Wolf et al., 2019) generates wordembeddings of dimensionality 768; average pooling is used to compute a fixed-size ADU embedding; asoftmax classifier predicts argument components.
Adding Local Context.
We define local context as ADUs preceding and/or following the targetADU, regardless of speaker ID. Context size is measured in terms of complete ADUs (i.e. entire utteranceor part of it), while context position refers to the relative position of the context ADUs to the targetADU (i.e. preceding, following, both). We believe defining context in terms of ADUs is the moststraightforward choice since it is the same unit of analysis used for individual argument components.Though beyond the scope of this paper, another compelling choice consists in defining context based onthe number of words outside the target ADU instead. We address the prior work limitations highlightedearlier in two ways: (i) we explore the impact of varying both the size and position of ADUs includedin the context, instead of picking a single position and optimizing size; (ii) we model context using thesame features used for the target ADU. Each context ADU is converted into a fixed-size feature vectorusing the baseline models described above, then concatenated to the feature vector for the target ADU. A Obtained from https://discussiontracker.cs.pitt.edu aximum context size of 6 was chosen based on results showing diminishing returns and on the fact thatincreasing size further would go beyond “local” context. We additionally evaluate whether context sizeand position can be automatically optimized by adding an attention layer (Luong et al., 2015): contextsize is set to the maximum value and both preceding and following positions are included; the attentionmechanism then aggregates all context ADUs into a single vector.
Adding Speaker Context.
Students exhibit highly variable behavior with respect to how they buildarguments. For example, in a discussion between six students from the dataset, one student only voicedclaims, only two students voiced warrants, and only two students voiced more than 10% of their argumentcomponents as evidence. We hypothesize that the argument component classifier can benefit from beinginformed of the propensity of a particular speaker to voice each argument component at any given point.While we have access to speaker ID, when making predictions ground truth ADU labels are not available;therefore we need to extract information from the ADU text. Given the speaker ID for the target ADU,the speaker context module performs the following steps: (1) gather the speaker’s previous ADUs fromthe discussion; (2) convert each ADU into a feature vector; (3) aggregate them into a single, fixed-sizefeature vector and concatenate it with the baseline (and possibly with the local context). Step 1 involvessimply filtering out ADUs based on speaker ID, which is readily available in each discussion. Step 2 canbe achieved in several ways, however, for the sake of simplicity, in the hybrid baseline we decided to usea CNN to generate a feature vector for each ADU. In order to further reduce complexity we implementeda CNN with the same structure as the one in the hybrid baseline model, but with the number of filtersreduced from 16 to 4. This resulted in a 200-dimensional vector. For the BERT baseline the sameembedding - average pooling model was used in this step. Step 3 was accomplished using a Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997). The final speaker context featurevector has dimensionality 100 (output dimensionality of the LSTM). We additionally experimented withautomatically optimizing speaker context size by replacing the LSTM with an attention layer: settingthe speaker context size to the maximum (40 in this case) the attention layer aggregates all ADUs into asingle feature vector.
Context Examples.
Assume the target ADU is row 5 in Table 1. A local context of size 3 containsrows 2, 3, and 4. A speaker context of size 3 contains rows 1, 2 and 3.
All models are evaluated using ten fold cross-validation, and results are shown in Table 2.
Local Context.
We extended both baseline models with local context extracted in three differentways: only ADUs preceding the target, only following ADUs, and both preceding/following ADUs.We report three main observations when adding context to the hybrid baseline. First, with respect toposition, all models including prior ADUs in local context significantly outperformed the baseline (p-value < Row Model Context Kappa Precision Recall F-score
Table 2: Results for different experimental settings. Each row shows the best results for the correspond-ing settings when varying context size and position. Bold font shows the best results for each model.igure 1: Results for various context sizes. (a) and (c) compare the baselines to models with localcontext. (b) and (d) compare the baselines with speaker context and both context types.context size increases, increasing context size from 2 (1 prior/next ADU) to 4 (2 prior/next ADUs) resultsin significantly better precision and f-score (p-value < Speaker Context.
We define speaker context size K as the K closest prior ADUs to the targetADU. We experimented with speaker context sizes ranging from a minimum of 5 to a maximum of 40(effectively considering all speaker turns since the beginning of the discussion). As shown in Figure1 (b), including any previous ADUs from the current speaker improves performance over the baselinemodel. For hybrid models all context sizes result in statistically significantly better precision, recall,and f-score (p-value < Local Context and Speaker Context.
After individually adding each of the two context types to thebaseline models, we experimented with including both context types simultaneously. In this setting weobtained the best overall performance (rows 4 and 8 of Table 2). As we can see from the grey and purplelines in Figure 1 (b), for hybrid models, modeling both contexts simultaneously always outperformsspeaker context alone. For speaker context size >
5, improvements are statistically significant. In thisexperiment we kept local context size and position constant (both prior and next ADUs, size 4) and variedspeaker context size. Repeating this experiment with different local context settings yielded similarresults. We also observed that by combining both context types we were able to achieve kappa > > Conclusions and Future Work
In this paper we analyzed the impact of context for predicting argument components in multi-party dis-cussions. We defined two types of context, local context and speaker context, and analyzed how differentmodels perform when varying context size. We also investigated the effect of different positions forlocal context. We performed evaluations of the two context types separately as well as simultaneouslyon two types of neural network models. Experimental results support our claim that both context sizeand position are important when incorporating context in argument mining systems, therefore sentencesbeyond the ones immediately surrounding an ADU should be considered. Our results also show thatspeaker context can improve performance in multi-party discussions. Finally, we investigated the use ofan attention mechanism for optimizing context size and found that its effectiveness is dependent on thetype of model used. Our future plans include two main directions: 1) repeating our local context experi-ments on larger datasets, including other domains; 2) evaluating the effectiveness of speaker context onmulti-party web discussions, where discussions are usually longer and author ID is typically available.
Acknowledgements
This work was supported by the National Science Foundation (1842334 and 1917673), and in part by theUniversity of Pittsburgh Center for Research Computing through the resources provided.
References
Ahmet Aker, Alfred Sliwa, Yuan Ma, Ruishen Lui, Niravkumar Borad, Seyedeh Ziyaei, and Mina Ghobadi. 2017.What works and what does not: Classifier and feature analysis for argument mining. In
Proceedings of the 4thWorkshop on Argument Mining , pages 91–96.Tuhin Chakrabarty, Christopher Hidey, Smaranda Muresan, Kathy McKeown, and Alyssa Hwang. 2019. AM-PERSAND: Argument mining for PERSuAsive oNline discussions. In
Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Processing and the 9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) , pages 2933–2943, Hong Kong, China, November.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirec-tional transformers for language understanding. In
Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Language Technologies , pages 4171–4186,Minneapolis, Minnesota, June.Steffen Eger, Johannes Daxenberger, and Iryna Gurevych. 2017. Neural end-to-end learning for computationalargumentation mining. In
Proceedings of the 55th Annual Meeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , pages 11–22.Ivan Habernal and Iryna Gurevych. 2017. Argumentation mining in user-generated web discourse.
ComputationalLinguistics , 43(1):125–179.Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory.
Neural computation , 9(8):1735–1780.Junyi Jessy Li and Ani Nenkova. 2015. Fast and accurate prediction of sentence specificity. In
Proceedings of theTwenty-Ninth Conference on Artificial Intelligence (AAAI) , pages 2281–2287, January.Luca Lugini and Diane Litman. 2018. Argument component classification for classroom discussions. In
Proceed-ings of the 5th Workshop on Argument Mining , pages 57–67.Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-basedneural machine translation. In
Proceedings of the 2015 Conference on Empirical Methods in Natural LanguageProcessing , pages 1412–1421.Huy Nguyen and Diane J Litman. 2016. Improving argument mining in student essays by learning and exploitingargument indicators versus essay topics. In
FLAIRS Conference , pages 485–490.Christopher Olshefski, Luca Lugini, Ravneet Singh, Diane Litman, and Amanda Godley. 2020. The discussiontracker corpus of collaborative argumentation. In
Proceedings of The 12th Language Resources and EvaluationConference , pages 1033–1043, Marseille, France, May. European Language Resources Association.uri Opitz and Anette Frank. 2019. Dissecting content and context in argumentative relation analysis. In
Proceed-ings of the 6th Workshop on Argument Mining , pages 25–34, Florence, Italy, August.Isaac Persing and Vincent Ng. 2016. End-to-end argumentation mining in student essays. In
Proceedings ofthe 2016 Conference of the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies , pages 1384–1394.Christian Stab and Iryna Gurevych. 2014. Annotating argument components and relations in persuasive essays.In
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: TechnicalPapers , pages 1501–1510.Philip J Stone and Earl B Hunt. 1963. A computer approach to content analysis: studies using the general inquirersystem. In
Proceedings of the May 21-23, 1963, spring joint computer conference , pages 241–256. ACM.Stephen Toulmin. 1958.
The uses of argument . Cambridge: Cambridge University Press.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, andIllia Polosukhin. 2017. Attention is all you need. In
Advances in neural information processing systems , pages5998–6008.Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. 2009. Recognizing contextual polarity: An exploration offeatures for phrase-level sentiment analysis.
Computational linguistics , 35(3):399–433.Michael Wilson. 1988. Mrc psycholinguistic database: Machine-usable dictionary, version 2.00.
Behavior Re-search Methods , 20(1):6–10.Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac,Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing.