[PDF] A Transfer Learning Approach for Dialogue Act Classification of GitHub Issue Comments

Abstract

Social coding platforms, such as GitHub, serve as laboratories for studying collaborative problem solving in open source software development; a key feature is their ability to support issue reporting which is used by teams to discuss tasks and ideas. Analyzing the dialogue between team members, as expressed in issue comments, can yield important insights about the performance of virtual teams. This paper presents a transfer learning approach for performing dialogue act classification on issue comments. Since no large labeled corpus of GitHub issue comments exists, employing transfer learning enables us to leverage standard dialogue act datasets in combination with our own GitHub comment dataset. We compare the performance of several word and sentence level encoding models including Global Vectors for Word Representations (GloVe), Universal Sentence Encoder (USE), and Bidirectional Encoder Representations from Transformers (BERT). Being able to map the issue comments to dialogue acts is a useful stepping stone towards understanding cognitive team processes.

Full PDF

AA Transfer Learning Approach for Dialogue ActClassiﬁcation of GitHub Issue Comments

Ayesha Enayet

Department of Computer ScienceUniversity of Central Florida

Orlando, FL [email protected]

Gita Sukthankar

Department of Computer ScienceUniversity of Central Florida

Orlando, FL [email protected]

Abstract —Social coding platforms, such as GitHub, serve aslaboratories for studying collaborative problem solving in opensource software development; a key feature is their ability tosupport issue reporting which is used by teams to discuss tasksand ideas. Analyzing the dialogue between team members, asexpressed in issue comments, can yield important insights aboutthe performance of virtual teams. This paper presents a transferlearning approach for performing dialogue act classiﬁcationon issue comments. Since no large labeled corpus of GitHubissue comments exists, employing transfer learning enables us toleverage standard dialogue act datasets in combination with ourown GitHub comment dataset. We compare the performance ofseveral word and sentence level encoding models including GlobalVectors for Word Representations (GloVe), Universal SentenceEncoder (USE), and Bidirectional Encoder Representations fromTransformers (BERT). Being able to map the issue comments todialogue acts is a useful stepping stone towards understandingcognitive team processes.

Index Terms —social coding platforms, dialogue act classi-ﬁcation, transfer learning, embeddings for natural languageprocessing

I. I

NTRODUCTION

The emergence of online collaboration platforms has dra-matically changed the dynamics of human teamwork, creatinga veritable army of virtual teams, composed of workers indifferent physical locations. Software engineering requires atremendous amount of collaborative problem solving, makingit an excellent domain for team cognition researchers whoseek to understand the manifestation of cognition applied toteam tasks. Mining data from social coding platforms suchas GitHub can yield insights about the thought processesof virtual teams. Previous work on issue comments [1]–[3]has focused on emotional aspects of team communication,such as sentiment and politeness. Our aim is to map issuecomments to states in team cognition such as informationgathering, knowledge building and problem solving. To dothis we employ dialogue act (DA) classiﬁcation, in order toidentify the intent of the speaker.Dialogue act classiﬁcation has a broad range of naturallanguage processing applications, including machine trans-lation, dialogue systems and speech recognition. Semantic-based classiﬁcation of human utterances is a challenging task,

This research was supported by DARPA program HR001117S0018. and the lack of a large annotated corpus that representsclass variations makes the job even harder. Compared to theexamples of human utterances available in standard datasetslike the Switchboard (SwBD) corpus and the CSI MeetingRecorder Dialogue Act (MRDA) corpus, GitHub utterancesare more complex.The primary purpose of our study is the DA classiﬁcationof GitHub issue comments by harnessing the strength oftransfer learning, using word and sentence level embeddingmodels ﬁne-tuned on our dataset. For word-level transferlearning, we have used GLoVe vectors [4], and UniversalSentence Encoders [5] and BERT [6] models were used forsentence-level transfer. This paper presents a comparison ofthe performance of various architectures on GitHub dialoguesin a limited resource scenario. A second contribution is ourpublicly available dataset of annotated issue comments. Thedataset is available at https://drive.google.com/drive/folders/1kLZvzfE80VeEYA1tqua aj6nSiT57f83?usp=sharing. In theﬁeld of computational collective intelligence, where peoplecollaborate and work in teams to achieve goals, dialogue actclassiﬁcation can play a vital role in understanding humanteamwork. II. B

ACKGROUND

Unlike general purpose communication platforms such asTwitter and Facebook, GitHub is specialized to support virtualteams of software developers whose primary communicationgoal is to discuss new features and monitor software bugs.It facilitates distributed, asynchronous collaborations in opensource software (OSS) development. Code development, issuereporting, and social interactions are tracked by the 20+ eventtypes. Our assumption is that each software repository ismaintained by a team and that the events associated with therepository form a partial history of the team activities andsocial interactions.GitHub has an open API to collect metadata about users,repositories, and the activities of users on repositories. Devel-opers make changes to the code repository by pushing theircontent, while GitHub tracks the version control process. AnyGitHub user can contribute to a repository by sending a pullrequest. Repository maintainers review pull requests, discusspossible modiﬁcations in the comments, and decide whether a r X i v : . [ c s . C L ] N ov o accept or reject the requests. GitHub also supports passivesocial media style interactions such as following repositoriesor developers. Within GitHub’s issue handling infrastructure,users can report a bug or provide a feature request by openingan issue. Issue closure rates thus reﬂect the speed with whichteams resolve problems and can be used as a measure of teamperformance. III. R ELATED W ORK

Issue resolution has been viewed by many researchers as arich source of information about the emotional health of theteam and how it affects the software development process [7].Kikas et al. demonstrated a model for predicting issue lifetimethat included a single feature aggregating textual commentinformation [8]. Several studies have employed sentimentanalysis [1]–[3] and topic modeling [9] to study GitHub issuecomments. Ortu et al. conducted a large study on communi-cation patterns in which they measured politeness and emo-tional affect in issue comments; their aim was to understandhow contribution levels modulate communication patterns [3].Murgia et al. demonstrated a machine learning classiﬁer foridentiﬁng love, joy or sadness in issue comments [2]. Anempirical study of issue comments conducted by Guzman etal. [1] showed that the sentiment expressed in issue commentsvaries based on day of week, geographic dispersion of theteam, and the programming language. Yang et al. addressed themore practical question of the relationship of issue commentsentiment and bug ﬁxing speed [10].In contrast, our aim is to study the team cognition aspects ofcollaborative problem solving using dialogue act classiﬁcation.Unlike topic modeling or sentiment analysis, dialogue actclassiﬁcation has not been extensively applied to GitHubdata. However, Saha et al. [11] proposed a deep learningapproach for the dialogue act classiﬁcation of Twitter data. Aconvolutional neural network was used to create the classiﬁer,along with hand-crafted rules. Seven classes were included:statement, expression, suggestion, request, question, threat,and other. In contrast, our work is done using a transferlearning approach and a signiﬁcantly larger set of classes.Prior to deep learning, statistical approaches such as hiddenMarkov models, have been used for dialogue act classiﬁcation.The HMM represents discourse structure, with dialogue acts asstates. Stolcke et al. demonstrated such a model that combinedprosodic, lexical, and collocational cues [12]. Chen et al.proposed the CRF-Attentive Structured Network (CRF-ASN)framework to exploit the CRF-attentive structure dependenciesalong with end-to-end training [13].This paper presents a transfer learning approach for dialogueact classiﬁcation that is used to compensate for our smalldataset. To do this, we learn an embedding from a largerdataset. The next section surveys the state of the art inembedding models for natural language processing.

A. Embeddings

Embeddings are a mechanism for mapping a high-dimensional space to a low-dimensional one while only re- taining the most effective structural representations. They canbe used as part of the transfer learning process to mitigate thelow availability of labeled language resources on various NLPtasks. This paper presents transfer learning results using thefollowing state of the art embedding methods: Global Vectorsfor Word Representation (GloVe) [4], Universal SentenceEncoding (USE) [5], and Bidirectional Encoding (BERT) [6].We compare these embedding models with the probabilistictechnique proposed by Duran et al. [14] on our GitHub issuecomments dataset.

1) Global Vectors for Word Representation (GloVe):

Pen-nington et al. [4] proposed the GloVe model in 2014. It createsa word-level embedding that leverages both the local contextwindow and global matrix factorization methods. GloVe em-ploys a log-bilinear prediction-based technique that utilizesword-word co-occurrence statistics to identify a meaningfulstructure and generate word-level embeddings. We are usingthe GloVe model to illustrate the results of DA classiﬁcationof GitHub data using word-level embedding.

2) Universal Sentence Encoders (USE):

In 2018, GoogleResearch released a Universal Sentence Encoder (USE) modelfor sentence-level transfer learning that achieves consistentperformance across multiple NLP tasks [5]. There are twodifferent variants of the model: 1) a transformer architecture,which gives high accuracy at the cost of high resourceconsumption and 2) a deep averaging network that requiresfew resources and makes small compromises for efﬁciency.The former uses attention-based, context-aware encoding sub-graphs for the transfer architecture. The model outputs a 512-dimensional vector. The deep averaging network works byaveraging words and bigram embeddings to use as an input toa deep neural network. The models are trained on web news,Wikipedia, web question-answer pages, discussion forums, andthe Stanford Natural Language Inference (SNLI) corpus.

3) Bidirectional Encoder Representations from Transform-ers (BERT):

Also created at Google, BERT is the ﬁrst modelthat was trained on both left and right contexts [6]. Toachieve pre-trained deep bidirectional representation, it usesthe masked model, which follows the cloze deletion task. Thismodel is trained on Books Corpus and English Wikipediacorpus. The code for BERT is available at https://github.com/google-research/bert. There are two available ﬂavors of BERT:1)

BERT

BASE , and 2)

BERT

LARGE . BERT

BASE has 12transformer blocks, 768 hidden layers, 12 self-attention heads,and 110 million parameters. On the other hand,

BERT

LARGE uses a fairly large network, with 24 transformer blocks,1024 hidden layers, 16 self-attention heads, and 340 millionparameters. IV. M

ETHOD

Treating our dialogue act classiﬁcation as a transfer learningproblem enables us to leverage embeddings learned on adataset that is over 200 times larger than our test dataset. Fivedialogue act classiﬁcation pipelines were created to evaluatethe performance of four different word and sentence-levelembedding models. ig. 1. Architecture Diagrams for (i) Probabilistic Representation with RNN (ii) GloVe+LSTM (iii) Universal Sentence Encoder+LSTM (USE+LSTM) (iv)Universal Sentence Encoder (USE) (v) Bidirectional Encoder Representations from Transformers (BERT)ABLE ID

ATASET S TATISTICS

Dataset Categories W DA D

ATASET S AMPLE

Speaker Utterance DA DescriptionA I don’t, I don’t have anykids. sd Statement-non-OpinionA I, uh, my sister has a, shejust had a baby, sd Statement-non-OpinionA he’s about ﬁve months old sd Statement-non-OpinionA and she was worryingabout going back to workand what she was going todo with him and – sd Statement-non-OpinionA Uh-huh. b AcknowledgeA do you have kids? qy Yes-No-QuestionB I have three. na Afﬁrmative non-yes AnswerA Oh, really? bh Backchannel inquestion form

A. Datasets

For our study, we collected a dataset of issue commentsfrom GitHub and hand annotated them using a standarddialogue act tagset, DAMSL (Discourse Annotation andMarkup System of Labeling), to facilitate the transferprocess. We have made the dataset available at https://drive.google.com/drive/folders/1kLZvzfE80VeEYA1tquaaj6nSiT57f83?usp=sharing. The tagset is available athttps://web.stanford.edu/ ∼ jurafsky/ws97/manual.august1.html.Our test set consists of 859 instances from more than 50GitHub issues.The models were trained using the Switchboard DialogueAct Corpus (SwDA) dataset. SwDA is one of the mostpopular public datasets for DA classiﬁcation. It consists of1155 human-to-human telephone speech conversations. Thedataset is tagged using 42 tags from the DAMSL tagset.Table I shows the statistics of both test and train datasets.Table II shows examples from the SwDA training dataset.Table III shows examples from our GitHub issue commentdataset. GitHub issue comments are a complex combination ofcomputer commands, special symbols, and standard English.From these examples, it is clear that this is a challengingtransfer learning problem. B. Probabilistic Representation with Recurrent Neural Net-works

Duran et al. proposed a probabilistic technique to repre-sent utterances while using the LSTM sentence model fordialogue act classiﬁcation [14]. The probabilistic distributionof each word in the corpus over DA categories providesthe representation of the utterances. The model does not

TABLE IIIG IT H UB D ATASET S AMPLE

Speaker Utterance DA DescriptionA What steps will reproducethe problem? qw Wh-QuestionB Give the *exact* argu-ments passedto include-what-you-use, and attachthe input source ﬁle thatgives theproblem (mini-mal test-cases are muchappreciated!) ad Action-directiveB Run IWYU against thefollowing ﬁle: ad Action-directiveB iwyu clear ad Action-directiveB iwyu cat ./cstdarg.cpp ad Action-directiveC What is the expected out-put? qw Wh-QuestionC What do you see instead? qw Wh-Question incorporate contextual features at the discourse level. The setof keywords consisting of all the words that occur above athreshold frequency is used to deﬁne a n × m matrix X,where m is the number of categories, and n is the numberof keywords. Each entry x ij of the matrix represents theprobability of the tag j given the word i . Training wasaccomplished using code downloaded from https://github.com/NathanDuran/Probabilistic-RNN-DA-Classiﬁer. C. GloVe + LSTM

We use glove.6b.100d.txt downloaded from https://nlp.stanford.edu/projects/glove/ to train our model on the SwDAdataset. The model consists of input, embedding, LSTM, andone dense layer with 42 output labels. The network was trainedusing an Adam optimizer.

D. Universal Sentence Encoder (USE)

The Universal Sentence Encoder was applied to the GitHubissue comments, after ﬁne-tuning on the SwDA dataset. Thecode to load the USE model is available at https://tfhub.dev/google/universal-sentence-encoder/1. We chose the USETransformer-based Architecture model with three dense layersand a softmax activation function. The network was trainedusing an Adam optimizer.

E. USE+LSTM

We also combine the Universal Sentence Encoder with anLSTM. This model consists of Input, Embedding, Convolu-tion, LSTM, and one Dense output layer.

F. Bidirectional Encoder Representations from Transformers(BERT)

This architecture was implemented using the bert en uncased L − H − A − model fromTensorFlow Hub. The model has 12 hidden layers (i.e.,Transformer blocks), with hidden size of 768, and 12attention heads. As proposed by [6], we append a singledense layer to BERT. ABLE IVT

RAINING , VALIDATION & TEST ACCURACY OF ALL THE MODELS model acc val acc test acc(GitHub)GloVe+LSTM 0.5089 0.5195 0.3714Prob+LSTM 0.7672 0.7694 0.4412USE 0.7247 0.6951 0.5071USE+LSTM 0.3841 0.4257 0.4074BERT 0.7151 0.7151 0.4063

V. E

VALUATION

Table IV shows the performance of all ﬁve architectures.Universal Sentence Encoder had the best performance on theGitHub issue comments, with a test accuracy of 50.71% whichis 6% more than the accuracy achieved using the probabilisticrepresentation of sentence. The other three models showedsigniﬁcantly lower performance than USE, lagging by almost10%. The probabilistic representation of sentence approachexhibited the highest validation accuracy of 76.9% which issigniﬁcantly higher than USE which had a validation accuracyof 69.5%. The well-known BERT model had a validationaccuracy of 71.5%, but had a low test accuracy.It is instructive to examine the performance differencesbetween the best (USE) and second best (probabilistic rep-resentation). Figure 2 shows the confusion matrix of theclassiﬁcation results obtained using the USE model, andFigure 3 shows the confusion matrix of the probabilisticrepresentation method. In both cases the most confused tagpair is sd (statement-non-opinion) & sv (statement-opinion).USE correctly classiﬁed 91.71% of the sd occurrences, whilethe probabilistic representation method only classiﬁed 76%correctly. On the other hand USE classiﬁed 38.98% sv utter-ances correctly while the probabilistic representation methodclassiﬁed 50% of them correctly.Table V shows the Precision, Recall, & F Score of ourbest performing architecture. In Table V, support representsthe number of occurrences of each tag in the GitHub dataset.USE failed to classify the third most frequent tag i.e. ad(action directive) in the test dataset. The average precisionof USE over all tags is 53%. The average recall is 51%,and the average F score is 42%. A difference of only 2%between precision and recall shows that the results of themodel are consistent. BERT is one of the newest models fortransfer learning; however our results show that ﬁne-tuningBERT doesn’t improve performance much in comparison tothe Universal Sentence Encoder. Prior work has shown thatBERT does not beneﬁts as much from ﬁne-tuning as otherembeddings [15]. VI. C ONCLUSION

This paper demonstrates a dialogue act classiﬁcation systemfor GitHub issue comments. Due to the lack of publicly

TABLE VP

RECISION , R

ECALL , & F SCORE OF ALL THE TAGS (USE)Tag Precision Recall F score Supportsd 0.51 0.92 0.66 350b 0.11 0.20 0.14 5sv 0.47 0.39 0.43 119% 0.00 0.00 0.00 4aa 1.00 0.05 0.09 21ba 0.67 0.21 0.32 29qy 0.80 0.66 0.72 61ny 0.00 0.00 0.00 6fc 0.04 0.50 0.07 2qw 0.48 0.58 0.52 19nn 0.00 0.00 0.00 1bk 0.00 0.00 0.00 0h 0.00 0.00 0.00 16qy ˆ d 0.50 0.17 0.25 6bh 0.00 0.00 0.00 1 ˆ q 0.00 0.00 0.00 2bf 0.00 0.00 0.00 14fo o fw ¨ by bc 0.00 0.00 0.00 0na 0.00 0.00 0.00 4ad 0.67 0.02 0.04 108 ˆ ˆ m 0.00 0.00 0.00 0qo 1.00 0.03 0.06 31qh 0.00 0.00 0.00 0 ˆ h 0.00 0.00 0.00 0ar 0.00 0.00 0.00 1ng 0.00 0.00 0.00 2br 0.00 0.00 0.00 0no 0.00 0.00 0.00 23fp 1.00 0.83 0.91 6qrr 0.00 0.00 0.00 0arp nd 0.00 0.00 0.00 0t3 0.00 0.00 0.00 0oo co cc 0.00 0.00 0.00 18aap am 0.00 0.00 0.00 2t1 0.00 0.00 0.00 0bd 0.00 0.00 0.00 0 ˆ g 0.00 0.00 0.00 0qw ˆ d 0.00 0.00 0.00 0fa 1.00 0.14 0.25 7ft 0.00 0.00 0.00 0avg / total 0.53 0.51 0.42 859 available training sets of formal teamwork dialogues, weformulated the problem as a transfer learning task, using bothsentence-level and word-level embedding models to leverageinformation from the SwDA dataset. A signiﬁcant contributionof our work is identifying the embedding model that performsbest after ﬁne-tuning on issue comments. We used GloVe,probabilistic representation, USE, and BERT embedding totrain ﬁve different models. USE showed the best performancewith an accuracy of 50.71%. The low accuracy of USE onDA classiﬁcation as compared to its accuracy on other state-of-the-art NLP tasks shows the complex nature of the dialogueact classiﬁcation. We evaluated many different settings for ig. 2. Confusion Matrix: Universal Sentence Encoder (all classes) learning rates, epochs, and batch size; even though minoraccuracy improvements were achievable, the performance ofthe embedding models remained fairly stable.Our aim is to map issue comments to cognitive states inthe Macrocognition in Teams Model (MITM) [16]. Draw-ing from research on externalized cognition, team cognition,group communication and problem solving, and collaborativelearning and adaptation, MITM provides a coherent theo-retically based conceptualization for understanding complexteam processes and how these emerge and change overtime. MITM consists of ﬁve components: Team Problem-Solving Outcomes, Externalized Team Knowledge, Internal-ized Knowledge, Team Knowledge Building, and IndividualKnowledge Building. It captures the parallel and iterative pro- cesses engaged by teams as they synthesize these componentsin service of team cognitive processes such as problem solving,decision making and planning. MITM has been applied toother team problem solving scenarios in military logistics [17]and business planning [18] but has never been used to ana-lyze software engineering teams. Its usage in the domain ofsoftware engineering would be a major research contributionto the ﬁeld of team cognition.Although it is possible to directly label issue commentsusing an MITM code book, that type of labeling would beless compatible with existing dialogue act datasets. Insteadwe are constructing a mapping that relates the DAMSL tagsetto these cognitive states. For instance, the question tags inDAMSL clearly relate to information gathering processes. ig. 3. Confusion Matrix: Probabilistic representation+LSTM. This illustration only includes classes with the largest support. The classes shown are:sv=statement-opinion, sd=statement non-opinion, aa=agree/accept, b=acknowledge, and %=abandoned. Also many of the DAMSL classes are less relevant to the teamcognition process and could be ignored. The most commonlyoccurring classes in the GitHub issue comments (statement-opinion, statement-non-opinion, agree/accept, acknowledge,and abandoned) are all relevant to the Macrocognition inTeams Model, and we plan to tune our dialogue act classiﬁersto bolster the performance on these classes. In future work,we continue to improve the size and quality of our publicly-released dataset by recruiting more annotators to help with thelabeling task and also more systematically studying inter-coderreliability. R

EFERENCES[1] E. Guzman, D. Az´ocar, and Y. Li, “Sentiment analysis of commitcomments in GitHub: An empirical study,” in

Proceedings of theWorking Conference on Mining Software Repositories , 2014, p. 352–355.[2] A. Murgia, M. Ortu, P. Tourani, B. Adams, and S. Demeyer, “An ex-ploratory qualitative and quantitative analysis of emotions in issue reportcomments of open source systems,”

Empirical Software Engineering ,vol. 23, no. 1, p. 521–564, 2018.[3] M. Ortu, T. Hall, M. Marchesi, R. Tonelli, D. Bowes, and G. Destefanis,“Mining communication patterns in software development: A GitHubanalysis,” in

International Conference on Predictive Models and DataAnalytics in Software Engineering , 2018.[4] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectorsfor word representation,” in

Proceedings of the Conference on Empiricalethods in Natural Language Processing (EMNLP) , 2014, pp. 1532–1543.[5] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John,N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar et al. , “Universalsentence encoder,” arXiv preprint arXiv:1803.11175 , 2018.[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-trainingof deep bidirectional transformers for language understanding,” arXivpreprint arXiv:1810.04805 , 2018.[7] M. R. Islam and M. F. Zibran, “Towards understanding and exploitingdevelopers’ emotional variations in software engineering,” in

IEEE In-ternational Conference on Software Engineering Research, Managementand Applications (SERA) , 2016, pp. 185–192.[8] R. Kikas, M. Dumas, and D. Pfahl, “Using dynamic and contextualfeatures to predict issue lifetime in GitHub projects,” in

IEEE/ACMWorking Conference on Mining Software Repositories (MSR) , 2016, pp.291–302.[9] X. Wang, M. Lee, A. Pinchbeck, and F. Fard, “Where does LDA sitfor GitHub?” in

IEEE/ACM International Conference on AutomatedSoftware Engineering Workshop (ASEW) , 2019, pp. 94–97.[10] B. Yang, X. Wei, and C. Liu, “Sentiments analysis in GitHub reposito-ries: An empirical study,” in

Asia-Paciﬁc Software Engineering Confer-ence Workshops (APSECW) , 2017, pp. 84–89.[11] T. Saha, S. Saha, and P. Bhattacharyya, “Tweet act classiﬁcation: Adeep learning based classiﬁer for recognizing speech acts in twitter,” in

International Joint Conference on Neural Networks (IJCNN) . IEEE,2019, pp. 1–8.[12] A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky,P. Taylor, R. Martin, C. V. Ess-Dykema, and M. Meteer, “Dialogueact modeling for automatic tagging and recognition of conversationalspeech,”

Computational Linguistics , vol. 26, no. 3, pp. 339–373, 2000.[13] Z. Chen, R. Yang, Z. Zhao, D. Cai, and X. He, “Dialogue act recogni-tion via crf-attentive structured network,” in

International ACM SIGIRConference on Research & Development in Information Retrieval , 2018,pp. 225–234.[14] N. Duran and S. Battle, “Probabilistic word association for dialogue actclassiﬁcation with recurrent neural networks,” in

International Confer-ence on Engineering Applications of Neural Networks . Springer, 2018,pp. 229–239.[15] D. Yu and Z. Yu, “Midas: A dialog act annotation scheme foropen domain human machine spoken conversations,” arXiv preprintarXiv:1908.10023 , 2019.[16] S. M. Fiore, S.-J. K. A., E. Salas, N. Warner, and L. M., “Towardan understanding of macrocognition in teams: Developing and deﬁningcomplex collaborative processes and products,”

Theoretical Issues inErgonomic Science , vol. 11, no. 4, pp. 250–271, 2010.[17] S. Hutchins and T. Kendall, “Understanding cognition in team collab-oration through use of communications analysis,”

Proceedings of theHuman Factors and Ergonomics Society Annual Meeting , vol. 54, pp.443–447, 09 2010.[18] I. Seeber, R. Maier, and B. Weber, “Macrocognition in collaboration:Analyzing processes of team knowledge building with CoPrA,”