A Transfer Learning Approach for Dialogue Act Classification of GitHub Issue Comments
AA Transfer Learning Approach for Dialogue ActClassification of GitHub Issue Comments
Ayesha Enayet
Department of Computer ScienceUniversity of Central Florida
Orlando, FL [email protected]
Gita Sukthankar
Department of Computer ScienceUniversity of Central Florida
Orlando, FL [email protected]
Abstract —Social coding platforms, such as GitHub, serve aslaboratories for studying collaborative problem solving in opensource software development; a key feature is their ability tosupport issue reporting which is used by teams to discuss tasksand ideas. Analyzing the dialogue between team members, asexpressed in issue comments, can yield important insights aboutthe performance of virtual teams. This paper presents a transferlearning approach for performing dialogue act classificationon issue comments. Since no large labeled corpus of GitHubissue comments exists, employing transfer learning enables us toleverage standard dialogue act datasets in combination with ourown GitHub comment dataset. We compare the performance ofseveral word and sentence level encoding models including GlobalVectors for Word Representations (GloVe), Universal SentenceEncoder (USE), and Bidirectional Encoder Representations fromTransformers (BERT). Being able to map the issue comments todialogue acts is a useful stepping stone towards understandingcognitive team processes.
Index Terms —social coding platforms, dialogue act classi-fication, transfer learning, embeddings for natural languageprocessing
I. I
NTRODUCTION
The emergence of online collaboration platforms has dra-matically changed the dynamics of human teamwork, creatinga veritable army of virtual teams, composed of workers indifferent physical locations. Software engineering requires atremendous amount of collaborative problem solving, makingit an excellent domain for team cognition researchers whoseek to understand the manifestation of cognition applied toteam tasks. Mining data from social coding platforms suchas GitHub can yield insights about the thought processesof virtual teams. Previous work on issue comments [1]–[3]has focused on emotional aspects of team communication,such as sentiment and politeness. Our aim is to map issuecomments to states in team cognition such as informationgathering, knowledge building and problem solving. To dothis we employ dialogue act (DA) classification, in order toidentify the intent of the speaker.Dialogue act classification has a broad range of naturallanguage processing applications, including machine trans-lation, dialogue systems and speech recognition. Semantic-based classification of human utterances is a challenging task,
This research was supported by DARPA program HR001117S0018. and the lack of a large annotated corpus that representsclass variations makes the job even harder. Compared to theexamples of human utterances available in standard datasetslike the Switchboard (SwBD) corpus and the CSI MeetingRecorder Dialogue Act (MRDA) corpus, GitHub utterancesare more complex.The primary purpose of our study is the DA classificationof GitHub issue comments by harnessing the strength oftransfer learning, using word and sentence level embeddingmodels fine-tuned on our dataset. For word-level transferlearning, we have used GLoVe vectors [4], and UniversalSentence Encoders [5] and BERT [6] models were used forsentence-level transfer. This paper presents a comparison ofthe performance of various architectures on GitHub dialoguesin a limited resource scenario. A second contribution is ourpublicly available dataset of annotated issue comments. Thedataset is available at https://drive.google.com/drive/folders/1kLZvzfE80VeEYA1tqua aj6nSiT57f83?usp=sharing. In thefield of computational collective intelligence, where peoplecollaborate and work in teams to achieve goals, dialogue actclassification can play a vital role in understanding humanteamwork. II. B
ACKGROUND
Unlike general purpose communication platforms such asTwitter and Facebook, GitHub is specialized to support virtualteams of software developers whose primary communicationgoal is to discuss new features and monitor software bugs.It facilitates distributed, asynchronous collaborations in opensource software (OSS) development. Code development, issuereporting, and social interactions are tracked by the 20+ eventtypes. Our assumption is that each software repository ismaintained by a team and that the events associated with therepository form a partial history of the team activities andsocial interactions.GitHub has an open API to collect metadata about users,repositories, and the activities of users on repositories. Devel-opers make changes to the code repository by pushing theircontent, while GitHub tracks the version control process. AnyGitHub user can contribute to a repository by sending a pullrequest. Repository maintainers review pull requests, discusspossible modifications in the comments, and decide whether a r X i v : . [ c s . C L ] N ov o accept or reject the requests. GitHub also supports passivesocial media style interactions such as following repositoriesor developers. Within GitHub’s issue handling infrastructure,users can report a bug or provide a feature request by openingan issue. Issue closure rates thus reflect the speed with whichteams resolve problems and can be used as a measure of teamperformance. III. R ELATED W ORK
Issue resolution has been viewed by many researchers as arich source of information about the emotional health of theteam and how it affects the software development process [7].Kikas et al. demonstrated a model for predicting issue lifetimethat included a single feature aggregating textual commentinformation [8]. Several studies have employed sentimentanalysis [1]–[3] and topic modeling [9] to study GitHub issuecomments. Ortu et al. conducted a large study on communi-cation patterns in which they measured politeness and emo-tional affect in issue comments; their aim was to understandhow contribution levels modulate communication patterns [3].Murgia et al. demonstrated a machine learning classifier foridentifing love, joy or sadness in issue comments [2]. Anempirical study of issue comments conducted by Guzman etal. [1] showed that the sentiment expressed in issue commentsvaries based on day of week, geographic dispersion of theteam, and the programming language. Yang et al. addressed themore practical question of the relationship of issue commentsentiment and bug fixing speed [10].In contrast, our aim is to study the team cognition aspects ofcollaborative problem solving using dialogue act classification.Unlike topic modeling or sentiment analysis, dialogue actclassification has not been extensively applied to GitHubdata. However, Saha et al. [11] proposed a deep learningapproach for the dialogue act classification of Twitter data. Aconvolutional neural network was used to create the classifier,along with hand-crafted rules. Seven classes were included:statement, expression, suggestion, request, question, threat,and other. In contrast, our work is done using a transferlearning approach and a significantly larger set of classes.Prior to deep learning, statistical approaches such as hiddenMarkov models, have been used for dialogue act classification.The HMM represents discourse structure, with dialogue acts asstates. Stolcke et al. demonstrated such a model that combinedprosodic, lexical, and collocational cues [12]. Chen et al.proposed the CRF-Attentive Structured Network (CRF-ASN)framework to exploit the CRF-attentive structure dependenciesalong with end-to-end training [13].This paper presents a transfer learning approach for dialogueact classification that is used to compensate for our smalldataset. To do this, we learn an embedding from a largerdataset. The next section surveys the state of the art inembedding models for natural language processing.
A. Embeddings
Embeddings are a mechanism for mapping a high-dimensional space to a low-dimensional one while only re- taining the most effective structural representations. They canbe used as part of the transfer learning process to mitigate thelow availability of labeled language resources on various NLPtasks. This paper presents transfer learning results using thefollowing state of the art embedding methods: Global Vectorsfor Word Representation (GloVe) [4], Universal SentenceEncoding (USE) [5], and Bidirectional Encoding (BERT) [6].We compare these embedding models with the probabilistictechnique proposed by Duran et al. [14] on our GitHub issuecomments dataset.
1) Global Vectors for Word Representation (GloVe):
Pen-nington et al. [4] proposed the GloVe model in 2014. It createsa word-level embedding that leverages both the local contextwindow and global matrix factorization methods. GloVe em-ploys a log-bilinear prediction-based technique that utilizesword-word co-occurrence statistics to identify a meaningfulstructure and generate word-level embeddings. We are usingthe GloVe model to illustrate the results of DA classificationof GitHub data using word-level embedding.
2) Universal Sentence Encoders (USE):
In 2018, GoogleResearch released a Universal Sentence Encoder (USE) modelfor sentence-level transfer learning that achieves consistentperformance across multiple NLP tasks [5]. There are twodifferent variants of the model: 1) a transformer architecture,which gives high accuracy at the cost of high resourceconsumption and 2) a deep averaging network that requiresfew resources and makes small compromises for efficiency.The former uses attention-based, context-aware encoding sub-graphs for the transfer architecture. The model outputs a 512-dimensional vector. The deep averaging network works byaveraging words and bigram embeddings to use as an input toa deep neural network. The models are trained on web news,Wikipedia, web question-answer pages, discussion forums, andthe Stanford Natural Language Inference (SNLI) corpus.
3) Bidirectional Encoder Representations from Transform-ers (BERT):
Also created at Google, BERT is the first modelthat was trained on both left and right contexts [6]. Toachieve pre-trained deep bidirectional representation, it usesthe masked model, which follows the cloze deletion task. Thismodel is trained on Books Corpus and English Wikipediacorpus. The code for BERT is available at https://github.com/google-research/bert. There are two available flavors of BERT:1)
BERT
BASE , and 2)
BERT
LARGE . BERT
BASE has 12transformer blocks, 768 hidden layers, 12 self-attention heads,and 110 million parameters. On the other hand,
BERT
LARGE uses a fairly large network, with 24 transformer blocks,1024 hidden layers, 16 self-attention heads, and 340 millionparameters. IV. M
ETHOD
Treating our dialogue act classification as a transfer learningproblem enables us to leverage embeddings learned on adataset that is over 200 times larger than our test dataset. Fivedialogue act classification pipelines were created to evaluatethe performance of four different word and sentence-levelembedding models. ig. 1. Architecture Diagrams for (i) Probabilistic Representation with RNN (ii) GloVe+LSTM (iii) Universal Sentence Encoder+LSTM (USE+LSTM) (iv)Universal Sentence Encoder (USE) (v) Bidirectional Encoder Representations from Transformers (BERT)ABLE ID
ATASET S TATISTICS
Dataset Categories W DA D
ATASET S AMPLE
Speaker Utterance DA DescriptionA I don’t, I don’t have anykids. sd Statement-non-OpinionA I, uh, my sister has a, shejust had a baby, sd Statement-non-OpinionA he’s about five months old sd Statement-non-OpinionA and she was worryingabout going back to workand what she was going todo with him and – sd Statement-non-OpinionA Uh-huh. b AcknowledgeA do you have kids? qy Yes-No-QuestionB I have three. na Affirmative non-yes AnswerA Oh, really? bh Backchannel inquestion form
A. Datasets
For our study, we collected a dataset of issue commentsfrom GitHub and hand annotated them using a standarddialogue act tagset, DAMSL (Discourse Annotation andMarkup System of Labeling), to facilitate the transferprocess. We have made the dataset available at https://drive.google.com/drive/folders/1kLZvzfE80VeEYA1tquaaj6nSiT57f83?usp=sharing. The tagset is available athttps://web.stanford.edu/ ∼ jurafsky/ws97/manual.august1.html.Our test set consists of 859 instances from more than 50GitHub issues.The models were trained using the Switchboard DialogueAct Corpus (SwDA) dataset. SwDA is one of the mostpopular public datasets for DA classification. It consists of1155 human-to-human telephone speech conversations. Thedataset is tagged using 42 tags from the DAMSL tagset.Table I shows the statistics of both test and train datasets.Table II shows examples from the SwDA training dataset.Table III shows examples from our GitHub issue commentdataset. GitHub issue comments are a complex combination ofcomputer commands, special symbols, and standard English.From these examples, it is clear that this is a challengingtransfer learning problem. B. Probabilistic Representation with Recurrent Neural Net-works
Duran et al. proposed a probabilistic technique to repre-sent utterances while using the LSTM sentence model fordialogue act classification [14]. The probabilistic distributionof each word in the corpus over DA categories providesthe representation of the utterances. The model does not
TABLE IIIG IT H UB D ATASET S AMPLE
Speaker Utterance DA DescriptionA What steps will reproducethe problem? qw Wh-QuestionB Give the *exact* argu-ments passedto include-what-you-use, and attachthe input source file thatgives theproblem (mini-mal test-cases are muchappreciated!) ad Action-directiveB Run IWYU against thefollowing file: ad Action-directiveB iwyu clear ad Action-directiveB iwyu cat ./cstdarg.cpp ad Action-directiveC What is the expected out-put? qw Wh-QuestionC What do you see instead? qw Wh-Question incorporate contextual features at the discourse level. The setof keywords consisting of all the words that occur above athreshold frequency is used to define a n × m matrix X,where m is the number of categories, and n is the numberof keywords. Each entry x ij of the matrix represents theprobability of the tag j given the word i . Training wasaccomplished using code downloaded from https://github.com/NathanDuran/Probabilistic-RNN-DA-Classifier. C. GloVe + LSTM
We use glove.6b.100d.txt downloaded from https://nlp.stanford.edu/projects/glove/ to train our model on the SwDAdataset. The model consists of input, embedding, LSTM, andone dense layer with 42 output labels. The network was trainedusing an Adam optimizer.
D. Universal Sentence Encoder (USE)
The Universal Sentence Encoder was applied to the GitHubissue comments, after fine-tuning on the SwDA dataset. Thecode to load the USE model is available at https://tfhub.dev/google/universal-sentence-encoder/1. We chose the USETransformer-based Architecture model with three dense layersand a softmax activation function. The network was trainedusing an Adam optimizer.
E. USE+LSTM
We also combine the Universal Sentence Encoder with anLSTM. This model consists of Input, Embedding, Convolu-tion, LSTM, and one Dense output layer.
F. Bidirectional Encoder Representations from Transformers(BERT)
This architecture was implemented using the bert en uncased L − H − A − model fromTensorFlow Hub. The model has 12 hidden layers (i.e.,Transformer blocks), with hidden size of 768, and 12attention heads. As proposed by [6], we append a singledense layer to BERT. ABLE IVT
RAINING , VALIDATION & TEST ACCURACY OF ALL THE MODELS model acc val acc test acc(GitHub)GloVe+LSTM 0.5089 0.5195 0.3714Prob+LSTM 0.7672 0.7694 0.4412USE 0.7247 0.6951 0.5071USE+LSTM 0.3841 0.4257 0.4074BERT 0.7151 0.7151 0.4063
V. E
VALUATION
Table IV shows the performance of all five architectures.Universal Sentence Encoder had the best performance on theGitHub issue comments, with a test accuracy of 50.71% whichis 6% more than the accuracy achieved using the probabilisticrepresentation of sentence. The other three models showedsignificantly lower performance than USE, lagging by almost10%. The probabilistic representation of sentence approachexhibited the highest validation accuracy of 76.9% which issignificantly higher than USE which had a validation accuracyof 69.5%. The well-known BERT model had a validationaccuracy of 71.5%, but had a low test accuracy.It is instructive to examine the performance differencesbetween the best (USE) and second best (probabilistic rep-resentation). Figure 2 shows the confusion matrix of theclassification results obtained using the USE model, andFigure 3 shows the confusion matrix of the probabilisticrepresentation method. In both cases the most confused tagpair is sd (statement-non-opinion) & sv (statement-opinion).USE correctly classified 91.71% of the sd occurrences, whilethe probabilistic representation method only classified 76%correctly. On the other hand USE classified 38.98% sv utter-ances correctly while the probabilistic representation methodclassified 50% of them correctly.Table V shows the Precision, Recall, & F Score of ourbest performing architecture. In Table V, support representsthe number of occurrences of each tag in the GitHub dataset.USE failed to classify the third most frequent tag i.e. ad(action directive) in the test dataset. The average precisionof USE over all tags is 53%. The average recall is 51%,and the average F score is 42%. A difference of only 2%between precision and recall shows that the results of themodel are consistent. BERT is one of the newest models fortransfer learning; however our results show that fine-tuningBERT doesn’t improve performance much in comparison tothe Universal Sentence Encoder. Prior work has shown thatBERT does not benefits as much from fine-tuning as otherembeddings [15]. VI. C ONCLUSION
This paper demonstrates a dialogue act classification systemfor GitHub issue comments. Due to the lack of publicly
TABLE VP
RECISION , R
ECALL , & F SCORE OF ALL THE TAGS (USE)Tag Precision Recall F score Supportsd 0.51 0.92 0.66 350b 0.11 0.20 0.14 5sv 0.47 0.39 0.43 119% 0.00 0.00 0.00 4aa 1.00 0.05 0.09 21ba 0.67 0.21 0.32 29qy 0.80 0.66 0.72 61ny 0.00 0.00 0.00 6fc 0.04 0.50 0.07 2qw 0.48 0.58 0.52 19nn 0.00 0.00 0.00 1bk 0.00 0.00 0.00 0h 0.00 0.00 0.00 16qy ˆ d 0.50 0.17 0.25 6bh 0.00 0.00 0.00 1 ˆ q 0.00 0.00 0.00 2bf 0.00 0.00 0.00 14fo o fw ¨ by bc 0.00 0.00 0.00 0na 0.00 0.00 0.00 4ad 0.67 0.02 0.04 108 ˆ ˆ m 0.00 0.00 0.00 0qo 1.00 0.03 0.06 31qh 0.00 0.00 0.00 0 ˆ h 0.00 0.00 0.00 0ar 0.00 0.00 0.00 1ng 0.00 0.00 0.00 2br 0.00 0.00 0.00 0no 0.00 0.00 0.00 23fp 1.00 0.83 0.91 6qrr 0.00 0.00 0.00 0arp nd 0.00 0.00 0.00 0t3 0.00 0.00 0.00 0oo co cc 0.00 0.00 0.00 18aap am 0.00 0.00 0.00 2t1 0.00 0.00 0.00 0bd 0.00 0.00 0.00 0 ˆ g 0.00 0.00 0.00 0qw ˆ d 0.00 0.00 0.00 0fa 1.00 0.14 0.25 7ft 0.00 0.00 0.00 0avg / total 0.53 0.51 0.42 859 available training sets of formal teamwork dialogues, weformulated the problem as a transfer learning task, using bothsentence-level and word-level embedding models to leverageinformation from the SwDA dataset. A significant contributionof our work is identifying the embedding model that performsbest after fine-tuning on issue comments. We used GloVe,probabilistic representation, USE, and BERT embedding totrain five different models. USE showed the best performancewith an accuracy of 50.71%. The low accuracy of USE onDA classification as compared to its accuracy on other state-of-the-art NLP tasks shows the complex nature of the dialogueact classification. We evaluated many different settings for ig. 2. Confusion Matrix: Universal Sentence Encoder (all classes) learning rates, epochs, and batch size; even though minoraccuracy improvements were achievable, the performance ofthe embedding models remained fairly stable.Our aim is to map issue comments to cognitive states inthe Macrocognition in Teams Model (MITM) [16]. Draw-ing from research on externalized cognition, team cognition,group communication and problem solving, and collaborativelearning and adaptation, MITM provides a coherent theo-retically based conceptualization for understanding complexteam processes and how these emerge and change overtime. MITM consists of five components: Team Problem-Solving Outcomes, Externalized Team Knowledge, Internal-ized Knowledge, Team Knowledge Building, and IndividualKnowledge Building. It captures the parallel and iterative pro- cesses engaged by teams as they synthesize these componentsin service of team cognitive processes such as problem solving,decision making and planning. MITM has been applied toother team problem solving scenarios in military logistics [17]and business planning [18] but has never been used to ana-lyze software engineering teams. Its usage in the domain ofsoftware engineering would be a major research contributionto the field of team cognition.Although it is possible to directly label issue commentsusing an MITM code book, that type of labeling would beless compatible with existing dialogue act datasets. Insteadwe are constructing a mapping that relates the DAMSL tagsetto these cognitive states. For instance, the question tags inDAMSL clearly relate to information gathering processes. ig. 3. Confusion Matrix: Probabilistic representation+LSTM. This illustration only includes classes with the largest support. The classes shown are:sv=statement-opinion, sd=statement non-opinion, aa=agree/accept, b=acknowledge, and %=abandoned. Also many of the DAMSL classes are less relevant to the teamcognition process and could be ignored. The most commonlyoccurring classes in the GitHub issue comments (statement-opinion, statement-non-opinion, agree/accept, acknowledge,and abandoned) are all relevant to the Macrocognition inTeams Model, and we plan to tune our dialogue act classifiersto bolster the performance on these classes. In future work,we continue to improve the size and quality of our publicly-released dataset by recruiting more annotators to help with thelabeling task and also more systematically studying inter-coderreliability. R
EFERENCES[1] E. Guzman, D. Az´ocar, and Y. Li, “Sentiment analysis of commitcomments in GitHub: An empirical study,” in
Proceedings of theWorking Conference on Mining Software Repositories , 2014, p. 352–355.[2] A. Murgia, M. Ortu, P. Tourani, B. Adams, and S. Demeyer, “An ex-ploratory qualitative and quantitative analysis of emotions in issue reportcomments of open source systems,”
Empirical Software Engineering ,vol. 23, no. 1, p. 521–564, 2018.[3] M. Ortu, T. Hall, M. Marchesi, R. Tonelli, D. Bowes, and G. Destefanis,“Mining communication patterns in software development: A GitHubanalysis,” in
International Conference on Predictive Models and DataAnalytics in Software Engineering , 2018.[4] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectorsfor word representation,” in
Proceedings of the Conference on Empiricalethods in Natural Language Processing (EMNLP) , 2014, pp. 1532–1543.[5] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John,N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar et al. , “Universalsentence encoder,” arXiv preprint arXiv:1803.11175 , 2018.[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-trainingof deep bidirectional transformers for language understanding,” arXivpreprint arXiv:1810.04805 , 2018.[7] M. R. Islam and M. F. Zibran, “Towards understanding and exploitingdevelopers’ emotional variations in software engineering,” in
IEEE In-ternational Conference on Software Engineering Research, Managementand Applications (SERA) , 2016, pp. 185–192.[8] R. Kikas, M. Dumas, and D. Pfahl, “Using dynamic and contextualfeatures to predict issue lifetime in GitHub projects,” in
IEEE/ACMWorking Conference on Mining Software Repositories (MSR) , 2016, pp.291–302.[9] X. Wang, M. Lee, A. Pinchbeck, and F. Fard, “Where does LDA sitfor GitHub?” in
IEEE/ACM International Conference on AutomatedSoftware Engineering Workshop (ASEW) , 2019, pp. 94–97.[10] B. Yang, X. Wei, and C. Liu, “Sentiments analysis in GitHub reposito-ries: An empirical study,” in
Asia-Pacific Software Engineering Confer-ence Workshops (APSECW) , 2017, pp. 84–89.[11] T. Saha, S. Saha, and P. Bhattacharyya, “Tweet act classification: Adeep learning based classifier for recognizing speech acts in twitter,” in
International Joint Conference on Neural Networks (IJCNN) . IEEE,2019, pp. 1–8.[12] A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky,P. Taylor, R. Martin, C. V. Ess-Dykema, and M. Meteer, “Dialogueact modeling for automatic tagging and recognition of conversationalspeech,”
Computational Linguistics , vol. 26, no. 3, pp. 339–373, 2000.[13] Z. Chen, R. Yang, Z. Zhao, D. Cai, and X. He, “Dialogue act recogni-tion via crf-attentive structured network,” in
International ACM SIGIRConference on Research & Development in Information Retrieval , 2018,pp. 225–234.[14] N. Duran and S. Battle, “Probabilistic word association for dialogue actclassification with recurrent neural networks,” in
International Confer-ence on Engineering Applications of Neural Networks . Springer, 2018,pp. 229–239.[15] D. Yu and Z. Yu, “Midas: A dialog act annotation scheme foropen domain human machine spoken conversations,” arXiv preprintarXiv:1908.10023 , 2019.[16] S. M. Fiore, S.-J. K. A., E. Salas, N. Warner, and L. M., “Towardan understanding of macrocognition in teams: Developing and definingcomplex collaborative processes and products,”
Theoretical Issues inErgonomic Science , vol. 11, no. 4, pp. 250–271, 2010.[17] S. Hutchins and T. Kendall, “Understanding cognition in team collab-oration through use of communications analysis,”
Proceedings of theHuman Factors and Ergonomics Society Annual Meeting , vol. 54, pp.443–447, 09 2010.[18] I. Seeber, R. Maier, and B. Weber, “Macrocognition in collaboration:Analyzing processes of team knowledge building with CoPrA,”