Towards Automatic Generation of Short Summaries of Commits
TTowards Automatic Generation ofShort Summaries of Commits
Siyuan Jiang and Collin McMillan
Department of Computer Science and EngineeringUniversity of Notre DameNotre Dame, IN, USAEmail: { sjiang1, cmc } @nd.edu Abstract —Committing to a version control system meanssubmitting a software change to the system. Each commit canhave a message to describe the submission. Several approacheshave been proposed to automatically generate the content of suchmessages. However, the quality of the automatically generatedmessages falls far short of what humans write. In studying thedifferences between auto-generated and human-written messages,we found that 82% of the human-written messages have only onesentence, while the automatically generated messages often havemultiple lines. Furthermore, we found that the commit messagesoften begin with a verb followed by an direct object. This findinginspired us to use a “verb+object” format in this paper togenerate short commit summaries. We split the approach intotwo parts: verb generation and object generation. As our firsttry, we trained a classifier to classify a diff to a verb. We areseeking feedback from the community before we continue to workon generating direct objects for the commits.
I. I
NTRODUCTION
A commit is the action of software developers submittinga software change to a version control system. Commits canhave commit messages, which are often written by developersto describe the changes. Commit messages are importantbecause developers use them to review, validate, and under-stand the commits, but commit messages sometimes are non-informative or even empty [1].To address this problem, automatic commit message gener-ation techniques have been proposed. They often use programanalysis and differencing techniques to generate summaries ofchanges [1]–[4]. These summaries are much shorter than thediff files (generated by differencing tools), but the summariesstill tend to have multiple lines. Other techniques generatecommit messages from other project documents. For exam-ple, Rastkar and Murphy proposed to generate the commitmessages from user stories [5]. The summaries generated bythese techniques are useful, but what is still missing is one-sentence summaries which convey the key ideas of commits.The idea of generating one-sentence summaries is basedon our exploratory data analysis on the two million commitmessages that we present in this paper. We used naturallanguage processing (NLP) techniques to analyze the textof the commit messages and found that the majority of thecommit messages are only one sentence long, and nearlyhalf of the commit messages begin with a verb followedby a direct object . This finding inspired us to design amethod for generating commit summaries that are similar
An extended summary of the changes, such as the changed class namesSummary of related documents, such as user storiesOne phrase summaryContentIntention
Our approach
ChangeScribe,DeltaDoc,ARENA, etc.Rastkar and
Murphy’s approach
A commit message
A diff fileText differencing techniques
Fig. 1. Existing techniques for commit message generation compared to ourapproach to what developers write: “verb + object”. These one-phrasesummaries can be the leading sentences or the topics of thesummaries generated by the existing techniques (Figure 1).We divided our work into three parts: the exploratory dataanalysis, the verb generation and the direct-object generation.We summarized the exploratory data analysis in the previousparagraph and describe it in Section IV. For the verb gener-ation, we trained a Naive Bayes classifier to identify verbsbased on diff files that should be the key verbs in the commitmessages. We also conducted a preliminary evaluation of theverb generation in Section V.In this ERA paper, we present several open questions to thecommunity that we hope will guide our future work, and inparticular the generation of direct objects for the verbs.Our contributions include: • Using NLP techniques to analyze the commit messages,which enable us to analyze a large set of the messages(which we release in our online appendix) • Discovery of a common phrase structure that is used bysoftware developers to write commit messages, and aprogram that automatically extracts such phrase structure • A proposal that aims to generate one-sentence commitmessages that convey the key ideas of commitsIn the rest of this paper, we will present a motivationalexample, the related work, the exploratory data analysis, theverb generation technique, and the future work.
Online Appendix
We put our scripts and results on ouronline appendix: http://nd.edu/ ∼ sjiang1/commitact a r X i v : . [ c s . S E ] M a r change the producer info” Diff fileDeltaDocOur approach
Fig. 2. The diff file, the commit message generated by DeltaDoc [2], andthe commit message that our approach aims to generate for Commit r3909 iniText (Section II)
II. E
XAMPLE
In this section, we borrow the example of Commit r3909 iniText from the paper of DeltaDoc [2]. The diff file of r3909 andthe summary generated by DeltaDoc are shown in Figure 2.The size of the generated document is about half of thediff file, but it is still difficult to get the general idea atthe first glance. Similarly, Changescribe [4] also generatesmessages that are several lines long. What is missing is aleading sentence that summarizes all the changes in a commit.Now consider the commit message that the developer wrote:“Changing the producer info.” This phrase contains the actionof the commit, “change”, and what is the object of the action,“the producer info”. The developer can skim this phrase andunderstand what was changed in the commit.Currently, our approach generates “change” for r3909, andin the future, we will have an approach to generate “theproducer info”. The combination of the two approaches isgoing to generate phrases like “change the producer info”.III. R
ELATED W ORK
Our project has two parts: exploratory data analysis andcommit message generation. Based on the two parts, weseparate the related work into three categories: empiricalstudies about commit messages, empirical studies about difffiles, and techniques that generate commit messages.
A. Empirical Studies about Commit Messages
Several empirical studies about commits messages havebeen conducted for commit classification and commit messagegeneration [2], [3], [6]–[8]. For example, Moreno et al. [3] manually inspected the existing release notes before theydesigned an approach to generate release notes automatically.Buse and Weimer [2] conducted a similar manual inspectionfor automatic commit message generation. Like these previousstudies, our exploratory data analysis aims to gain insights forour approach of generating commit messages.Different from the previous studies, we used natural lan-guage processing (NLP) techniques, which help us to mineinformation from the existing commit messages automaticallyand confirm hypotheses on a large data set. Besides manualinspection, the previous studies also computed the sizes ofcommit messages and analyzed the messages as bags ofwords [7], [8]. In contrast, we are able to conduct grammaranalysis on the commit messages. The grammar analysisleaded to a key finding that shaped our approach.
B. Empirical Studies about Commit Changes
There are many empirical studies about the changes incommits [6]–[9]. For example, Fluri et al. studied change typesbased on their syntax differencing technique [9]. Currently, wehave not conducted an empirical study on the commit changes,but we plan to study the content of the diff files in the future.Instead of looking for change types, we will study whetherthere are overlapped words in the commit messages and theirdiff files and where we can locate the overlapped words in thediff files.
C. Commit Message Generation Techniques
A common way to generate commit messages is summa-rizing code changes of a commit [2]–[4]. Many techniquesuse syntax differences to present code changes [2]–[4], [9].Different from the existing techniques, we use diff files(generated by git diff command) in our approach. Diff filesare textual differences and easy to obtain. On the otherhand, syntax differencing requires code parsing. Additionally,syntax differencing includes only code changes, while diff filescontain other changes, such as comment and makefile changes.While the two differencing types have their own advantagesand disadvantages, we chose to use the diff files as our firsttry, because they are easier to obtain.To include context information in a commit message, severalapproaches consider the information outside the text or codechanges of a commit [3], [5], [10]. While we agree thatthe context information is an important part of a commitmessage, our approach is currently focusing on summarizingtext changes into a short sentence to increase readability andinterpretability of a commit message.Our approach to generate a verb for a commit is similarto the approach taken by Le et al. to link issue reports tocommits [10]. Le et al. conducted textual similarity analy-sis between commit messages and issue reports where theyused term frequency-inverse document frequency (tf-idf) torepresent commit messages and issue reports. We also usedtf-idf, but tf-idf is used to represent the diff files instead ofthe commit messages.
Number of sentences per commit F r equen cy ( N u m be r o f c o mm i t s ) Fig. 3. Histogram of number of sentences in the commit messages
IV. E
XPLORATORY D ATA A NALYSIS
We conducted an exploratory data analysis that is similarto the analysis done by Hattori and Lanza [7]. Hattori andLanza found that most commits include few files and veryfew commits have hundreds of files. Likewise, we found thatmost commit messages have few sentences and few commitmessages have more than ten sentences.
The Data Set
First, we obtained 967 commits from thework by Mauczka et al [11]. Second, we obtained all thecommits from the top 1,000 popular Java projects in Github(due to space limit, we put the details on our online appendix,Section I). Then, we filtered the commit messages that areempty or have non-English letters. In the end, we obtained2,027,734 commits.
Removing Special Commits
We excluded the rollback andmerge commits from our analysis. Version control systemsoften provide automatic commit messages for rollbacks andmerges, such as, “merge commits X and Y”. In the two millioncommits, we removed nearly 400k rollbacks and mergesby checking whether the commit messages are begun with“merge” or “rollback”.
Number of the Sentences
In the remaining 1.6 millioncommit messages, we counted the number of the sentencesin each commit message by using Stanford CoreNLP [12].The majority of the commit messages have few sentences.82% of the commit messages have only one sentence. Only0.2% of the commit messages have more than ten sentences.Figure 3 shows the histogram of the number of the sentencesin the commit messages (excluding the messages have morethan ten sentences due to space limit).
Grammar Analysis on the Commit Messages
We took twosteps in the grammar analysis. First, we manually read 12randomly-sampled commit messages from the commits weobtained from Mauczka et al [11]. In this step, we formed thehypothesis that “verb + object” is a common phrase structurein the commit messages. Second, to confirm the hypothesis,we used Stanford CoreNLP [12] to detect the verbs and theirdirect objects in the first sentences of the commit messages. Inthe 1.6 million messages, we found 763,826 messages (whichis 47% of the 1.6 million messages) where the first sentencesare begun with a verb and its direct object. upgradehandlereplacerevertsetimplementcreateallowrenamechangeignoreimprovepreparemakemoveuseupdateremovefixadd 0 50000 100000 150000
Frequency V e r b T y pe s Fig. 4. Histogram of the verbs in commit messagesTABLE IT HE V ERB G ROUPS
Id Verb types Id Verb types Id Verb types1 add, create, make, 6 move, change 12 allowimplement 7 prepare 13 set2 fix 8 improve 14 revert3 remove 9 ignore 15 replace4 update, upgrade 10 handle5 use 11 rename
V. C
LASSIFYING D IFFS INTO V ERB G ROUPS
In this section, our goal is to generate a verb from a commit.We used diff files (i.e., textual differences) to represent thechanges of commits because diff files can be easily obtainedby git diff command. Then we treated the problem of verbgeneration as a multiclass classification problem—classifyinga diff file into one of the verb groups, where a verb group isa group of verbs that have similar meanings. As the first step,we define our verb groups in the following section.
A. Verb Groups
When we analyzed phrase structures of the commit mes-sages (Section IV), we retrieved for each commit a verbfrom the commit message. There are 763k verbs in total.We transformed the verbs into their lemmas and we calledeach distinct lemma a verb type. There are 4962 verb typesin the 763k verbs. Figure 4 shows the histogram of the 20most frequent verb types. Alali et al. [6] has reported a list offrequent words in commit messages, which overlap with ourfrequent verb types.From all the verb types, we considered only the 20 mostfrequent word types, which cover 70% of the commit messages(537k commit messages). We grouped similar word typesby using a word embedding tool , which uses word2vecmethod [13]. Finally we manually inspected the grouped verbsand added “implement” to the group of “add”. There are 15verb groups in total, which are shown in Table I. The first,third, and fifth columns list the ids of the verb groups. Labeling
To label each diff file, we used the verb that weextracted from the commit message, and we labeled the difffile with the verb group id that includes the verb. The verb lassifier Training PhaseTestingPhase
Naive Bayes Classifier LearningGround Truth Verb Group
Labels
Computingtf-idf
Computing tf-idf Classification diff files(test set)diff files(training set)
Verb Group Labels
Ground Truth
Verb Group Labels EvaluationDataLegend Process Model
Fig. 5. The overall approach groups only include the 20 most frequent verb types, in thisstudy, we excluded the diffs that have other verbs. In total, wehave 537k labeled diff files.
B. The Data Set
We removed the diff files that are larger than 1MB dueto space limit. We also removed the diff files that have non-ascii codes. In the end, we have 509k labeled diff files. Werandomly selected 3k diff files as the test set and the rest ofthe diff files are used for training.
C. Overall Approach
The overall approach is shown in Figure 5. We chose aNaive Bayes classifier to classify the diff files into the verbgroups. Before we train the classifier on the diff files, wecomputed tf-idf (term frequency-inverse document frequency)for every word type (i.e., distinct word) as the features of thediff files. Tf-idf is a common textual feature that evaluates theimportance of a word type by two factors: 1) the number oftimes the word type occurs in a diff file divided by the totalnumber of words in the diff file, and 2) the number of timesthe word type occurs in all the diff files [10].
D. Evaluation
The overall accuracy is 39%; the precision is 43%; and therecall is 39%. The classifier works best for verb groups 1and 9. The precision for verb group 1 is 38% and the recallis 100%; the precision for verb group 9 is 100% and therecall is 41%. Although we trained the classifier with 15 verbgroups, the classifier classified the test set into five verb groupsand was not able to detect any of the other ten verb groups.We plan to improve our training approach by 1) trying othermachine learning techniques, such as random forests [10]; 2)using SMOTE [14] to address the problem of the unbalanceddata set (most of the diffs are labeled with verb group 1).VI. D
ISCUSSION AND F UTURE W ORK
In the process of this project, we have formed severalpotential research questions to be discussed in the conference. We hope the conversions at the conference will help indirecting us towards answering these questions.
RQ1
What techniques are appropriate for generating directobjects for the commits? We observed that the direct objectsoften occur in the diff files. So one of our options is touse extractive summarization techniques to extract the “directobjects” from the diff files.
RQ2
What machine learning models and features suitverb-generation task better? To improve our verb-generationapproach, we can try other classification methods, such asdecision trees. Feature-wise, diff files follow a certain formatand we can create some features to represent the characteristicsof a diff file, for example, the number of “+” in a diff file.
RQ3
To what extent are the short summaries useful? Al-though we think the short summaries are useful based onour experience, we need to conduct a study to confirm ourhypothesis. Our current assumption is that the short summarieshelp developers understand a commit more quickly.A
CKNOWLEDGMENT
This work was partially supported by the NSF CCF-1452959 and CNS-1510329 grants, and the Office of NavalResearch grant N000141410037. Any opinions, findings, andconclusions expressed herein are the authors’ and do notnecessarily reflect those of the sponsors.R
EFERENCES[1] M. Linares-V´asquez, L. F. Cort´es-Coy, J. Aponte, and D. Poshyvanyk,“Changescribe: A tool for automatically generating commit messages,”in
ICSE ’15 , vol. 2, May 2015, pp. 709–712.[2] R. P. Buse and W. R. Weimer, “Automatically documenting programchanges,” in
ASE ’10 , 2010, pp. 33–42.[3] L. Moreno, G. Bavota, M. Di Penta, R. Oliveto, A. Marcus, andG. Canfora, “Automatic generation of release notes,” in
Proceedingsof the 2014 FSE , pp. 484–495.[4] M. Linares-V´asquez, L. F. Cort´es-Coy, J. Aponte, and D. Poshyvanyk,“Changescribe: A tool for automatically generating commit messages,”in , vol. 2, pp. 709–712.[5] S. Rastkar and G. C. Murphy, “Why did this code change?” in
Proceed-ings of the 2013 ICSE , ser. ICSE ’13, 2013, pp. 1193–1196.[6] A. Alali, H. Kagdi, and J. I. Maletic, “What’s a typical commit? acharacterization of open source software repositories,” in , pp. 182–191.[7] L. P. Hattori and M. Lanza, “On the nature of commits,” in , pp. 63–71.[8] A. Hindle, D. M. German, M. W. Godfrey, and R. C. Holt, “Automaticclassication of large changes into maintenance categories,” in , pp. 30–39.[9] B. Fluri, M. Wuersch, M. PInzger, and H. Gall, “Change distilling:treedifferencing for fine-grained source code change extraction,”
IEEE TSE ,vol. 33, no. 11, pp. 725–743, 2007.[10] T. D. B. Le, M. Linares-Vasquez, D. Lo, and D. Poshyvanyk, “Rclinker:Automated linking of issue reports and commits leveraging rich contex-tual information,” in , pp. 36–47.[11] A. Mauczka, F. Brosch, C. Schanes, and T. Grechenig, “Dataset ofdeveloper-labeled commit messages,” ser. MSR ’15, pp. 490–493.[12] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, andD. McClosky, “The Stanford CoreNLP natural language processingtoolkit,” in
ACL System Demonstrations , 2014, pp. 55–60.[13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in
NIPS , 2013, pp. 3111–3119.[14] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:synthetic minority over-sampling technique,”