[PDF] Towards Automatic Generation of Short Summaries of Commits

Abstract

Committing to a version control system means submitting a software change to the system. Each commit can have a message to describe the submission. Several approaches have been proposed to automatically generate the content of such messages. However, the quality of the automatically generated messages falls far short of what humans write. In studying the differences between auto-generated and human-written messages, we found that 82% of the human-written messages have only one sentence, while the automatically generated messages often have multiple lines. Furthermore, we found that the commit messages often begin with a verb followed by an direct object. This finding inspired us to use a "verb+object" format in this paper to generate short commit summaries. We split the approach into two parts: verb generation and object generation. As our first try, we trained a classifier to classify a diff to a verb. We are seeking feedback from the community before we continue to work on generating direct objects for the commits.

Full PDF

TTowards Automatic Generation ofShort Summaries of Commits

Siyuan Jiang and Collin McMillan

Department of Computer Science and EngineeringUniversity of Notre DameNotre Dame, IN, USAEmail: { sjiang1, cmc } @nd.edu Abstract —Committing to a version control system meanssubmitting a software change to the system. Each commit canhave a message to describe the submission. Several approacheshave been proposed to automatically generate the content of suchmessages. However, the quality of the automatically generatedmessages falls far short of what humans write. In studying thedifferences between auto-generated and human-written messages,we found that 82% of the human-written messages have only onesentence, while the automatically generated messages often havemultiple lines. Furthermore, we found that the commit messagesoften begin with a verb followed by an direct object. This ﬁndinginspired us to use a “verb+object” format in this paper togenerate short commit summaries. We split the approach intotwo parts: verb generation and object generation. As our ﬁrsttry, we trained a classiﬁer to classify a diff to a verb. We areseeking feedback from the community before we continue to workon generating direct objects for the commits.

I. I

NTRODUCTION

A commit is the action of software developers submittinga software change to a version control system. Commits canhave commit messages, which are often written by developersto describe the changes. Commit messages are importantbecause developers use them to review, validate, and under-stand the commits, but commit messages sometimes are non-informative or even empty [1].To address this problem, automatic commit message gener-ation techniques have been proposed. They often use programanalysis and differencing techniques to generate summaries ofchanges [1]–[4]. These summaries are much shorter than thediff ﬁles (generated by differencing tools), but the summariesstill tend to have multiple lines. Other techniques generatecommit messages from other project documents. For exam-ple, Rastkar and Murphy proposed to generate the commitmessages from user stories [5]. The summaries generated bythese techniques are useful, but what is still missing is one-sentence summaries which convey the key ideas of commits.The idea of generating one-sentence summaries is basedon our exploratory data analysis on the two million commitmessages that we present in this paper. We used naturallanguage processing (NLP) techniques to analyze the textof the commit messages and found that the majority of thecommit messages are only one sentence long, and nearlyhalf of the commit messages begin with a verb followedby a direct object . This ﬁnding inspired us to design amethod for generating commit summaries that are similar

An extended summary of the changes, such as the changed class namesSummary of related documents, such as user storiesOne phrase summaryContentIntention

Our approach

ChangeScribe,DeltaDoc,ARENA, etc.Rastkar and

Murphy’s approach

A commit message

A diff fileText differencing techniques

Fig. 1. Existing techniques for commit message generation compared to ourapproach to what developers write: “verb + object”. These one-phrasesummaries can be the leading sentences or the topics of thesummaries generated by the existing techniques (Figure 1).We divided our work into three parts: the exploratory dataanalysis, the verb generation and the direct-object generation.We summarized the exploratory data analysis in the previousparagraph and describe it in Section IV. For the verb gener-ation, we trained a Naive Bayes classiﬁer to identify verbsbased on diff ﬁles that should be the key verbs in the commitmessages. We also conducted a preliminary evaluation of theverb generation in Section V.In this ERA paper, we present several open questions to thecommunity that we hope will guide our future work, and inparticular the generation of direct objects for the verbs.Our contributions include: • Using NLP techniques to analyze the commit messages,which enable us to analyze a large set of the messages(which we release in our online appendix) • Discovery of a common phrase structure that is used bysoftware developers to write commit messages, and aprogram that automatically extracts such phrase structure • A proposal that aims to generate one-sentence commitmessages that convey the key ideas of commitsIn the rest of this paper, we will present a motivationalexample, the related work, the exploratory data analysis, theverb generation technique, and the future work.

Online Appendix

We put our scripts and results on ouronline appendix: http://nd.edu/ ∼ sjiang1/commitact a r X i v : . [ c s . S E ] M a r change the producer info” Diff fileDeltaDocOur approach

Fig. 2. The diff ﬁle, the commit message generated by DeltaDoc [2], andthe commit message that our approach aims to generate for Commit r3909 iniText (Section II)

II. E

XAMPLE

In this section, we borrow the example of Commit r3909 iniText from the paper of DeltaDoc [2]. The diff ﬁle of r3909 andthe summary generated by DeltaDoc are shown in Figure 2.The size of the generated document is about half of thediff ﬁle, but it is still difﬁcult to get the general idea atthe ﬁrst glance. Similarly, Changescribe [4] also generatesmessages that are several lines long. What is missing is aleading sentence that summarizes all the changes in a commit.Now consider the commit message that the developer wrote:“Changing the producer info.” This phrase contains the actionof the commit, “change”, and what is the object of the action,“the producer info”. The developer can skim this phrase andunderstand what was changed in the commit.Currently, our approach generates “change” for r3909, andin the future, we will have an approach to generate “theproducer info”. The combination of the two approaches isgoing to generate phrases like “change the producer info”.III. R

ELATED W ORK

Our project has two parts: exploratory data analysis andcommit message generation. Based on the two parts, weseparate the related work into three categories: empiricalstudies about commit messages, empirical studies about diffﬁles, and techniques that generate commit messages.

A. Empirical Studies about Commit Messages

Several empirical studies about commits messages havebeen conducted for commit classiﬁcation and commit messagegeneration [2], [3], [6]–[8]. For example, Moreno et al. [3] manually inspected the existing release notes before theydesigned an approach to generate release notes automatically.Buse and Weimer [2] conducted a similar manual inspectionfor automatic commit message generation. Like these previousstudies, our exploratory data analysis aims to gain insights forour approach of generating commit messages.Different from the previous studies, we used natural lan-guage processing (NLP) techniques, which help us to mineinformation from the existing commit messages automaticallyand conﬁrm hypotheses on a large data set. Besides manualinspection, the previous studies also computed the sizes ofcommit messages and analyzed the messages as bags ofwords [7], [8]. In contrast, we are able to conduct grammaranalysis on the commit messages. The grammar analysisleaded to a key ﬁnding that shaped our approach.

B. Empirical Studies about Commit Changes

There are many empirical studies about the changes incommits [6]–[9]. For example, Fluri et al. studied change typesbased on their syntax differencing technique [9]. Currently, wehave not conducted an empirical study on the commit changes,but we plan to study the content of the diff ﬁles in the future.Instead of looking for change types, we will study whetherthere are overlapped words in the commit messages and theirdiff ﬁles and where we can locate the overlapped words in thediff ﬁles.

C. Commit Message Generation Techniques

A common way to generate commit messages is summa-rizing code changes of a commit [2]–[4]. Many techniquesuse syntax differences to present code changes [2]–[4], [9].Different from the existing techniques, we use diff ﬁles(generated by git diff command) in our approach. Diff ﬁlesare textual differences and easy to obtain. On the otherhand, syntax differencing requires code parsing. Additionally,syntax differencing includes only code changes, while diff ﬁlescontain other changes, such as comment and makeﬁle changes.While the two differencing types have their own advantagesand disadvantages, we chose to use the diff ﬁles as our ﬁrsttry, because they are easier to obtain.To include context information in a commit message, severalapproaches consider the information outside the text or codechanges of a commit [3], [5], [10]. While we agree thatthe context information is an important part of a commitmessage, our approach is currently focusing on summarizingtext changes into a short sentence to increase readability andinterpretability of a commit message.Our approach to generate a verb for a commit is similarto the approach taken by Le et al. to link issue reports tocommits [10]. Le et al. conducted textual similarity analy-sis between commit messages and issue reports where theyused term frequency-inverse document frequency (tf-idf) torepresent commit messages and issue reports. We also usedtf-idf, but tf-idf is used to represent the diff ﬁles instead ofthe commit messages.

Number of sentences per commit F r equen cy ( N u m be r o f c o mm i t s ) Fig. 3. Histogram of number of sentences in the commit messages

IV. E

XPLORATORY D ATA A NALYSIS

We conducted an exploratory data analysis that is similarto the analysis done by Hattori and Lanza [7]. Hattori andLanza found that most commits include few ﬁles and veryfew commits have hundreds of ﬁles. Likewise, we found thatmost commit messages have few sentences and few commitmessages have more than ten sentences.

The Data Set

First, we obtained 967 commits from thework by Mauczka et al [11]. Second, we obtained all thecommits from the top 1,000 popular Java projects in Github(due to space limit, we put the details on our online appendix,Section I). Then, we ﬁltered the commit messages that areempty or have non-English letters. In the end, we obtained2,027,734 commits.

Removing Special Commits

We excluded the rollback andmerge commits from our analysis. Version control systemsoften provide automatic commit messages for rollbacks andmerges, such as, “merge commits X and Y”. In the two millioncommits, we removed nearly 400k rollbacks and mergesby checking whether the commit messages are begun with“merge” or “rollback”.

Number of the Sentences

In the remaining 1.6 millioncommit messages, we counted the number of the sentencesin each commit message by using Stanford CoreNLP [12].The majority of the commit messages have few sentences.82% of the commit messages have only one sentence. Only0.2% of the commit messages have more than ten sentences.Figure 3 shows the histogram of the number of the sentencesin the commit messages (excluding the messages have morethan ten sentences due to space limit).

Grammar Analysis on the Commit Messages

We took twosteps in the grammar analysis. First, we manually read 12randomly-sampled commit messages from the commits weobtained from Mauczka et al [11]. In this step, we formed thehypothesis that “verb + object” is a common phrase structurein the commit messages. Second, to conﬁrm the hypothesis,we used Stanford CoreNLP [12] to detect the verbs and theirdirect objects in the ﬁrst sentences of the commit messages. Inthe 1.6 million messages, we found 763,826 messages (whichis 47% of the 1.6 million messages) where the ﬁrst sentencesare begun with a verb and its direct object. upgradehandlereplacerevertsetimplementcreateallowrenamechangeignoreimprovepreparemakemoveuseupdateremovefixadd 0 50000 100000 150000

Frequency V e r b T y pe s Fig. 4. Histogram of the verbs in commit messagesTABLE IT HE V ERB G ROUPS

Id Verb types Id Verb types Id Verb types1 add, create, make, 6 move, change 12 allowimplement 7 prepare 13 set2 ﬁx 8 improve 14 revert3 remove 9 ignore 15 replace4 update, upgrade 10 handle5 use 11 rename

V. C

LASSIFYING D IFFS INTO V ERB G ROUPS

In this section, our goal is to generate a verb from a commit.We used diff ﬁles (i.e., textual differences) to represent thechanges of commits because diff ﬁles can be easily obtainedby git diff command. Then we treated the problem of verbgeneration as a multiclass classiﬁcation problem—classifyinga diff ﬁle into one of the verb groups, where a verb group isa group of verbs that have similar meanings. As the ﬁrst step,we deﬁne our verb groups in the following section.

A. Verb Groups

When we analyzed phrase structures of the commit mes-sages (Section IV), we retrieved for each commit a verbfrom the commit message. There are 763k verbs in total.We transformed the verbs into their lemmas and we calledeach distinct lemma a verb type. There are 4962 verb typesin the 763k verbs. Figure 4 shows the histogram of the 20most frequent verb types. Alali et al. [6] has reported a list offrequent words in commit messages, which overlap with ourfrequent verb types.From all the verb types, we considered only the 20 mostfrequent word types, which cover 70% of the commit messages(537k commit messages). We grouped similar word typesby using a word embedding tool , which uses word2vecmethod [13]. Finally we manually inspected the grouped verbsand added “implement” to the group of “add”. There are 15verb groups in total, which are shown in Table I. The ﬁrst,third, and ﬁfth columns list the ids of the verb groups. Labeling

To label each diff ﬁle, we used the verb that weextracted from the commit message, and we labeled the diffﬁle with the verb group id that includes the verb. The verb lassifier Training PhaseTestingPhase

Naive Bayes Classifier LearningGround Truth Verb Group

Labels

Computingtf-idf

Computing tf-idf Classification diff files(test set)diff files(training set)

Verb Group Labels

Ground Truth

Verb Group Labels EvaluationDataLegend Process Model

Fig. 5. The overall approach groups only include the 20 most frequent verb types, in thisstudy, we excluded the diffs that have other verbs. In total, wehave 537k labeled diff ﬁles.

B. The Data Set

We removed the diff ﬁles that are larger than 1MB dueto space limit. We also removed the diff ﬁles that have non-ascii codes. In the end, we have 509k labeled diff ﬁles. Werandomly selected 3k diff ﬁles as the test set and the rest ofthe diff ﬁles are used for training.

C. Overall Approach

The overall approach is shown in Figure 5. We chose aNaive Bayes classiﬁer to classify the diff ﬁles into the verbgroups. Before we train the classiﬁer on the diff ﬁles, wecomputed tf-idf (term frequency-inverse document frequency)for every word type (i.e., distinct word) as the features of thediff ﬁles. Tf-idf is a common textual feature that evaluates theimportance of a word type by two factors: 1) the number oftimes the word type occurs in a diff ﬁle divided by the totalnumber of words in the diff ﬁle, and 2) the number of timesthe word type occurs in all the diff ﬁles [10].

D. Evaluation

The overall accuracy is 39%; the precision is 43%; and therecall is 39%. The classiﬁer works best for verb groups 1and 9. The precision for verb group 1 is 38% and the recallis 100%; the precision for verb group 9 is 100% and therecall is 41%. Although we trained the classiﬁer with 15 verbgroups, the classiﬁer classiﬁed the test set into ﬁve verb groupsand was not able to detect any of the other ten verb groups.We plan to improve our training approach by 1) trying othermachine learning techniques, such as random forests [10]; 2)using SMOTE [14] to address the problem of the unbalanceddata set (most of the diffs are labeled with verb group 1).VI. D

ISCUSSION AND F UTURE W ORK

In the process of this project, we have formed severalpotential research questions to be discussed in the conference. We hope the conversions at the conference will help indirecting us towards answering these questions.

RQ1

What techniques are appropriate for generating directobjects for the commits? We observed that the direct objectsoften occur in the diff ﬁles. So one of our options is touse extractive summarization techniques to extract the “directobjects” from the diff ﬁles.

RQ2

What machine learning models and features suitverb-generation task better? To improve our verb-generationapproach, we can try other classiﬁcation methods, such asdecision trees. Feature-wise, diff ﬁles follow a certain formatand we can create some features to represent the characteristicsof a diff ﬁle, for example, the number of “+” in a diff ﬁle.

RQ3

To what extent are the short summaries useful? Al-though we think the short summaries are useful based onour experience, we need to conduct a study to conﬁrm ourhypothesis. Our current assumption is that the short summarieshelp developers understand a commit more quickly.A

CKNOWLEDGMENT

This work was partially supported by the NSF CCF-1452959 and CNS-1510329 grants, and the Ofﬁce of NavalResearch grant N000141410037. Any opinions, ﬁndings, andconclusions expressed herein are the authors’ and do notnecessarily reﬂect those of the sponsors.R

EFERENCES[1] M. Linares-V´asquez, L. F. Cort´es-Coy, J. Aponte, and D. Poshyvanyk,“Changescribe: A tool for automatically generating commit messages,”in

ICSE ’15 , vol. 2, May 2015, pp. 709–712.[2] R. P. Buse and W. R. Weimer, “Automatically documenting programchanges,” in

ASE ’10 , 2010, pp. 33–42.[3] L. Moreno, G. Bavota, M. Di Penta, R. Oliveto, A. Marcus, andG. Canfora, “Automatic generation of release notes,” in

Proceedingsof the 2014 FSE , pp. 484–495.[4] M. Linares-V´asquez, L. F. Cort´es-Coy, J. Aponte, and D. Poshyvanyk,“Changescribe: A tool for automatically generating commit messages,”in , vol. 2, pp. 709–712.[5] S. Rastkar and G. C. Murphy, “Why did this code change?” in

Proceed-ings of the 2013 ICSE , ser. ICSE ’13, 2013, pp. 1193–1196.[6] A. Alali, H. Kagdi, and J. I. Maletic, “What’s a typical commit? acharacterization of open source software repositories,” in , pp. 182–191.[7] L. P. Hattori and M. Lanza, “On the nature of commits,” in , pp. 63–71.[8] A. Hindle, D. M. German, M. W. Godfrey, and R. C. Holt, “Automaticclassication of large changes into maintenance categories,” in , pp. 30–39.[9] B. Fluri, M. Wuersch, M. PInzger, and H. Gall, “Change distilling:treedifferencing for ﬁne-grained source code change extraction,”

IEEE TSE ,vol. 33, no. 11, pp. 725–743, 2007.[10] T. D. B. Le, M. Linares-Vasquez, D. Lo, and D. Poshyvanyk, “Rclinker:Automated linking of issue reports and commits leveraging rich contex-tual information,” in , pp. 36–47.[11] A. Mauczka, F. Brosch, C. Schanes, and T. Grechenig, “Dataset ofdeveloper-labeled commit messages,” ser. MSR ’15, pp. 490–493.[12] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, andD. McClosky, “The Stanford CoreNLP natural language processingtoolkit,” in

ACL System Demonstrations , 2014, pp. 55–60.[13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in

NIPS , 2013, pp. 3111–3119.[14] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:synthetic minority over-sampling technique,”