Sci. China Inf. Sci. | 2021

Predicting accepted pull requests in GitHub

Abstract

Various open source software hosting sites, such as Github, provide support for pull-based development and allow developers to flexibly and efficiently make contributions [1]. In GitHub, contributors fork the project’s main repository of a project, and independently change their codes. When a set of changes is ready, contributors create pull requests and submit them to the main repository. Any developer can provide comments and exchange opinions regarding pull requests [2, 3]. Developers freely discuss whether the code style meets the quality standard, repositories require modification, or submitted codes are of high quality. According to the comments, the contributors can modify codes. Members of the core team of the project, here on, referred to as integrators, are responsible for inspecting the submitted code changes, identifying issues (e.g., vulnerabilities), and deciding whether to accept pull requests and merge these code changes into the main repository [1]. Integrators act as the guardians of the project quality. As projects gain popularity, the volume of incoming pull requests increases, heavily burdening the integrators [1]. There has been previous study [4] about integrator recommendation for pull requests. Besides recommending integrators, inspecting code changes consumes much of the integrators’ time and effort. An accepted pull request prediction approach can help the integrators by allowing them to either enforce an immediate rejection of code changes or allocate more resources to overcome the deficiency. The integrators can reduce time wasted on the inspection of rejected code changes, and focus on more promising ones, which will eventually be merged. In this study, we propose an accepted pull request prediction approach named XGPredict, which builds an XGBoost classifier based on the training dataset of a project to predict whether pull requests will be accepted. First, we extract the code, text, and project features. The code features measure the characteristics of the modified code in projects. Text features are extracted from the natural language descriptions of pull requests, such as the title and body. The project features mainly analyze recent development history in projects. The aforementioned features can be automatically and immediately extracted when pull requests are submitted. Based on these features, we leverage the XGBoost classifier to predict accepted pull requests. XGPredict computes the acceptance and rejection probabilities of pull requests. If the acceptance probability is higher than or equal to the rejection probability, the pull request is predicted as accepted. If the acceptance probability is lower than the rejection probability, the pull request is predicted as rejected. XGPredict not only returns the prediction results but also provides the acceptance and rejection probabilities of pull requests, which helps integrators to make decisions in a code review. In the following part, the XGBoost classifier is introduced at first. Then, the code, text and project features are described. Finally, the framework of the accepted pull request prediction approach, XGPredict is introduced. XGBoost. XGBoost (extreme gradient boosting) is a supervised machine learning algorithm that implements a process called boosting to yield accurate models [5]. The XGBoost inputs comprise pairs of training examples (x0, y0), (x1, y1), . . ., (xn, yn), where x is a vector of the features describing the example and y is its label. The output of XGBoost is the predicted value ŷ. XGBoost aims to minimize the loss function L, which indicates the difference between ŷ and y. XGBoost uses a decision tree boosting algorithm to minimize the loss function L. It sequentially builds an ensemble of decision tree models. XGBoost uses a greedy algorithm to choose the feature, which achieves the minimum loss function L based on the current tree structure, as the node to split the tree. In particular, XGBoost enumerates the possible tree structures q. The greedy algorithm commences from a single leaf and iteratively adds branches to the tree. We assume that IL and IR are the instance sets of the left and right nodes after the split. HL and HR are the loss functions of the left and right subtrees, respectively. GL and GR are the gradients of the loss function based on the left and right subtrees, respectively. λ is a penalty for introducing new leaf nodes, and γ is the complexity cost of introducing new leaf nodes. Letting I = IL ∪ IR, the loss function after the split is given as Lsplit = 1 2 [ G 2 L HL+λ + G 2 R HR+λ + (GL+GR) 2 HL+HR+λ ]−γ. XGBoost uses the formula specified above to evaluate the

Volume 64

Sci. China Inf. Sci. | 2021

Predicting accepted pull requests in GitHub

Abstract

Volume 64

Pages None

DOI 10.1007/S11432-018-9823-4

Language English

Journal Sci. China Inf. Sci.

Full Text