Archive | 2019

Learning Code Transformations via Neural Machine Translation

 

Abstract


Source code evolves – inevitably – to remain useful, secure, correct, readable, and efficient. Developers perform software evolution and maintenance activities by transforming existing source code via corrective, adaptive, perfective, and preventive changes. These code changes are usually managed and stored by a variety of tools and infrastructures such as version control, issue trackers, and code review systems. Software evolution and maintenance researchers have been mining these code archives in order to distill useful insights regarding the nature of developers’ activities. One of the long-lasting goals of software engineering research is to better support and automate different types of code changes performed by developers. In this thesis we depart from classic manually crafted ruleor heuristic-based approaches, and propose a novel technique to learn code transformations by leveraging the vast amount of publicly available code changes performed by developers. We rely on Deep Learning, and in particular on Neural Machine Translation (NMT), to train models able to learn code change patterns and apply them to novel, unseen, source code. First, we tackle the problem of generating source code mutants for Mutation Testing. In contrast to classic approaches, which rely on handcrafted mutation operators, we propose to automatically learn how to mutate source code by observing real faults. We mine millions of bug fixing commits from GitHub, and then process and abstract this source code. This data is used to train and evaluate an NMT model to translate fixed code into buggy code (i.e., the mutated code). In the second project, we rely on the same dataset of bug-fixes to learn code transformations for the purpose of Automated Program Repair (APR). This represents one of the most challenging research problem in Software Engineering, whose goal is to automatically fix bugs without developers’ intervention. We train a model to translate buggy code into fixed code (i.e., learning patches) and, in conjunction with beam search, generate many different potential patches for a given buggy method. In our empirical investigation we found that such a model is able to fix thousands of unique buggy methods in the wild. Finally, in our third project we push our novel technique to the limits by enlarging the scope to consider not only bug-fixing activities, but any type of meaningful code changes performed by developers. We focus on accepted and merged code changes that have undergone a Pull Request (PR) process. We quantitatively and qualitatively investigate the code transformations learned by the model to build a taxonomy. The taxonomy shows that NMT can replicate a wide variety of meaningful code changes, especially refactorings and bug-fixing activities. In this dissertation we illustrate and evaluate the proposed techniques, which represent a significant departure from earlier approaches in the literature. The promising results corroborate the potential applicability of learning techniques, such as NMT, to a variety of Software Engineering tasks.

Volume None
Pages None
DOI 10.21220/S2-1Z1J-D577
Language English
Journal None

Full Text