Miltiadis Allamanis | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Miltiadis Allamanis is active.

Explore More

Publication

Featured researches published by Miltiadis Allamanis.

foundations of software engineering | 2014

Learning natural coding conventions

Miltiadis Allamanis; Earl T. Barr; Christian Bird; Charles A. Sutton

Every programmer has a characteristic style, ranging from preferences about identifier naming to preferences about object relationships and design patterns. Coding conventions define a consistent syntactic style, fostering readability and hence maintainability. When collaborating, programmers strive to obey a project’s coding conventions. However, one third of reviews of changes contain feedback about coding conventions, indicating that programmers do not always follow them and that project members care deeply about adherence. Unfortunately, programmers are often unaware of coding conventions because inferring them requires a global view, one that aggregates the many local decisions programmers make and identifies emergent consensus on style. We present NATURALIZE, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. NATURALIZE builds on recent work in applying statistical natural language processing to source code. We apply NATURALIZE to suggest natural identifier names and formatting conventions. We present four tools focused on ensuring natural code during development and release management, including code review. NATURALIZE achieves 94 % accuracy in its top suggestions for identifier names. We used NATURALIZE to generate 18 patches for 5 open source projects: 14 were accepted.

mining software repositories | 2013

Mining source code repositories at massive scale using language modeling

Miltiadis Allamanis; Charles A. Sutton

The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle et al. The giga-token model is significantly better at the code suggestion task than previous models. More broadly, our approach provides a new “lens” for analyzing software projects, enabling new complexity metrics based on statistical analysis of large corpora. We call these metrics data-driven complexity metrics. We propose new metrics that measure the complexity of a code module and the topical centrality of a module to a software project. In particular, it is possible to distinguish reusable utility classes from classes that are part of a programs core logic based solely on general information theoretic criteria.

foundations of software engineering | 2015

Suggesting accurate method and class names

Miltiadis Allamanis; Earl T. Barr; Christian Bird; Charles A. Sutton

Descriptive names are a vital part of readable, and hence maintainable, code. Recent progress on automatically suggesting names for local variables tantalizes with the prospect of replicating that success with method and class names. However, suggesting names for methods and classes is much more difficult. This is because good method and class names need to be functionally descriptive, but suggesting such names requires that the model goes beyond local context. We introduce a neural probabilistic language model for source code that is specifically designed for the method naming problem. Our model learns which names are semantically similar by assigning them to locations, called embeddings, in a high-dimensional continuous space, in such a way that names with similar embeddings tend to be used in similar contexts. These embeddings seem to contain semantic information about tokens, even though they are learned only from statistical co-occurrences of tokens. Furthermore, we introduce a variant of our model that is, to our knowledge, the first that can propose neologisms, names that have not appeared in the training corpus. We obtain state of the art results on the method, class, and even the simpler variable naming tasks. More broadly, the continuous embeddings that are learned by our model have the potential for wide application within software engineering.

foundations of software engineering | 2014

Mining idioms from source code

Miltiadis Allamanis; Charles A. Sutton

We present the first method for automatically mining code idioms from a corpus of previously written, idiomatic software projects. We take the view that a code idiom is a syntactic fragment that recurs across projects and has a single semantic purpose. Idioms may have metavariables, such as the body of a for loop. Modern IDEs commonly provide facilities for manually defining idioms and inserting them on demand, but this does not help programmers to write idiomatic code in languages or using libraries with which they are unfamiliar. We present Haggis, a system for mining code idioms that builds on recent advanced techniques from statistical natural language processing, namely, nonparametric Bayesian probabilistic tree substitution grammars. We apply Haggis to several of the most popular open source projects from GitHub. We present a wide range of evidence that the resulting idioms are semantically meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. Manual examination of the most common idioms indicate that they describe important program concepts, including object creation, exception handling, and resource management.

mining software repositories | 2013

Why, when, and what: Analyzing Stack Overflow questions by topic, type, and code

Miltiadis Allamanis; Charles A. Sutton

Questions from Stack Overflow provide a unique opportunity to gain insight into what programming concepts are the most confusing. We present a topic modeling analysis that combines question concepts, types, and code. Using topic modeling, we are able to associate programming concepts and identifiers (like the String class) with particular types of questions, such as, “how to perform encoding”.

ACM Computing Surveys | 2018

A Survey of Machine Learning for Big Code and Naturalness

Miltiadis Allamanis; Earl T. Barr; Premkumar T. Devanbu; Charles A. Sutton

Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.

IEEE Transactions on Software Engineering | 2017

Autofolding for Source Code Summarization

Jaroslav M. Fowkes; Pankajan Chanthirasegaran; Razvan Ranca; Miltiadis Allamanis; Mirella Lapata; Charles A. Sutton

Developers spend much of their time reading and browsing source code, raising new opportunities for summarization methods. Indeed, modern code editors provide code folding, which allows one to selectively hide blocks of code. However this is impractical to use as folding decisions must be made manually or based on simple rules. We introduce the autofolding problem, which is to automatically create a code summary by folding less informative code regions. We present a novel solution by formulating the problem as a sequence of AST folding decisions, leveraging a scoped topic model for code tokens. On an annotated set of popular open source projects, we show that our summarizer outperforms simpler baselines, yielding a 28 percent error reduction. Furthermore, we find through a case study that our summarizer is strongly preferred by experienced developers. More broadly, we hope this work will aid program comprehension by turning code folding into a usable and valuable tool.

international conference on software engineering | 2016

TASSAL: autofolding for source code summarization

Jaroslav M. Fowkes; Pankajan Chanthirasegaran; Razvan Ranca; Miltiadis Allamanis; Mirella Lapata; Charles A. Sutton

We present a novel tool, TASSAL, that automatically creates a summary of each source file in a project by folding its least salient code regions. The intended use-case for our tool is the first-look problem: to help developers who are unfamiliar with a new codebase and are attempting to understand it. TASSAL is intended to aid developers in this task by folding away less informative regions of code and allowing them to focus their efforts on the most informative ones. While modern code editors do provide \emph{code folding} to selectively hide blocks of code, it is impractical to use as folding decisions must be made manually or based on simple rules. We find through a case study that TASSAL is strongly preferred by experienced developers over simple folding baselines, demonstrating its usefulness. In short, we strongly believe TASSAL can aid program comprehension by turning code folding into a usable and valuable tool. A video highlighting the main features of TASSAL can be found at https://youtu.be/_yu7JZgiBA4.

international conference on adaptive and natural computing algorithms | 2013

Effective Rule-Based Multi-label Classification with Learning Classifier Systems

Miltiadis Allamanis; Fani A. Tzima; Pericles A. Mitkas

In recent years, multi-label classification has attracted a significant body of research, motivated by real-life applications such as text classification and medical diagnoses. However, rule-based methods, and especially Learning Classifier Systems (LCS), for tackling such problems have only been sparsely studied. This is the motivation behind our current work that introduces a generalized multi-label rule format and uses it as a guide for further adapting the general Michigan-style LCS framework. The resulting LCS algorithm is thoroughly evaluated and found competitive to other state-of-the-art multi-label classification methods.

international conference on machine learning | 2016