Joshua Charles Campbell

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Joshua Charles Campbell is active.

Explore More

Publication

Featured researches published by Joshua Charles Campbell.

mining software repositories | 2014

Syntax errors just aren't natural: improving error reporting with language models

Joshua Charles Campbell; Abram Hindle; José Nelson Amaral

A frustrating aspect of software development is that compiler error messages often fail to locate the actual cause of a syntax error. An errant semicolon or brace can result in many errors reported throughout the file. We seek to find the actual source of these syntax errors by relying on the consistency of software: valid source code is usually repetitive and unsurprising. We exploit this consistency by constructing a simple N-gram language model of lexed source code tokens. We implemented an automatic Java syntax-error locator using the corpus of the project itself and evaluated its performance on mutated source code from several projects. Our tool, trained on the past versions of a project, can effectively augment the syntax error locations produced by the native compiler. Thus we provide a methodology and tool that exploits the naturalness of software source code to detect syntax errors alongside the parser.

mining software repositories | 2013

Deficient documentation detection a methodology to locate deficient project documentation using topic analysis

Joshua Charles Campbell; Chenlei Zhang; Zhen Xu; Abram Hindle; James Miller

A projects documentation is the primary source of information for developers using that project. With hundreds of thousands of programming-related questions posted on programming Q&A websites, such as Stack Overflow, we question whether the developer-written documentation provides enough guidance for programmers. In this study, we wanted to know if there are any topics which are inadequately covered by the project documentation. We combined questions from Stack Overflow and documentation from the PHP and Python projects. Then, we applied topic analysis to this data using latent Dirichlet allocation (LDA), and found topics in Stack Overflow that did not overlap the project documentation. We successfully located topics that had deficient project documentation. We also found topics in need of tutorial documentation that were outside of the scope of the PHP or Python projects, such as MySQL and HTML.

The Art and Science of Analyzing Software Data | 2015

Latent Dirichlet Allocation: Extracting Topics from Software Engineering Data

Joshua Charles Campbell; Abram Hindle; Eleni Stroulia

Abstract Topic analysis is a powerful tool that extracts “topics” from document collections. Unlike manual tagging, which is effort intensive and requires expertise in the documents’ subject matter, topic analysis (in its simplest form) is an automated process. Relying on the assumption that each document in a collection refers to a small number of topics, it extracts bags of words attributable to these topics. These topics can be used to support document retrieval or to relate documents to each other through their associated topics. Given the variety and amount of textual information included in software repositories, in issue reports, in commit and source-code comments, and in other forms of documentation, this method has found many applications in the software-engineering field of mining software repositories. This chapter provides an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic-analysis method today. Next, it illustrates, with a brief tutorial introduction, how to employ LDA on a textual data set. Third, it reviews the software-engineering literature for uses of LDA for analyzing textual software-development assets, in order to support developers’ activities. Finally, we discuss the interpretability of the automatically extracted topics, and their correlation with tags provided by subject-matter experts.We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmanns aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.Topic analysis is a powerful tool that extracts “topics” from document collections. Unlike manual tagging, which is effort intensive and requires expertise in the documents’ subject matter, topic analysis (in its simplest form) is an automated process. Relying on the assumption that each document in a collection refers to a small number of topics, it extracts bags of words attributable to these topics. These topics can be used to support document retrieval or to relate documents to each other through their associated topics. Given the variety and amount of textual information included in software repositories, in issue reports, in commit and source-code comments, and in other forms of documentation, this method has found many applications in the software-engineering field of mining software repositories. This chapter provides an overview of the theory underlying latent Dirichlet allocation (LDA), the most popular topic-analysis method today. Next, it illustrates, with a brief tutorial introduction, how to employ LDA on a textual data set. Third, it reviews the software-engineering literature for uses of LDA for analyzing textual software-development assets, in order to support developers’ activities. Finally, we discuss the interpretability of the automatically extracted topics, and their correlation with tags provided by subject-matter experts.

mining software repositories | 2016

The unreasonable effectiveness of traditional information retrieval in crash report deduplication

Joshua Charles Campbell; Eddie Antonio Santos; Abram Hindle

Organizations like Mozilla, Microsoft, and Apple are floodedwith thousands of automated crash reports per day. Although crash reports contain valuable information for debugging, there are often too many for developers to examineindividually. Therefore, in industry, crash reports are oftenautomatically grouped together in buckets. Ubuntu’s repository contains crashes from hundreds of software systemsavailable with Ubuntu. A variety of crash report bucketing methods are evaluated using data collected by Ubuntu’sApport automated crash reporting system. The trade-off between precision and recall of numerous scalable crash deduplication techniques is explored. A set of criteria that acrash deduplication method must meet is presented and several methods that meet these criteria are evaluated on anew dataset. The evaluations presented in this paper showthat using off-the-shelf information retrieval techniques, thatwere not designed to be used with crash reports, outperformother techniques which are specifically designed for the taskof crash bucketing at realistic industrial scales. This researchindicates that automated crash bucketing still has a lot ofroom for improvement, especially in terms of identifier tokenization.

PeerJ | 2017

Finding and correcting syntax errors using recurrent neural networks

Eddie Antonio Santos; Joshua Charles Campbell; Abram Hindle; José Nelson Amaral

Minor syntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of intuition that help them resolve these tiny errors. Standard LR parsers typically resolve syntax errors and their precise location poorly. We propose a methodology that helps locate where syntax errors occur, but also suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by checking if two language models “agree” on each token. If the models disagree, it indicates a possible syntax error; the methodology tries to suggest a fix by finding an alternative token sequence obtained from the models. We trained two LSTM (Long short-term memory) language models on a large corpus of JavaScript code collected from GitHub. The dual LSTM neural network model predicts the correct location of the syntax error 54.74% in its top 4 suggestions and produces an exact fix up to 35.50% of the time. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors.

PeerJ | 2016

Anatomy of a crash repository.

Joshua Charles Campbell; Eddie Antonio Santos; Abram Hindle

4 This work investigates the properties of crash reports collected from Ubuntu Linux users. Understanding crash reports is important to better store, categorize, prioritize, parse, triage, assign bugs to, and potentially synthesize them. Understanding what is in a crash report, and how the metadata and stack traces in crash reports vary will help solve, debug, and prevent the causes of crashes. 10 different aspects of 40,592 crash reports about 1,921 pieces of software submitted by users and developers to the Ubuntu project were analyzed, plotted, and statistical distributions were fitted to some of them. We investigated the structure and properties of crash reports. Crashes have many properties that seem to have distributions similar to standard statistical distributions, but with even longer tails than expected. These aspects of crash reports have not been analyzed statistically before. We found that many applications only had a single crash, while a few applications had a large number of crashes reported. Crash bucket size (clusters of similar crashes) also followed a Zipf-like distribution. The lifespan of buckets ranged from less than an hour to over four years. Some stack traces were short, and some were so long they were truncated by the tool that produced them. Many crash reports had no recursion, some contained recursion, and some displayed evidence of unbounded recursion. Linguistics literature hinted that sentence length follows a gamma distribution; this is not the case for function name length. Additionally, only two hardware architectures, and a few signals are reported for almost all of the crashes in the Ubuntu dataset. Many crashes were similar but there were also many unique crashes. This study of crashes from 1,921 projects will be valuable for anyone who wishes to: cluster or deduplicate crash reports, synthesize or simulate crash reports, store or triage crash reports, or data-mine crash reports. 5

PeerJ | 2015

Error location in Python: where the mutants hide

Joshua Charles Campbell; Abram Hindle; José Nelson Amaral

5 Dynamic scripting programming languages present a unique challenge to software engineering tools that depend on static analysis. Dynamic languages do not benefit from the full lexical and syntax analysis provided by compilers and static analysis tools. Prior work exploited a statically typed language (Java) and a simple n-gram language model to find syntax-error locations in programs. This work investigates whether ngram-based error location on source code written in a dynamic language is effective without static analysis or compilation. UnnaturalCode.py is a syntax-error locator developed for the Python programming language. The UnnaturalCode.py approach is effective on Python code, but faces significantly more challenges than its Java counterpart did. UnnaturalCode.py generalizes the success of previous statically-typed approaches to a dynamically-typed language. 6

mining software repositories | 2014