[PDF] Syntax and Stack Overflow: A methodology for extracting a corpus of syntax errors and fixes

Abstract

One problem when studying how to find and fix syntax errors is how to get natural and representative examples of syntax errors. Most syntax error datasets are not free, open, and public, or they are extracted from novice programmers and do not represent syntax errors that the general population of developers would make. Programmers of all skill levels post questions and answers to Stack Overflow which may contain snippets of source code along with corresponding text and tags. Many snippets do not parse, thus they are ripe for forming a corpus of syntax errors and corrections. Our primary contribution is an approach for extracting natural syntax errors and their corresponding human made fixes to help syntax error research. A Python abstract syntax tree parser is used to determine preliminary errors and corrections on code blocks extracted from the SOTorrent data set. We further analyzed our code by executing the corrections in a Python interpreter. We applied our methodology to produce a public data set of 62,965 Python Stack Overflow code snippets with corresponding tags, errors, and stack traces. We found that errors made by Stack Overflow users do not match errors made by student developers or random mutations, implying there is a serious representativeness risk within the field. Finally we share our dataset openly so that future researchers can re-use and extend our syntax errors and fixes.

Full PDF

SSyntax and Stack Overﬂow: A methodology forextracting a corpus of syntax errors and ﬁxes

Alexander William Wong, Amir Salimi, Shaiful Chowdhury, and Abram Hindle

Department of Computing Science, University of AlbertaEdmonton, Alberta, Canada { alex.wong, asalimi, shaiful, abram.hindle } @ualberta.ca Abstract —One problem when studying how to ﬁnd and ﬁxsyntax errors is how to get natural and representative examplesof syntax errors. Most syntax error datasets are not free, open,and public, or they are extracted from novice programmers anddo not represent syntax errors that the general population ofdevelopers would make. Programmers of all skill levels postquestions and answers to Stack Overﬂow which may containsnippets of source code along with corresponding text and tags.Many snippets do not parse, thus they are ripe for forming acorpus of syntax errors and corrections. Our primary contribu-tion is an approach for extracting natural syntax errors and theircorresponding human made ﬁxes to help syntax error research.A Python abstract syntax tree parser is used to determinepreliminary errors and corrections on code blocks extractedfrom the SOTorrent data set. We further analyzed our code byexecuting the corrections in a Python interpreter. We applied ourmethodology to produce a public data set of 62,965 Python StackOverﬂow code snippets with corresponding tags, errors, and stacktraces. We found that errors made by Stack Overﬂow users do notmatch errors made by student developers or random mutations,implying there is a serious representativeness risk within the ﬁeld.Finally we share our dataset openly so that future researcherscan re-use and extend our syntax errors and ﬁxes.

Index Terms —stack overﬂow, natural, syntax errors, python,mining software repositories

I. I

NTRODUCTION

Syntax errors stymie novice and expert developers alike [1,2]. Many researchers have sought to help developers recoverfrom syntax errors by locating syntax errors [3] and/or ﬁxingthem [4]. One impediment to this kind of research is gettingaccess to a representative corpus of syntax errors and theirsubsequent corrections. Corporae are limited because syntaxerror commits are rare as open source developers typically donot commit syntax error ridden code to their repositories.Many existing datasets suffer from limited access due toethical concerns while other datasets lack representativeness asthey are pulled from student developers rather than practicingsoftware engineers [5, 6]. Some authors address the lack ofaccess to syntax errors by synthetically creating their ownsyntax error datasets via mutation [3, 7]. Mutation does notfully address the representativeness of syntax errors as they candiffer from the syntax errors of actual developers [8]. Theseshortcomings can be mitigated by using data containing natu-ral errors created by a more general population of developers,where natural is deﬁned as “product of human effort” [9].Unfortunately, naturally made syntax errors are difﬁcult to obtain as current methods require ﬁne grain observations ofprogrammer activity [10].Thus the goal of this work is to reproducibly and openlyproduce a corpus of Python syntax errors that is both natural and representative of syntax errors made by developers as awhole. A dataset of natural human made syntax errors andhuman revised ﬁxes would enable future software engineeringresearch. Models tasked with source code completion or errordetection could use this dataset to train and evaluate theirperformance on actual developer errors. Statistical analysison commonly made syntax errors and their correspondingﬁxes could offer insight into planning future software languagespeciﬁcations. This dataset would enable novel approaches formeasuring the similarity of source code to natural languages.Potential new studies could explore the naturalness of syntaxcorrections, comparing how natural language errors are ﬁxedto how code errors are ﬁxed [11]. Most importantly, a repli-cable and reproducible methodology to produce such a corpuswould allow researchers to update and improve the dataset ontheir own and in the future.We extend the SOTorrent dataset , built from Stack Over-ﬂow, to extract syntax errors and corrections [12]. Our maincontribution is a methodology for extraction of human madeerrors and their ﬁxes. With our methodology, it is important tounderstand how accurate the pairs of syntax errors and theirﬁxes are, and what are the general characteristics of thosesyntax errors. We answer the following research questions:RQ0) How accurate is our approach for extracting pairs ofsyntax errors and their ﬁxes?RQ1) What are Python parse errors and corresponding run-time properties of their corrections in Stack Overﬂow?RQ2) Are Python errors in Stack Overﬂow similar to errorsmade by student programmers?RQ3) Are Python errors in Stack Overﬂow similar to gener-ated errors created by mutating valid code?Our generated dataset is released to enable future work insource code naturalness research .II. R ELATED W ORK

Syntactically incorrect code is artiﬁcially derivable, as for-mal programming languages provide grammar rules which can SOTorrent Dataset: https://zenodo.org/record/2273117 Dataset of Python3.6 Natural Syntax Errors and Corrections:https://doi.org/10.6084/m9.ﬁgshare.8244686.v1 a r X i v : . [ c s . S E ] J u l e referred to for correctness. Random token level insertions,deletions, and replacements were performed to generate syntaxerrors from existing open source Java projects [13]. Campbellet al. created Python syntax errors from valid code minedfrom GitHub by applying mutations on tokens, characters,and lines [3]. Although generated errors are appealing due tothe availability of open source code, Just et al. demonstratedlimitations of using mutations for software testing research [7].Given the task of using mutants as replacements for real faultsin automated software testing research, only 73% of mutantswere coupled to faults. Furthermore, when accounting for codecoverage, the mutant to fault coupling effect is small [7].Automated source code repair, like identifying and refactor-ing improper method names, also required a labeled datasetof valid and invalid source code [14]. Program repair is oftenviewed as different than syntax error correction because testingis performed which serves as a benchmark for repaired code,while syntax errors rely primarily on parseability.Free and open datasets of naturally made errors and theirﬁxes are more difﬁcult to obtain. Blackbox, a data collec-tion project within the BlueJ Java development environment,requires manual staff contact for access to data and forbidsthe release of the raw dataset [10]. Pritchard analyzed Pythonprograms submitted to CS Circles, an online tool for learningPython [5]. Kelley et al. studied Python code submitted bystudents in an introductory programming course at MIT [6].Gathering this data without privileged access to the providedcode submissions is difﬁcult, limiting the reproducibility oftheir research. Our research used Stack Overﬂow and isadvantageous as the raw content is freely accessible to theinternet, revisions and history is tracked, and contributors havea wide range of software engineering expertise and skill sets .III. M ETHODOLOGY

Our methodology for ﬁnding syntax error and ﬁx pairs is:to leverage the historical view of Stack Overﬂow extractedfrom the SOTorrent project [12]; parse detected Python sourcesnippet histories; extract pairs of failed and ﬁxed revisions;validate pairs; and ﬁnally record successfully evaluated pairs.

A. Code Snippets from SOTorrent

The SOTorrent schema contains a

PostBlockVersion tablewhich stores version history on the latest

Posts , which must beeither a question or answer [12]. We queried a subset of rowsfrom the

PostBlockVersion table where content type equaled

CodeSnippet and the corresponding

Post was tagged with aterm matching python . If the corresponding

Post was typed

Answer , we used tags deﬁned in the referenced

Question .To limit

TabError exceptions, extraneous white space isremoved from the content while preserving nested indentation.

B. Parsing Abstract Syntax Tree

To determine the presence of syntax errors, we attemptedto parse abstract syntax trees (AST) given the extracted code. Stack Overﬂow 2019 Survey https://insights.stackoverﬂow.com/survey/2019

We tackled this problem using Python’s built-in ast mod-ule, which compiles source code into an AST object. Validcode parses without error, while invalid code raises errors suchas

IndentationError , TabError , SyntaxError , or

MemoryError . The speciﬁc error message, offending linenumber and column offset are stored for the indentation, tab,and syntax errors (memory errors offer no such metadata).

C. Filtering Candidates using Block Versions

One issue is that many extracted code snippets containinga python matching tag are not valid Python, but insteadcontain stack traces, script execution commands, tabular data,markup, etc. Further candidate ﬁltering is necessary to distin-guish invalid Python code from unrelated text.To address this issue, we queried for an existing priorversion within the

PostBlockVersion table for each codesnippet that successfully parsed into an AST. If the priorversion contained a parse error, we stored these two versionsof code as our syntax error and ﬁx. Unrelated snippets that arenever ﬁxed into valid Python source code are removed fromanalysis.

D. Runtime Validation with the Interpreter

We checked for code snippet validity by running the cor-rected source code. Code that validly parsed into an AST maystill be faulty due to various errors, like importing invalidmodules or referencing undeﬁned variables. Source code thatcannot parse into an AST does not need to be run as the thrownexception will be the same as the parse error.We used the Python built-in exec function, isolating theglobal and local execution scopes. To mitigate the impact ofexecuting arbitrary source code, we ran the snippets in anisolated Linux kernel virtual machine (KVM) with no accessto network. Each snippet invocation was further container-ized within its own Docker image running Python 3.6 usingDebian Jessie. We capped the maximum execution time tofour seconds to account for long running and non-terminatingprograms. All encountered exceptions and their correspondingstack traces are stored in our dataset for further analysis.IV. R

ESULTS

We successfully parsed and analyzed 82,469,989 code snip-pets from the SOTorrent dataset. Of these snippets, 7,260,631contained a python matching tag. A subset of 3,774,225snippets parsed correctly. Of the parseable snippets, 1,550,090had prior versions, 62,965 of which contained a parse error.A summary of the ﬁltering steps can be found in Tab. I. Withthis dataset, we answer the four research questions.

TABLE IF

ILTERED D ATASET E NTITY C OUNTS

Filter Metric Entity Block Count

All code snippets 82,469,989Block matched python tag 7,260,631Content AST parseable 3,774,225Prior version exists 1,550,090Prior version AST error 62,965

Q0. How accurate is our approach for extracting pairs ofsyntax errors and their ﬁxes?

We wanted to determine how accurate our methodology isfor extracting true syntax errors and ﬁxes. Three co-authorstogether inspected 100 random samples of syntax errors andcorrections from the extracted dataset of 62,965 pairs. Only2 observed cases were not entirely syntax error revisions, asthey included large amounts of new code. The 95% conﬁdenceinterval for our methodology is 0-5% erroneous cases.While our methodology is accurate, future research can takecare of these erroneous cases (e.g., by leveraging the codesnippets version similarity provided by SOTorrent).Stack Overﬂow is a good source for methodologicallygenerating pairs of natural syntax errors and ﬁxes.

RQ1. What are Python parse errors and corresponding run-time properties of their corrections in Stack Overﬂow?

Stack Overﬂow posts are written and parsed using a subsetof Markdown. Code blocks are deﬁned by using four space in-dentations, creating an indented code block, or by surroundingthe text using three backtick characters ( ‘ ) to create a fencedcode block. In addition, the Stack Overﬂow text editor does notprovide code assistance features, such as syntax highlightingand code completion.Of the 62,965 AST parse errors, there were 35,564 (56.5%)syntax errors, 27,075 (43.0%) indentation errors, and 326(0.5%) tab errors. Over 41% of all AST parse errors were SyntaxError: invalid syntax . A detailed summaryof the error types and messages are listed in Tab. II.Over a third of all corrected snippets still throw a Nameerror. Roughly one third, or 21,332 corrected snippets ran in aPython interpreter with no errors. Only 0.4% of all evaluatedsnippets triggered an execution timeout by running for longerthan four seconds. The results from running all AST ﬁxes ina Python interpreter are summarized in Tab. III.We extracted 62,965 parse errors and corrections fromover 82 million code snippets. Of the corrections,21,332 ran in a Python interpreter without error.

RQ2. Are Python errors in Stack Overﬂow similar to errorsmade by student programmers?

We compared the distribution of our errors with the distri-bution of runtime errors as presented in two prior works. If thedistribution of errors are similar to student programmer errorsthen student errors can be viewed as representative. If they arenot similar then this provide evidence to arguments regardinghow representative student errors are considering their errordistribution is different.Kelley et al. collected the distribution of Python errorssubmitted by MIT programming students to an online tutoras part of an introductory Python programming course [6].The collected student error distributions were TypeError: ; TABLE IIT OP P YTHON

AST . PARSE E RRORS

Error Message Count %

Syntax invalid syntax 26336 41.83Indentation expected an indented block 14535 23.08Indentation unexpected indent 11030 17.52Syntax Missing parentheses in call to ’print’ 3063 4.86Syntax unexpected EOF while parsing 2732 4.34Syntax EOL while scanning string literal 2045 3.25Indentation unindent does not match 1468 2.33any outer indentation levelTab inconsistent use of tabs 326 0.52and spaces in indentationSyntax invalid character in identiﬁer 297 0.47Syntax unexpected character after 183 0.29line continuation characterSyntax invalid token 139 0.22Syntax positional argument follows 126 0.20keyword argumentSyntax EOF while scanning 102 0.16triple-quoted string literalSyntax (unicode error) ’unicodeescape’ 97 0.15codec can’t decode bytes. . .Syntax can’t assign to function call 94 0.15Syntax can’t assign to operator 63 0.10Syntax illegal target for annotation 59 0.09Syntax keyword can’t be an expression 54 0.09Syntax Generator expression must be 47 0.07parenthesized if not sole argumentIndentation unexpected unindent 42 0.07Syntax other . . . 127 0.20

Total OP P YTHON

ORRECTED C ODE R UNTIME E RRORS

Error Count %

NameError 24270 38.55

No Error

Execution Timeout

251 0.40AttributeError 244 0.39ImportError 116 0.18 other...

518 0.82

Total

AttributeError: ; NameError: ; SyntaxError: ; In-dexError: and other: . We performed a Pearson’s Chi-squared Test for given probabilities, comparing our combinederrors with the MIT student error distributions. We found thatthe error distribution in Stack Overﬂow was not statisticallysimilar to the errors made by MIT student programmers( α = 0 . , X = 239031 , df = 5 , p -value < . − ).Pritchard collected the student error distribution from inter-active tutorials within CS Circles [5]. The paper reported dis-tribution was SyntaxError: . ; NameError: . ; EO-FError: . ; IndentationError: . ; and other: . .We performed a Chi-squared test for the probabilities givenin CS Circles student Python error distributions and foundthat errors in Stack Overﬂow were not similarly distributed α = 0 . , X = 154689 , df = 4 , p -value < . − ).The distribution of errors found in Stack Overﬂow extractedsource code snippets did not match the distribution of errorsprior literature has found in student and novice developers.This empirically suggests that creating a corpus of syntaxerrors strictly from novice developers is insufﬁcient to capturethe full spectrum of errors made by the developer communityas a whole. Additional methods of contributing to existingsyntax corpus datasets are therefore necessary. We thereforeemphasize the value of our corpus of data, as it providesa traceable distribution of natural programming errors andcorrections.The errors posted on Stack Overﬂow are not similar tothe errors made by student and novice programmers. RQ3. Are Python errors in Stack Overﬂow similar to generatederrors created by mutating valid code?

We evaluated our errors against a distribution of syntaxerrors generated by random source code mutations, summa-rized in Fig. 1. Campbell et al. provided frequencies of Pythonexceptions caused by mutations of a randomly chosen tokenwithin a sequence of valid

Python source code [3]. Weperformed a Pearson’s Chi-squared Test for the distributionsof Token deletions, insertions, and replacements. • Delete; X = 933962 , df = 10 , p -value < . − • Insert; X = 8424235 , df = 10 , p -value < . − • Replace; X = 652478 , df = 10 , p -value < . − We reject the assumption that the Stack Overﬂow extractedcode error distributions are similar to the mutated source codeusing any mutation type with α = 0 . . This suggests thatrandom mutations alone are insufﬁcient for creating an errorcorpus that compares to natural faults. As the distributions ofmutation-based syntax errors were also not similar to errorsfound in programming novices, additional tuning is necessaryin order to emulate natural distributions of syntax errors.The errors posted on Stack Overﬂow are not similar togenerated errors from randomly mutating source code.V. F UTURE W ORK

The methodology and data provided in this paper enableadditional opportunities of research for software engineering. • The current paper generates syntax errors and correctionsonly for Python3.6. A natural, incremental addition is toperform the same computation for all Python versions. • Further languages can also be analyzed using our method-ology. These languages will have their own static parsingand runtime execution speciﬁcations. • Extra runtime error corrections can be done by installingrequired dependencies or initializing variables. • We will investigate how this dataset enhances existingcode completion and syntax error location models, as wellas provide a new evaluation benchmark for these tasks. Error DistributionSyntaxErrorIndentationErrorNoneImportErrorNameErrorValueErrorTypeErrorAttributeErrorIndexErrorerrorOther E rr o r T y p e Python Error Frequencies of Mutations versus Our Study

ExperimentOur StudyDelete TokenInsert TokenReplace Token

Fig. 1. Comparison of error distributions in our study vs random mutations • More insight on natural error corrections is crucial.Fixing an improperly indented block of code involvesmultiple white-space changes across many lines but canbe resolved by highlighting the lines of code in an edi-tor and indenting the selection. Character/token changesalone do not fully encapsulate developer-code interaction.VI. T

HREATS TO V ALIDITY

We relied on user submitted tags to identify Python code.Because snippets not matching the Python tag were not con-sidered, unaccounted Python code in Stack Overﬂow may stillexist. Unfortunately, relying only on source code parseabil-ity generated many false positives. One recurring issue wasJavaScript object notation (JSON) had the same structureas Python dictionaries. Without matching the Python tag,an additional 5.4 million false positive code snippets wouldrequire analysis.Another concern is whether or not we can rely on method-ology to accurately distinguish between true syntacticallyinvalid Python code versus syntactically invalid arbitrary text.Although we found no occurrences of blatantly invalid sourcecode, we publicize our dataset so anyone can audit our results.Our approach obtains pairs of snippets where one snippetis syntactically correct and the other is not. We only look ateach snippet previous version at one point in time rather thanlooking at the entire history. This prevents the inclusion ofcode blocks that have evolved greatly over time.Another hurdle is to obtain runtime faults from parseablePython code snippets, it is necessary to run the source codein a Python interpreter. Despite our attempts to eliminatevariability of source code execution using KVMs and Docker,we acknowledge that more work is necessary to rigorouslyevaluate runtime faults of programs. Improvements involveaccommodating operating system speciﬁcity, ﬁle system casesensitivity, and underlying hardware limitations.e argue that the distribution of errors extracted from StackOverﬂow are representative of developers, but not representa-tive of software projects as a whole. We acknowledge thisdistinction as code snippets are a subset of working software.C

ONCLUSION

We provide a novel methodology for automatically extract-ing natural source code syntax errors and their ﬁxes fromStack Overﬂow, which sets the groundwork for future softwarenaturalness and future syntax error detection and correctionresearch. Syntax errors extracted from Stack Overﬂow do notmatch prior distributions found in novice code or randomlygenerated errors. We hope our data will be used for trainingand evaluating code completion & error detection models,analysis of common programming language pitfalls, and futuresource code naturalness research.R

EFERENCES [1] E. S. Tabanao, M. M. T. Rodrigo, and M. C. Jadud,“Predicting at-risk novice java programmers throughthe analysis of online protocols,” in

Proceedings ofthe Seventh International Workshop on ComputingEducation Research , ser. ICER ’11. New York, NY,USA: ACM, 2011, pp. 85–92. [Online]. Available:http://doi.acm.org/10.1145/2016911.2016930[2] P. Denny, A. Luxton-Reilly, and E. Tempero, “All syntaxerrors are not equal,” in

Proceedings of the 17th ACMAnnual Conference on Innovation and Technology inComputer Science Education , ser. ITiCSE ’12. NewYork, NY, USA: ACM, 2012, pp. 75–80. [Online].Available: http://doi.acm.org/10.1145/2325296.2325318[3] J. C. Campbell, A. Hindle, and J. N. Amaral, “Errorlocation in python: where the mutants hide,”

PeerJPrePrints , vol. 3, p. e1132v1, May 2015. [Online].Available: https://doi.org/10.7287/peerj.preprints.1132v1[4] E. A. Santos, J. C. Campbell, D. Patel, A. Hindle, andJ. N. Amaral, “Syntax and sensibility: Using languagemodels to detect and correct syntax errors,” in . IEEE ComputerSociety, March 2018, pp. 311–322. [Online]. Available:https://ieeexplore.ieee.org/abstract/document/8330219[5] D. Pritchard, “Frequency distribution of error messages,”in

Proceedings of the 6th Workshop on Evaluation andUsability of Programming Languages and Tools , ser.PLATEAU 2015. New York, NY, USA: ACM, 2015,pp. 1–8. [Online]. Available: http://doi.acm.org/10.1145/2846680.2846681[6] A. K. Kelley et al. , “A system for classifying andclarifying python syntax errors for educational purposes,”Master’s thesis, Massachusetts Institute of Technology,2018. [Online]. Available: http://hdl.handle.net/1721.1/119750[7] R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst,R. Holmes, and G. Fraser, “Are mutants a validsubstitute for real faults in software testing?” in

Proceedings of the 22Nd ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering ,ser. FSE 2014. New York, NY, USA: ACM, 2014,pp. 654–665. [Online]. Available: http://doi.acm.org/10.1145/2635868.2635929[8] M. Jimenez, T. T. Checkam, M. Cordy, M. Papadakis,M. Kintis, Y. L. Traon, and M. Harman, “Aremutants really natural?: A study on how ”naturalness”helps mutant selection,” in

Proceedings of the 12thACM/IEEE International Symposium on EmpiricalSoftware Engineering and Measurement , ser. ESEM’18. New York, NY, USA: ACM, 2018, pp. 3:1–3:10.[Online]. Available: http://doi.acm.org/10.1145/3239235.3240500[9] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu,“On the naturalness of software,” in

Proceedingsof the 34th International Conference on SoftwareEngineering , ser. ICSE ’12. Piscataway, NJ, USA:IEEE Press, 2012, pp. 837–847. [Online]. Available:http://dl.acm.org/citation.cfm?id=2337223.2337322[10] N. C. C. Brown, M. K¨olling, D. McCall, andI. Utting, “Blackbox: A large scale repository ofnovice programmers’ activity,” in

Proceedings ofthe 45th ACM Technical Symposium on ComputerScience Education , ser. SIGCSE ’14. New York, NY,USA: ACM, 2014, pp. 223–228. [Online]. Available:http://doi.acm.org/10.1145/2538862.2538924[11] B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli,and P. Devanbu, “On the ”naturalness” of buggy code,”in , May 2016, pp. 428–439.[Online]. Available: http://doi.acm.org/10.1145/2884781.2884848[12] S. Baltes, L. Dumani, C. Treude, and S. Diehl,“Sotorrent: reconstructing and analyzing the evolutionof stack overﬂow posts,” in

Proceedings of the15th International Conference on Mining SoftwareRepositories, MSR 2018, Gothenburg, Sweden, May28-29, 2018 , A. Zaidman, Y. Kamei, and E. Hill,Eds. ACM, 2018, pp. 319–330. [Online]. Available:https://doi.org/10.1145/3196398.3196430[13] J. C. Campbell, A. Hindle, and J. N. Amaral,“Syntax errors just aren’t natural: Improving errorreporting with language models,” in

Proceedings ofthe 11th Working Conference on Mining SoftwareRepositories , ser. MSR 2014. New York, NY, USA:ACM, 2014, pp. 252–261. [Online]. Available: http://doi.acm.org/10.1145/2597073.2597102[14] K. Liu, D. Kim, T. F. D. A. Bissyande, T. Kim,K. Kim, A. Koyuncu, S. Kim, and Y. Le Traon,“Learning to spot and refactor inconsistent methodnames,” in41st ACM/IEEE International Conferenceon Software Engineering (ICSE)