Syntax and Stack Overflow: A methodology for extracting a corpus of syntax errors and fixes
Alexander William Wong, Amir Salimi, Shaiful Chowdhury, Abram Hindle
SSyntax and Stack Overflow: A methodology forextracting a corpus of syntax errors and fixes
Alexander William Wong, Amir Salimi, Shaiful Chowdhury, and Abram Hindle
Department of Computing Science, University of AlbertaEdmonton, Alberta, Canada { alex.wong, asalimi, shaiful, abram.hindle } @ualberta.ca Abstract —One problem when studying how to find and fixsyntax errors is how to get natural and representative examplesof syntax errors. Most syntax error datasets are not free, open,and public, or they are extracted from novice programmers anddo not represent syntax errors that the general population ofdevelopers would make. Programmers of all skill levels postquestions and answers to Stack Overflow which may containsnippets of source code along with corresponding text and tags.Many snippets do not parse, thus they are ripe for forming acorpus of syntax errors and corrections. Our primary contribu-tion is an approach for extracting natural syntax errors and theircorresponding human made fixes to help syntax error research.A Python abstract syntax tree parser is used to determinepreliminary errors and corrections on code blocks extractedfrom the SOTorrent data set. We further analyzed our code byexecuting the corrections in a Python interpreter. We applied ourmethodology to produce a public data set of 62,965 Python StackOverflow code snippets with corresponding tags, errors, and stacktraces. We found that errors made by Stack Overflow users do notmatch errors made by student developers or random mutations,implying there is a serious representativeness risk within the field.Finally we share our dataset openly so that future researcherscan re-use and extend our syntax errors and fixes.
Index Terms —stack overflow, natural, syntax errors, python,mining software repositories
I. I
NTRODUCTION
Syntax errors stymie novice and expert developers alike [1,2]. Many researchers have sought to help developers recoverfrom syntax errors by locating syntax errors [3] and/or fixingthem [4]. One impediment to this kind of research is gettingaccess to a representative corpus of syntax errors and theirsubsequent corrections. Corporae are limited because syntaxerror commits are rare as open source developers typically donot commit syntax error ridden code to their repositories.Many existing datasets suffer from limited access due toethical concerns while other datasets lack representativeness asthey are pulled from student developers rather than practicingsoftware engineers [5, 6]. Some authors address the lack ofaccess to syntax errors by synthetically creating their ownsyntax error datasets via mutation [3, 7]. Mutation does notfully address the representativeness of syntax errors as they candiffer from the syntax errors of actual developers [8]. Theseshortcomings can be mitigated by using data containing natu-ral errors created by a more general population of developers,where natural is defined as “product of human effort” [9].Unfortunately, naturally made syntax errors are difficult to obtain as current methods require fine grain observations ofprogrammer activity [10].Thus the goal of this work is to reproducibly and openlyproduce a corpus of Python syntax errors that is both natural and representative of syntax errors made by developers as awhole. A dataset of natural human made syntax errors andhuman revised fixes would enable future software engineeringresearch. Models tasked with source code completion or errordetection could use this dataset to train and evaluate theirperformance on actual developer errors. Statistical analysison commonly made syntax errors and their correspondingfixes could offer insight into planning future software languagespecifications. This dataset would enable novel approaches formeasuring the similarity of source code to natural languages.Potential new studies could explore the naturalness of syntaxcorrections, comparing how natural language errors are fixedto how code errors are fixed [11]. Most importantly, a repli-cable and reproducible methodology to produce such a corpuswould allow researchers to update and improve the dataset ontheir own and in the future.We extend the SOTorrent dataset , built from Stack Over-flow, to extract syntax errors and corrections [12]. Our maincontribution is a methodology for extraction of human madeerrors and their fixes. With our methodology, it is important tounderstand how accurate the pairs of syntax errors and theirfixes are, and what are the general characteristics of thosesyntax errors. We answer the following research questions:RQ0) How accurate is our approach for extracting pairs ofsyntax errors and their fixes?RQ1) What are Python parse errors and corresponding run-time properties of their corrections in Stack Overflow?RQ2) Are Python errors in Stack Overflow similar to errorsmade by student programmers?RQ3) Are Python errors in Stack Overflow similar to gener-ated errors created by mutating valid code?Our generated dataset is released to enable future work insource code naturalness research .II. R ELATED W ORK
Syntactically incorrect code is artificially derivable, as for-mal programming languages provide grammar rules which can SOTorrent Dataset: https://zenodo.org/record/2273117 Dataset of Python3.6 Natural Syntax Errors and Corrections:https://doi.org/10.6084/m9.figshare.8244686.v1 a r X i v : . [ c s . S E ] J u l e referred to for correctness. Random token level insertions,deletions, and replacements were performed to generate syntaxerrors from existing open source Java projects [13]. Campbellet al. created Python syntax errors from valid code minedfrom GitHub by applying mutations on tokens, characters,and lines [3]. Although generated errors are appealing due tothe availability of open source code, Just et al. demonstratedlimitations of using mutations for software testing research [7].Given the task of using mutants as replacements for real faultsin automated software testing research, only 73% of mutantswere coupled to faults. Furthermore, when accounting for codecoverage, the mutant to fault coupling effect is small [7].Automated source code repair, like identifying and refactor-ing improper method names, also required a labeled datasetof valid and invalid source code [14]. Program repair is oftenviewed as different than syntax error correction because testingis performed which serves as a benchmark for repaired code,while syntax errors rely primarily on parseability.Free and open datasets of naturally made errors and theirfixes are more difficult to obtain. Blackbox, a data collec-tion project within the BlueJ Java development environment,requires manual staff contact for access to data and forbidsthe release of the raw dataset [10]. Pritchard analyzed Pythonprograms submitted to CS Circles, an online tool for learningPython [5]. Kelley et al. studied Python code submitted bystudents in an introductory programming course at MIT [6].Gathering this data without privileged access to the providedcode submissions is difficult, limiting the reproducibility oftheir research. Our research used Stack Overflow and isadvantageous as the raw content is freely accessible to theinternet, revisions and history is tracked, and contributors havea wide range of software engineering expertise and skill sets .III. M ETHODOLOGY
Our methodology for finding syntax error and fix pairs is:to leverage the historical view of Stack Overflow extractedfrom the SOTorrent project [12]; parse detected Python sourcesnippet histories; extract pairs of failed and fixed revisions;validate pairs; and finally record successfully evaluated pairs.
A. Code Snippets from SOTorrent
The SOTorrent schema contains a
PostBlockVersion tablewhich stores version history on the latest
Posts , which must beeither a question or answer [12]. We queried a subset of rowsfrom the
PostBlockVersion table where content type equaled
CodeSnippet and the corresponding
Post was tagged with aterm matching python . If the corresponding
Post was typed
Answer , we used tags defined in the referenced
Question .To limit
TabError exceptions, extraneous white space isremoved from the content while preserving nested indentation.
B. Parsing Abstract Syntax Tree
To determine the presence of syntax errors, we attemptedto parse abstract syntax trees (AST) given the extracted code. Stack Overflow 2019 Survey https://insights.stackoverflow.com/survey/2019
We tackled this problem using Python’s built-in ast mod-ule, which compiles source code into an AST object. Validcode parses without error, while invalid code raises errors suchas
IndentationError , TabError , SyntaxError , or
MemoryError . The specific error message, offending linenumber and column offset are stored for the indentation, tab,and syntax errors (memory errors offer no such metadata).
C. Filtering Candidates using Block Versions
One issue is that many extracted code snippets containinga python matching tag are not valid Python, but insteadcontain stack traces, script execution commands, tabular data,markup, etc. Further candidate filtering is necessary to distin-guish invalid Python code from unrelated text.To address this issue, we queried for an existing priorversion within the
PostBlockVersion table for each codesnippet that successfully parsed into an AST. If the priorversion contained a parse error, we stored these two versionsof code as our syntax error and fix. Unrelated snippets that arenever fixed into valid Python source code are removed fromanalysis.
D. Runtime Validation with the Interpreter
We checked for code snippet validity by running the cor-rected source code. Code that validly parsed into an AST maystill be faulty due to various errors, like importing invalidmodules or referencing undefined variables. Source code thatcannot parse into an AST does not need to be run as the thrownexception will be the same as the parse error.We used the Python built-in exec function, isolating theglobal and local execution scopes. To mitigate the impact ofexecuting arbitrary source code, we ran the snippets in anisolated Linux kernel virtual machine (KVM) with no accessto network. Each snippet invocation was further container-ized within its own Docker image running Python 3.6 usingDebian Jessie. We capped the maximum execution time tofour seconds to account for long running and non-terminatingprograms. All encountered exceptions and their correspondingstack traces are stored in our dataset for further analysis.IV. R
ESULTS
We successfully parsed and analyzed 82,469,989 code snip-pets from the SOTorrent dataset. Of these snippets, 7,260,631contained a python matching tag. A subset of 3,774,225snippets parsed correctly. Of the parseable snippets, 1,550,090had prior versions, 62,965 of which contained a parse error.A summary of the filtering steps can be found in Tab. I. Withthis dataset, we answer the four research questions.
TABLE IF
ILTERED D ATASET E NTITY C OUNTS
Filter Metric Entity Block Count
All code snippets 82,469,989Block matched python tag 7,260,631Content AST parseable 3,774,225Prior version exists 1,550,090Prior version AST error 62,965
Q0. How accurate is our approach for extracting pairs ofsyntax errors and their fixes?
We wanted to determine how accurate our methodology isfor extracting true syntax errors and fixes. Three co-authorstogether inspected 100 random samples of syntax errors andcorrections from the extracted dataset of 62,965 pairs. Only2 observed cases were not entirely syntax error revisions, asthey included large amounts of new code. The 95% confidenceinterval for our methodology is 0-5% erroneous cases.While our methodology is accurate, future research can takecare of these erroneous cases (e.g., by leveraging the codesnippets version similarity provided by SOTorrent).Stack Overflow is a good source for methodologicallygenerating pairs of natural syntax errors and fixes.
RQ1. What are Python parse errors and corresponding run-time properties of their corrections in Stack Overflow?
Stack Overflow posts are written and parsed using a subsetof Markdown. Code blocks are defined by using four space in-dentations, creating an indented code block, or by surroundingthe text using three backtick characters ( ‘ ) to create a fencedcode block. In addition, the Stack Overflow text editor does notprovide code assistance features, such as syntax highlightingand code completion.Of the 62,965 AST parse errors, there were 35,564 (56.5%)syntax errors, 27,075 (43.0%) indentation errors, and 326(0.5%) tab errors. Over 41% of all AST parse errors were SyntaxError: invalid syntax . A detailed summaryof the error types and messages are listed in Tab. II.Over a third of all corrected snippets still throw a Nameerror. Roughly one third, or 21,332 corrected snippets ran in aPython interpreter with no errors. Only 0.4% of all evaluatedsnippets triggered an execution timeout by running for longerthan four seconds. The results from running all AST fixes ina Python interpreter are summarized in Tab. III.We extracted 62,965 parse errors and corrections fromover 82 million code snippets. Of the corrections,21,332 ran in a Python interpreter without error.
RQ2. Are Python errors in Stack Overflow similar to errorsmade by student programmers?
We compared the distribution of our errors with the distri-bution of runtime errors as presented in two prior works. If thedistribution of errors are similar to student programmer errorsthen student errors can be viewed as representative. If they arenot similar then this provide evidence to arguments regardinghow representative student errors are considering their errordistribution is different.Kelley et al. collected the distribution of Python errorssubmitted by MIT programming students to an online tutoras part of an introductory Python programming course [6].The collected student error distributions were TypeError: ; TABLE IIT OP P YTHON
AST . PARSE E RRORS
Error Message Count %
Syntax invalid syntax 26336 41.83Indentation expected an indented block 14535 23.08Indentation unexpected indent 11030 17.52Syntax Missing parentheses in call to ’print’ 3063 4.86Syntax unexpected EOF while parsing 2732 4.34Syntax EOL while scanning string literal 2045 3.25Indentation unindent does not match 1468 2.33any outer indentation levelTab inconsistent use of tabs 326 0.52and spaces in indentationSyntax invalid character in identifier 297 0.47Syntax unexpected character after 183 0.29line continuation characterSyntax invalid token 139 0.22Syntax positional argument follows 126 0.20keyword argumentSyntax EOF while scanning 102 0.16triple-quoted string literalSyntax (unicode error) ’unicodeescape’ 97 0.15codec can’t decode bytes. . .Syntax can’t assign to function call 94 0.15Syntax can’t assign to operator 63 0.10Syntax illegal target for annotation 59 0.09Syntax keyword can’t be an expression 54 0.09Syntax Generator expression must be 47 0.07parenthesized if not sole argumentIndentation unexpected unindent 42 0.07Syntax other . . . 127 0.20
Total OP P YTHON
ORRECTED C ODE R UNTIME E RRORS
Error Count %
NameError 24270 38.55
No Error
Execution Timeout
251 0.40AttributeError 244 0.39ImportError 116 0.18 other...
518 0.82
Total
AttributeError: ; NameError: ; SyntaxError: ; In-dexError: and other: . We performed a Pearson’s Chi-squared Test for given probabilities, comparing our combinederrors with the MIT student error distributions. We found thatthe error distribution in Stack Overflow was not statisticallysimilar to the errors made by MIT student programmers( α = 0 . , X = 239031 , df = 5 , p -value < . − ).Pritchard collected the student error distribution from inter-active tutorials within CS Circles [5]. The paper reported dis-tribution was SyntaxError: . ; NameError: . ; EO-FError: . ; IndentationError: . ; and other: . .We performed a Chi-squared test for the probabilities givenin CS Circles student Python error distributions and foundthat errors in Stack Overflow were not similarly distributed α = 0 . , X = 154689 , df = 4 , p -value < . − ).The distribution of errors found in Stack Overflow extractedsource code snippets did not match the distribution of errorsprior literature has found in student and novice developers.This empirically suggests that creating a corpus of syntaxerrors strictly from novice developers is insufficient to capturethe full spectrum of errors made by the developer communityas a whole. Additional methods of contributing to existingsyntax corpus datasets are therefore necessary. We thereforeemphasize the value of our corpus of data, as it providesa traceable distribution of natural programming errors andcorrections.The errors posted on Stack Overflow are not similar tothe errors made by student and novice programmers. RQ3. Are Python errors in Stack Overflow similar to generatederrors created by mutating valid code?
We evaluated our errors against a distribution of syntaxerrors generated by random source code mutations, summa-rized in Fig. 1. Campbell et al. provided frequencies of Pythonexceptions caused by mutations of a randomly chosen tokenwithin a sequence of valid
Python source code [3]. Weperformed a Pearson’s Chi-squared Test for the distributionsof Token deletions, insertions, and replacements. • Delete; X = 933962 , df = 10 , p -value < . − • Insert; X = 8424235 , df = 10 , p -value < . − • Replace; X = 652478 , df = 10 , p -value < . − We reject the assumption that the Stack Overflow extractedcode error distributions are similar to the mutated source codeusing any mutation type with α = 0 . . This suggests thatrandom mutations alone are insufficient for creating an errorcorpus that compares to natural faults. As the distributions ofmutation-based syntax errors were also not similar to errorsfound in programming novices, additional tuning is necessaryin order to emulate natural distributions of syntax errors.The errors posted on Stack Overflow are not similar togenerated errors from randomly mutating source code.V. F UTURE W ORK
The methodology and data provided in this paper enableadditional opportunities of research for software engineering. • The current paper generates syntax errors and correctionsonly for Python3.6. A natural, incremental addition is toperform the same computation for all Python versions. • Further languages can also be analyzed using our method-ology. These languages will have their own static parsingand runtime execution specifications. • Extra runtime error corrections can be done by installingrequired dependencies or initializing variables. • We will investigate how this dataset enhances existingcode completion and syntax error location models, as wellas provide a new evaluation benchmark for these tasks. Error DistributionSyntaxErrorIndentationErrorNoneImportErrorNameErrorValueErrorTypeErrorAttributeErrorIndexErrorerrorOther E rr o r T y p e Python Error Frequencies of Mutations versus Our Study
ExperimentOur StudyDelete TokenInsert TokenReplace Token
Fig. 1. Comparison of error distributions in our study vs random mutations • More insight on natural error corrections is crucial.Fixing an improperly indented block of code involvesmultiple white-space changes across many lines but canbe resolved by highlighting the lines of code in an edi-tor and indenting the selection. Character/token changesalone do not fully encapsulate developer-code interaction.VI. T
HREATS TO V ALIDITY
We relied on user submitted tags to identify Python code.Because snippets not matching the Python tag were not con-sidered, unaccounted Python code in Stack Overflow may stillexist. Unfortunately, relying only on source code parseabil-ity generated many false positives. One recurring issue wasJavaScript object notation (JSON) had the same structureas Python dictionaries. Without matching the Python tag,an additional 5.4 million false positive code snippets wouldrequire analysis.Another concern is whether or not we can rely on method-ology to accurately distinguish between true syntacticallyinvalid Python code versus syntactically invalid arbitrary text.Although we found no occurrences of blatantly invalid sourcecode, we publicize our dataset so anyone can audit our results.Our approach obtains pairs of snippets where one snippetis syntactically correct and the other is not. We only look ateach snippet previous version at one point in time rather thanlooking at the entire history. This prevents the inclusion ofcode blocks that have evolved greatly over time.Another hurdle is to obtain runtime faults from parseablePython code snippets, it is necessary to run the source codein a Python interpreter. Despite our attempts to eliminatevariability of source code execution using KVMs and Docker,we acknowledge that more work is necessary to rigorouslyevaluate runtime faults of programs. Improvements involveaccommodating operating system specificity, file system casesensitivity, and underlying hardware limitations.e argue that the distribution of errors extracted from StackOverflow are representative of developers, but not representa-tive of software projects as a whole. We acknowledge thisdistinction as code snippets are a subset of working software.C
ONCLUSION
We provide a novel methodology for automatically extract-ing natural source code syntax errors and their fixes fromStack Overflow, which sets the groundwork for future softwarenaturalness and future syntax error detection and correctionresearch. Syntax errors extracted from Stack Overflow do notmatch prior distributions found in novice code or randomlygenerated errors. We hope our data will be used for trainingand evaluating code completion & error detection models,analysis of common programming language pitfalls, and futuresource code naturalness research.R
EFERENCES [1] E. S. Tabanao, M. M. T. Rodrigo, and M. C. Jadud,“Predicting at-risk novice java programmers throughthe analysis of online protocols,” in
Proceedings ofthe Seventh International Workshop on ComputingEducation Research , ser. ICER ’11. New York, NY,USA: ACM, 2011, pp. 85–92. [Online]. Available:http://doi.acm.org/10.1145/2016911.2016930[2] P. Denny, A. Luxton-Reilly, and E. Tempero, “All syntaxerrors are not equal,” in
Proceedings of the 17th ACMAnnual Conference on Innovation and Technology inComputer Science Education , ser. ITiCSE ’12. NewYork, NY, USA: ACM, 2012, pp. 75–80. [Online].Available: http://doi.acm.org/10.1145/2325296.2325318[3] J. C. Campbell, A. Hindle, and J. N. Amaral, “Errorlocation in python: where the mutants hide,”
PeerJPrePrints , vol. 3, p. e1132v1, May 2015. [Online].Available: https://doi.org/10.7287/peerj.preprints.1132v1[4] E. A. Santos, J. C. Campbell, D. Patel, A. Hindle, andJ. N. Amaral, “Syntax and sensibility: Using languagemodels to detect and correct syntax errors,” in . IEEE ComputerSociety, March 2018, pp. 311–322. [Online]. Available:https://ieeexplore.ieee.org/abstract/document/8330219[5] D. Pritchard, “Frequency distribution of error messages,”in
Proceedings of the 6th Workshop on Evaluation andUsability of Programming Languages and Tools , ser.PLATEAU 2015. New York, NY, USA: ACM, 2015,pp. 1–8. [Online]. Available: http://doi.acm.org/10.1145/2846680.2846681[6] A. K. Kelley et al. , “A system for classifying andclarifying python syntax errors for educational purposes,”Master’s thesis, Massachusetts Institute of Technology,2018. [Online]. Available: http://hdl.handle.net/1721.1/119750[7] R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst,R. Holmes, and G. Fraser, “Are mutants a validsubstitute for real faults in software testing?” in
Proceedings of the 22Nd ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering ,ser. FSE 2014. New York, NY, USA: ACM, 2014,pp. 654–665. [Online]. Available: http://doi.acm.org/10.1145/2635868.2635929[8] M. Jimenez, T. T. Checkam, M. Cordy, M. Papadakis,M. Kintis, Y. L. Traon, and M. Harman, “Aremutants really natural?: A study on how ”naturalness”helps mutant selection,” in
Proceedings of the 12thACM/IEEE International Symposium on EmpiricalSoftware Engineering and Measurement , ser. ESEM’18. New York, NY, USA: ACM, 2018, pp. 3:1–3:10.[Online]. Available: http://doi.acm.org/10.1145/3239235.3240500[9] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu,“On the naturalness of software,” in
Proceedingsof the 34th International Conference on SoftwareEngineering , ser. ICSE ’12. Piscataway, NJ, USA:IEEE Press, 2012, pp. 837–847. [Online]. Available:http://dl.acm.org/citation.cfm?id=2337223.2337322[10] N. C. C. Brown, M. K¨olling, D. McCall, andI. Utting, “Blackbox: A large scale repository ofnovice programmers’ activity,” in
Proceedings ofthe 45th ACM Technical Symposium on ComputerScience Education , ser. SIGCSE ’14. New York, NY,USA: ACM, 2014, pp. 223–228. [Online]. Available:http://doi.acm.org/10.1145/2538862.2538924[11] B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli,and P. Devanbu, “On the ”naturalness” of buggy code,”in , May 2016, pp. 428–439.[Online]. Available: http://doi.acm.org/10.1145/2884781.2884848[12] S. Baltes, L. Dumani, C. Treude, and S. Diehl,“Sotorrent: reconstructing and analyzing the evolutionof stack overflow posts,” in
Proceedings of the15th International Conference on Mining SoftwareRepositories, MSR 2018, Gothenburg, Sweden, May28-29, 2018 , A. Zaidman, Y. Kamei, and E. Hill,Eds. ACM, 2018, pp. 319–330. [Online]. Available:https://doi.org/10.1145/3196398.3196430[13] J. C. Campbell, A. Hindle, and J. N. Amaral,“Syntax errors just aren’t natural: Improving errorreporting with language models,” in
Proceedings ofthe 11th Working Conference on Mining SoftwareRepositories , ser. MSR 2014. New York, NY, USA:ACM, 2014, pp. 252–261. [Online]. Available: http://doi.acm.org/10.1145/2597073.2597102[14] K. Liu, D. Kim, T. F. D. A. Bissyande, T. Kim,K. Kim, A. Koyuncu, S. Kim, and Y. Le Traon,“Learning to spot and refactor inconsistent methodnames,” in41st ACM/IEEE International Conferenceon Software Engineering (ICSE)