Learning Off-By-One Mistakes: An Empirical Study
Hendrig Sellik, Onno van Paridon, Georgios Gousios, Maurício Aniche
LLearning Off-By-One Mistakes: An Empirical Study
Hendrig Sellik
Delft University of Technology
Delft, The [email protected]
Onno van Paridon
Adyen N.V.
Amsterdam, The [email protected]
Georgios Gousios, Maurício Aniche
Delft University of Technology
Delft, The Netherlands{g.gousios,m.f.aniche}@tudelft.nl
Abstract —Mistakes in binary conditions are a source of errorin many software systems. They happen when developers use, e.g.,‘ < ’ or ‘ > ’ instead of ‘ < = ’ or ‘ > = ’. These boundary mistakesare hard to find and impose manual, labor-intensive work forsoftware developers.While previous research has been proposing solutions to iden-tify errors in boundary conditions, the problem remains open. Inthis paper, we explore the effectiveness of deep learning modelsin learning and predicting mistakes in boundary conditions. Wetrain different models on approximately 1.6M examples withfaults in different boundary conditions. We achieve a precisionof 85% and a recall of 84% on a balanced dataset, but lowernumbers in an imbalanced dataset. We also perform tests on 41real-world boundary condition bugs found from GitHub, wherethe model shows only a modest performance. Finally, we test themodel on a large-scale Java code base from Adyen , our industrialpartner. The model reported 36 buggy methods, but none of themwere confirmed by developers.
Index Terms —machine learning for software engineering, deeplearning for software engineering, software testing, boundarytesting.
I. I
NTRODUCTION
Off-by-one mistakes happen when developers do not cor-rectly implement a boundary condition in the code. Suchmistakes often occur when developers use ‘ > ’ or ‘ < ’ in caseswhere they should have used ‘ = > ’ or ‘ < = ’, or vice versa.Take the example of an off-by-one error in theGson library , which we illustrate in Figure 1. The toFind.length() < limit condition is wrong. The fixchanges the < operator by the <= operator. Such mistakes areparticularly difficult to find in source code. After all, the resultof the program is not always obviously wrong, as it is “merelyoff by one”. In most cases, the mistake will lead to an “outof bounds” situation, which will then result in an applicationcrash.A large body of knowledge in the software testing field isdedicated to (manual) boundary testing techniques (e.g., [1,2, 3, 4, 5]). However, manually inspecting code for off-by-one errors is time-consuming since determining whichbinary operator is the correct one is usually heavily context-dependent. The industry has been relying on static analysistools, such as SpotBugs or PVS-Studio . SpotBugs promisesto identify possible infinite loops, as well as array indices, https://github.com/google/gson/commit/161b4ba https://spotbugs.github.io offsets, lengths, and indexes that are out of bounds. PVS-Studio also tries to identify mistakes in conditional statementsand indexes that are out of bounds in array manipulation. Andwhile they can indeed find some of them, many of them goundetected. As we later show in this paper, none of the real-world off-by-one errors could be detected by the state-of-the-practice static analysis tools.We conjecture that, for a tool to be able to preciselyidentify mistakes in boundary conditions, it should be able tocapture the overall context of the source code under analysis.Understanding the context of the source code has been tra-ditionally a challenge for static analysis techniques. However,recent advances in machine and deep learning have shown thatmodels can learn useful information from the syntactic andsemantic information that exist in source code. Tasks that weredeemed not possible before, such as method naming [6, 7, 8],type inference [9, 10, 11], and bug finding [12], are nowfeasible. The lack of reliable tools that detect off-by-onemistakes leaves an excellent opportunity for researchers toexperiment with machine learning approaches.Inspired by the code2vec and code2seq models proposed byAlon et al. [8, 13], we trained several deep learning modelson likely correct methods and their counterparts affected byoff-by-one mistakes. The models are trained on over 1.6Mexamples, and the best results are obtained with the Code2Seq[13] model achieving 85% precision and a recall of 84% on abalanced testing set. However, our results also show that themodel, when tested on a real-world dataset that consisted of41 bugs in open-source systems, yields low performance (55%precision and 46% recall).Finally, we tested the best models in one of our industrialpartners. Adyen is one of the world’s largest payment serviceproviders allowing customers from over 150 countries to useover 250 payment methods including different internet banktransfers and point of sales solutions. The company is workingin a highly regulated banking industry and combined with thehigh processing volumes there is little to no room for errors.Hence,
Adyen uses the industry-standard best practices forearly bug detection such as code reviews, unit testing, andstatic analysis. It is at
Adyen ’s best interest to look into noveltools to prevent software defects finding their way into theirlarge code base, preferring methods that scale and do not wastethe most expensive resource of the company, the developers’time. Our results show that, while the model did not reveal anybugs per se, it pointed developers to code that they considered a r X i v : . [ c s . S E ] F e b rivate boolean skipTo(String toFind) throws IOException {outer:for (; pos + toFind.length() < limit || (cid:44) → fillBuffer(toFind.length()); pos++) {for (int c = 0; c < toFind.length(); c++) {if (buffer[pos + c] != toFind.charAt(c)) {continue outer;...} Fig. 1: A off-by-one error in the Gson library, fixed in commit toFind.length() The use of static analysis tools is quite common amongsoftware development teams (e.g., [15, 16]). These tools,however, rely on bug pattern detectors that are manuallycrafted and fine-tuned by static analysis experts. The vastamount of different bug patterns makes it very difficult to covermore than a fraction of them.Machine Learning for Software Engineering has seen rapiddevelopment in recent years inspired by the successful applica-tion in the Natural Language Processing field [17]. It is appliedin many tasks related to software code such as code translation(e.g., [18]), type inference (e.g., [9, 10]), code refactoring(e.g., [19]) and, as we list below, bug identification.Pradel et al. [12] use a technique similar to Word2Vec[20] to learn embeddings for JavaScript code tokens extractedfrom the AST. These embeddings are used to train two-layerfeed-forward binary classification models to detect bugs. Eachtrained model focuses on a single bug type, and the authors testit on problems such as wrong binary operator, wrong operandin binary operation and swapped function arguments. Thesemodels do not use all the tokens from the code, but only thosespecific to the problem at hand. For example, the model thatdetects swapped function arguments only uses embeddings ofthe function name and arguments with a few other AST nodesas features.Allamanis et al. [21] use Gated Graph Neural Network [22]to detect variable misuse bugs on a token level. As an input tothe model, the authors use an AST graph of the source codeand augment it with additional edges from the control flowgraph. Pascarella et al. [23] show that defective commits are oftencomposed of both defective and non-defective files. They alsotrain a model to predict defective files in a given commit.Habib et al. [24] create an embedding from methods using aone-hot encoding of tokens such as keywords (for, if, etc.),separators (;, (), etc.), identifiers (method, variable names andliterals (values such as "abc" and 10). The embeddings for thefirst 50 tokens are then used to create a binary classificationmodel. The oracle for training data is a state-of-the-art staticanalysis tool, and the results show that neural bug finding canbe highly successful for some patterns, but fail at others.Li et al. [25] use method AST in combination with aglobal Program Dependency Graph and Data Flow Graphto determine whether the source code in a given method isbuggy or not. The authors use Word2Vec to extract AST nodeembeddings with a combination of GRU Attention layer andAttention Convolutional Layer to build a representation of themethod’s body. Node2Vec [26] is used to create a distributedrepresentation of the data flow graph of the file which theinspected method is in. The results are combined into a methodvector which is used to make a softmax prediction.Wang et al. [27] define bug prediction as a binary classifi-cation problem and train three different graph neural networksbased on control flow graphs of Java code. They use a novelinterval-based propagation mechanism to more efficiently gen-eralize a Graph Neural Network (GNN). The resulting methodembedding is fed into a feed-forward neural network to findnull-pointer de-reference, array index out of bounds and classcast exceptions. For each bug type, a separate bug detector istrained. III. A PPROACH In order to detect off-by-one errors in Java code, we aim tocreate a hypothesis function that will calculate output basedon the inputs generated from an example. More specifically,we train and compare different binary classification machinelearning models to classify Java source code methods to oneof the two possible output labels which are ”defective” and”non-defective”. If a method is considered as ”defective”, it issuffering from an off-by-one error, otherwise, it is deemed tobe clear from errors.These models are based on the Code2Vec [8] andCode2Seq [13] models, state-of-the-art deep learning modelsoriginally developed for generating method names and descrip-tions. The models use Abstract Syntax Tree paths of a methodas features and create an embedding by combining them withthe help of an attention mechanism. In addition, we also builda Random Forest baseline model based on source code tokens.We acquired the datasets necessary for the training of thesemodels from the work of Alon et al. [8] which results in animbalanced dataset of 920K examples (1 to 10 ratio) and abalanced dataset of 1.6M examples when combined with ourautomatically mutated methods.We train on both imbalanced and balanced data to see thedifference in performance. We then evaluate the accuracy ofthe model in 41 real-world open source off-by-one errors. In ava-large V.1 Transfer Trained Model from Server Train a Model onOpen-SourceDataset Further Train onClosed-SourceProject Find Bugs V.2 Results forOpen-SourceSoftware Results forCompany Data java-med(mutated) Tune Hyperparameters java-med java-large(mutated) Train (mutated) Further TrainSplit and MutateMethods Split and MutateMethods Companyproject Fig. 2: The flow of the research including data collection,mutation, training and testing.TABLE I: The different datasets used in this paper. Dataset Train Validation Test Total large-balanced 1,593,610 30,634 48,516 large-imbalanced 876,485 16,849 26,684 Adyen open-source bugs - - 82 addition, we further train the models with data from a companyproject to fine-tune the model and find bugs from that projectspecifically.In Figure 2, we show the overall research method. In thefollowing, we provide a more detailed description of theapproach and research questions. A. Datasets We used the java-large dataset provided by Alon et al.[13] for model training. We used Adyen ’s production Javacode to further train and test the model with project-specificdata. Finally, we used an additional real-world-bugs dataset toevaluate models on real-world bugs. A summary of the datasetscan be seen in Table I.1) The java-large-balanced dataset consists of 9,500 top-starred Java projects from GitHub created since January2007. Out of those 9500 projects, 9000 were randomlyselected for the training set, 250 for the validation set andthe remaining 300 were used for the testing set. Originally,this dataset contained about 16M methods, but 836,380were candidates for off-by-one errors (e.g. methods withloops and if conditions containing binary operator < , < = , > or > = ). After mutating the methods, the final balanceddataset consisted of 1,672,760 methods, 836,380 of themassumed to be correct and 836,380 assumed to be buggy.2) The additional imbalanced dataset java-large-imbalanced was constructed to emulate more realistic data, where themajority of the code is not defective. A 10-to-1 ratiobetween non-defective and defective methods was chosen since it resulted in a high precision while having a reason-able recall. We empirically observed that upon increasingthe ratio of non-defective methods even further, the modeldid not return possibly defective methods when runningon Adyen ’s codebase. Meaning that if the ratio was higherthan 10-to-1, the recall of the model became too low touse it.3) Adyen ’s code is a repository containing the productionJava code of the company. It consists of over 200,000methods out of which 7,435 contain a mutation candidateto produce an off-by-one error. After mutating the methods,this resulted in a balanced dataset containing 14,870 datapoints.4) 41 real-world bugs in boundary conditions were used formanual evaluation. We extracted the bugs from the 500most starred GitHub Java projects. The analyzed projectswere not part of the training and evaluation sets and thusare not seen by a model before testing. Using a Pydrillerscript [28], we extracted a list of candidate commits whereauthors made a change in a comparator (e.g., a ” > ” to” = > ”; ” < = ” to ” < ”, etc.). This process returned a list of1,571 candidate commits which were analyzed manuallyuntil 41 were confirmed to be off-by-one errors and addedto the dataset. The manual analysis was stopped due to itbeing a very labor-intensive process. B. Generating positive and negative instances In order to train a supervised binary classification model,we require defective examples. To get those, we modified theexisting likely correct code to produce likely incorrect code.For each method, we found a list of possible mutation pointsand selected a random one. After this, we altered the selectedbinary expressions using JavaParser in a way to generate anoff-by-one error.Due to changing only one of the expressions, the equivalentmutant problem does not exist for the training examples,unless the original code was unreachable at the position ofthe mutation. It is also important to note that the datasetsare split on a project level for the java-large dataset and ona sub-module level for Adyen ’s code. This means that thepositive and the negative examples both end up in the sametraining, validation or test set. We did this to avoid evaluatingmodel predictions on a code that only had one binary operatorchanged compared to the code that was used during training. C. Model Architecture The models we used in this work are based on the recentCode2Vec model [8] and its enhancement Code2Seq [13], anda baseline model that makes use of random forest. We describethe models in more detail in the next sub-sections. JavaParser GitHub page https://github.com/javaparser/javaparser/ Equivalent mutant problem may exist, for example, if we mutate “deadcode”. However, we conjecture that this is a negligible problem and will notaffect the results. ) Code2Vec: The Code2Vec model created by Alon et al.[8] is a Neural Network model used to create embeddings fromJava methods. These embeddings were used in the originalwork to predict method names.The architecture of this model requires Java methods to besplit into path contexts based on the AST of the method. Apath context is a random path between two nodes in the ASTand consists of two terminal nodes x s , x t and the path betweenthose terminal nodes p j which does not include the terminals.The embeddings for those terminal nodes and paths are learnedduring training and stored in two separate vocabularies. Duringtraining, these paths are concatenated to a single vector tocreate a context vector c i which has the length l of · x s + x p where the length of x s is equal to x t .The acquired context vectors c i for paths are passed throughthe same fully connected (dense) neural network layer (usingthe same weights). The network uses hyperbolic tangent acti-vation function and dropout in order to generate a combinedcontext vector ˜ c i . The size of the dense layer allows controllingthe size of the resulting context vector.The attention mechanism of the model works by using aglobal attention vector a ∈ R h which is initialized randomlyand learned with the rest of the network. It is used to calculateattention weight a i for each individual combined contextvector ˜ c i .It is possible that some methods are not with a large enoughAST to generate the required number of context paths. Forthis dummy (masked) context paths are inputted to the modelwhich get a value of zero for attention weight a i . This enablesthe model to use examples with the same shape.During training, a tag vocabulary tags _ vocab ∈ R | Y |× l iscreated where for each tag (label) y i ∈ Y corresponds to anembedding of size l . The tags are learned during training andin the task proposed by the authors, these represent methodnames.A prediction for a new example is made by computing thenormalized dot product between code vector v and each ofthe tag embeddings tags _ vocab i , resulting in a probabilityfor each tag y i . The higher the probability, the more likely thetag belongs to the method. 2) Code2Seq: The Code2Seq model created by Alon et al.[13] is a sequence-to-sequence model used to create embed-dings from Java methods from which method descriptions arelearned. The original work was used to generate sequences ofnatural language words to describe methods.Similarly to the Code2Vec model, the model works bygenerating random paths from the AST with a specifiedmaximum length. Each path consists of 2 terminal tokens x s , x t and the path between those terminal nodes p j which, inCode2Seq, includes the terminal nodes p s , p t ∈ p j , but nottokens.It is important to make a difference between terminal tokensand path nodes. The former are user-defined values, such asa number or variable called stringBuilder while the lattercome from a limited set of AST constructs such as NameExpr, BlockStmt, ReturnStmt. There are around 400 different nodetypes that are predefined in the JavaParser implementation .During training, the path nodes and the terminal tokensare encoded differently. Terminal tokens get partitioned intosubtokens based on the camelCase notation, which is a stan-dard coding convention in Java. For example, a terminal token stringBuilder will be partitioned into string and Builder . Thesubtokens are turned into embeddings with a learned matrix E subtokens and encoding is created for the entire token byadding the values for subtokens.Paths of the AST are also split into nodes and each ofthe nodes corresponds to a value in a learned embeddingmatrix E nodes . These embeddings are fed into a bi-directionalLSTM which final states result in a forward pass output → h andbackward pass output ← h . These are concatenated to producea path encoding.As with the Code2Vec model, the encodings of the terminalnodes and the path are concatenated and the resulting encodingis an input to a dense layer with tanh activation to create a combined context vector ˜ c i . Finally, to provide an initial stateto the decoder, the representations of all n paths in a givenmethod are averaged.The decoder uses the initial state h to generate an outputsequence while attending over all the combined context vectors ˜ c , ..., ˜ c n . The resulting output sequence represents a naturallanguage description of the method. We adapted Code2Seq’ssequence output to be { (0 | , 3) Baseline Model: We developed a baseline model toassess the performance of a simpler architecture. For this,we used a Random Forest model [29] and compared theperformance with the same datasets.First, we tokenized the Java methods using leaf nodes oftheir respective ASTs. After this, all the tokens of the methodwere vectorized using the TF-IDF method. The vectorizedtokens of one method comprised a training example for aRandom Forest model. This model was then trained on allof the methods from java-large training set. D. Hyper-parameter optimization and model training For hyper-parameter optimization, we used Bayesian opti-mization [30]. We selected model precision as the optimization ABLE II: Model training times. B=Balanced dataset,I=Imbalanced dataset, E=number of epochs, T=time to train. Balanced Imbalanced Adyen Model T E T E T EBaseline 5h33m 1 1h59m 1 48s 1Code2Vec 1d2h2m 52 11h6m 52 1h1m 53Code2Seq 3d18h18m 14 2d14h 41m 15 1h8m 17 parameter since high precision is required to obtain a usabledefect prediction model. We used Bayesian optimization overother methods like random search or grid search because itenables us to generate a surrogate function that is used tosearch the hyper-parameter space based on previous resultsand acts as intuition for parameter selection. This results insaving significantly more time because the actual model doesnot need to run as much due to wrong parameter ranges beingdiscarded early in the process.The hyper-parameters are optimized in the java-med-balanced , another dataset made public by Alon et al. [8].The dataset consists of 1,000 top-starred Java projects fromGitHub. Out of those 1000 projects, 800 were randomlyselected for the training set, 100 for validation set and theremaining 100 were used for the testing set. Originally, thisdataset contained about 4M methods, but 170,295 were can-didates for off-by-one errors (e.g. methods with loops and ifconditions containing binary operator < , < = , > or > = ). Thisresulted in a balanced dataset of 340,590 methods, 170,295 ofthem assumed to be correct and 170,295 assumed to be buggy.We ran optimization for four different scenarios. Two runsfor the balanced java-medium dataset with Code2Vec modeland Code2Seq models, respectively, and an additional two runswith the same models for imbalanced datasets. We used amachine with Intel(R) Xeon(R) CPU E5-2660 v3 processorrunning at 2.60GHz with a Tesla M60 graphics card.Once the hyper-parameters were identified, we train theCode2Vec and Code2Seq models (as well as the baseline)using the balanced and imbalanced versions of the java-large dataset, and perform further training with the source code ofour industrial partner. We show the training time of the finalmodels in Table II. E. Analysis We report the precision and recall of our models. Precisionhelps to evaluate the models’ proneness to classify negativeexamples as positive. The latter is also known as false positive.This means that a model with high precision has a low false-positive rate and a model with low precision has a high false-positive rate. More formally, precision is the number of truepositive (TP) predictions divided by the sum of true positiveand false positive (FP) predictions.For a bug detection model, low precision means a highnumber of false positives, making the developers spend theirtime checking a large number of errors reported by the modelonly to find very few predictions that are defective. This means that in this work, we prefer high precision for a bug-detectionmodel.Monitoring precision alone is not enough since a model thatis precise but only predicts few bugs per thousands of bugsis also not useful. Hence, recall is also measured. It measuresthe models’ ability to find all the defective examples from thedataset. A recall of a model is low when it does not find manyof the positive examples from the dataset and very high if itmanages to find all of them. More formally, it is the numberof true positive predictions divided by the sum of true positiveand false negative predictions.Ideally, a bug prediction model would find all of the bugsfrom the dataset and have a high recall score. However, deeplearning networks usually do not achieve perfect precision andrecall at the same time. For more difficult problems with aprobabilistic model, there can be a trade-off. When increasingthe threshold of the model confidence for the positive example,the recall will decline. For this reason, a sci-kit learn packagewas used to also make a precision-recall curve to observe theeffect of the change in precision and recall upon changingthe confidence of the model needed to classify an example aspositive (defective). F. Reproducibility We provide all the source code (data collection, pre-processing, and machine learning models) in our online ap-pendix [31]. The source code is also available in GitHub .IV. M ETHODOLOGY The goal of this study is to measure the effectiveness ofdeep learning models in identifying off-by-one mistakes. Tothat aim, we propose three research questions: • RQ : How do the models perform on a controlleddataset? In order to obtain a vast quantity of data, weuse a controlled dataset (see Section III-A). We train themodels on the dataset and use metrics such as precisionand recall to assess the performance. • RQ : How well do the methods generalize to adataset made of real-world bugs? We mine a datasetof real-world off-by-one error bugs from GitHub issuesof various open-source projects. Then we use a model topredict the error-proneness of a method before and aftera fix. This will indicate how well the model works forreal-world data. This evaluation will enable us to extractthe precision metric and compare it to the one from RQ . • RQ : Can the approach be used to find bugs from alarge-scale industry project? One useful application toan error-detection model is to analyze the existing projectand notify of methods containing off-by-one errors. Wemake several runs where the model is firstly trained on adataset with mutated code and then tested on real code tofind such errors. In addition, we further train the modelwith a different version of the industry project to finderrors in the future versions of the project. https://github.com/hsellik/thesis/tree/MSR-2021 o answer RQ , we performed hyper-parameter optimiza-tion. After this, we selected the best hyper-parameter valuesand trained the model with randomly initialized parameterson the java-large dataset on the same machine as used forhyper-parameter optimization (see Section III-D). We trainedCode2Seq and Code2Vec models until there was no gain inprecision for three epochs of training in the evaluation set.After this, we assessed the model on the testing set of java-large dataset.The process was conducted for three different configurationsof data. These were:1) BB - the training data was balanced (B) with the cross-validation and testing data also being balanced (B).2) BI - the training data was balanced (B) with the cross-validation and testing data being imbalanced (I).3) II - the training data was imbalanced (I) with the cross-validation and the testing data also being imbalanced (I).The data imbalance was inspired by the work of Habib etal. [24], who reported that a bug detection model trained on abalanced dataset would have poor performance when testingon a more real-life scenario with imbalanced classes.To answer RQ , we selected the best-performing modelon the controlled java-large testing set (see Table III), whichwas the model based on the Code2Seq architecture. After this,the model was tested on the bugs and their fixes found fromseveral real-world Java projects (open-source bugs dataset inTable I).Firstly, we tested the model on the correct code that wasobtained from the GitHub diff after the change to see theclassification performance on non-defective code. To test themodel performance on defective code, we reverted the exampleto the state where the bug was present using the git versioncontrol system. After this, we recorded the model predictionon the defective method.In addition, as a way to compare our work with staticanalysis, we apply three popular static analyzers to the sameset of defective and non-defective snippets: SpotBugs (v.4.0.0-beta1), PVS-Studio (v.7.04.34029), and the static analyzerintegrated with IntelliJ IDEA (v. 2019.2.3).To answer RQ , we trained the Code2Seq model only onthe data generated from the company project, but the trainingdid not start with randomly initialized weights. Instead, theprocess was started with the weights acquired after trainingon the java-large dataset (see Figure 2).We selected the Code2Seq based model because it hadthe best performance on the imbalanced testing set of thecontrolled java-large set. We selected the performance on theimbalanced controlled set as a criterion since we assumedthat the company project also contains more non-defectiveexamples than defective ones.We used the pre-trained model because the company projectalone did not contain enough data for the training process.Additionally, due to the architecture of the Code2Seq andCode2Vec models, the embeddings of terminal and AST nodevocabularies did not receive additional updates during furthertraining with company data. We trained the model until there was no gain in precision for three epochs on the validation set,and after this, we tested the model on the test set consistingof controlled Adyen data.We conducted additional checking on Adyen data by try-ing to find bugs in the most recent version of the project.More specifically, we updated the project to its most recentversion using their git version control system, and withoutany modifications to their original code, we used the modelto predict whether every Java method in their code base hada off-by-one mistake. We analyzed all bug predictions thatwere over a threshold of 0.8 to see if they contained bugs.The 0.8 threshold was defined after manual experimentation.We aimed at a set of methods that were large enough to bringus interesting conclusions, yet small enough to enable us tomanually verify each of them. A. Threats to Validity In this section, we discuss the threats to the validity of thisstudy and the actions we took to mitigate them. 1) Internal validity: Our method performs mutations togenerate faulty examples from likely correct code by editingone of the binary condition within the method. This means thatwhile the correct examples represent a diverse set of methodsfrom open-source projects, the likely incorrect methods maynot represent a realistic distribution of real-world bugs. Thisaffects the model that is being trained with those examplesand also the testing results conducted on this data. 2) External validity: While the work included a diverseset of open-source projects, the only closed-source projectthat was used during this study was Adyen ’s. Hence, theclosed-source projects (in training and in validation) are under-represented in this study.Moreover, we have only experimented with Java codesnippets. While the approach seems to be generic enough tobe applicable to any programming language, the results mightvary given the particular way that developers write code indifferent communities. Therefore, more experimentation needsto be conducted before we can argue that the results generalizeto any programming language.V. R ESULTS In the following sections, we present the results of ourresearch questions. A. RQ : How do the models perform on a controlled dataset? In Table III, we show the precision and recall of the differentmodels. In Figures 3a and 3b, we show the ROC curve andthe precision-recall curve of the experiment with Code2Seqbased model for the imbalanced java-large dataset. Observation 1: Models present high precision and recallwhen trained and tested with balanced data. The resultsshow that when training models on a balanced dataset with anequal amount of defective/non-defective code and then testingthe same model on a balanced testing set, both Code2Vec andCode2Seq models achieve great precision and recall wherethe Code2Seq based model has better precision (85.23% vsABLE III: Model results in controlled testing sets. BB stands for balanced training and testing set, II stands for imbalancedtraining set and testing set. Experiment BB Experiment BI Experiment II Java-large Java-large Java-large Adyen data(cross-project) Adyen data(further trained)Model Pr. % Re. % Pr. % Re. % Pr. % Re. % Pr. % Re. % Pr. % Re. %Code2Seq Code2Vec 80.11 77.01 28.52 75.53 64.65 41 53.85 20.46 43.95 23.39Baseline 50 49.08 8.99 49.18 17.86 0.15 0.0 0.0 9.25 0.92Offside [14] 80.9 75.6 - - - - - - - - Observation 2: The metrics drop considerably whentested on an imbalanced dataset. When simulating a morereal-life scenario and creating an imbalance in the testing setwith more non-defective methods, the recall of the modelsremained similar with recall increasing from 84.82 to 84.86for the Code2Seq model and dropping from 77.01% to 75.53%for the Code2Vec model. However, the precision of the modelsreduced drastically with the Code2Seq model dropping from85.23% to 36.08% and Code2Vec model from 80.11% to28.52%. The baseline model also drops in precision from 50%to 8.99% while keeping the same recall. Observation 3: The low precision can be mitigated bytraining on an imbalanced dataset, but at the cost of recall. We trained Code2Seq and Code2Vec models on an imbalanceddataset and results show that the precision score for imbal-anced data returned almost to the same level for the Code2Seq-based model (83.04% vs 85.23%), but remained lower forthe Code2Vec-based model (64.65% vs 80.11%). However,the recall declined drastically from 84.82% to 42.34% forthe Code2Seq model and from 77.01% to 41.00% for theCode2Vec model.When analysing the ROC curve (Figures 3a and 3b), theprecision is ≈ ≈ RQ summary: Both Code2Seq and Code2Vec based mod-els present high accuracy on a balanced dataset. The numbersdrop when we make use of imbalanced (i.e., more similar tothe real-world) datasets. B. RQ : How well do the methods generalize to a datasetmade of real-world bugs? The performance of the model on the 41 real-world bound-ary mistakes and their non-defective counterparts are presented TABLE IV: Results of applying the Code2Seq model to41 real-world off-by-one bugs and their corrected ver-sions. B=balanced training set, I=imbalanced training set,Pr=Precision, Re=Recall, Thr=Threshold. Model Thr TP TN FP FN Pr Re F1 Code2Seq (B) 0.5 26 15 Code2Seq (B) 0.8 10 33 8 31 55.56 24.39 33.9Code2Seq (I) 0.5 3 41 0 41 0 in Table IV. Observation 4: The model can detect real-world bugs,but with a high false-positive rate. Out of the 41 defectivemethods, 19 (46.34%) were classified correctly and out of 41correct methods, 26 (63.41%) were classified correctly. Theprecision and recall scores of 55.88 and 46.34 were achievedwhile evaluating the model on real-world bugs with theCode2Seq model trained on balanced data using a thresholdof 0.5. Compared to the results from the java-large testing setwith augmented methods, the results are significantly lowerwith precision and recall being 29.35 and 38.08 points lowerrespectively (see metrics for Code2Seq model with ExperimentBB in Table III). Observation 5: The state-of-the-practice linter tools didnot find any of the real-world bugs. As an interesting remark,none of the bugs was identified by any of the state-of-the-practice linting tools we experimented. This reinforces theneed for approaches that identify such bugs (by means of staticanalysis or deep learning). RQ summary: The model presents only reasonable per-formance on real-world off-by-one mistakes in open-sourceprojects. Static analysis tools did not detect any bug. C. RQ : Can the approach be used to find bugs from a large-scale industry project? We present the accuracy of the model in our industrialpartner, Adyen , also in Table III. Observation 6: Models trained on open-source data showsatisfactory results in the industry dataset. Our empiricalfindings show that when a model is trained on an open-sourcedataset and then applied to the company project (followingthe same pipeline of mutating methods as to generate positive a) ROC curve. Area under curve 0.89. (b) Precision-recall curve. Area under curve 0.65. Fig. 3: ROC and precision-recall curves for the Experiment II with the java-large dataset.and negative instances), it will have good precision and recallscores with 71.15% and 24.66% for Code2Seq and somewhatlower 53.85% and 20.46% for Code2Vec model respectively. Observation 7: Further training on the Adyen projectdid not yield better results. We hypothesized that trainingthe model further on Adyen ’s code base would give a boost inprecision and recall scores. The recall of the models improvedby 6.0 percentage points for Code2Seq based model and 2.93for Code2Vec based model. However, the precision of bothmodels dropped by 4.49 percentage points for Code2Seq and9.9 for Code2Vec. Observation 8: The model did not reveal any bugs, but20% of the reported methods were considered suspiciousby the developers. Running the model on a newer versionof the repository reported 36 potential bugs with a confidencethreshold over 0.8 (which we chose after experimenting withdifferent thresholds and analyzing the number of suspiciousmethods the model returned that we considered feasible tomanually investigate). While no bugs were found after man-ually analyzing all the reported snippets, we marked sevenmethods as suspicious. When we showed these methods tothe developers, they agreed that, while not containing a bugper se, the seven methods deviate from good coding standardsand should be refactored. More specifically, four methods hadthe for loop being initialized at a wrong index (i.e., the forloop was initialized with i = 1 , but inside the body, the codeperformed several i − ) and three snippets had hard-codedunusual constraints in the binary expression (i.e., a > ,where 256 is a specific business constraint). Interestingly,Pradel and Sen [12] also observed that models can sometimespoint to pieces of code that are not buggy, but highly deviatedfrom coding standards. Observation 9: The model can potentially be useful atcommit time, however, the number of false alarms is to beconsidered. Fixing mistakes regarding good code practices for old pieces of software might not be considered worthwhileat large companies, given the possible unwanted changes tothe behavior of the software. However, if such a system wereto be employed during automated testing, the alerts mighthelp developers to adhere to better practices. We observed themodel pointing to relevant problems in 7 out of the 36 potentialbugs (20% of methods it identifies). While 20% might beconsidered a low number, one might argue that inspecting36 methods out of a code base that contains thousands ofmethods is not a costly operation and might be worth the effort.However, we still do not know the number of false negativesthat the tool might give, as inspecting all the methods of thecode base is unfeasible. RQ summary: When tested on a large-scale industrialsoftware system, the approach did not reveal any bugs per se,but pointed to code considered to deviate from good practices.VI. F UTURE W ORK We see much room for improvement before these modelscan reliably identify off-by-one errors. In the following, welist the ones we believe to be most urgent: The need for more data points for the off-by-one prob-lem. In this paper, we leveraged the existing java-large datasetcreated by Alon et al. [13]. While the entire dataset was builton top of 9,500 GitHub projects and contained approximately16M methods, only around 836k had binary conditions (e.g.,methods with loops and ifs containing a < , < = , > or > = ).We augment this dataset to 1.6M by introducing the defectivesamples. Nevertheless, there is a big difference between 16Mand 1.6M methods for training. As Alon et al. [8] argues:“a simpler model with more data might perform better thana complex model with little data”. It should be part of anyfuture work to devise a much larger dataset for the off-by-one problem and try the models we experiment here beforeproposing more complex models.Moreover, our dataset contains fewer usages of >= or<= compared to usages of > or <, clearly representing thereferences of developers when coding such boundaries. Thesedifferences can lead to biased training and, as a result, weobserved models tending to give false positive results in case of>= or <=. One way to mitigate the issue is to create a balanceddataset with a more equal distribution of binary operators, aswell as the distribution of the places of their occurrence (if-conditions, for- and while-loops, ternary expressions, etc). The challenges of imbalanced data. In this study, weexplored the effects of balancing and imbalacing in the ef-fectiveness of the model. However, the real imbalance of theproblem in real life (i.e., the proportion between methodswith off-by-one mistakes and methods without off-by-onemistakes) is unknown, although we strongly believe it to beimbalanced. Nevertheless, a 10:1 proportion enables us tohave an initial understanding how models would handle suchhigh imbalance. Our results show that it indeed negativelyaffects the performance of the model. Therefore, we suggestresearchers to focus their attention on how to make thesemodels better in face of imbalanced datasets. The support for inter-procedural analysis. Currently, ourapproach is only supporting the analysis of the AST of onemethod. However, the behaviour of a method, and the possi-bility of the bugs thereof, also depends on the contents of theother methods. For example, in recent research by Comptonet al. [32], the embedding vectors from the Code2Vec modelare concatenated to form an embedding for the entire class.Future work should explore whether class embeddings wouldperform better. Experimenting with different (and more recent) archi-tectures. In our work, we mainly looked at Code2Vec andCode2Seq models. We now see more recent models, suchas the GREAT model proposed by Hellendoorn et al. [33],which uses transformers and also captures the data-flow of thecode. We believe that data-flow information would enhance theperformance of our models. Making use of Byte-Pair Encoding (BPE) techniques. NLP models are often dependent on the vocabulary they aretrained on. The out-of-vocabulary (OoV) problem also hap-pens in this work. When testing the models trained on top ofopen-source data at Adyen , we had to replace unknown tokensby a generic UNK . We conjecture that this may diminishthe effectiveness of the models. We unfortunately did notmeasure the extent of how many times the UNK token wasused in our experiments. We plan to more precisely measureit in future replications of this work. In future work, we alsoplan to make use of techniques such as Byte-Pair Encoding(BPE) [34, 35], which attempts to mitigate the impact ofout-of-vocabulary tokens. We note that the use of BPE isbecoming more and more common in software engineeringmodels (e.g., [10, 33, 36]). A deeper understanding of the differences between ourmodel and Pradel’s and Sen’s [12] model. The DeepBugspaper explores the effectiveness of deep learning models to asimilar problem, which authors call “Wrong Binary Operator”.The overall idea of their approach is similar to ours (in otherwords, their work also served as inspiration for this one): the negative instances (i.e., the buggy code) are generated throughmutations in the positive code (i.e., non-buggy code), the coderepresentation is a vector that is based on the embeddings of allthe identifiers in the code, and the classification task is a feed-forward neural network that learns from the balanced set ofpositive and negative instances. Their results show an accuracyof 89%-92% in the controlled dataset (i.e., slightly higher thanour results in RQ1), and a precision of 68% in the manualanalysis (i.e., higher than our results in RQ2). Interestingly,authors also observe that the model also reports non-buggycode which deviates from best practices (i.e., similar to ourobservations in RQ3). When designing this study, we didexplicit compare the results to DeepBugs. The embeddingsderived from code2vec/seq capture more information, and weconjectured that they would naturally supersede DeepBugs.We nevertheless see a few differences between both works:First, in their “Wrong Binary Operator” task, the mutationreplaces the (correct) binary operator to any binary operator,e.g., a i < length can become a i % length . In ourcase, we limit ourselves only to off-by-one mistakes, i.e., acorrect i < length will always become a i <= length .We conjecture that this may increase the difficulty for themodel to learn, as bugs are now slightly more subtle. Second,while the manual analysis conducted in the DeepBugs paperis performed on the testing set (which contains artificial bugs),our RQ2 explores the performance of the model in real-worldbugs, i.e., bugs that were found and fixed by developers. Thisextra reality we bring to the experiment may be the reason forthe lower performance. Finally, we assumed that more robustmodels such as code2vec and code2seq would better capturethe intricacies of the off-by-one mistake. The model used inDeepBugs is simpler and yet as accurate as ours. More work isneeded to understand the pros and cons of our model and howboth works can be combined for the development of better andmore accurate models.VII. C ONCLUSIONS Software development practices offer many techniques fordetecting bugs at an early stage. However, these methods comewith their challenges and are either too labor-intensive or leavea lot of room for improvement. In this paper, we adaptedrecent state-of-the-art deep learning models to detect off-by-one errors in Java code, which are traditionally hard for staticanalysis tools due to their high dependency on context.We concluded that the trained models, while effective incontrolled datasets, still do not work well in real-world situ-ations. We see the use of deep learning models to identifyoff-by-one errors as promising. Nevertheless, there is stillmuch room for improvement, and we hope that this paperhelps researchers in paving the road for future studies in thisdirection. A CKNOWLEDGMENTS We thank Jón Arnar Briem, Jordi Smit, and Pavel Rapoportfor their participation in the workshop version of this paper. EFERENCES[1] B. Jeng and E. J. Weyuker, “A simplified domain-testing strategy,” ACM Transactions on Software Engineering and Methodology (TOSEM) ,vol. 3, no. 3, pp. 254–270, 1994.[2] S. C. Reid, “An empirical analysis of equivalence partitioning, boundaryvalue analysis and random testing,” in Proceedings Fourth InternationalSoftware Metrics Symposium . IEEE, 1997, pp. 64–73.[3] D. Hoffman, P. Strooper, and L. White, “Boundary values and automatedcomponent testing,” Software Testing, Verification and Reliability , vol. 9,no. 1, pp. 3–26, 1999.[4] B. Legeard, F. Peureux, and M. Utting, “Automated boundary testingfrom z and b,” in International Symposium of Formal Methods Europe .Springer, 2002, pp. 21–40.[5] P. Samuel and R. Mall, “Boundary value testing based on uml models,”in . IEEE, 2005, pp. 94–99.[6] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, “Learning naturalcoding conventions,” in Proceedings of the 22nd ACM SIGSOFT Inter-national Symposium on Foundations of Software Engineering , 2014, pp.281–293.[7] ——, “Suggesting accurate method and class names,” in Proceedings ofthe 2015 10th Joint Meeting on Foundations of Software Engineering ,2015, pp. 38–49.[8] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: Learningdistributed representations of code,” Proceedings of the ACM on Pro-gramming Languages , vol. 3, no. POPL, pp. 1–29, 2019.[9] V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deep learningtype inference,” in Proceedings of the 2018 26th acm joint meetingon european software engineering conference and symposium on thefoundations of software engineering , 2018, pp. 152–162.[10] M. Allamanis, E. T. Barr, S. Ducousso, and Z. Gao, “Typilus: Neuraltype hints,” in PLDI , 2020.[11] M. Pradel, G. Gousios, J. Liu, and S. Chandra, “Typewriter: Neu-ral type prediction with search-based validation,” arXiv preprintarXiv:1912.03768 , 2019.[12] M. Pradel and K. Sen, “Deepbugs: A learning approach to name-basedbug detection,” Proceedings of the ACM on Programming Languages ,vol. 2, no. OOPSLA, pp. 1–25, 2018.[13] U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generatingsequences from structured representations of code,” in InternationalConference on Learning Representations , 2019. [Online]. Available:https://openreview.net/forum?id=H1gKYo09tX[14] J. A. Briem, J. Smit, H. Sellik, P. Rapoport, G. Gousios, and M. Aniche,“Offside: Learning to identify mistakes in boundary conditions,” in Pro-ceedings of the IEEE/ACM 42nd International Conference on SoftwareEngineering Workshops , 2020, pp. 203–208.[15] K. F. Tómasdóttir, M. Aniche, and A. Van Deursen, “The adoption ofjavascript linters in practice: A case study on eslint,” IEEE Transactionson Software Engineering , 2018.[16] K. F. Tómasdóttir, M. Aniche, and A. van Deursen, “Why and howjavascript developers use linters,” in . IEEE, 2017,pp. 578–589.[17] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On thenaturalness of software,” in [24] A. Habib and M. Pradel, “Neural bug finding: A study of opportunities Software Engineering (ICSE) . IEEE, 2012, pp. 837–847.[18] X. Chen, C. Liu, and D. Song, “Tree-to-tree neural networks for programtranslation,” in Advances in neural information processing systems ,2018, pp. 2547–2557.[19] M. Aniche, E. Maziero, R. Durelli, and V. Durelli, “The effectivenessof supervised machine learning algorithms in predicting software refac-toring,” Transactions on Software Engineering (TSE) , 2020.[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in Advances in neural information processing systems , 2013,pp. 3111–3119.[21] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to repre-sent programs with graphs,” arXiv preprint arXiv:1711.00740 , 2017.[22] Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow, “Gated graphsequence neural networks,” in Proceedings of ICLR’16 , April 2016.[23] L. Pascarella, F. Palomba, and A. Bacchelli, “Fine-grained just-in-timedefect prediction,” Journal of Systems and Software , vol. 150, pp. 22–36,2019.and challenges,” arXiv preprint arXiv:1906.00307 , 2019.[25] Y. Li, S. Wang, T. N. Nguyen, and S. Van Nguyen, “Improving bugdetection via context-based code representation learning and attention-based neural networks,” Proceedings of the ACM on ProgrammingLanguages , vol. 3, no. OOPSLA, pp. 1–30, 2019.[26] A. Grover and J. Leskovec, “node2vec: Scalable feature learning fornetworks,” in Proceedings of the 22nd ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 2016,pp. 855–864.[27] Y. Wang, F. Gao, L. Wang, and K. Wang, “Learning a static bug finderfrom data,” arXiv preprint arXiv:1907.05579 , 2019.[28] D. Spadini, M. Aniche, and A. Bacchelli, “Pydriller: Python frameworkfor mining software repositories,” in Proceedings of the 2018 26thACM Joint Meeting on European Software Engineering Conference andSymposium on the Foundations of Software Engineering , 2018, pp. 908–911.[29] L. Breiman, “Random forests,” Machine learning , vol. 45, no. 1, pp.5–32, 2001.[30] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza-tion of machine learning algorithms,” in Advances in neural informationprocessing systems , 2012, pp. 2951–2959.[31] H. Sellik, O. van Paridon, G. Gousios, and M. Aniche, “Learning off-by-one mistakes: An empirical study (appendix),” https://doi.org/10.5281/zenodo.4560410, 2021.[32] R. Compton, E. Frank, P. Patros, and A. Koay, “Embedding java classeswith code2vec: improvements from variable obfuscation [accepted],” in MSR 2020 . ACM, 2020.[33] V. J. Hellendoorn, C. Sutton, R. Singh, P. Maniatis, and D. Bieber,“Global relational models of source code,” in International Conferenceon Learning Representations , 2019.[34] P. Gage, “A new algorithm for data compression,” C Users Journal ,vol. 12, no. 2, pp. 23–38, 1994.[35] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation ofrare words with subword units,” arXiv preprint arXiv:1508.07909 , 2015.[36] R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes, “Bigcode!= big vocabulary: Open-vocabulary models for source code,” arXivpreprint arXiv:2003.07914arXivpreprint arXiv:2003.07914