[PDF] Learning Off-By-One Mistakes: An Empirical Study

Abstract

Mistakes in binary conditions are a source of error in many software systems. They happen when developers use, e.g., < or > instead of <= or >=. These boundary mistakes are hard to find and impose manual, labor-intensive work for software developers. While previous research has been proposing solutions to identify errors in boundary conditions, the problem remains open. In this paper, we explore the effectiveness of deep learning models in learning and predicting mistakes in boundary conditions. We train different models on approximately 1.6M examples with faults in different boundary conditions. We achieve a precision of 85% and a recall of 84% on a balanced dataset, but lower numbers in an imbalanced dataset. We also perform tests on 41 real-world boundary condition bugs found from GitHub, where the model shows only a modest performance. Finally, we test the model on a large-scale Java code base from Adyen, our industrial partner. The model reported 36 buggy methods, but none of them were confirmed by developers.

Full PDF

LLearning Off-By-One Mistakes: An Empirical Study

Hendrig Sellik

Delft University of Technology

Delft, The [email protected]

Onno van Paridon

Adyen N.V.

Amsterdam, The [email protected]

Georgios Gousios, Maurício Aniche

Delft University of Technology

Delft, The Netherlands{g.gousios,m.f.aniche}@tudelft.nl

Abstract —Mistakes in binary conditions are a source of errorin many software systems. They happen when developers use, e.g.,‘ < ’ or ‘ > ’ instead of ‘ < = ’ or ‘ > = ’. These boundary mistakesare hard to ﬁnd and impose manual, labor-intensive work forsoftware developers.While previous research has been proposing solutions to iden-tify errors in boundary conditions, the problem remains open. Inthis paper, we explore the effectiveness of deep learning modelsin learning and predicting mistakes in boundary conditions. Wetrain different models on approximately 1.6M examples withfaults in different boundary conditions. We achieve a precisionof 85% and a recall of 84% on a balanced dataset, but lowernumbers in an imbalanced dataset. We also perform tests on 41real-world boundary condition bugs found from GitHub, wherethe model shows only a modest performance. Finally, we test themodel on a large-scale Java code base from Adyen , our industrialpartner. The model reported 36 buggy methods, but none of themwere conﬁrmed by developers.

Index Terms —machine learning for software engineering, deeplearning for software engineering, software testing, boundarytesting.

I. I

NTRODUCTION

Off-by-one mistakes happen when developers do not cor-rectly implement a boundary condition in the code. Suchmistakes often occur when developers use ‘ > ’ or ‘ < ’ in caseswhere they should have used ‘ = > ’ or ‘ < = ’, or vice versa.Take the example of an off-by-one error in theGson library , which we illustrate in Figure 1. The toFind.length() < limit condition is wrong. The ﬁxchanges the < operator by the <= operator. Such mistakes areparticularly difﬁcult to ﬁnd in source code. After all, the resultof the program is not always obviously wrong, as it is “merelyoff by one”. In most cases, the mistake will lead to an “outof bounds” situation, which will then result in an applicationcrash.A large body of knowledge in the software testing ﬁeld isdedicated to (manual) boundary testing techniques (e.g., [1,2, 3, 4, 5]). However, manually inspecting code for off-by-one errors is time-consuming since determining whichbinary operator is the correct one is usually heavily context-dependent. The industry has been relying on static analysistools, such as SpotBugs or PVS-Studio . SpotBugs promisesto identify possible inﬁnite loops, as well as array indices, https://github.com/google/gson/commit/161b4ba https://spotbugs.github.io offsets, lengths, and indexes that are out of bounds. PVS-Studio also tries to identify mistakes in conditional statementsand indexes that are out of bounds in array manipulation. Andwhile they can indeed ﬁnd some of them, many of them goundetected. As we later show in this paper, none of the real-world off-by-one errors could be detected by the state-of-the-practice static analysis tools.We conjecture that, for a tool to be able to preciselyidentify mistakes in boundary conditions, it should be able tocapture the overall context of the source code under analysis.Understanding the context of the source code has been tra-ditionally a challenge for static analysis techniques. However,recent advances in machine and deep learning have shown thatmodels can learn useful information from the syntactic andsemantic information that exist in source code. Tasks that weredeemed not possible before, such as method naming [6, 7, 8],type inference [9, 10, 11], and bug ﬁnding [12], are nowfeasible. The lack of reliable tools that detect off-by-onemistakes leaves an excellent opportunity for researchers toexperiment with machine learning approaches.Inspired by the code2vec and code2seq models proposed byAlon et al. [8, 13], we trained several deep learning modelson likely correct methods and their counterparts affected byoff-by-one mistakes. The models are trained on over 1.6Mexamples, and the best results are obtained with the Code2Seq[13] model achieving 85% precision and a recall of 84% on abalanced testing set. However, our results also show that themodel, when tested on a real-world dataset that consisted of41 bugs in open-source systems, yields low performance (55%precision and 46% recall).Finally, we tested the best models in one of our industrialpartners. Adyen is one of the world’s largest payment serviceproviders allowing customers from over 150 countries to useover 250 payment methods including different internet banktransfers and point of sales solutions. The company is workingin a highly regulated banking industry and combined with thehigh processing volumes there is little to no room for errors.Hence,

Adyen uses the industry-standard best practices forearly bug detection such as code reviews, unit testing, andstatic analysis. It is at

Adyen ’s best interest to look into noveltools to prevent software defects ﬁnding their way into theirlarge code base, preferring methods that scale and do not wastethe most expensive resource of the company, the developers’time. Our results show that, while the model did not reveal anybugs per se, it pointed developers to code that they considered a r X i v : . [ c s . S E ] F e b rivate boolean skipTo(String toFind) throws IOException {outer:for (; pos + toFind.length() < limit || (cid:44) → ﬁllBuffer(toFind.length()); pos++) {for (int c = 0; c < toFind.length(); c++) {if (buffer[pos + c] != toFind.charAt(c)) {continue outer;...} Fig. 1: A off-by-one error in the Gson library, ﬁxed in commit toFind.length()

The use of static analysis tools is quite common amongsoftware development teams (e.g., [15, 16]). These tools,however, rely on bug pattern detectors that are manuallycrafted and ﬁne-tuned by static analysis experts. The vastamount of different bug patterns makes it very difﬁcult to covermore than a fraction of them.Machine Learning for Software Engineering has seen rapiddevelopment in recent years inspired by the successful applica-tion in the Natural Language Processing ﬁeld [17]. It is appliedin many tasks related to software code such as code translation(e.g., [18]), type inference (e.g., [9, 10]), code refactoring(e.g., [19]) and, as we list below, bug identiﬁcation.Pradel et al. [12] use a technique similar to Word2Vec[20] to learn embeddings for JavaScript code tokens extractedfrom the AST. These embeddings are used to train two-layerfeed-forward binary classiﬁcation models to detect bugs. Eachtrained model focuses on a single bug type, and the authors testit on problems such as wrong binary operator, wrong operandin binary operation and swapped function arguments. Thesemodels do not use all the tokens from the code, but only thosespeciﬁc to the problem at hand. For example, the model thatdetects swapped function arguments only uses embeddings ofthe function name and arguments with a few other AST nodesas features.Allamanis et al. [21] use Gated Graph Neural Network [22]to detect variable misuse bugs on a token level. As an input tothe model, the authors use an AST graph of the source codeand augment it with additional edges from the control ﬂowgraph. Pascarella et al. [23] show that defective commits are oftencomposed of both defective and non-defective ﬁles. They alsotrain a model to predict defective ﬁles in a given commit.Habib et al. [24] create an embedding from methods using aone-hot encoding of tokens such as keywords (for, if, etc.),separators (;, (), etc.), identiﬁers (method, variable names andliterals (values such as "abc" and 10). The embeddings for theﬁrst 50 tokens are then used to create a binary classiﬁcationmodel. The oracle for training data is a state-of-the-art staticanalysis tool, and the results show that neural bug ﬁnding canbe highly successful for some patterns, but fail at others.Li et al. [25] use method AST in combination with aglobal Program Dependency Graph and Data Flow Graphto determine whether the source code in a given method isbuggy or not. The authors use Word2Vec to extract AST nodeembeddings with a combination of GRU Attention layer andAttention Convolutional Layer to build a representation of themethod’s body. Node2Vec [26] is used to create a distributedrepresentation of the data ﬂow graph of the ﬁle which theinspected method is in. The results are combined into a methodvector which is used to make a softmax prediction.Wang et al. [27] deﬁne bug prediction as a binary classiﬁ-cation problem and train three different graph neural networksbased on control ﬂow graphs of Java code. They use a novelinterval-based propagation mechanism to more efﬁciently gen-eralize a Graph Neural Network (GNN). The resulting methodembedding is fed into a feed-forward neural network to ﬁndnull-pointer de-reference, array index out of bounds and classcast exceptions. For each bug type, a separate bug detector istrained. III. A

PPROACH

In order to detect off-by-one errors in Java code, we aim tocreate a hypothesis function that will calculate output basedon the inputs generated from an example. More speciﬁcally,we train and compare different binary classiﬁcation machinelearning models to classify Java source code methods to oneof the two possible output labels which are ”defective” and”non-defective”. If a method is considered as ”defective”, it issuffering from an off-by-one error, otherwise, it is deemed tobe clear from errors.These models are based on the Code2Vec [8] andCode2Seq [13] models, state-of-the-art deep learning modelsoriginally developed for generating method names and descrip-tions. The models use Abstract Syntax Tree paths of a methodas features and create an embedding by combining them withthe help of an attention mechanism. In addition, we also builda Random Forest baseline model based on source code tokens.We acquired the datasets necessary for the training of thesemodels from the work of Alon et al. [8] which results in animbalanced dataset of 920K examples (1 to 10 ratio) and abalanced dataset of 1.6M examples when combined with ourautomatically mutated methods.We train on both imbalanced and balanced data to see thedifference in performance. We then evaluate the accuracy ofthe model in 41 real-world open source off-by-one errors. In ava-large

V.1

Transfer Trained Model from Server

Train a Model onOpen-SourceDataset Further Train onClosed-SourceProject

Find Bugs

V.2

Results forOpen-SourceSoftware

Results forCompany Data java-med(mutated)

Tune Hyperparameters java-med java-large(mutated)

Train (mutated)

Further TrainSplit and MutateMethods Split and MutateMethods

Companyproject

Fig. 2: The ﬂow of the research including data collection,mutation, training and testing.TABLE I: The different datasets used in this paper.

Dataset Train Validation Test Total large-balanced 1,593,610 30,634 48,516 large-imbalanced 876,485 16,849 26,684

Adyen open-source bugs - - 82 addition, we further train the models with data from a companyproject to ﬁne-tune the model and ﬁnd bugs from that projectspeciﬁcally.In Figure 2, we show the overall research method. In thefollowing, we provide a more detailed description of theapproach and research questions. A. Datasets

We used the java-large dataset provided by Alon et al.[13] for model training. We used

Adyen ’s production Javacode to further train and test the model with project-speciﬁcdata. Finally, we used an additional real-world-bugs dataset toevaluate models on real-world bugs. A summary of the datasetscan be seen in Table I.1) The java-large-balanced dataset consists of 9,500 top-starred Java projects from GitHub created since January2007. Out of those 9500 projects, 9000 were randomlyselected for the training set, 250 for the validation set andthe remaining 300 were used for the testing set. Originally,this dataset contained about 16M methods, but 836,380were candidates for off-by-one errors (e.g. methods withloops and if conditions containing binary operator < , < = , > or > = ). After mutating the methods, the ﬁnal balanceddataset consisted of 1,672,760 methods, 836,380 of themassumed to be correct and 836,380 assumed to be buggy.2) The additional imbalanced dataset java-large-imbalanced was constructed to emulate more realistic data, where themajority of the code is not defective. A 10-to-1 ratiobetween non-defective and defective methods was chosen since it resulted in a high precision while having a reason-able recall. We empirically observed that upon increasingthe ratio of non-defective methods even further, the modeldid not return possibly defective methods when runningon Adyen ’s codebase. Meaning that if the ratio was higherthan 10-to-1, the recall of the model became too low touse it.3) Adyen ’s code is a repository containing the productionJava code of the company. It consists of over 200,000methods out of which 7,435 contain a mutation candidateto produce an off-by-one error. After mutating the methods,this resulted in a balanced dataset containing 14,870 datapoints.4)

41 real-world bugs in boundary conditions were used formanual evaluation. We extracted the bugs from the 500most starred GitHub Java projects. The analyzed projectswere not part of the training and evaluation sets and thusare not seen by a model before testing. Using a Pydrillerscript [28], we extracted a list of candidate commits whereauthors made a change in a comparator (e.g., a ” > ” to” = > ”; ” < = ” to ” < ”, etc.). This process returned a list of1,571 candidate commits which were analyzed manuallyuntil 41 were conﬁrmed to be off-by-one errors and addedto the dataset. The manual analysis was stopped due to itbeing a very labor-intensive process. B. Generating positive and negative instances

In order to train a supervised binary classiﬁcation model,we require defective examples. To get those, we modiﬁed theexisting likely correct code to produce likely incorrect code.For each method, we found a list of possible mutation pointsand selected a random one. After this, we altered the selectedbinary expressions using JavaParser in a way to generate anoff-by-one error.Due to changing only one of the expressions, the equivalentmutant problem does not exist for the training examples,unless the original code was unreachable at the position ofthe mutation. It is also important to note that the datasetsare split on a project level for the java-large dataset and ona sub-module level for Adyen ’s code. This means that thepositive and the negative examples both end up in the sametraining, validation or test set. We did this to avoid evaluatingmodel predictions on a code that only had one binary operatorchanged compared to the code that was used during training.

C. Model Architecture

The models we used in this work are based on the recentCode2Vec model [8] and its enhancement Code2Seq [13], anda baseline model that makes use of random forest. We describethe models in more detail in the next sub-sections. JavaParser GitHub page https://github.com/javaparser/javaparser/ Equivalent mutant problem may exist, for example, if we mutate “deadcode”. However, we conjecture that this is a negligible problem and will notaffect the results. ) Code2Vec:

The Code2Vec model created by Alon et al.[8] is a Neural Network model used to create embeddings fromJava methods. These embeddings were used in the originalwork to predict method names.The architecture of this model requires Java methods to besplit into path contexts based on the AST of the method. Apath context is a random path between two nodes in the ASTand consists of two terminal nodes x s , x t and the path betweenthose terminal nodes p j which does not include the terminals.The embeddings for those terminal nodes and paths are learnedduring training and stored in two separate vocabularies. Duringtraining, these paths are concatenated to a single vector tocreate a context vector c i which has the length l of · x s + x p where the length of x s is equal to x t .The acquired context vectors c i for paths are passed throughthe same fully connected (dense) neural network layer (usingthe same weights). The network uses hyperbolic tangent acti-vation function and dropout in order to generate a combinedcontext vector ˜ c i . The size of the dense layer allows controllingthe size of the resulting context vector.The attention mechanism of the model works by using aglobal attention vector a ∈ R h which is initialized randomlyand learned with the rest of the network. It is used to calculateattention weight a i for each individual combined contextvector ˜ c i .It is possible that some methods are not with a large enoughAST to generate the required number of context paths. Forthis dummy (masked) context paths are inputted to the modelwhich get a value of zero for attention weight a i . This enablesthe model to use examples with the same shape.During training, a tag vocabulary tags _ vocab ∈ R | Y |× l iscreated where for each tag (label) y i ∈ Y corresponds to anembedding of size l . The tags are learned during training andin the task proposed by the authors, these represent methodnames.A prediction for a new example is made by computing thenormalized dot product between code vector v and each ofthe tag embeddings tags _ vocab i , resulting in a probabilityfor each tag y i . The higher the probability, the more likely thetag belongs to the method.

2) Code2Seq:

The Code2Seq model created by Alon et al.[13] is a sequence-to-sequence model used to create embed-dings from Java methods from which method descriptions arelearned. The original work was used to generate sequences ofnatural language words to describe methods.Similarly to the Code2Vec model, the model works bygenerating random paths from the AST with a speciﬁedmaximum length. Each path consists of 2 terminal tokens x s , x t and the path between those terminal nodes p j which, inCode2Seq, includes the terminal nodes p s , p t ∈ p j , but nottokens.It is important to make a difference between terminal tokensand path nodes. The former are user-deﬁned values, such asa number or variable called stringBuilder while the lattercome from a limited set of AST constructs such as NameExpr, BlockStmt, ReturnStmt. There are around 400 different nodetypes that are predeﬁned in the JavaParser implementation .During training, the path nodes and the terminal tokensare encoded differently. Terminal tokens get partitioned intosubtokens based on the camelCase notation, which is a stan-dard coding convention in Java. For example, a terminal token stringBuilder will be partitioned into string and Builder . Thesubtokens are turned into embeddings with a learned matrix E subtokens and encoding is created for the entire token byadding the values for subtokens.Paths of the AST are also split into nodes and each ofthe nodes corresponds to a value in a learned embeddingmatrix E nodes . These embeddings are fed into a bi-directionalLSTM which ﬁnal states result in a forward pass output → h andbackward pass output ← h . These are concatenated to producea path encoding.As with the Code2Vec model, the encodings of the terminalnodes and the path are concatenated and the resulting encodingis an input to a dense layer with tanh activation to create a combined context vector ˜ c i . Finally, to provide an initial stateto the decoder, the representations of all n paths in a givenmethod are averaged.The decoder uses the initial state h to generate an outputsequence while attending over all the combined context vectors ˜ c , ..., ˜ c n . The resulting output sequence represents a naturallanguage description of the method. We adapted Code2Seq’ssequence output to be { (0 | , } , i.e., a 1 or 0 tokenindicating the method being buggy or not buggy, and a tokenthat ends the sequence.The advantage of the Code2Seq model is in the way thecontext vectors ˜ c i are created. In particular, due to splittingterminal nodes. The vocabulary of the terminal nodes yieldsgreater ﬂexibility towards different combinations of sub-tokencombinations. In addition, while Code2Vec embeds entire ASTpaths between terminals, the Code2Seq model only embedssub-tokens. This results in fewer out-of-vocabulary examplesand a far smaller model size. The model also has an order-of-magnitude fewer parameters compared to the Code2Vecmodel.

3) Baseline Model:

We developed a baseline model toassess the performance of a simpler architecture. For this,we used a Random Forest model [29] and compared theperformance with the same datasets.First, we tokenized the Java methods using leaf nodes oftheir respective ASTs. After this, all the tokens of the methodwere vectorized using the TF-IDF method. The vectorizedtokens of one method comprised a training example for aRandom Forest model. This model was then trained on allof the methods from java-large training set.

D. Hyper-parameter optimization and model training

For hyper-parameter optimization, we used Bayesian opti-mization [30]. We selected model precision as the optimization ABLE II: Model training times. B=Balanced dataset,I=Imbalanced dataset, E=number of epochs, T=time to train.

Balanced Imbalanced

Adyen

Model T E T E T EBaseline 5h33m 1 1h59m 1 48s 1Code2Vec 1d2h2m 52 11h6m 52 1h1m 53Code2Seq 3d18h18m 14 2d14h 41m 15 1h8m 17 parameter since high precision is required to obtain a usabledefect prediction model. We used Bayesian optimization overother methods like random search or grid search because itenables us to generate a surrogate function that is used tosearch the hyper-parameter space based on previous resultsand acts as intuition for parameter selection. This results insaving signiﬁcantly more time because the actual model doesnot need to run as much due to wrong parameter ranges beingdiscarded early in the process.The hyper-parameters are optimized in the java-med-balanced , another dataset made public by Alon et al. [8].The dataset consists of 1,000 top-starred Java projects fromGitHub. Out of those 1000 projects, 800 were randomlyselected for the training set, 100 for validation set and theremaining 100 were used for the testing set. Originally, thisdataset contained about 4M methods, but 170,295 were can-didates for off-by-one errors (e.g. methods with loops and ifconditions containing binary operator < , < = , > or > = ). Thisresulted in a balanced dataset of 340,590 methods, 170,295 ofthem assumed to be correct and 170,295 assumed to be buggy.We ran optimization for four different scenarios. Two runsfor the balanced java-medium dataset with Code2Vec modeland Code2Seq models, respectively, and an additional two runswith the same models for imbalanced datasets. We used amachine with Intel(R) Xeon(R) CPU E5-2660 v3 processorrunning at 2.60GHz with a Tesla M60 graphics card.Once the hyper-parameters were identiﬁed, we train theCode2Vec and Code2Seq models (as well as the baseline)using the balanced and imbalanced versions of the java-large dataset, and perform further training with the source code ofour industrial partner. We show the training time of the ﬁnalmodels in Table II. E. Analysis

We report the precision and recall of our models. Precisionhelps to evaluate the models’ proneness to classify negativeexamples as positive. The latter is also known as false positive.This means that a model with high precision has a low false-positive rate and a model with low precision has a high false-positive rate. More formally, precision is the number of truepositive (TP) predictions divided by the sum of true positiveand false positive (FP) predictions.For a bug detection model, low precision means a highnumber of false positives, making the developers spend theirtime checking a large number of errors reported by the modelonly to ﬁnd very few predictions that are defective. This means that in this work, we prefer high precision for a bug-detectionmodel.Monitoring precision alone is not enough since a model thatis precise but only predicts few bugs per thousands of bugsis also not useful. Hence, recall is also measured. It measuresthe models’ ability to ﬁnd all the defective examples from thedataset. A recall of a model is low when it does not ﬁnd manyof the positive examples from the dataset and very high if itmanages to ﬁnd all of them. More formally, it is the numberof true positive predictions divided by the sum of true positiveand false negative predictions.Ideally, a bug prediction model would ﬁnd all of the bugsfrom the dataset and have a high recall score. However, deeplearning networks usually do not achieve perfect precision andrecall at the same time. For more difﬁcult problems with aprobabilistic model, there can be a trade-off. When increasingthe threshold of the model conﬁdence for the positive example,the recall will decline. For this reason, a sci-kit learn packagewas used to also make a precision-recall curve to observe theeffect of the change in precision and recall upon changingthe conﬁdence of the model needed to classify an example aspositive (defective).

F. Reproducibility

We provide all the source code (data collection, pre-processing, and machine learning models) in our online ap-pendix [31]. The source code is also available in GitHub .IV. M ETHODOLOGY

The goal of this study is to measure the effectiveness ofdeep learning models in identifying off-by-one mistakes. Tothat aim, we propose three research questions: • RQ : How do the models perform on a controlleddataset? In order to obtain a vast quantity of data, weuse a controlled dataset (see Section III-A). We train themodels on the dataset and use metrics such as precisionand recall to assess the performance. • RQ : How well do the methods generalize to adataset made of real-world bugs? We mine a datasetof real-world off-by-one error bugs from GitHub issuesof various open-source projects. Then we use a model topredict the error-proneness of a method before and aftera ﬁx. This will indicate how well the model works forreal-world data. This evaluation will enable us to extractthe precision metric and compare it to the one from RQ . • RQ : Can the approach be used to ﬁnd bugs from alarge-scale industry project? One useful application toan error-detection model is to analyze the existing projectand notify of methods containing off-by-one errors. Wemake several runs where the model is ﬁrstly trained on adataset with mutated code and then tested on real code toﬁnd such errors. In addition, we further train the modelwith a different version of the industry project to ﬁnderrors in the future versions of the project. https://github.com/hsellik/thesis/tree/MSR-2021 o answer RQ , we performed hyper-parameter optimiza-tion. After this, we selected the best hyper-parameter valuesand trained the model with randomly initialized parameterson the java-large dataset on the same machine as used forhyper-parameter optimization (see Section III-D). We trainedCode2Seq and Code2Vec models until there was no gain inprecision for three epochs of training in the evaluation set.After this, we assessed the model on the testing set of java-large dataset.The process was conducted for three different conﬁgurationsof data. These were:1) BB - the training data was balanced (B) with the cross-validation and testing data also being balanced (B).2) BI - the training data was balanced (B) with the cross-validation and testing data being imbalanced (I).3) II - the training data was imbalanced (I) with the cross-validation and the testing data also being imbalanced (I).The data imbalance was inspired by the work of Habib etal. [24], who reported that a bug detection model trained on abalanced dataset would have poor performance when testingon a more real-life scenario with imbalanced classes.To answer RQ , we selected the best-performing modelon the controlled java-large testing set (see Table III), whichwas the model based on the Code2Seq architecture. After this,the model was tested on the bugs and their ﬁxes found fromseveral real-world Java projects (open-source bugs dataset inTable I).Firstly, we tested the model on the correct code that wasobtained from the GitHub diff after the change to see theclassiﬁcation performance on non-defective code. To test themodel performance on defective code, we reverted the exampleto the state where the bug was present using the git versioncontrol system. After this, we recorded the model predictionon the defective method.In addition, as a way to compare our work with staticanalysis, we apply three popular static analyzers to the sameset of defective and non-defective snippets: SpotBugs (v.4.0.0-beta1),

PVS-Studio (v.7.04.34029), and the static analyzerintegrated with

IntelliJ IDEA (v. 2019.2.3).To answer RQ , we trained the Code2Seq model only onthe data generated from the company project, but the trainingdid not start with randomly initialized weights. Instead, theprocess was started with the weights acquired after trainingon the java-large dataset (see Figure 2).We selected the Code2Seq based model because it hadthe best performance on the imbalanced testing set of thecontrolled java-large set. We selected the performance on theimbalanced controlled set as a criterion since we assumedthat the company project also contains more non-defectiveexamples than defective ones.We used the pre-trained model because the company projectalone did not contain enough data for the training process.Additionally, due to the architecture of the Code2Seq andCode2Vec models, the embeddings of terminal and AST nodevocabularies did not receive additional updates during furthertraining with company data. We trained the model until there was no gain in precision for three epochs on the validation set,and after this, we tested the model on the test set consistingof controlled Adyen data.We conducted additional checking on

Adyen data by try-ing to ﬁnd bugs in the most recent version of the project.More speciﬁcally, we updated the project to its most recentversion using their git version control system, and withoutany modiﬁcations to their original code, we used the modelto predict whether every Java method in their code base hada off-by-one mistake. We analyzed all bug predictions thatwere over a threshold of 0.8 to see if they contained bugs.The 0.8 threshold was deﬁned after manual experimentation.We aimed at a set of methods that were large enough to bringus interesting conclusions, yet small enough to enable us tomanually verify each of them.

A. Threats to Validity

In this section, we discuss the threats to the validity of thisstudy and the actions we took to mitigate them.

1) Internal validity:

Our method performs mutations togenerate faulty examples from likely correct code by editingone of the binary condition within the method. This means thatwhile the correct examples represent a diverse set of methodsfrom open-source projects, the likely incorrect methods maynot represent a realistic distribution of real-world bugs. Thisaffects the model that is being trained with those examplesand also the testing results conducted on this data.

2) External validity:

While the work included a diverseset of open-source projects, the only closed-source projectthat was used during this study was

Adyen ’s. Hence, theclosed-source projects (in training and in validation) are under-represented in this study.Moreover, we have only experimented with Java codesnippets. While the approach seems to be generic enough tobe applicable to any programming language, the results mightvary given the particular way that developers write code indifferent communities. Therefore, more experimentation needsto be conducted before we can argue that the results generalizeto any programming language.V. R

ESULTS

In the following sections, we present the results of ourresearch questions.

A. RQ : How do the models perform on a controlled dataset? In Table III, we show the precision and recall of the differentmodels. In Figures 3a and 3b, we show the ROC curve andthe precision-recall curve of the experiment with Code2Seqbased model for the imbalanced java-large dataset.

Observation 1: Models present high precision and recallwhen trained and tested with balanced data.

The resultsshow that when training models on a balanced dataset with anequal amount of defective/non-defective code and then testingthe same model on a balanced testing set, both Code2Vec andCode2Seq models achieve great precision and recall wherethe Code2Seq based model has better precision (85.23% vsABLE III: Model results in controlled testing sets. BB stands for balanced training and testing set, II stands for imbalancedtraining set and testing set.

Experiment BB Experiment BI Experiment II

Java-large Java-large Java-large

Adyen data(cross-project)

Adyen data(further trained)Model Pr. % Re. % Pr. % Re. % Pr. % Re. % Pr. % Re. % Pr. % Re. %Code2Seq

Code2Vec 80.11 77.01 28.52 75.53 64.65 41 53.85 20.46 43.95 23.39Baseline 50 49.08 8.99 49.18 17.86 0.15 0.0 0.0 9.25 0.92Offside [14] 80.9 75.6 - - - - - - - -

Observation 2: The metrics drop considerably whentested on an imbalanced dataset.

When simulating a morereal-life scenario and creating an imbalance in the testing setwith more non-defective methods, the recall of the modelsremained similar with recall increasing from 84.82 to 84.86for the Code2Seq model and dropping from 77.01% to 75.53%for the Code2Vec model. However, the precision of the modelsreduced drastically with the Code2Seq model dropping from85.23% to 36.08% and Code2Vec model from 80.11% to28.52%. The baseline model also drops in precision from 50%to 8.99% while keeping the same recall.

Observation 3: The low precision can be mitigated bytraining on an imbalanced dataset, but at the cost of recall.

We trained Code2Seq and Code2Vec models on an imbalanceddataset and results show that the precision score for imbal-anced data returned almost to the same level for the Code2Seq-based model (83.04% vs 85.23%), but remained lower forthe Code2Vec-based model (64.65% vs 80.11%). However,the recall declined drastically from 84.82% to 42.34% forthe Code2Seq model and from 77.01% to 41.00% for theCode2Vec model.When analysing the ROC curve (Figures 3a and 3b), theprecision is ≈ ≈ RQ summary: Both Code2Seq and Code2Vec based mod-els present high accuracy on a balanced dataset. The numbersdrop when we make use of imbalanced (i.e., more similar tothe real-world) datasets.

B. RQ : How well do the methods generalize to a datasetmade of real-world bugs? The performance of the model on the 41 real-world bound-ary mistakes and their non-defective counterparts are presented TABLE IV: Results of applying the Code2Seq model to41 real-world off-by-one bugs and their corrected ver-sions. B=balanced training set, I=imbalanced training set,Pr=Precision, Re=Recall, Thr=Threshold.

Model Thr TP TN FP FN Pr Re F1

Code2Seq (B) 0.5

26 15 Code2Seq (B) 0.8 10 33 8 31 55.56 24.39 33.9Code2Seq (I) 0.5 3

41 0

41 0 in Table IV. Observation 4: The model can detect real-world bugs,but with a high false-positive rate.

Out of the 41 defectivemethods, 19 (46.34%) were classiﬁed correctly and out of 41correct methods, 26 (63.41%) were classiﬁed correctly. Theprecision and recall scores of 55.88 and 46.34 were achievedwhile evaluating the model on real-world bugs with theCode2Seq model trained on balanced data using a thresholdof 0.5. Compared to the results from the java-large testing setwith augmented methods, the results are signiﬁcantly lowerwith precision and recall being 29.35 and 38.08 points lowerrespectively (see metrics for Code2Seq model with

ExperimentBB in Table III).

Observation 5: The state-of-the-practice linter tools didnot ﬁnd any of the real-world bugs.

As an interesting remark,none of the bugs was identiﬁed by any of the state-of-the-practice linting tools we experimented. This reinforces theneed for approaches that identify such bugs (by means of staticanalysis or deep learning). RQ summary: The model presents only reasonable per-formance on real-world off-by-one mistakes in open-sourceprojects. Static analysis tools did not detect any bug.

C. RQ : Can the approach be used to ﬁnd bugs from a large-scale industry project? We present the accuracy of the model in our industrialpartner,

Adyen , also in Table III.

Observation 6: Models trained on open-source data showsatisfactory results in the industry dataset.

Our empiricalﬁndings show that when a model is trained on an open-sourcedataset and then applied to the company project (followingthe same pipeline of mutating methods as to generate positive a) ROC curve. Area under curve 0.89. (b) Precision-recall curve. Area under curve 0.65.

Fig. 3: ROC and precision-recall curves for the Experiment II with the java-large dataset.and negative instances), it will have good precision and recallscores with 71.15% and 24.66% for Code2Seq and somewhatlower 53.85% and 20.46% for Code2Vec model respectively.

Observation 7: Further training on the

Adyen projectdid not yield better results.

We hypothesized that trainingthe model further on

Adyen ’s code base would give a boost inprecision and recall scores. The recall of the models improvedby 6.0 percentage points for Code2Seq based model and 2.93for Code2Vec based model. However, the precision of bothmodels dropped by 4.49 percentage points for Code2Seq and9.9 for Code2Vec.

Observation 8: The model did not reveal any bugs, but20% of the reported methods were considered suspiciousby the developers.

Running the model on a newer versionof the repository reported 36 potential bugs with a conﬁdencethreshold over 0.8 (which we chose after experimenting withdifferent thresholds and analyzing the number of suspiciousmethods the model returned that we considered feasible tomanually investigate). While no bugs were found after man-ually analyzing all the reported snippets, we marked sevenmethods as suspicious. When we showed these methods tothe developers, they agreed that, while not containing a bugper se, the seven methods deviate from good coding standardsand should be refactored. More speciﬁcally, four methods hadthe for loop being initialized at a wrong index (i.e., the forloop was initialized with i = 1 , but inside the body, the codeperformed several i − ) and three snippets had hard-codedunusual constraints in the binary expression (i.e., a > ,where 256 is a speciﬁc business constraint). Interestingly,Pradel and Sen [12] also observed that models can sometimespoint to pieces of code that are not buggy, but highly deviatedfrom coding standards. Observation 9: The model can potentially be useful atcommit time, however, the number of false alarms is to beconsidered.

Fixing mistakes regarding good code practices for old pieces of software might not be considered worthwhileat large companies, given the possible unwanted changes tothe behavior of the software. However, if such a system wereto be employed during automated testing, the alerts mighthelp developers to adhere to better practices. We observed themodel pointing to relevant problems in 7 out of the 36 potentialbugs (20% of methods it identiﬁes). While 20% might beconsidered a low number, one might argue that inspecting36 methods out of a code base that contains thousands ofmethods is not a costly operation and might be worth the effort.However, we still do not know the number of false negativesthat the tool might give, as inspecting all the methods of thecode base is unfeasible. RQ summary: When tested on a large-scale industrialsoftware system, the approach did not reveal any bugs per se,but pointed to code considered to deviate from good practices.VI. F

UTURE W ORK

We see much room for improvement before these modelscan reliably identify off-by-one errors. In the following, welist the ones we believe to be most urgent:

The need for more data points for the off-by-one prob-lem.

In this paper, we leveraged the existing java-large datasetcreated by Alon et al. [13]. While the entire dataset was builton top of 9,500 GitHub projects and contained approximately16M methods, only around 836k had binary conditions (e.g.,methods with loops and ifs containing a < , < = , > or > = ).We augment this dataset to 1.6M by introducing the defectivesamples. Nevertheless, there is a big difference between 16Mand 1.6M methods for training. As Alon et al. [8] argues:“a simpler model with more data might perform better thana complex model with little data”. It should be part of anyfuture work to devise a much larger dataset for the off-by-one problem and try the models we experiment here beforeproposing more complex models.Moreover, our dataset contains fewer usages of >= or<= compared to usages of > or <, clearly representing thereferences of developers when coding such boundaries. Thesedifferences can lead to biased training and, as a result, weobserved models tending to give false positive results in case of>= or <=. One way to mitigate the issue is to create a balanceddataset with a more equal distribution of binary operators, aswell as the distribution of the places of their occurrence (if-conditions, for- and while-loops, ternary expressions, etc). The challenges of imbalanced data.

In this study, weexplored the effects of balancing and imbalacing in the ef-fectiveness of the model. However, the real imbalance of theproblem in real life (i.e., the proportion between methodswith off-by-one mistakes and methods without off-by-onemistakes) is unknown, although we strongly believe it to beimbalanced. Nevertheless, a 10:1 proportion enables us tohave an initial understanding how models would handle suchhigh imbalance. Our results show that it indeed negativelyaffects the performance of the model. Therefore, we suggestresearchers to focus their attention on how to make thesemodels better in face of imbalanced datasets.

The support for inter-procedural analysis.

Currently, ourapproach is only supporting the analysis of the AST of onemethod. However, the behaviour of a method, and the possi-bility of the bugs thereof, also depends on the contents of theother methods. For example, in recent research by Comptonet al. [32], the embedding vectors from the Code2Vec modelare concatenated to form an embedding for the entire class.Future work should explore whether class embeddings wouldperform better.

Experimenting with different (and more recent) archi-tectures.

In our work, we mainly looked at Code2Vec andCode2Seq models. We now see more recent models, suchas the GREAT model proposed by Hellendoorn et al. [33],which uses transformers and also captures the data-ﬂow of thecode. We believe that data-ﬂow information would enhance theperformance of our models.

Making use of Byte-Pair Encoding (BPE) techniques.

NLP models are often dependent on the vocabulary they aretrained on. The out-of-vocabulary (OoV) problem also hap-pens in this work. When testing the models trained on top ofopen-source data at

Adyen , we had to replace unknown tokensby a generic

UNK . We conjecture that this may diminishthe effectiveness of the models. We unfortunately did notmeasure the extent of how many times the

UNK token wasused in our experiments. We plan to more precisely measureit in future replications of this work. In future work, we alsoplan to make use of techniques such as Byte-Pair Encoding(BPE) [34, 35], which attempts to mitigate the impact ofout-of-vocabulary tokens. We note that the use of BPE isbecoming more and more common in software engineeringmodels (e.g., [10, 33, 36]).

A deeper understanding of the differences between ourmodel and Pradel’s and Sen’s [12] model.

The DeepBugspaper explores the effectiveness of deep learning models to asimilar problem, which authors call “Wrong Binary Operator”.The overall idea of their approach is similar to ours (in otherwords, their work also served as inspiration for this one): the negative instances (i.e., the buggy code) are generated throughmutations in the positive code (i.e., non-buggy code), the coderepresentation is a vector that is based on the embeddings of allthe identiﬁers in the code, and the classiﬁcation task is a feed-forward neural network that learns from the balanced set ofpositive and negative instances. Their results show an accuracyof 89%-92% in the controlled dataset (i.e., slightly higher thanour results in RQ1), and a precision of 68% in the manualanalysis (i.e., higher than our results in RQ2). Interestingly,authors also observe that the model also reports non-buggycode which deviates from best practices (i.e., similar to ourobservations in RQ3). When designing this study, we didexplicit compare the results to DeepBugs. The embeddingsderived from code2vec/seq capture more information, and weconjectured that they would naturally supersede DeepBugs.We nevertheless see a few differences between both works:First, in their “Wrong Binary Operator” task, the mutationreplaces the (correct) binary operator to any binary operator,e.g., a i < length can become a i % length . In ourcase, we limit ourselves only to off-by-one mistakes, i.e., acorrect i < length will always become a i <= length .We conjecture that this may increase the difﬁculty for themodel to learn, as bugs are now slightly more subtle. Second,while the manual analysis conducted in the DeepBugs paperis performed on the testing set (which contains artiﬁcial bugs),our RQ2 explores the performance of the model in real-worldbugs, i.e., bugs that were found and ﬁxed by developers. Thisextra reality we bring to the experiment may be the reason forthe lower performance. Finally, we assumed that more robustmodels such as code2vec and code2seq would better capturethe intricacies of the off-by-one mistake. The model used inDeepBugs is simpler and yet as accurate as ours. More work isneeded to understand the pros and cons of our model and howboth works can be combined for the development of better andmore accurate models.VII. C

ONCLUSIONS

Software development practices offer many techniques fordetecting bugs at an early stage. However, these methods comewith their challenges and are either too labor-intensive or leavea lot of room for improvement. In this paper, we adaptedrecent state-of-the-art deep learning models to detect off-by-one errors in Java code, which are traditionally hard for staticanalysis tools due to their high dependency on context.We concluded that the trained models, while effective incontrolled datasets, still do not work well in real-world situ-ations. We see the use of deep learning models to identifyoff-by-one errors as promising. Nevertheless, there is stillmuch room for improvement, and we hope that this paperhelps researchers in paving the road for future studies in thisdirection. A

CKNOWLEDGMENTS

We thank Jón Arnar Briem, Jordi Smit, and Pavel Rapoportfor their participation in the workshop version of this paper.

EFERENCES[1] B. Jeng and E. J. Weyuker, “A simpliﬁed domain-testing strategy,”

ACM Transactions on Software Engineering and Methodology (TOSEM) ,vol. 3, no. 3, pp. 254–270, 1994.[2] S. C. Reid, “An empirical analysis of equivalence partitioning, boundaryvalue analysis and random testing,” in

Proceedings Fourth InternationalSoftware Metrics Symposium . IEEE, 1997, pp. 64–73.[3] D. Hoffman, P. Strooper, and L. White, “Boundary values and automatedcomponent testing,”

Software Testing, Veriﬁcation and Reliability , vol. 9,no. 1, pp. 3–26, 1999.[4] B. Legeard, F. Peureux, and M. Utting, “Automated boundary testingfrom z and b,” in

International Symposium of Formal Methods Europe .Springer, 2002, pp. 21–40.[5] P. Samuel and R. Mall, “Boundary value testing based on uml models,”in . IEEE, 2005, pp. 94–99.[6] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, “Learning naturalcoding conventions,” in

Proceedings of the 22nd ACM SIGSOFT Inter-national Symposium on Foundations of Software Engineering , 2014, pp.281–293.[7] ——, “Suggesting accurate method and class names,” in

Proceedings ofthe 2015 10th Joint Meeting on Foundations of Software Engineering ,2015, pp. 38–49.[8] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: Learningdistributed representations of code,”

Proceedings of the ACM on Pro-gramming Languages , vol. 3, no. POPL, pp. 1–29, 2019.[9] V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deep learningtype inference,” in

Proceedings of the 2018 26th acm joint meetingon european software engineering conference and symposium on thefoundations of software engineering , 2018, pp. 152–162.[10] M. Allamanis, E. T. Barr, S. Ducousso, and Z. Gao, “Typilus: Neuraltype hints,” in

PLDI , 2020.[11] M. Pradel, G. Gousios, J. Liu, and S. Chandra, “Typewriter: Neu-ral type prediction with search-based validation,” arXiv preprintarXiv:1912.03768 , 2019.[12] M. Pradel and K. Sen, “Deepbugs: A learning approach to name-basedbug detection,”

Proceedings of the ACM on Programming Languages ,vol. 2, no. OOPSLA, pp. 1–25, 2018.[13] U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generatingsequences from structured representations of code,” in

InternationalConference on Learning Representations , 2019. [Online]. Available:https://openreview.net/forum?id=H1gKYo09tX[14] J. A. Briem, J. Smit, H. Sellik, P. Rapoport, G. Gousios, and M. Aniche,“Offside: Learning to identify mistakes in boundary conditions,” in

Pro-ceedings of the IEEE/ACM 42nd International Conference on SoftwareEngineering Workshops , 2020, pp. 203–208.[15] K. F. Tómasdóttir, M. Aniche, and A. Van Deursen, “The adoption ofjavascript linters in practice: A case study on eslint,”

IEEE Transactionson Software Engineering , 2018.[16] K. F. Tómasdóttir, M. Aniche, and A. van Deursen, “Why and howjavascript developers use linters,” in . IEEE, 2017,pp. 578–589.[17] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On thenaturalness of software,” in [24] A. Habib and M. Pradel, “Neural bug ﬁnding: A study of opportunities

Software Engineering (ICSE) . IEEE, 2012, pp. 837–847.[18] X. Chen, C. Liu, and D. Song, “Tree-to-tree neural networks for programtranslation,” in

Advances in neural information processing systems ,2018, pp. 2547–2557.[19] M. Aniche, E. Maziero, R. Durelli, and V. Durelli, “The effectivenessof supervised machine learning algorithms in predicting software refac-toring,”

Transactions on Software Engineering (TSE) , 2020.[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in

Advances in neural information processing systems , 2013,pp. 3111–3119.[21] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to repre-sent programs with graphs,” arXiv preprint arXiv:1711.00740 , 2017.[22] Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow, “Gated graphsequence neural networks,” in

Proceedings of ICLR’16 , April 2016.[23] L. Pascarella, F. Palomba, and A. Bacchelli, “Fine-grained just-in-timedefect prediction,”

Journal of Systems and Software , vol. 150, pp. 22–36,2019.and challenges,” arXiv preprint arXiv:1906.00307 , 2019.[25] Y. Li, S. Wang, T. N. Nguyen, and S. Van Nguyen, “Improving bugdetection via context-based code representation learning and attention-based neural networks,”

Proceedings of the ACM on ProgrammingLanguages , vol. 3, no. OOPSLA, pp. 1–30, 2019.[26] A. Grover and J. Leskovec, “node2vec: Scalable feature learning fornetworks,” in

Proceedings of the 22nd ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 2016,pp. 855–864.[27] Y. Wang, F. Gao, L. Wang, and K. Wang, “Learning a static bug ﬁnderfrom data,” arXiv preprint arXiv:1907.05579 , 2019.[28] D. Spadini, M. Aniche, and A. Bacchelli, “Pydriller: Python frameworkfor mining software repositories,” in

Proceedings of the 2018 26thACM Joint Meeting on European Software Engineering Conference andSymposium on the Foundations of Software Engineering , 2018, pp. 908–911.[29] L. Breiman, “Random forests,”

Machine learning , vol. 45, no. 1, pp.5–32, 2001.[30] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza-tion of machine learning algorithms,” in

Advances in neural informationprocessing systems , 2012, pp. 2951–2959.[31] H. Sellik, O. van Paridon, G. Gousios, and M. Aniche, “Learning off-by-one mistakes: An empirical study (appendix),” https://doi.org/10.5281/zenodo.4560410, 2021.[32] R. Compton, E. Frank, P. Patros, and A. Koay, “Embedding java classeswith code2vec: improvements from variable obfuscation [accepted],” in

MSR 2020 . ACM, 2020.[33] V. J. Hellendoorn, C. Sutton, R. Singh, P. Maniatis, and D. Bieber,“Global relational models of source code,” in

International Conferenceon Learning Representations , 2019.[34] P. Gage, “A new algorithm for data compression,”

C Users Journal ,vol. 12, no. 2, pp. 23–38, 1994.[35] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation ofrare words with subword units,” arXiv preprint arXiv:1508.07909 , 2015.[36] R.-M. Karampatsis, H. Babii, R. Robbes, C. Sutton, and A. Janes, “Bigcode!= big vocabulary: Open-vocabulary models for source code,” arXivpreprint arXiv:2003.07914arXivpreprint arXiv:2003.07914