[PDF] Aroma: Code Recommendation via Structural Code Search

Abstract

Programmers often write code that has similarity to existing code written somewhere. A tool that could help programmers to search such similar code would be immensely useful. Such a tool could help programmers to extend partially written code snippets to completely implement necessary functionality, help to discover extensions to the partial code which are commonly included by other programmers, help to cross-check against similar code written by other programmers, or help to add extra code which would fix common mistakes and errors. We propose Aroma, a tool and technique for code recommendation via structural code search. Aroma indexes a huge code corpus including thousands of open-source projects, takes a partial code snippet as input, searches the corpus for method bodies containing the partial code snippet, and clusters and intersects the results of the search to recommend a small set of succinct code snippets which both contain the query snippet and appear as part of several methods in the corpus. We evaluated Aroma on 2000 randomly selected queries created from the corpus, as well as 64 queries derived from code snippets obtained from Stack Overflow, a popular website for discussing code. We implemented Aroma for 4 different languages, and developed an IDE plugin for Aroma. Furthermore, we conducted a study where we asked 12 programmers to complete programming tasks using Aroma, and collected their feedback. Our results indicate that Aroma is capable of retrieving and recommending relevant code snippets efficiently.

Full PDF

1152Aroma: Code Recommendation via Structural Code Search

SIFEI LUAN,

Facebook, USA

DI YANG ∗ , University of California, Irvine, USA

CELESTE BARNABY,

Facebook, USA

KOUSHIK SEN † , University of California, Berkeley, USA

SATISH CHANDRA,

Facebook, USAProgrammers often write code that has similarity to existing code written somewhere. A tool that could helpprogrammers to search such similar code would be immensely useful. Such a tool could help programmersto extend partially written code snippets to completely implement necessary functionality, help to discoverextensions to the partial code which are commonly included by other programmers, help to cross-checkagainst similar code written by other programmers, or help to add extra code which would fix commonmistakes and errors. We propose Aroma, a tool and technique for code recommendation via structural codesearch. Aroma indexes a huge code corpus including thousands of open-source projects, takes a partial codesnippet as input, searches the corpus for method bodies containing the partial code snippet, and clusters andintersects the results of the search to recommend a small set of succinct code snippets which both contain thequery snippet and appear as part of several methods in the corpus. We evaluated Aroma on 2000 randomlyselected queries created from the corpus, as well as 64 queries derived from code snippets obtained fromStack Overflow, a popular website for discussing code. We implemented Aroma for 4 different languages, anddeveloped an IDE plugin for Aroma. Furthermore, we conducted a study where we asked 12 programmers tocomplete programming tasks using Aroma, and collected their feedback. Our results indicate that Aroma iscapable of retrieving and recommending relevant code snippets efficiently.CCS Concepts: •

Information systems → Near-duplicate and plagiarism detection ; •

Software and itsengineering → Development frameworks and environments ; Software post-development issues .Additional Key Words and Phrases: code recommendation, structural code search, clone detection, feature-based code representation, clustering

ACM Reference Format:

Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, and Satish Chandra. 2019. Aroma: Code Recommendationvia Structural Code Search.

Proc. ACM Program. Lang.

3, OOPSLA, Article 152 (October 2019), 28 pages.https://doi.org/10.1145/3360578

Suppose an Android programmer wants to write code to decode a bitmap. The programmer isfamiliar with the libraries necessary to write the code, but they are not quite sure how to writethe code completely with proper error handling and suitable configurations. They write the codesnippet shown in Listing 1 as a first attempt. The programmer now wants to know how others haveimplemented this functionality fully and correctly in related projects. Specifically, they want toknow what is the customary way to extend the code so that proper setup is done, common errorsare handled, and appropriate library methods are called. It would be nice if a tool could return a ∗ This work was done in part while this author was an intern at Facebook. † This work was done in part while this author was a visiting scientist at Facebook.Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses,contact the owner/author(s).© 2019 Copyright held by the owner/author(s).2475-1421/2019/10-ART152https://doi.org/10.1145/3360578Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. a r X i v : . [ c s . S E ] O c t few code snippets shown in Listings 2, 3, which demonstrate how to configure the decoder to useless memory, and how to handle potential runtime exceptions, respectively. We call this the coderecommendation problem. InputStream input = manager.open(fileName);Bitmap image = BitmapFactory.decodeStream(input);

Listing 1. Suppose an Android programmer writes this code to decode a bitmap. final BitmapFactory.Options options = new BitmapFactory.Options();options.inSampleSize = 2;Bitmap bmp = BitmapFactory.decodeStream(is, null , options);

Listing 2. A recommended code snippet that shows how to configure the decoder to use less memory.Recommended lines are highlighted. try {InputStream is = am.open(fileName);image = BitmapFactory.decodeStream(is);is.close();} catch (IOException e) {// ...} Listing 3. Another recommended code snippet that shows how to properly close the input stream and handleany potential

IOException . Recommended lines are highlighted. There are a few existing techniques which could potentially be used to get code recommendations.For example, code-to-code search tools [Kim et al. 2018; Krugler 2013] could retrieve relevant codesnippets from a corpus using a partial code snippet as query. However, such code-to-code searchtools return lots of relevant code snippets without removing or aggregating similar-looking ones.Moreover, such tools do not make any effort to carve out common and concise code snippets fromsimilar-looking retrieved code snippets. Pattern-based code completion tools [Mover et al. 2018;Nguyen et al. 2012, 2009] mine common API usage patterns from a large corpus and use thosepatterns to recommend code completion for partially written programs as long as the partial programmatches a prefix of a mined pattern. Such tools work well for the mined patterns; however, theycannot recommend any code outside the mined patterns—the number of mined patterns are usuallylimited to a few hundreds. We emphasize that the meaning of the phrase “code recommendation” inAroma is different from the term “API code recommendation” [Nguyen et al. 2016a,b]. The latter isa recommendation engine for the next API method to invoke given a code change, whereas Aromaaims to recommend code snippets, as shown in Listings 2, 3, for programmers to learn commonusages and integrate those usages with their own code. Aroma’s recommendations contain moresyntactic variety than just API usages; for instance, the recommended code snippet in Listing 3includes a try-catch block, and Example B in Table 1 recommends adding an if statement thatmodifies a variable. Code clone detectors [Cordy and Roy 2011; Jiang et al. 2007; Kamiya et al.2002; Sajnani et al. 2016] are another set of techniques that could potentially be used to retrieverecommended code snippets. However, code clone detection tools usually retrieve code snippetsthat are almost identical to a query snippet. Such retrieved code snippets may not always containextra code which could be used to extend the query snippet. Adapted from https://github.com/zom/Zom-Android/blob/master/app/src/main/java/org/awesomeapp/messenger/ui/stickers/StickerGridAdapter.java Adapted from https://github.com/yuyuyu123/ZCommon/blob/master/zcommon/src/main/java/com/cc/android/zcommon/utils/android/AssetUtils.java roma: Code Recommendation via Structural Code Search 152:3

We propose Aroma, a code recommendation engine. Given a code snippet as input query and alarge corpus of code containing millions of methods, Aroma returns a set of recommended codesnippets such that each recommended code snippet: • contains the query snippet approximately, and • is contained approximately in a non-empty set of method bodies in the corpus.Furthermore, Aroma ensures that any two recommended code snippets are not quite similar toeach other.Aroma works by first indexing the given corpus of code. Then Aroma searches for a small set(e.g. 1000) of method bodies which contain the query code snippet approximately . A challenge indesigning this search step is that a query snippet, unlike a natural language query, has structure,which should be taken into account while searching for code. Once Aroma has retrieved a smallset of code snippets which approximately contain the query snippet, Aroma prunes the retrievedsnippets so that the resulting pruned snippets become similar to the query snippet. It then ranksthe retrieved code snippets based on the similarity of the pruned snippets to the query snippet.This step helps to rank the retrieved snippets based on how well they contain the query snippet.The step is precise, but is relatively expensive; however, the step is only performed on a small setof code snippets, making it efficient in practice. After ranking the retrieved code snippets, Aromaclusters the snippets so that similar snippets fall under the same cluster. Aroma then intersects thesnippets in each cluster to carve out a maximal code snippet which is common to all the snippetsin the cluster and which contains the query snippet. The set of intersected code snippets are thenreturned as recommended code snippets. Figure 3 shows an outline of the algorithm. For the queryshown in Listing 1, Aroma recommends the code snippets shown in Listings 2, 3. The right columnof Table 1 shows more examples of code snippets recommended by Aroma for the code queriesshown on the left column of the table.To our best knowledge, Aroma is the first tool which could recommend relevant code snippetsgiven a query code snippet. The advantages of Aroma are the following: • A code snippet recommended by Aroma does not simply come from a single method body,but is generated from several similar-looking code snippets via intersection. This increasesthe likelihood that Aroma’s recommendation is idiomatic rather than one-off. • Aroma does not require mining common coding patterns or idioms ahead of time. Therefore,Aroma is not limited to a set of mined patterns—it can retrieve new and interesting codesnippets on-the-fly. • Aroma is fast enough to use in real time. A key innovation in Aroma is that it first retrievesa small set of snippets based on approximate search, and then performs the heavy-dutypruning and clustering operations on this set. This enables Aroma to create recommendedcode snippets on a given query from a large corpus containing millions of methods within acouple of seconds on a multi-core server machine. • Aroma is easy to deploy for different programming languages because its core algorithmworks on generic parse trees. We have implemented Aroma for Hack, Java, JavaScript andPython. • Although we developed Aroma for the purpose of code recommendation, it could be used toalso perform efficient and precise code-to-code structural search.We have implemented Aroma in C++ for four programming languages: Hack [Verlaguet andMenghrajani 2014], Java, JavaScript and Python. We have also implemented IDE plugins for all ofthese four languages. We report our experimental evaluation of Aroma for the Java programminglanguage. We have used Aroma to index 5,417 GitHub Java Android projects. We performed ourexperiments for Android Java because we initially developed Aroma for Android based on internal

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019.

Table 1. Aroma code recommendation examples

Query Code Snippet Aroma Code Recommendation with Extra Lines Highlighted

TextView textView = (TextView)view.findViewById(R.id.textview);SpannableString content = new

SpannableString("Content");content.setSpan( new

UnderlineSpan(), 0, content.length(), 0);textView.setText(content);

Example A: Configuring Objects. • This code snippet adds underline to a piece of text. • The recommended code suggests adding a callback handler to pop upa dialog once the underlined text is touched upon. • Intersected from a cluster of 2 methods . TextView licenseView = (TextView)findViewById(R.id.library_license_link);SpannableString underlinedLicenseLink = new

SpannableString(getString(R.string.library_license_link));underlinedLicenseLink.setSpan( new

UnderlineSpan(), 0,underlinedLicenseLink.length(), 0);licenseView.setText(underlinedLicenseLink);licenseView.setOnClickListener(v -> {FragmentManager fm = getSupportFragmentManager();LibraryLicenseDialog libraryLicenseDlg = newLibraryLicenseDialog();libraryLicenseDlg.show(fm, "fragment_license"); });Bitmap bitmap = BitmapFactory.decodeResource(getResources(),R.drawable.image);

Example B: Post-Processing. • This code snippet decodes a bitmap. • The recommended code suggests applying Gaussian blur on the de-coded image, a customary effect to be applied. • Intersected from a cluster of 4 methods . int radius = seekBar.getProgress();if (radius < 1) {radius = 1;}Bitmap bitmap = BitmapFactory.decodeResource(getResources(),R.drawable.image);imageView.setImageBitmap(blur.gaussianBlur(radius, bitmap));EditText et = (EditText)findViewById(R.id.inbox);et.setSelection(et.getText().length()); Example C: Correlated Statements. • This code snippet moves the cursor to the end in a text area. • The recommended code suggests also configuring the action bar to cre-ate a more focused view. • Intersected from a cluster of 2 methods . super.onCreate(savedInstanceState);setContentView(R.layout.material_edittext_activity_main);getSupportActionBar().setDisplayHomeAsUpEnabled(true);getSupportActionBar().setDisplayShowTitleEnabled(false);EditText singleLineEllipsisEt = (EditText)findViewById(R.id.singleLineEllipsisEt);singleLineEllipsisEt.setSelection(singleLineEllipsisEt.getText().length());PackageInfo pInfo =getPackageManager().getPackageInfo(getPackageName(),0);String version = pInfo.versionName; Example D: Exact Recommendations. • This partial code snippet gets the current version of the application.The rest of the code snippet (not shown) catches and handles possible

NameNotFound errors. • The recommended code suggests the exact same error handling as inthe original code snippet. • Intersected from a cluster of 2 methods . try {PackageInfo pInfo =getPackageManager().getPackageInfo(getPackageName(),0);String version = pInfo.versionName;TextView versionView = (TextView)findViewById(R.id.about_project_version);versionView.setText("v" + version);} catch (PackageManager.NameNotFoundException ex) {Log.e(...);}i.putExtra("parcelable_extra", (Parcelable)myParcelableObject); Example E: Alternative Recommendations. • This partial code snippet demonstrates one way to attach an object toan

Intent . The rest of the code snippet (not shown) shows a differentway to serialize and attach an object. • Intersected from a cluster of 10 methods . Intent intent = new Intent(this, BoardTopicActivity.class);intent.putExtra(SMTHApplication.BOARD_OBJECT, (Parcelable)board);startActivity(intent); • The recommended code does not suggest the other way of serializing the ob-ject, but rather suggests a common way to complete the operation by startingan activity with an

Intent containing a serialized object. Adapted from the Stack Overflow post “Can I underline text in an android layout?” [https://stackoverflow.com/questions/2394939], by Anthony Forloney[https://stackoverflow.com/users/166712]. Adapted from https://github.com/tonyvu2014/android-shoppingcart/blob/master/demo/src/main/java/com/android/tonyvu/sc/demo/ProductActivity.java. Adapted from the Stack Overflow post “How to set a bitmap from resource” [https://stackoverflow.com/questions/4955305], by xandy [https://stackoverflow.com/users/109112]. Adapted from https://github.com/TonnyL/GaussianBlur/blob/master/app/src/main/java/io/github/marktony/gaussianblur/MainActivity.java. Adapted from the Stack Overflow post “Place cursor at the end of text in EditText” [https://stackoverflow.com/questions/6624186], by Marqs [https://stackoverflow.com/users/400493]. Adapted from https://github.com/cymcsg/UltimateAndroid/blob/master/deprecated/UltimateAndroidGradle/demoofui/src/main/java/com/marshalchen/common/demoofui/sampleModules/MaterialEditTextActivity.java. Adapted from the Stack Overflow post “How to get the build/version number of your Android application?” [https://stackoverflow.com/questions/6593822], byplus- [https://stackoverflow.com/users/709635]. Adapted from https://github.com/front-line-tech/background-service-lib/blob/master/SampleService/servicelib/src/main/java/com/flt/servicelib/AbstractPermissionExtensionAppCompatActivity.java. Adapted from the Stack Overflow post “How to send an object from one Android Activity to another using Intents?” [https://stackoverflow.com/questions/2141166], by Jeremy Logan [https://stackoverflow.com/users/76835]. Adapted from https://github.com/zfdang/zSMTH-Android/blob/master/app/src/main/java/com/zfdang/zsmth_android/MainActivity.java.All Stack Overflow content is licensed under CC-BY-SA 3.0. All URLs are accessed in August 2018.

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. roma: Code Recommendation via Structural Code Search 152:5 developers’ need. We evaluated Aroma using code snippets obtained from Stack Overflow. Wemanually analyzed and categorized the recommendations into several representative categories.We also evaluated Aroma recommendations on 50 partial code snippets, where we found thatAroma can recommend the exact code snippets for 37 queries, and in the remaining 13 casesAroma recommends alternative recommendations that are still useful. On average, Aroma takes 1.6seconds to create recommendations for a query code snippet on a 24-core CPU. In our large-scaleautomated evaluation, we used a micro-benchmarking suite containing artificially created querysnippets to evaluate the effectiveness of various design choices in Aroma. Finally, we conducteda user study of Aroma by observing 12 Hack programmers interacting with the IDE on 4 shortprogramming tasks, and found that Aroma is a helpful addition to the existing coding assistanttools.The rest of the paper is organized as follows: Section 2 presents a case study that reveals theopportunity for a code recommendation tool like Aroma. In Section 3, we describe the algorithmAroma uses to create code recommendations. In Section 4 we manually assess how useful Aromacode recommendations are. Since code search is a key component of creating recommendations, inSection 5 we measure the search recall of Aroma and compare it with other techniques. Section 6introduces the real-world deployment of Aroma. In Section 7, we report the initial developerexperience with using the Aroma tool. Section 8 presents related work. Finally, Section 9 concludesthe paper.

Aroma is based on the idea that new code often resembles code that has already been written—therefore, programmers can benefit from recommendations from existing code. To substantiate thisclaim, we conducted an experiment to measure the similarity of new code to existing code. Thisexperiment was conducted on a large codebase in the Hack language.We first collected all code commits submitted in a two-day period. From these commits, weextracted a set of changesets. A changeset is defined as a set of contiguous added or modified linesin a code commit. We filtered out changesets that were shorter than two lines or longer than sevenlines. We decided to use this filter because longer changesets are more likely to span multiplemethods, and we wanted to limit our dataset to code added or modified within a single method.Alternatively, we could have taken portions of changesets found within a single method; however,since changesets are raw text, finding the method boundaries involves additional parsing. We stuckto the simple solution of taking short changesets.For each of the first 1000 changesets in this set, we used Aroma to perform a code-to-codesearch, taking the snippet as input and returning a list of methods in the repository that containstructurally similar code. Aroma was used because it was already implemented for Hack—butfor the purpose of this experiment, any code-to-code search tool or clone detector can work. Theresults are ranked by similarity score: the percentage of features in the search query that are alsofound in the search result. For each changeset, we took the top-ranked method and its similarityscore. 71 changesets did not yield any search result, because they contained only comments orvariable lists, which Aroma disregards in search (see Section 3.2).To interpret the results, we first needed to assess the correlation between the similarity score (i.e.a measure of the syntactic similarity) and the semantic similarity between the changeset and theresult. Two authors manually looked over a random sample of 50 pairs of changesets and resultmethods, and decided whether this method contained code similar enough to the changeset that aprogrammer could adopt the existing code (by copy-pasting or refactoring) with minimal changes.Using this criteria, each pair was deemed “similar” or “not similar”. Conflicting judgments werecross-checked and re-assessed. As shown in the box plot in Figure 1, there is a clear distinction in

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019.

Similar code Non-similar code S i m i l a r i t y S c o r e Fig. 1. Distribution of similarity scores used toobtain threshold

Similar code 35.3% Non-similar code64.7%

Fig. 2. Proportion of new code similar to existing code similarity scores between the manually-labeled “similar” and “not similar” pairs. Note that in thisfigure, the top and bottom of the box represents the third and first quartile of the data. The linesextending above and below the box represent the maximum and minimum, and the line runningthrough the box represents the median value.We chose the first quartile of the similarity scores in the manually-labeled similar pairs—0.71—asthe threshold similarity score to decide whether a retrieved code snippet contains meaningfulsemantic similarities to new code in the commit. We found that for 35.3% of changesets, the mostsimilar result had a score of at least 0.71, meaning that in these cases it would be easy for aprogrammer to adapt the existing code with minimal efforts, should the code be provided to them.These results indicate that a considerable amount of new code contains similarities to codethat already exists in a large code repository. With Aroma, we aim to utilize this similar code tooffer concise, helpful code recommendations to programmers. The amount of similar code in newcommits suggests that Aroma has the potential to save a lot of programmers’ time and effort.

Figure 3 illustrates the overall architecture of Aroma. In order to generate code recommendations,Aroma must first featurize the code corpus. To do so, Aroma parses the body of each method inthe corpus and creates its parse tree. It extracts a set of structural features from each parse tree.Then, given a query code snippet, Aroma runs the following phases to create recommendations: • Light-weight Search.

In this phase, Aroma takes a query code snippet, and outputs a list ofthe top few (e.g. 1000) methods that have the most overlap with the query. To do so, Aromaextracts custom features from the query and each method in the corpus. Aroma intersects theset of features of the query and each method body, and uses the cardinality of the intersectionto compute the degree of overlap between the query and the method body. To make thiscomputation efficient, Aroma represents the set of features of a code snippet as a sparsevector and performs matrix multiplication to compute the degree of overlap of all methodswith the query code. • Prune and Rerank.

In this phase, Aroma reranks the list of method bodies retrieved fromthe previous phase using a more precise, but expensive algorithm for computing similarity. • Cluster and Intersect.

In the final phase, Aroma clusters the reranked list of code snippetsfrom the previous phase. Clustering is based on the similarity of the method bodies. Clustering

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. roma: Code Recommendation via Structural Code Search 152:7

Fig. 3. Aroma code recommendation pipeline also needs to satisfy constraints which ensure that recommendations are of high quality.Therefore, we have devised a custom clustering algorithm which takes the constraints intoaccount. Aroma then intersects the snippets in each cluster to come up with recommendedcode snippets. This approach of clustering and intersection helps to create a succinct, yetdiverse set of recommendations.We next describe the details of each step using the code snippet shown in Listing 4 as the runningexample. if (view instanceof ViewGroup) { for ( int i = 0; i < ((ViewGroup) view).getChildCount(); i++) {View innerView = ((ViewGroup) view).getChildAt(i);}} Listing 4. A code snippet adapted from a Stack Overflow post. This snippet is used as the running examplethrough Section 3.

In this section, we introduce several notations and definitions used to compute the features of acode snippet. The terminologies and notations are also used to describe Aroma formally.Definition 1 (Keyword tokens).

This is the set of all tokens in a language whose values are fixedas part of the language. Keyword tokens include keywords such as while , if , else , and symbols suchas { , } , . , + , * . The set of all keyword tokens is finite for a language. Definition 2 (Non-keyword tokens).

This is the set of all tokens that are not keyword tokens.Non-keyword tokens include variable names, method names, field names, and literals.

Examples of non-keyword tokens are i , length , 0, 1, etc. The set of non-keyword tokens isnon-finite for most languages. Adapted from the Stack Overflow post “How to hide soft keyboard on android after clicking outside EditText?” [https://stackoverflow.com/questions/11656129], by Navneeth G [https://stackoverflow.com/users/1135909]. CC-BY-SA 3.0 License.Accessed in August 2018. Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019.

Definition 3 (Simplified Parse Tree).

A simplified parse tree is a data structure we use torepresent a program. It is recursively defined as a non-empty list whose elements could be any of thefollowing: • a non-keyword token, • a keyword token, or • a simplified parse tree.Moreover, a simplified parse tree cannot be a list containing a single simplified parse tree. We picked this particular representation of programs instead of a conventional abstract syntaxtree representation because the representation only consists of program tokens, and does not use anyspecial language-specific rule names such as

IfStatement , block etc. As such, the representationcan be used uniformly across various programming languages. Moreover, one could perform anin-order traversal of a simplified parse tree and print the token names to obtain the original program,albeit unformatted. We use this feature of a simplified parse tree to show the recommended codesnippets.Definition 4 (Label of a Simplified Parse Tree). The label of a simplified parse tree is obtainedby concatenating all the elements of the list representing the tree as follows: • If an element is a keyword token, the value of the token is used for concatenation. • If an element is a non-keyword token or a simplified parse tree, the special symbol is used forconcatenation.

For example, the label of the simplified parse tree ["x", ">", ["y", ".", "f"]] is " . if Fig. 4. The simplified parse tree representation of the code in Listing 4. Keyword tokens at the leaves areomitted to avoid clutter. Variable nodes are highlighted in double circles.

Figure 4 visualizes the simplified parse tree of the code snippet in Listing 4. In the figure, eachinternal node represents a simplified parse tree, and is labeled using the tree’s label as defined above.Since keyword tokens in a simplified parse tree become part of the label of the tree, we do notcreate leaf nodes for keyword tokens in the tree diagram—we only add leaf nodes for non-keywordtokens. We show the label of each node in the tree, and add a unique index to each label as subscriptto distinguish between any two similar labels.

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. roma: Code Recommendation via Structural Code Search 152:9

To obtain the simplified parse tree of a code snippet, Aroma relies on a language-specific parser.For example, Aroma utilizes the

ANTLR4 [Parr 2013] Java parser to produce the parse tree for aJava program. Aroma traverses the parse tree produced by the parser to collect at each internalnode the tokens and subtrees that are immediate descendants of each internal node of the parsetree. The collected elements at each node form a list, which is a simplified parse tree. Aroma usesthe list at each internal node to create the label for the node. The distinction between keywordand non-keyword tokens is done using the language’s lexical specification. Aroma performs asecond traversal of the tree and uses the static scoping rules of the language to identify the globalvariables. Aroma uses the knowledge of the global variables in the featurization phase which wedescribe later. Note that this process of creating a simplified parse tree from a code snippet islanguage-dependent and requires knowledge about the grammar and static scoping rules of thelanguage. Once the simplified parse tree of a code snippet has been created, the rest of Aroma’salgorithm is language-agnostic.Given a simplified parse tree t , we use the following notations. All examples refer to Figure 4. • L ( t ) denotes the label of the tree t . E.g. L ( if ) = if . • N ( t ) denotes the list of all non-keyword tokens present in t or in any of itssub-trees, in the same order as appearing in the source code. E.g. N ( ) = [ ViewGroup , view , getChildAt , i ] . • If n is a non-keyword direct child of t , then we use P ( n ) to denote the parent of n which is t .E.g. P ( view ) = . • If t ′ is a simplified parse tree and is a direct child of t , then we again use P ( t ′ ) to denote theparent of t ′ which is t . E.g. P ( ) = ( . • If n and n are two non-keyword tokens in a program and if n appears after n in theprogram without any intervening non-keyword token, then we use Prev ( n ) to denote n and Next ( n ) to denote n . E.g. Prev ( view ) = ViewGroup , Next ( ViewGroup ) = view . • If n and n are two non-keyword tokens denoting the same local variable in a programand if n and n are the two consecutive usages of the variable in the source code, thenwe use PrevUse ( n ) to denote n and NextUse ( n ) to denote n . E.g. PrevUse ( view ) = view , NextUse ( view ) = view . • If n is a non-keyword token denoting a local variable and it is the i th child of its parent t ,then the context of n , denoted by C ( n ) , is defined to be: – ( i , L ( t )) , if L ( t ) (cid:44) . E.g. C ( view ) = ( , ( ) . – The first non-keyword token that is not a local variable in N ( t ) , otherwise. This is toaccommodate for cases like x.foo() , where we want the context feature for x to be foo rather than ( , ) , because the former better reflects its usage context. The high-level goal of this step is to take a simplified parse tree for a code snippet, and extract a setof structural features from that parse tree. A key requirement of the features is that if two codesnippets are similar, they should have the same collection of features.A simple way to featurize a code snippet is to treat the labels of all nodes in the simplified parsetree as features. This simple approach creates problem if we have two code snippets 1) which onlydiffer in local variable names, and 2) where one code snippet can be obtained from the other byalpha renaming the local variables. The two code snippets should be considered as similar, but thecollection of features will differ in the name of some of the variables. Therefore, we replace eachtoken that denotes a local variable by a special token . We do not perform similar replacements

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. for global variables and method names. This is because such identifiers are often part of somelibrary API and cannot be alpha-renamed to obtain similar programs.Treating the labels of parse tree nodes as the only features does not help to capture the relationbetween the nodes. Such relations are necessary to identify the structural features of a code snippet.For example, without such a relation Aroma will treat the snippets if (x > 0) z = 3; and if(z > 3) x = 0; as similar since they have the exact same collection of node labels (i.e. {if ). If we can somehow create a feature encapsulatingthe fact that belongs to the body of the first if statement, Aroma will distinguish between thetwo snippets. Therefore, Aroma also creates features which represent some relations betweencertain pairs of nodes in the parse tree. Examples of some such features involving the token 3 are (if , ( , and ( . The first feature, which is denoted as a triplet, states thatthe 2 nd child of a node labeled if has a descendant leaf node with label . Similarly, the secondfeature asserts that the 2 nd child of a node labeled has a descendant leaf node with label . Wecall these two features parent features , as they help capture the relation of a leaf node with its parent,grand-parent, and great-grand parent. The third feature relays the fact that a variable leaf nodeappears before 3. We call such features sibling features . In summary, the parent features and siblingfeatures capture some local relations between the nodes in a parse tree. However, these features arenot exhaustive enough to recreate the parse tree from the features. These non-exhaustiveness offeatures helps Aroma tolerate some non-similarities in otherwise similar code snippets, and helpsAroma to retrieve some closely related, but nonidentical code snippets during search.Since we replace all local variable names with , we also need to relate two variable usages ina code snippet which refer to the same local variable. For example, in the code snippet if (y < 0)x = -x; , we will have three features of the form corresponding to the two occurrences of x and one occurrence of y . However, the collection of features described so far does not express thefact that two features refer to the same variable. There is no direct way to state two variablesare related since we have gotten rid of variable names. Rather, we capture features about the factthat the consecutive usage context of the same local variables are related. We call such features variable usage features. For example, the context of the two usages of x are (1, and (1,- ,respectively. The first context corresponds to the first usage of x and denotes that there is a variablewhich is the first child of the node labeled . The index and the parent node label together formsthe context of this particular variable usage. Similarly, the second context denotes the second usageof x . We create a feature of the form ((1, which captures the relation between thecontext of two consecutive usage of the same variable.We now describe formally how a code snippet is featurized by Aroma. Given a simplified parsetree, we extract four kinds of features for each non-keyword token n in the program representedby the tree:(1) A Token Feature of the form n . If n is a local variable, we replace n with .(2) Parent Features of the form ( n , i , L ( t )) , ( n , i , L ( t )) , and ( n , i , L ( t )) . Here n is the i th1 childof t , t is the i th2 child of t , and t is the i th3 child of t . As before, if n is a local variable, thenwe replace n with . Note that in each of these features, we do not specify if the thirdelement in a feature is the parent, grand-parent, or the great-grand parent. This helps Aromato tolerate some non-similarities in otherwise similar code snippets.(3) Sibling Features of the form ( n , Next ( n )) and ( Prev ( n ) , n ) . As before, if any of n , Next ( n ) , Prev ( n ) is a local variable, it is replaced with .(4) Variable Usage Features of the form ( C ( PrevUse ( n )) , C ( n )) and ( C ( n ) , C ( NextUse ( n ))) . We onlyadd these features if n is a local variable. Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. roma: Code Recommendation via Structural Code Search 152:11

For a non-keyword token n ∈ N ( t ) , we use F ( n ) to denote the multi-set of features extracted for n . We extend the definition of F to a set of non-keyword tokens Q as follows: F ( Q ) = ⊎ n ∈ Q F ( n ) where ⊎ denotes multi-set union. For a simplified parse tree t , we use F ( t ) to denote the multi-setof features of all non-keyword tokens in t , i.e. F ( t ) = F ( N ( t )) . Let F be the set of all features thatcan extracted from a given corpus of code.Table 2 illustrates the features extracted for two non-keyword tokens from the simplified parsetree in Figure 4. In the interest of space, we do not show the features extracted by Aroma for allnon-keyword tokens. Table 2. Features for selected tokens in Figure 4

Token Feature Parent Features Sibling Features Variable Usage Features view ( , 2, ( )( , 1, ( )( , 1, ) ( ViewGroup , )( , getChildCount ) ((1, ),(2, ( ))((2, ( ), (2, ( )) )(0, 1, int )(0, 1, ) ( , 0)(0, ) - In this phase, Aroma takes a query code snippet, and outputs alist of the top few (e.g. 1000) methods that contain the most overlap with the query. To computethe top methods, we need to compute the degree of overlap between the query and each methodbody in the corpus. Because our corpus has millions of methods, we need to make sure that thedegree of overlap can be computed fast at query time. We use the feature sets of code snippets tocompute the degree of overlap, which we call the overlap score . Specifically, Aroma intersectsthe set of features of the query and the method body, and uses the cardinality of the intersection asthe overlap score. Computing intersection and its cardinality could be computationally expensive.For efficient computation, we represent the set of features of a code snippet as a sparse vector andperform matrix multiplication to compute the overlap score of all methods with the query code.We next describe the formal details of the phase.Given a large code corpus containing millions of methods. Aroma parses and creates a simplifiedparse tree for each method body. It then featurizes each simplified parse tree. Let M be the setof simplified parse trees of all method bodies in the corpus. Aroma also parses the query codesnippet to create its simplified parse tree, say q , and extracts its features. For the simplified parsetree m of each method body in the corpus, we use the cardinality of the set S ( F ( m )) ∩ S ( F ( q )) as anapproximate score, called overlap score , of how much of the query code snippet overlaps with themethod body. Here S ( X ) denotes the set of elements of the multi-set X , where we ignore the countof each element in the multi-set. Aroma computes a list of η method bodies whose overlap scoresare highest with respect to the query code snippet. In our implementation η is usually 1000.The computation of this list can be reduced to a simple multiplication between a matrix anda sparse vector as follows. The features of a code snippet can be represented as a sparse vectorof length |F | —the vector has an entry for each feature in F . If a feature f i is present in F ( m ) ,the multi-set of features of the simplified parse tree m , then the i th entry of the vector is 1 and 0otherwise. Note that the elements of each vector can be either 0 or 1—we ignore the count of eachfeature in the vector. To understand this decision, consider a method m that contains a feature f numerous times (say n ). Then, say we give Aroma a query q that contains f once. The overlapscore between m and q will be increased by n , even though the multiple instances of this feature do Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. not actually indicate greater overlap between m and q . The sparse feature vectors of all methodbodies can then be organized as a matrix, say D , of shape | M | × |F | . Let v q be the sparse featurevector of the query code snippet q . Then D · v q is a vector of size | M | that gives the overlap scoreof each method body with respect to the query snippet. Aroma picks the top η method bodieswith the highest overlap scores. Let N be the set of simplified parse trees of the method bodiespicked by Aroma.The corpus we used for evaluation has over 37 million unique features. But each method hasan average of 63 methods, so the feature vectors are very sparse. Thus, the matrix multiplicationdescribed above can be done efficiently using a fast sparse matrix multiplication library—for ourcorpus, this phase finishes in less than a second. In the following phases, we need a sub-algorithm to compute amaximal code snippet that is common to two given code snippets. For example, given the codesnippets x = 1; y = 2; and y = 2; z = 3; , we need an algorithm that computes y = 2; asthe intersection of the two code snippets. This algorithm could be implemented using a longest-common subsequence (LCS) [Porter 1997] computation algorithm on strings by treating the twocode snippets as strings. Such an algorithm was used in SNIFF [Chatterjee et al. 2009] (whichperforms natural language small code snippet search). However, LCS does not work well for Aromabecause often the common parts between two code snippets may not be exactly similar. To illustratethis point, suppose we are given the two code snippets x = 1; if (y > 1) if (z < 0) w =4; and if (z < 0) if (y > 1) w = 4; v = 10; , where we have swapped the nesting of thetwo if statements. LCS will retrieve either if (y > 1) w = 4; or if (z < 0) w = 4; asthe intersection, i.e. LCS drops one of the if statements along with the non-common assignmentstatements. Ideally, we should have both if statements in the intersection, i.e. the intersectionalgorithm should compute either if (y > 1) if (z < 0) w = 4; or if (z < 0) if (y >1) w = 4; as the intersection.The example also shows that we can have at most two intersected code snippets when fuzzysimilarity exists between the given code snippets—a snippet will either be most similar to the firstsnippet, or the second snippet. We resolve this ambiguity by picking the intersected snippet thatis close to the second snippet. Thus, we can think of the intersection as a code snippet obtainedby taking the second snippet, and dropping its fragments which have no resemblance to the firstsnippet. That is, the algorithm is pruning the second snippet while retaining the parts commonwith the first one.A simple way to prune the second snippet is to look at its parse tree and find a subtree whichis most similar to the first snippet. However, such an algorithm will be expensive because thereare exponentially many subtrees in a given tree. Instead, Aroma uses a greedy algorithm whichgives us a maximal subtree of the second snippet’s parse tree. We have also observed that if wecan identify all the leaf nodes in the second snippet’s parse tree that need to be present in theintersection, we can get a maximal subtree by simply retaining all the nodes and edges in the treethat lie in a path from the root to the identified leaf nodes. We next formally describe the pruningalgorithm.Let us assume we are given two code snippets, say m and m , in the form of their parse trees. Thecomputation of the optimal pruned simplified parse tree, say m p , requires us to find a subset, say R ,of the leaf nodes of m . Recall that the set of leaf nodes of m is denoted by N ( m ) and contains exactlythe non-keyword tokens in the parse tree. The set R should be such that the similarity between m p and m is maximal. We will use the cardinality of the multi-set intersection of the features of twocode snippets as their similarity score. That is, the similarity score between two snippets givenas parse trees, say m and m , is | F ( m ) ∩ F ( m )| . Let us denote it by SimScore ( m , m ) . Once we Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. roma: Code Recommendation via Structural Code Search 152:13 have computed the set of leaf nodes (i.e. R ) that need to be present in the intersection, m p is thesubtree consisting of the nodes in R , and any internal nodes and edges in m which are along apath from any n ∈ R to the root node in m . The greedy algorithm for computing R is described inAlgorithm 1. Algorithm 1

The Pruning Algorithm procedure Prune( F ( m ) , m ) ▷ The first argument takes the features of m instead of m itself to simplify the descriptionof Phase III. R ← ∅ F ← ∅ repeat n ← argmax n ′ ∈ N ( m )\ R SimScore ( F ( m ) , ( F ⊎ F ( n ′ ))) if n exists and SimScore ( F ( m ) , F ⊎ F ( n )) > SimScore ( F ( m ) , F ) then R ← R ∪ { n } F ← F ⊎ F ( n ) end if until R does not change anymore return m p where m p is obtained from m by retaining all the non-keyword tokens in R and any internal node or edge which appears in a path from a n ∈ R to the root of m end procedure In the algorithm, Aroma maintains the collection of the features of the intersected snippet inthe variable F . The variable R maintains the set of leaf nodes in the intersected code. Initially, thealgorithm starts with an empty set of leaf nodes. It then iteratively adds more leaf nodes to the setfrom the parse tree of the second method (i.e. m ). A node n is added if it increases the similaritybetween the first method and the tree that can be obtained from R . Since F maintains the featuresof the tree that can be constructed from R , we can get the features of R ∪ { n } by simply adding thefeatures associated with n (i.e. F ( n ) ) to F . If such a node cannot be found, the algorithm constructsthe intersected tree from R and returns it.We are next going to show how Aroma uses the pruning algorithm to rerank the snippetsretrieved in Phase 1. Given a query, say q , and a method body, say m , pruning of the method withrespect to the query (i.e. Prune ( F ( q ) , m ) ) gives a code snippet that is common to both the query andmethod. If we consider the similarity score between the query and the pruned code snippet, thescore should be an alternative way to quantify the overlap between the query and the method. Wefound empirically that if we use this alternative score to rerank the methods retrieved in Phase 1(i.e. N ), then our ranking of search results improves slightly. Aroma uses the reranked list, whichwe call N , in the next phase for clustering and intersection. Note that the pruning algorithm isgreedy, so we may not find the best intersection between two code snippets. In Section 5, we showthat in very rare cases the greedy pruning algorithm may not give us the best recommended codesnippets.Listing 5 shows a code snippet from the reranked search results for the query code snippet inListing 4. In the code snippet, the highlighted tokens are selected by the pruning algorithm tomaximize the similarity score to the query snippet. In the final phase, Aroma prepares recommendations byclustering and intersecting the reranked search results from the previous phase. Clustering andintersection are computationally expensive. Therefore, we pick from the list of search results the

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. top η =

100 methods whose overlap score with the query is above a threshold τ = .

65, and runthe last phase on them. In the discussion below, we assume that N has been modified to containthe top η search results. Clustering.

Aroma clusters together method bodies that are similar to each other. The cluster-ing step is necessary to avoid creating redundant recommendations—for each cluster, only onerecommendation is generated. Furthermore, the methods in a cluster may contain unnecessary,extraneous code fragments. An intersection of the code snippets in a cluster helps to create aconcise recommendation by getting rid of these unnecessary code fragments.A cluster contains method bodies that are similar to each other. Specifically, a cluster must satisfythe following two constraints:(1) If we intersect the snippets in a cluster, we should get a code snippet that has more codefragments than the query. This ensures that Aroma’s recommendation (which is obtained byintersecting the snippets in the cluster) is an extension to the query snippet.(2) The pruned code snippets in a cluster are similar to each other. This is because Aroma hasbeen designed to perform search that can tolerate some degree of differences between thequery and the results. As such, two code snippets may overlap with different parts of thequery. If two such code snippets are part of a cluster, then their intersection will not containthe query snippet. Therefore, the recommendation, which is obtained by intersecting all thesnippets in a cluster, will not contain any part of the query. This is undesirable because wewant a recommendation that contains the query and some extra new code.Moreover, Aroma does not require the clusters to be disjoint.Because of these constraints on a cluster, we cannot simply use a textbook clustering algorithmsuch as k-means, DBSCAN, or Affinity Propagation. We tried using those clustering algorithmsinitially (ignoring the constraints) and got poor results. Therefore, we developed a custom clusteringalgorithm that takes the constraints into account. At a high level, the clustering algorithm starts bytreating each method body as a separate cluster. Then, it iteratively merges a cluster with anothercluster with single snippet provided that the merged cluster satisfies the cluster constraints andthe size of the recommended code snippet from the merged cluster is minimally reduced. We nextformally describe the clustering algorithm.We use N ( i ) to denote the tree at index i in the list N . A cluster is a tuple of indices of the form ( i , . . . , i k ) , where i j < i j + for all 1 ≤ j < k . A tuple ( i , . . . , i k ) denotes a cluster containing thecode snippets N ( i ) , . . . , N ( i k ) . We define the commonality score of the tuple τ = ( i , . . . , i k ) as cs ( τ ) = | ∩ ≤ j ≤ k F ( N ( i j ))| Similarly, we define the commonality score of the tuple τ = ( i , . . . , i k ) with respect to the query q as csq ( τ ) = | ∩ ≤ j ≤ k F ( Prune ( F ( q ) , N ( i j )))| We say that a tuple τ = ( i , . . . , i k ) is a valid tuple or a valid cluster if(1) l ( τ ) = cs ( τ )/ csq ( τ ) is greater than some user-defined threshold τ (which is 1.5 in ourexperiments). This ensures that after intersecting all the snippets in the cluster, we get asnippet that is at least τ times bigger than the query code snippet.(2) s ( τ ) = csq ( τ )/| F ( N ( i ))| is greater than some user-defined threshold τ (which is 0.9 in ourexperiments). This requirement ensures that the trees in the cluster are not too similar toeach other. Specifically, it says that the intersection of the pruned snippets in a cluster shouldbe very similar to the first pruned snippet.The set of valid tuples C is computed iteratively as follows: Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. roma: Code Recommendation via Structural Code Search 152:15 (1) C is the set {( i ) | ≤ i ≤ | N | and ( i ) is a valid tuple } .(2) C ℓ + = C ℓ ∪ {( i , . . . , i k , i ) | ( i , . . . , i k ) ∈ C ℓ and i k < i ≤ | N | and ( i , . . . , i k , i ) is a validtuple and ∀ j if i k < j ≤ | N | then l (( i , . . . , i k , i )) ≥ l (( i , . . . , i k , j ))} Aroma computes C , C , . . . iteratively until it finds an ℓ such that C ℓ = C ℓ + . C = C ℓ is then theset of all clusters. We developed this custom clustering algorithm because existing popular clusteringalgorithms such as k-means, DBSCAN and Affinity Propagation all gave poor recommendations.Our clustering algorithm makes use of several similarity metrics (the containment score, the Jaccardsimilarity of various feature sets), whereas standard clustering algorithms usually depend on asingle notion of distance. We found the current best similarity metric and clustering algorithmthrough trial and error.After computing all valid tuples, Aroma sorts the tuples in ascending order on the first index ineach tuple and then in descending order on the length of each tuple. It also drops any tuple τ fromthe list if it is similar (i.e. has a Jaccard similarity more than 0.5) to any tuple appearing before τ inthe sorted list. This ensures that the recommended code snippets are not too similar to each other.Let N be the sorted list of the remaining clusters. Intersection.

Aroma creates a recommendation by intersecting all the snippets in each clus-ter. The intersection algorithm uses the

Prune function described in Algorithm 1 and ensuresthat the intersection does not discard any code fragment that is part of the query. Formally,given a tuple τ = ( i , . . . , i k ) , Intersect ( τ , q ) returns a code snippet that is the intersection ofthe code snippets N ( i ) , . . . , N ( i k ) while ensuring that we retain any code that is similar to q . Intersect (( i , . . . , i k ) , q ) is defined recursively as follows: • Intersect (( i ) , q ) = Prune ( F ( q ) , N ( i )) . • Intersect (( i , i ) , q ) = Prune ( F ( N ( i )) ⊎ F ( q ) , N ( i )) . • Intersect (( i , . . . , i j , i j + ) , q ) = Prune ( F ( N ( i j + )) ∪ F ( q ) , Intersect (( i , . . . , i j ) , q )) .In the running example, Listing 5 and Listing 6 form a cluster. Aroma prunes Listing 5 withrespect to the union set of features of the query code and Listing 6 as the intersection betweenListing 5 and Listing 6. The result of the intersection is shown in Listing 7, which is returned as therecommended code snippet from this cluster.Finally, Aroma picks the top K (where K = N and returnsthe intersection of each tuple with the query code snippet as recommendations. if (!(view instanceof EditText)) {view.setOnTouchListener( new

View.OnTouchListener() { public boolean onTouch(View v, MotionEvent event) {hideKeyBoard(); return false ;}});} if (view instanceof ViewGroup) { for ( int i = 0; i < ((ViewGroup) view).getChildCount(); i++) {View innerView = ((ViewGroup) view).getChildAt(i);setupUIToHideKeyBoardOnTouch(innerView);}} Listing 5. A method body containing the query code snippet in Listing 4. The highlighted text representstokens selected in the pruning step. Adapted from https://github.com/arcbit/arcbit-android/blob/master/app/src/main/java/com/arcbit/arcbit/ui/SendFragment.java if (!(view instanceof EditText)) {view.setOnTouchListener( new

View.OnTouchListener() { public boolean onTouch(View v, MotionEvent event) {Utils.toggleSoftKeyBoard(LoginActivity. this , true ); return false ;}});} if (view instanceof ViewGroup) { for ( int i = 0; i < ((ViewGroup) view).getChildCount(); i++) {View innerView = ((ViewGroup) view).getChildAt(i);setupUI(innerView);}} Listing 6. Another method containing the query code snippet in Listing 4. The highlighted text representstokens selected in the pruning step. if (!(view instanceof EditText)) {view.setOnTouchListener(new View.OnTouchListener() {public boolean onTouch(View v, MotionEvent event) {// your code...return false;}});} if (view instanceof ViewGroup) { for ( int i = 0; i < ((ViewGroup) view).getChildCount(); i++) {View innerView = ((ViewGroup) view).getChildAt(i);setupUIToHideKeyBoardOnTouch(innerView);}} Listing 7. A recommended code snippet created by intersecting code in Listing 5 and Listing 6. Extra lines arehighlighted.

Our goal in this section is to assess how Aroma code recommendation can be useful to programmers.To do so, we collected real-world code snippets from Stack Overflow, used them as query snippets,and inspected the code recommendations provided by Aroma to understand how they can addvalue to programmers in various ways.

We instantiated Aroma on 5,417 GitHub projects where Java is the main language and Android isthe project topic. We ensured the quality of the corpus by picking projects that are not forked fromother projects, and have at least 5 stars. A previous study [Lopes et al. 2017] shows that duplicationexists pervasively on GitHub. To make sure Aroma recommendations are created from multipledifferent code snippets, rather than the same code snippet duplicated in multiple locations, weremoved duplicates at project level, file level, and method level. We do this by taking hashes of theseentities and by comparing these hashes. After removing duplicates, the corpus contains 2,417,125methods.For evaluation, we picked the 500 most popular questions on Stack Overflow with the android tag. From these questions, we only considered the top voted answers. From each answer, weextracted all Java code snippets containing at least 3 tokens, a method call, and less than 20lines, excluding comments. We randomly picked 64 from this set of Java code snippets. We then Adapted from https://github.com/AppLozic/Applozic-Android-Chat-Sample/blob/master/Applozic-Android-AV-Sample/app/src/main/java/com/applozic/mobicomkit/sample/LoginActivity.java roma: Code Recommendation via Structural Code Search 152:17 used these code snippets to carry out the experimental evaluations in the following two sections.In these experiments, we found that on average Aroma takes 1.6 seconds end-to-end to createrecommendations on a 24-core CPU. The median response time is 1.3s and 95% queries completein 4 seconds. A 24-core server was not necessary to achieve reasonable response time: We reranour experiments on a 4-core desktop machine, and the average response time is 2.9 seconds. Webelieve this makes Aroma suitable for integration into the development environment as a coderecommendation tool.

In this experiment, we manually created partial code snippets by taking the first half of thestatements from each of the 64 code snippets. Since each full code snippet from Stack Overflowrepresents a popular coding pattern, we wanted to check whether Aroma could recommend themissing statements in the code snippet given the partial query code snippet. We always selectedthe first half of each code snippet to avoid subjective bias. Since we know how the tool works, wewould be inclined to pick the lines that we think will produce better results. On average, the querycode snippets were 1 to 5 lines and contained 10 to 100 features.We could not extract partial query code snippets from 14 out of 64 code snippets because theycontained a single statement. Single-statement snippets do get recommendations, but since we donot have a ground truth, we cannot judge their quality objectively. For the remaining 50 querycode snippets, Aroma recommendations fall into the following two categories.

In 37 cases (74%), one of the top 5 Aroma recommendations matchedthe original code snippet. Example D in Table 1 shows a partial query snippet which included thefirst two statements in a try-catch block of a Stack Overflow code snippet, and Aroma recommendedthe same error handling code as in the original code snippet.

In the other 13 cases (26%), none of the Aroma recommendedcode snippets matched the original snippets. While in each case the recommended snippets didnot contain the original usage pattern, they still fall in some of the categories in Table 3 whichwe discuss in the next section. Example E in Table 1 shows a partial code snippet which includedone of two common ways to send an object with an

Intent . Given the statement, Aroma didnot recommend the other way to serialize an object in the original code snippet, but suggested acustomary way to start an activity with an

Intent containing a serialized object.

In this experiment, we used the 64 code snippets as queries to evaluate the quality of Aroma’srecommendations. While the experiment in the previous section used partial snippets extractedfrom each of the 64 code snippets, here we used the full code snippets. This meant that we coulduse all 64 snippets instead of just the 50 used in Section 4.2, as we did not have to filter outsingle-statement code snippets.We manually inspected the recommended code snippets and determined whether they areuseful. We considered a recommended code snippet to be “useful” if in a programming scenariowhere a programmer writes the query code, they would benefit from seeing the related methods orcommon usage patterns in the code recommendations. We classified the recommended snippets intoseveral categories by how the recommended code relates to the query snippet. The classification issubjective because there is no “ground truth” on what the recommended code should be, and theactual usefulness depends on how familiar the programmer is with the language and framework.Nevertheless, we present the categories and some examples in Table 1 to demonstrate the variety

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. of code recommendations Aroma can provide. Two of the authors did the manual inspection andcategorization, and two other authors verified the results.

In this category, the recommended code suggests additional configura-tions on objects that are already appearing in the query code. Examples include adding callbackhandlers, and setting additional flags and properties of an existing object. Listings 1, 2 in the intro-duction, as well as Example A in Table 1 shows examples of this category. These recommendationscan be helpful to programmers who are unfamiliar with the idiomatic usages of library methods.

In this category, the recommended code adds null checks andother checks before using an object, or adds a try-catch block that guards the original code snippet.Such additional statements are useful reminders to programmers that the program might enter anerroneous state or even crash at runtime if exceptions and corner cases are not carefully handled.Listings 1, 3 in the introduction show an example of this category.

The recommended code extends the query code to perform some commonoperations on the objects or values computed by the query code. For example, recommended codecan show API methods that are commonly called. Example B in Table 1 shows an example of thiscategory, where the recommendation applies Gaussian blurring on the decoded bitmap image. Thispattern is not obligatory but demonstrates a possible effect that can be applied on the originalobject. This category of recommendations can help programmers discover related methods forachieving certain tasks.

The recommended code adds statements that do not affect the originalfunctionalities of the query code, but rather suggests related statements that commonly appearalongside the query code. In Example C in Table 1, the original code moves the cursor to the end oftext in an editable text area, where the recommended code also configures the Android SupportAction Bar to show the home button and hide the activity title in order to create a more focusedview. These statements are not directly related to the text view, but are common in real-world code.

In rare cases, the query code snippet could match methodbodies that are mostly different from each other. This results in clusters of size 1. In these cases,Aroma performs no intersection and recommends the full method bodies without any pruning.The number of recommended code snippets for each category is listed in Table 3. For recommen-dations that belong to multiple categories, we counted them for each of the categories. We believedthe first four categories all can be useful to programmers in different ways, where the unclusteredrecommendations may not be. For 59 out of the 64 query code snippets (92%), Aroma generated atleast one useful recommended snippet that falls in the first four categories.

Table 3. Categories of Aroma code recommendations

Configuring Objects 17Error Checking and Handling 14Post-processing 16Correlated Statements 21Unclustered Recommendations 5

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. roma: Code Recommendation via Structural Code Search 152:19

Pattern-oriented code completion tools [Mover et al. 2018; Nguyen et al. 2012, 2009] could also beused for code recommendation. For example, GraPacc [Nguyen et al. 2012] proposed using minedAPI usage patterns for code completion. To compare GraPacc’s code recommendation capabilitiesto Aroma’s, we took the dataset of 15 Android API usage patterns manually curated from StackOverflow posts and Android documentation by the authors of BigGroum [Mover et al. 2018]. Weused BigGroum’s dataset because this tool extends the pattern-mining tool Groum to scale to largecorpora with over 1000 repos. While there are more recent ML-based code completion tools, theyfocus on completing the next token or predicting the correct API method to invoke, which does notdirectly compare to Aroma. Among the 15 snippets in this dataset, 11 were found in BigGroummining results. Therefore, if GraPacc is instantiated on the patterns mined by BigGroum, 11 outof the 15 patterns could be recommended by GraPacc.In order to evaluate Aroma, we followed the same methodology as in Section 4.2 to create a partialquery snippet from each of the 15 full patterns, and checked if any of the Aroma recommendedcode snippets contained the full pattern. For 14 out of 15 patterns, Aroma recommended codecontaining the original usage patterns, i.e. they are exact recommendations as defined in Section 4.2.1.An advantage of Aroma is that it could recommend code snippets that do not correspond to anypreviously mined pattern by BigGroum. Moreover, Aroma could recommend code which may notcontain any API usage.

One of the most important and novel phases of the Aroma’s code recommendation algorithm isphase II: prune and rerank, which produces the reranked search results . The purpose of this phaseis to rank the search results from phase I (i.e. the light-weight search phase) so that any methodcontaining most parts of the query code is ranked higher than a method body containing a smallerpart of the query code. Therefore, if a method contains the entire query code snippet, it shouldbe ranked top in the reranked search result list. However, in rare cases this property of Aromamay not hold due to two reasons: 1) Aroma’s pruning algorithm is greedy and approximate due toefficiency reasons, and 2) the kinds of features that we extract may not be sufficient.To evaluate the recall of the prune and rerank phase, we created a micro-benchmark datasetby extracting partial query code snippets from existing method bodies in the corpus. On each ofthese query snippets, Aroma should rank the original method body as number 1 in the rerankedsearch result list, or the original method body should be 100% similar to the first code snippet inthe ranked results. We created two kinds of query code snippets for this micro-benchmark: • Contiguous code snippets.

We randomly sampled 1000 method bodies with at least 12 lines ofcode. From each method body we take the first 5 lines to form a partial query code snippet. • Non-contiguous code snippets.

We again randomly sampled 1000 method bodies with at least12 lines of code. From each method body we randomly sample 5 lines to form a partial querycode snippet.We first evaluated Aroma’s search recall on this dataset. We employed statistical bootstrappingto minimize sampling bias from the dataset. Then, we compared it with alternative setups usingclone detectors and conventional search techniques. The results are reported in Table 4.

Recall@ n is defined as the percentage of query code snippets for which the original method body isfound in the top n methods in the reranked search result list. In addition to Recall@1, we considered Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019.

Table 4. Comparison of recall between a clone detector, conventional search techniques, and Aroma

Contiguous Non-contiguousRecall@1 Recall@100 Recall@1 Recall@100SCC (12.2%) (7.7%)Keywords Search 78.3% 96.9% 93.0% 99.9%Features Search 78.3% 96.8% 88.1% 98.6%Aroma

Recall@100 because the first 100 methods in the reranked list are used in the clustering phase tocreate recommended code.The result in the last row of Table 4 shows that Aroma is always able to retrieve the originalmethod in the top 100 methods in the reranked search result list. For 99.1% of contiguous codequeries and for 98.3% of non-contiguous query code snippets, Aroma was able to retrieve theoriginal method as the top-ranked result in the reranked search result list.Listing 8 demonstrates a rare case where the original method was not retrieved as the top result.Since Aroma’s pruning algorithm is greedy (Section 3.3.2), it erroneously decided to pick thestatement at line 7, because the statement contains a lot of features which overlap with that of thequery code despite the absence of that statement in the query code. This results in an imperfectsimilarity score of 0.984. Since there are other code snippets with similar structures which achievesa perfect similarity score of 1.0, the original method was not retrieved at rank 1. Fortunately, thisscenario happens rarely enough that it does not affect overall recall. int result = 0; int cur; int count = 0; do { cur = in.readByte() & 0xff; result |= (cur & 0x7f) << (count * 7); count++; } while (((cur & 0x80) == 0x80) && count < 5); if ((cur & 0x80) == 0x80) { throw new DexException("invalid LEB128 sequence"); } return result; Listing 8. A method body in the GitHub corpus. Lines 1–5 and 8 were used as query. The pruning algorithmselected the wrong token (as highlighted) that does not belong to the query, resulting in an imperfectsimilarity score of 0.984.

We employed the statistical bootstrapping method and randomly sampled the dataset 30 times,each time from the entire code corpus (i.e. with replacement), in order to test the robustness ofour approach and to reduce sampling bias. We ran the recall test on each of the 30 samples, andcalculated a bootstrapped confidence interval using ˆ θ ± Z α / ˆ σ /√ B , where ˆ θ is the average recallrate, ˆ σ the standard deviation of the recall rates, α the confidence level set to 0 .

05, and B thenumber of resampling 30. For contiguous code snippets, the bootstrapped confidence intervalis 99 . ± .

25% for Recall@1, and 99 . ± .

03% for Recall@100. For non-contiguous codesnippets, the bootstrapped confidence interval is 98 . ± .

49% for Recall@1, and 99 . ± . Adapted from https://github.com/iqiyi/dexSplitter/blob/master/extra/hack_dx/src/com/android/dex/Leb128.java roma: Code Recommendation via Structural Code Search 152:21 for Recall@100. These confidence intervals suggest that Aroma is able to robustly recall the originalmethod bodies given partial code snippets.

Aroma’s search and pruning phases are somewhat related to clone detection and conventionalcode search. In principle, Aroma can use a clone detector or a conventional code search techniqueto first retrieve a list of methods that contain the query code snippet, and then cluster and intersectthe methods to get recommendations. We tested these alternative setups for search recall on thesame micro-benchmark dataset.

SourcererCC [Sajnani et al. 2016] is a state-of-the-art clone detector thatsupports Type-3 clone detection. We wanted to compare Aroma with SourcererCC to examinewhether a current-generation clone detector can be used as the light-weight search phase in Aroma.We instantiated SourcererCC on the same corpus indexed by Aroma. We then used Sourcer-erCC to find clones of the same 1000 contiguous and non-contiguous queries in the micro-benchmark suite. SourcererCC retrieved all similar methods above a certain similarity threshold,which is 0.7 by default. However, it does not provide any similarity score between two code snippets,so we were unable to rank the retrieved results and report recall at a specific ranking. We couldmodify SourcererCC to return the similarity scores, but we do not expect the results to change.SourcererCC’s recall was 12.2% and 7.7% for contiguous and non-contiguous code queries,respectively. SourcererCC indexes at method-level granularity, and only returns methods whoseentire body matches the query code. We also found that in many cases SourcererCC found codesnippets shorter than the query snippet. While these are Type-3 clones by definition, they are notuseful for generating code recommendations in Aroma. Extending SourcererCC to return themethods enclosing the clone snippets found would not work, because it does not consider themethods enclosing the target snippets as “clones” in the first place. We worked closely with amember of the SourcererCC team, and found that making SourcererCC find all occurrences ofan arbitrary code snippet, contiguous and non-contiguous, would require significant reengineering.Therefore, we conclude that current-generation clone detectors may not suit Aroma’s requirementsfor light-weight search.

We implemented a conventionalcode search technique using classic TF-IDF [Salton and McGill 1986]. Specifically, instead of creatinga binary vector in the featurization stage, we created a normalized TF-IDF vector. We then createdthe sparse index matrix by combining the sparse vectors for every method body. The ( i , j ) th entryin the matrix is defined as: tfidf ( i , j ) = ( + log tf ( i , j )) · log J df ( i ) where tf ( i , j ) is the count of occurrences of feature i in method j , and df ( i ) is the number ofmethods in which feature i exists. J is the total number of methods. During retrieval, we createda normalized TF-IDF sparse vector from the query code snippet, and then took its dot productwith the feature matrix. Since all vectors are normalized, the result contains the cosine similaritybetween the feature vectors of the query and of every method. We then returned the list of methodsranked by their cosine similarities. We implemented another conventionalcode search technique by simply treating a method body as a bag of words and using the standardTF-IDF technique for retrieval. To do so, we extracted words instead of structural features fromeach token, and used the same vectorization technique as in Section 5.2.2.

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019.

As shown in Table 4, the recall rates of both conventional search techniques are considerablylower than Aroma. We observed that in many cases, though the original method was presentin the top 100 results, it was not the top result because there are other methods with highersimilarity scores due to more overlapping features or keywords. Without pruning, there is no wayto determine how well a method contains the query code snippet. This experiment shows thatpruning is essential in order to create a precise ranked list of search results.

For deployment, we have implemented Aroma for three additional languages: Hack, JavaScript andPython. One advantage of Aroma is its language-agnostic nature: accommodating a new languagerequires only implementing a parser that parses code from the target language into a simplifiedparse tree (as defined in Section 3). The rest of the algorithm, including pruning, clustering andintersecting, all work on the generic form of simplified parse trees. This makes Aroma suitable forreal-world deployment on a codebase that consists of many different programming languages.The code recommendations for these languages are similar to those for Java, as shown in Table 1.We have also conducted the same recall experiment as described in Section 5. This ensures thatAroma instantiations on different languages all have high retrieval recall rates and thus are capableof creating code recommendations pertinent to a query. The results are shown in Table 5.

Table 5. Aroma recall performance on different languages

Contiguous Non-contiguousRecall@1 Recall@100 Recall@1 Recall@100Aroma for Hack 98.5% 100% 98.3% 99.9%Aroma for JavaScript 93.9% 99.6% not applicableAroma for Python 97.5% 99.4% not applicableThe recall rates are on par with the original Aroma version on the open-source Java corpus.Non-contiguous test samples were not generated for JavaScript or Python for practical reasons:JavaScript code is often embedded with HTML tags, and Python code structure is dependent onindentation levels. Both language features made generating non-contiguous code queries thatresemble real-world code queries particularly difficult. Nevertheless, the high recall rates suggestAroma’s algorithm works well across different languages, and Aroma is capable of creating usefulcode recommendations for each language.We have implemented Aroma as an IDE plugin for all four languages (Hack, Java, JavaScriptand Python). A screenshot of the integrated development environment is shown in Figure 5.In our setup, we deployed Aroma on a dedicated set of servers to respond to queries from alldevelopers. Compared to a distributed setup—where Aroma runs on individuals’ laptops—thissetup allowed for easier control and delivery of new search models. Our search server has 24cores, and on average took 1.6 seconds end-to-end to create a code recommendation. However, a24-core server was not necessary to achieve reasonable response time: We reran our experimentsin Section 4.2 on a 4-core desktop machine, and the average response time is 2.9 seconds.The indexing stage (i.e. the creation of the feature vectors, as described in the beginning Sec-tion 3.3) was also designed for scalability. In our setup, we rebuilt the feature vectors for all methodsevery day. The size of the codebase was comparable to the evaluation dataset. On a 24-core server,

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. roma: Code Recommendation via Structural Code Search 152:23

Fig. 5. Aroma code recommendation plugin in an IDE. Recommended code snippets are shown in the sidepane for code selected in the main editor. building the feature vectors took 20 minutes on average. If Aroma is being deployed on a larger-scale codebase, it is possible to implement incremental indexing for only the changed files, withoutrebuilding the entire feature matrix.

We asked 12 Hack programmers to each complete 4 simple programming tasks. For 2 randomlyselected tasks, they were allowed to use Aroma; for the other 2, they were not. After each pro-grammer completed the tasks, we gave them a brief follow-up survey. For each task, we provided adescription of the functionality to achieve, and some incomplete code to start with. The participantswere requested to write between 4 to 10 lines of code to implement the desired functionality.Measuring the time taken to complete these tasks with versus without Aroma was also initiallyof interest to us; however, we found that the time taken varied greatly, depending mostly on theexperience level of the participant and how familiar they were with the particular frameworksbeing used in the tasks. It had little correlation with the choice of tools.We focused on getting some initial feedback on developers’ experiences using Aroma. Thesurvey began with 2 yes/no questions regarding Aroma’s usefulness: • Did you find Aroma useful in completing the programming tasks? • Did you wish Aroma were available in the programming tasks where you were not permitted touse it? • Aroma is convenient for discovering usage patterns.

Four participants stated they found Aromais “convenient” and they were able to “quickly find answers”. One participant stated “being

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. able to see general patterns is nice.” Another participant commented that when they did nothave Aroma, they “used BigGrep Code Search to achieve the same goal, but it took longer.” • Aroma is more capable.

One participant commented that Aroma is “more capable at findingresults” with multi-line queries, or queries that do not have exact string matches. • Aroma is as useful as documentation.

Two participants commented that the Aroma coderecommendations helped them write correct code in the same way as manually curatedexamples seen in documentations. One said “it would be a nice backup if there were nodocumentation,” another said “otherwise I would have to read wikis; that would be moretedious.”We also found some common reasons why participants did not feel Aroma added any additionalvalue. • Familiarity with the libraries.

Three participants said they did not find Aroma code recom-mendations to be useful because they either “had just been working on something like that,”or “already knew what to call.” In these cases they finished the complete code without thehelp of Aroma. • Simple code search sufficed.

One participant claimed they can get the same amount of informa-tion using BigGrep code search. When asked whether clustered results provided additionalvalue, they answered “the code search results were sufficient for completing the tasks athand.” One other participant stated they “got lucky to find a relevant example using BigGrep,but that might not always be the case.”Based on this initial developer survey, we found that sentiment towards Aroma is generallypositive, as the participants found Aroma useful for conveniently identifying common codingpatterns and integrating them into their own code.

Code Search Engines.

Code-to-code search tools like FaCoY [Kim et al. 2018]and Krugle [Krugler2013] take a code snippet as query and retrieve relevant code snippets from the corpus. FaCoY aimsto find semantically similar results for input queries. Given a code query, it first searches in a StackOverflow dataset to find natural language descriptions of the code, and then finds related posts andsimilar code. While these code-to-code search tools retrieve similar code at different syntactic andsemantic levels, they do not attempt to create concise recommendations from their search results.Further, most of these search engines cannot be instantiated on our code corpus, so we could notexperimentally compare Aroma with these search engines. For instance, the code search engineFaCoY only provides a VM-based demo that is instantiated on a fixed corpus which is not availablepublicly. We were also unable to instantiate FaCoY on our corpus for a direct comparison. Mostother open-source code search tools, including Krugle and searchcode.com, suffer from the sameproblem. Instead, we compared Aroma with two conventional code search techniques based onfeaturization and TF-IDF in Section 5.2, and found that Aroma’s pruning-based search techniquein Phase II outperforms both techniques.Many efforts have been made to improve keyword-based code search [Bajracharya et al. 2006;Chan et al. 2012; Martie et al. 2015; McMillan et al. 2012; Sachdev et al. 2018]. CodeGenie [Lemoset al. 2007] uses test cases to search and reuse source code; SNIFF [Chatterjee et al. 2009] works byinlining API documentation in its code corpus. SNIFF also intersects the search results to providerecommendations, but only targets at resolving natural language queries. The clustering algorithmin SNIFF is limited and does not take structure into account. Two statements are considered similarif they are syntactically similar after replacing variable names with types. The intersection of two BigGrep refers to a version of grep that searches the entire codebase.Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. roma: Code Recommendation via Structural Code Search 152:25 code snippets is the set of statements that appear in both snippets. Due to the strict definitionof similarity, SNIFF cannot find large clusters that contain approximately similar code snippets.Also, SNIFF uses the longest common subsequence algorithm, whose limitations we discuss inSection 3.3.2. MAPO [Zhong et al. 2009] recommends code examples by mining and indexingassociated API usage patterns. Portfolio [McMillan et al. 2011] retrieves functions and visualizestheir usage chains. CodeHow [Lv et al. 2015] augments the query with API calls which are retrievedfrom documentation to improve search results. CoCaBu [Sirres et al. 2018] augments the querywith structural code entities. A developer survey [Sadowski et al. 2015] reports the top reason forcode search is to find code examples or related APIs, and tools have been created for this need.While these code search techniques focus on creating code examples based on keyword queries,they do not support code-to-code search and recommendation.

Clone Detectors.

Clone detectors are designed to detect syntactically identical or highly similar code.SourcererCC [Sajnani et al. 2016] is a token-based clone detector targeting Type 1, 2, and 3 clones.Compared with other clone detectors that also support Type 3 clones, including NiCad [Cordyand Roy 2011], Deckard [Jiang et al. 2007], and CCFinder [Kamiya et al. 2002], SourcererCC hashigh precision and recall and also scales to large-size projects. One may repurpose a clone detectorto find similar code, but since it is designed for finding highly similar code rather than code thatcontains the query code snippet—as demonstrated in Section 5.2—its results are not suitable forcode recommendation.Recent clone detection techniques explored other research directions, from finding semanti-cally similar clones [Kim et al. 2011, 2018; Saini et al. 2018; White et al. 2016], to finding gappedclones [Ueda et al. 2002] and gapped clones with a large number of edits (large-gapped clones) [Wanget al. 2018]. These techniques may excel in finding a particular type of clone, but they sacrifice theprecision and recall for Type 1 to 3 clones.

Pattern Mining and Code Completion.

Code completion can be achieved by different approaches—from extracting the structural context of the code to mining recent histories of editing [Bruch et al.2009; Hill and Rideout 2004; Holmes and Murphy 2005; Robbes and Lanza 2008]. GraPacc [Nguyenet al. 2012] achieves pattern-oriented code completion by first mining graph-represented codingpatterns using GrouMiner [Nguyen et al. 2009], then searching for input code to produce codecompletion suggestions. More recent work [Nguyen et al. 2016a, 2018, 2016b] improves codecompletion by predicting the next API call given a code change. Pattern-oriented code completionrequires mining usage patterns ahead of time, and cannot recommend any code outside of themined patterns, while Aroma does not require pattern mining and recommends snippets on the fly.

API Documentation Tools.

More techniques exist for improving API documentations and exam-ples. The work by Buse and Weimer [2012] synthesizes API usage examples through data flowanalysis, clustering and pattern abstraction. The work by Subramanian et al. [2014] augments APIdocumentations with up-to-date source code examples. MUSE [Moreno et al. 2015] generates codeexamples for a specific method using static slicing. SWIM [Raghothaman et al. 2016] synthesizesstructured call sequences based on a natural language query. The work by Treude and Robillard[2016] augments API documentation with insights from Stack Overflow. These tools are limited toAPI usages and do not generalize to structured code queries.

We presented Aroma, a new tool for code recommendation via structural code search. Aromaworks by first indexing a large code corpus. It takes a code snippet as input, assembles a list of

Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. method bodies from the corpus that contain the snippet, and clusters and intersects those methodbodies to offer several succinct code recommendations.To evaluate Aroma, we indexed a code corpus with over 2 million Java methods, and performedAroma searches with code snippets chosen from the 500 most popular Stack Overflow questionswith the android tag. We observed that Aroma provided useful recommendations for a majorityof these snippets. Moreover, when we used half of the snippet as the query, Aroma exactlyrecommended the second half of the code snippet in 37 out of 50 cases.Further, we performed a large-scale automated evaluation to test the accuracy of Aroma searchresults. We extracted partial code snippets from existing method bodies in the corpus and performedAroma searches with those snippets as the queries. We found that for 99.1% of contiguous queriesand 98.3% of non-contiguous queries, Aroma retrieved the original method as the top-rankedresult. We also showed that Aroma’s search and pruning algorithms are decidedly better at findingmethods containing a code snippet than conventional code search techniques.Finally, we conducted a case study to investigate how programmers interact with Aroma, whereinparticipants completed two short programming tasks with Aroma and two without Aroma. Wefound that many participants successfully used Aroma to identify common patterns for librariesthey were unfamiliar with. In a follow-up survey, a majority of participants stated that they foundAroma useful for completing the tasks.Our ongoing work shows that Aroma has the potential to be a powerful developer tool. Thoughnew code is frequently similar to existing code in a repository, currently available code searchtools do not leverage this similar code to help programmers add to or improve their code. Aromaaddresses this problem by identifying common additions or modifications to an input code snippetand presenting them to the programmer in a concise, convenient way.

REFERENCES

Sushil Bajracharya, Trung Ngo, Erik Linstead, Yimeng Dou, Paul Rigor, Pierre Baldi, and Cristina Lopes. 2006. Sourcerer:A Search Engine for Open Source Code Supporting Structure-based Search. In

Companion to the 21st ACM SIGPLANSymposium on Object-oriented Programming Systems, Languages, and Applications (OOPSLA ’06) . ACM, New York, NY,USA, 681–682. https://doi.org/10.1145/1176617.1176671Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from Examples to Improve Code Completion Systems.In

Proceedings of the the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFTSymposium on The Foundations of Software Engineering (ESEC/FSE ’09) . ACM, New York, NY, USA, 213–222. https://doi.org/10.1145/1595696.1595728Raymond P. L. Buse and Westley Weimer. 2012. Synthesizing API Usage Examples. In

Proceedings of the 34th InternationalConference on Software Engineering (ICSE ’12) . IEEE Press, Piscataway, NJ, USA, 782–792. http://dl.acm.org/citation.cfm?id=2337223.2337316Wing-Kwan Chan, Hong Cheng, and David Lo. 2012. Searching Connected API Subgraph via Text Phrases. In

Proceedings ofthe ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE ’12) . ACM, New York,NY, USA, Article 10, 11 pages. https://doi.org/10.1145/2393596.2393606Shaunak Chatterjee, Sudeep Juvekar, and Koushik Sen. 2009. SNIFF: A Search Engine for Java Using Free-Form Queries. In

Fundamental Approaches to Software Engineering , Marsha Chechik and Martin Wirsing (Eds.). Springer Berlin Heidelberg,Berlin, Heidelberg, 385–400.J. R. Cordy and C. K. Roy. 2011. The NiCad Clone Detector. In . 219–220. https://doi.org/10.1109/ICPC.2011.26R. Hill and J. Rideout. 2004. Automatic method completion. In

Proceedings. 19th International Conference on AutomatedSoftware Engineering, 2004.

Proceedings. 27thInternational Conference on Software Engineering, 2005. ICSE 2005. . 96–105. https://doi.org/10.1109/ICSE.2007.30T. Kamiya, S. Kusumoto, and K. Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for largescale source code.

IEEE Transactions on Software Engineering

28, 7 (July 2002), 654–670. https://doi.org/10.1109/TSE.Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019. roma: Code Recommendation via Structural Code Search 152:27 . 301–310. https://doi.org/10.1145/1985793.1985835Kisub Kim, Dongsun Kim, Tegawendé F. Bissyandé, Eunjong Choi, Li Li, Jacques Klein, and Yves Le Traon. 2018. FaCoY: ACode-to-code Search Engine. In

Proceedings of the 40th International Conference on Software Engineering (ICSE ’18) . ACM,New York, NY, USA, 946–957. https://doi.org/10.1145/3180155.3180187Ken Krugler. 2013.

Krugle Code Search Architecture . Springer New York, New York, NY, 103–120. https://doi.org/10.1007/978-1-4614-6596-6_6Otávio Augusto Lazzarini Lemos, Sushil Krishna Bajracharya, Joel Ossher, Ricardo Santos Morla, Paulo Cesar Masiero, PierreBaldi, and Cristina Videira Lopes. 2007. CodeGenie: Using Test-cases to Search and Reuse Source Code. In

Proceedings ofthe Twenty-second IEEE/ACM International Conference on Automated Software Engineering (ASE ’07) . ACM, New York, NY,USA, 525–526. https://doi.org/10.1145/1321631.1321726Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017.DéJàVu: A Map of Code Duplicates on GitHub.

Proc. ACM Program. Lang.

1, OOPSLA, Article 84 (Oct. 2017), 28 pages.https://doi.org/10.1145/3133908Fei Lv, Hongyu Zhang, Jian-guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. CodeHow: EffectiveCode Search Based on API Understanding and Extended Boolean Model (E). In

Proceedings of the 2015 30th IEEE/ACMInternational Conference on Automated Software Engineering (ASE) (ASE ’15) . IEEE Computer Society, Washington, DC,USA, 260–270. https://doi.org/10.1109/ASE.2015.42L. Martie, T. D. LaToza, and A. v. d. Hoek. 2015. CodeExchange: Supporting Reformulation of Internet-Scale Code Queriesin Context (T). In . 24–35. https://doi.org/10.1109/ASE.2015.51C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie. 2012. Exemplar: A Source Code Search Engine for FindingHighly Relevant Applications.

IEEE Transactions on Software Engineering

38, 5 (Sept 2012), 1069–1087. https://doi.org/10.1109/TSE.2011.84Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: Finding Relevant Functionsand Their Usage. In

Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11) . ACM, New York,NY, USA, 111–120. https://doi.org/10.1145/1985793.1985809Laura Moreno, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, and Andrian Marcus. 2015. How Can I Use ThisMethod?. In

Proceedings of the 37th International Conference on Software Engineering - Volume 1 (ICSE ’15) . IEEE Press,Piscataway, NJ, USA, 880–890. http://dl.acm.org/citation.cfm?id=2818754.2818860S. Mover, S. Sankaranarayanan, R. B. Olsen, and B. E. Chang. 2018. Mining framework usage graphs from app corpora.In , Vol. 00. 277–289.https://doi.org/10.1109/SANER.2018.8330216Anh Tuan Nguyen, Michael Hilton, Mihai Codoban, Hoan Anh Nguyen, Lily Mast, Eli Rademacher, Tien N. Nguyen, andDanny Dig. 2016a. API Code Recommendation Using Statistical Learning from Fine-grained Changes. In

Proceedings ofthe 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016) . ACM, New York,NY, USA, 511–522. https://doi.org/10.1145/2950290.2950333Anh Tuan Nguyen, Tung Thanh Nguyen, Hoan Anh Nguyen, Ahmed Tamrawi, Hung Viet Nguyen, Jafar Al-Kofahi, andTien N. Nguyen. 2012. Graph-based Pattern-oriented, Context-sensitive Source Code Completion. In

Proceedings ofthe 34th International Conference on Software Engineering (ICSE ’12) . IEEE Press, Piscataway, NJ, USA, 69–79. http://dl.acm.org/citation.cfm?id=2337223.2337232Thanh Nguyen, Ngoc Tran, Hung Phan, Trong Nguyen, Linh Truong, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N.Nguyen. 2018. Complementing Global and Local Contexts in Representing API Descriptions to Improve API RetrievalTasks. In

Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium onthe Foundations of Software Engineering (ESEC/FSE 2018) . ACM, New York, NY, USA, 551–562. https://doi.org/10.1145/3236024.3236036Tung Thanh Nguyen, Hoan Anh Nguyen, Nam H. Pham, Jafar M. Al-Kofahi, and Tien N. Nguyen. 2009. Graph-basedMining of Multiple Object Usage Patterns. In

Proceedings of the the 7th Joint Meeting of the European Software EngineeringConference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering (ESEC/FSE ’09) . ACM, NewYork, NY, USA, 383–392. https://doi.org/10.1145/1595696.1595767Tam The Nguyen, Hung Viet Pham, Phong Minh Vu, and Tung Thanh Nguyen. 2016b. Learning API Usages from Bytecode:A Statistical Approach. In

Proceedings of the 38th International Conference on Software Engineering (ICSE ’16) . ACM, NewYork, NY, USA, 416–427. https://doi.org/10.1145/2884781.2884873Terence Parr. 2013.

The Definitive ANTLR 4 Reference (2 ed.). Pragmatic Bookshelf.M. F. Porter. 1997. Readings in Information Retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ChapterAn Algorithm for Suffix Stripping, 313–316. http://dl.acm.org/citation.cfm?id=275537.275705Proc. ACM Program. Lang., Vol. 3, No. OOPSLA, Article 152. Publication date: October 2019.

Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: Synthesizing What I Mean: Code Search and IdiomaticSnippet Synthesis. In

Proceedings of the 38th International Conference on Software Engineering (ICSE ’16) . ACM, New York,NY, USA, 357–367. https://doi.org/10.1145/2884781.2884808R. Robbes and M. Lanza. 2008. How Program History Can Improve Code Completion. In . 317–326. https://doi.org/10.1109/ASE.2008.42Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on SourceCode: A Neural Code Search. In

Proceedings of the 2Nd ACM SIGPLAN International Workshop on Machine Learning andProgramming Languages (MAPL 2018) . ACM, New York, NY, USA, 31–41. https://doi.org/10.1145/3211346.3211353Caitlin Sadowski, Kathryn T. Stolee, and Sebastian Elbaum. 2015. How Developers Search for Code: A Case Study. In

Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015) . ACM, New York, NY,USA, 191–201. https://doi.org/10.1145/2786805.2786855Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, and Cristina V. Lopes. 2018. Oreo: Detection of Clonesin the Twilight Zone. In

Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conferenceand Symposium on the Foundations of Software Engineering (ESEC/FSE 2018) . ACM, New York, NY, USA, 354–365.https://doi.org/10.1145/3236024.3236026Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling CodeClone Detection to Big-code. In

Proceedings of the 38th International Conference on Software Engineering (ICSE ’16) . ACM,New York, NY, USA, 1157–1168. https://doi.org/10.1145/2884781.2884877Gerard Salton and Michael J. McGill. 1986.

Introduction to Modern Information Retrieval . McGraw-Hill, Inc., New York, NY,USA.Raphael Sirres, Tegawendé F. Bissyandé, Dongsun Kim, David Lo, Jacques Klein, Kisub Kim, and Yves Le Traon. 2018.Augmenting and structuring user queries to support efficient free-form code search.

Empirical Software Engineering

23, 5(01 Oct 2018), 2622–2654. https://doi.org/10.1007/s10664-017-9544-ySiddharth Subramanian, Laura Inozemtseva, and Reid Holmes. 2014. Live API Documentation. In

Proceedings of the 36thInternational Conference on Software Engineering (ICSE 2014) . ACM, New York, NY, USA, 643–652. https://doi.org/10.1145/2568225.2568313Christoph Treude and Martin P. Robillard. 2016. Augmenting API Documentation with Insights from Stack Overflow. In

Proceedings of the 38th International Conference on Software Engineering (ICSE ’16) . ACM, New York, NY, USA, 392–403.https://doi.org/10.1145/2884781.2884800Y. Ueda, T. Kamiya, S. Kusumoto, and K. Inoue. 2002. On detection of gapped code clones using gap locations. In

NinthAsia-Pacific Software Engineering Conference, 2002.

Proceedings of the 40th International Conference on Software Engineering (ICSE ’18) . ACM, New York,NY, USA, 1066–1077. https://doi.org/10.1145/3180155.3180179Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep Learning Code Fragments forCode Clone Detection. In

Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering(ASE 2016) . ACM, New York, NY, USA, 87–98. https://doi.org/10.1145/2970276.2970326Hao Zhong, Tao Xie, Lu Zhang, Jian Pei, and Hong Mei. 2009. MAPO: Mining and Recommending API Usage Patterns. In