[PDF] Recommending API Function Calls and Code Snippets to Support Software Development

Abstract

Software development activity has reached a high degree of complexity, guided by the heterogeneity of the components, data sources, and tasks. The proliferation of open-source software (OSS) repositories has stressed the need to reuse available software artifacts efficiently. To this aim, it is necessary to explore approaches to mine data from software repositories and leverage it to produce helpful recommendations. We designed and implemented FOCUS as a novel approach to provide developers with API calls and source code while they are programming. The system works on the basis of a context-aware collaborative filtering technique to extract API usages from OSS projects. In this work, we show the suitability of FOCUS for Android programming by evaluating it on a dataset of 2,600 mobile apps. The empirical evaluation results show that our approach outperforms two state-of-the-art API recommenders, UP-Miner and PAM, in terms of prediction accuracy. We also point out that there is no significant relationship between the categories for apps defined in Google Play and their API usages. Finally, we show that participants of a user study positively perceive the API and source code recommended by FOCUS as relevant to the current development context.

Full PDF

11 Recommending API Function Calls and CodeSnippets to Support Software Development

Phuong T. Nguyen, Juri Di Rocco, Claudio Di Sipio, Davide Di Ruscio, and Massimiliano Di Penta

Abstract —Software development activity has reached a high degree of complexity, guided by the heterogeneity of the components, datasources, and tasks. The proliferation of open-source software (OSS) repositories has stressed the need to reuse available softwareartifacts efﬁciently. To this aim, it is necessary to explore approaches to mine data from software repositories and leverage it to producehelpful recommendations. We designed and implemented FOCUS as a novel approach to provide developers with API calls and sourcecode while they are programming. The system works on the basis of a context-aware collaborative ﬁltering technique to extract APIusages from OSS projects. In this work, we show the suitability of FOCUS for Android programming by evaluating it on a dataset of 2,600mobile apps. The empirical evaluation results show that our approach outperforms two state-of-the-art API recommenders, UP-Miner andPAM, in terms of prediction accuracy. We also point out that there is no signiﬁcant relationship between the categories for apps deﬁned inGoogle Play and their API usages. Finally, we show that participants of a user study positively perceive the API and source coderecommended by FOCUS as relevant to the current development context.

Index Terms —Recommender Systems, API Calls, Source Code Recommendations, Android Programming. (cid:70)

NTRODUCTION

When dealing with certain programming tasks, ratherthan implementing new systems from scratch, developersoften make use of third-party libraries that provide thedesired functionalities. Such libraries expose their function-ality through Application Programming Interfaces (APIs)which govern the interaction between a client project and itsincorporated libraries. To use a library in a proper way,it is necessary to use the correct sequence of API calls,known as API usage patterns. The knowledge needed tomanipulate an API can be extracted from various sources:the API source code itself, the ofﬁcial website and docu-mentation, Q&A websites such as StackOverﬂow, forumsand mailing lists, bug trackers, other projects using thesame API, etc. However, ofﬁcial documentation often merelyreports the API description without providing non-trivialexample usages. Besides, querying informal sources suchas StackOverﬂow might become time-consuming and error-prone [37]. Also, API documentation may be ambiguous,incomplete, or erroneous [51], while API examples found onQ&A websites may be of poor quality [24]. In this respect,we come across the following motivating question:“

Which API calls should this piece of code invoke, giventhat it has already invoked these API calls? ”The problem of recommending API function calls andusage patterns has garnered considerable efforts and atten-tion of the research community in recent years [23], [58].Several techniques have been developed to automate theextraction of API usage patterns [38] for reducing developers’burden when manually searching these sources, and forproviding them with high-quality code examples. However, • P.T. Nguyen, J. Di Rocco, C. Di Sipio, D. Di Ruscio are with Universitàdegli Studi dell’Aquila, 67100-L’Aquila, Italy.E-mail: {phuong.nguyen,juri.dirocco,claudio.disipio,davide.diruscio}@univaq.it • M. Di Penta is with Università degli Studi del Sannio, Benevento, Italy.

Email: [email protected] these techniques, based on clustering [54], [58] or predictivemodeling [11], still suffer from high redundancy and poorrun-time performance. Moreover, most of the existing ap-proaches are based on clustering on data from code snippetsto recommend API usage, which sustains redundancy.In an attempt to transcend the limitations, we proposedFOCUS [26] as a novel approach to mining open-source soft-ware repositories to provide developers with API

FunctiOnCalls and USage patterns . We aim to suggest to developershighly relevant API usages that ease the developmentprocess. Our tool distinguishes itself from other tools thatrecommend API usages as it can provide both function callsand real code snippets that match well with the developer’scontext. FOCUS has been built based on a collaborative-ﬁltering recommender system [8], whose fundamental prin-ciple is to recommend to users the items that have beenbought by similar users in similar contexts. By consideringAPI methods as products and client code as customers, wereformulate the problem of usage pattern recommendationin terms of a collaborative-ﬁltering recommender system. Wemodeled the mutual relationships among projects using atensor and mined API usage from the most similar projects.Implementing a collaborative-ﬁltering recommender sys-tem requires to assess the similarity of two customers,i.e., two projects. Existing approaches assume that any twoprojects using an API of interest are equally valuable sourcesof knowledge. On the contrary, we postulate that not allprojects are equal when it comes to recommending usage pat-terns: a project that is highly similar to the project currentlybeing developed should provide higher quality patternsthan a highly dissimilar one does. Our recommender systemattempts to narrow down the search scope by consideringonly the projects that are the most similar to the active project.Thus, methods that are typically used conjointly by similarprojects in similar contexts tend to be recommended ﬁrst. a r X i v : . [ c s . S E ] F e b The ﬁrst prototype of FOCUS [26] has been successfullyrealized and integrated into the Eclipse IDE, and it isavailable for download. An empirical evaluation has beenconducted on a large number of Java projects extractedfrom G IT H UB and the Maven Central repository to studyFOCUS’s performance, and to compare it with a state-of-the-art tool for API usage patterns mining, i.e., PAM [11].We simulated different stages of a development process, byremoving portions of client code and assessing how FOCUScan recommend snippets with API invocations to completethe code being developed. The experiments showed thatFOCUS outperforms PAM, with regards to success rate,accuracy, and execution time.In this paper, we further extend the evaluation to studyif FOCUS can assist mobile developers in ﬁnding relevantAPI function calls as well as real code snippets by meansof an Android dataset. In this sense, the main contributionsof our work are summarized as follows: (i) We proposeFOCUS as a practical solution to API recommendationsemploying a context-aware collaborative ﬁltering recom-mender system; (ii)

Through a comprehensive evaluationon a dataset collected by mining G IT H UB and Google Play,we show that FOCUS is applicable to Android programming,outperforming two state-of-the-art API recommenders; (iii) We investigate how the calibration of the various FOCUSparameters can inﬂuence its performance; (iv)

We showthat the system is capable of achieving good performanceregardless of the availability of an extensive training setwithin the same app category; (v)

By means of a clonedetection evaluation and a user study, we show that FOCUScan be used to recommend real code snippets relevant forthe program artifact being developed; (vi)

Finally, we makeavailable both the FOCUS tool and the datasets related toour paper to allow for future replication [27].The paper is organized as follows. Section 2 introducesa motivating example and background notions. Section 3brings in FOCUS, our proposed solution to API recommen-dation. The materials and methods used to evaluate theapproach are presented in Section 4, while Section 5 analyzesthe key results. The lessons learned and the threats to validityare discussed in Section 6. We present related work andconclude the paper in Section 7 and Section 8, respectively.

ACKGROUND

We describe a motivating example to introduce the problemaddressed by FOCUS in Section 2.1. Section 2.2 gives anoverview of the main components of the proposed solution;Afterwards, we introduce the main notions underpinningour approach, by reviewing the previous work by

Schafer et al. [45] and

Chen [8].

The typical setting considered in the paper is shown inFig. 1(a): a programmer is implementing some methods tosatisfy the requirements of the system being developed. Thedevelopment is at an early stage, and the developer alreadyused some methods of the chosen API to realize the requiredfunctionality. However, she is not sure how to proceed from

1. https://mdegroup.github.io/FOCUS-Appendix/ this point. Under such circumstances, the programmer maybrowse different sources of information, including StackOverﬂow, video tutorials, ofﬁcial API documentation, etc.Figure 1(b) depicts the ﬁnal version of the snippet inFig. 1(a). In the framed code, the ﬁndBoekrekeningen methodqueries the available entities and retrieves those of type

Boekrekening . To this end, the

Criteria API library is usedas it provides useful interfaces for querying system en-tities according to the deﬁned criteria. FOCUS has beenconceptualized to do the exactly same thing: it is able tosuggest to developers recommendations consisting of a listof API method calls that should be used next. Furthermore,it also recommends real code snippets that can be used asa reference to support developers in ﬁnalizing the methoddeﬁnition under development. “ Collaborative Filtering (CF) is the process of ﬁltering orevaluating items through the opinions of other people” [45].In a CF system, a user who buys or uses an item attributesa rating to it based on her experience and perceived value.Therefore, a rating is the association of a user and an itemthrough a value in a given unit (usually in scalar, binary, orunary form). The set of all ratings of a given user is alsoknown as a user proﬁle [8]. Moreover, the set of all ratingsgiven in a system by existing users can be represented in aso-called rating matrix , where a row represents a user, and acolumn represents an item.The expected outcome of a CF system is a set of predictedratings (aka. recommendations ) for a speciﬁc user and a subsetof items [45]. The recommender system considers the mostsimilar users (aka. neighbors ) to the active user to suggestnew ratings. A similarity function sim usr ( u a , u j ) computesthe weight of the active user proﬁle u a against each ofthe user proﬁles u j in the system. Finally, to suggest arecommendation for an item i based on this subset of similarproﬁles, the CF system computes a weighted average r ( u a , i ) of the existing ratings, where r ( u a , i ) varies with the valueof sim usr ( u a , u j ) obtained for all neighbors [8], [45]. Context-aware CF systems compute recommendationsbased not only on neighbors’ proﬁles but also on the context where the recommendation is demanded. Each ratingis associated with a context [8]. Therefore, for a tuple C modeling different contexts, a context similarity metric sim ctx ( c a , c i ) , for c a , c i ∈ C is computed to identify relevantratings according to a given context. Then, the weightedaverage is reformulated as r ( u a , i, c a ) [8]. ROPOSED A PPROACH

To tackle the problem of recommending API function callsand usage patterns, we leverage the wisdom of the crowdand existing recommender system techniques. In particular,we hypothesize that API calls and usages can be mined fromexisting codebases, prioritizing the projects that are similarto the one from where the recommendation is demanded.We start with a deﬁnition of the main components of ourapproach in Section 3.1. Afterwards, Section 3.2 presents indetail the conceived architecture.

2. https://docs.oracle.com/javaee/6/tutorial/doc/gjivm.html public List findBoekrekeningen() {CriteriaBuilder cb = entityManager.getCriteriaBuilder();CriteriaQuery criteriaQueryBoekrekening =cb.createQuery(Boekrekening.class);?? .} (a) Initial version public List findBoekrekeningen() {CriteriaBuilder cb = entityManager.getCriteriaBuilder();CriteriaQuery criteriaQueryBoekrekening =cb.createQuery(Boekrekening.class);Root boekrekeningFrom =criteriaQueryBoekrekening.from(BoekrekeningPO.class);criteriaQueryBoekrekening.select(boekrekeningFrom);criteriaQueryBoekrekening.orderBy(cb.asc(boekrekeningFrom.get(BoekrekeningPO_.rekeningnr)));return entityManager.createQuery(criteriaQueryBoekrekening).getResultList();} (b) Final version

Fig. 1. Motivating example. A software project is a standalone source code unit thatperforms a set of tasks. Furthermore, an API is like a black-box, i.e., an interface that abstracts the piece of functionalityoffered by a project by hiding its implementation details. Thisinterface is meant to support reuse and modularity [30], [37].An API X built in an object-oriented programming language,e.g., the Criteria API in Fig. 1(a), consists of a set T X of publictypes, e.g., CriteriaBuilder and

CriteriaQuery . Each type in T X consists of a set of public methods and ﬁelds that areavailable to client projects, e.g., the method createQuery ofthe type CriteriaQuery .A method declaration consists of a name, a (possibly empty)list of types of parameters, a return type, and a (possiblyempty) body, for example the ﬁndBoekrekeningen method inFig. 1(b). Given a set of declarations D in a project P , an API method invocation i is a call made from a declaration d ∈ D to another declaration m . Similarly, an API ﬁeld access is anaccess to a ﬁeld f ∈ F from a declaration d in P . API methodinvocations M I and ﬁeld accesses

F A in P form the set ofAPI usages U = M I ∪ F A . Finally, an

API usage pattern (or code snippet) is a sequence ( u , u , ..., u n ) , ∀ u k ∈ U .For the sake of presentation, in the scope of this paper thefollowing terms are used interchangeably: method declaration vs. declaration and API vs. invocation . For each declaration, weextract its method name, a list of types of the parameters, anda list of API function calls. In this way, a project is representedas a set of declarations from its constituent classes.Our tool makes use of a context-aware collaborative-ﬁltering technique to search for invocations from highlyrelevant projects. This allows us to consider both projectand declaration similarities to recommend APIs and codesnippets. Following the terminology of recommender sys-tems [8], we treat projects as the enclosing contexts , methoddeclarations as users , and method invocations as items . Intu-itively, we recommend a method invocation for a declarationin a given project, which is analogous to recommendingan item to a customer in a speciﬁc context. For instance,the set of method invocations and the usage pattern (cf.framed code in Fig. 1(b)) recommended for the declaration findBoekrekeningen can be obtained from a set of similarprojects and declarations in a codebase. The collaborative aspect of the approach enables to extract recommendationsfrom the most similar projects, while the context-awareness aspect enables to narrow down the search space further tosimilar declarations. The architecture of FOCUS is depicted in Fig. 2. To pro-vide its recommendations, FOCUS considers a set of

OSS Repositories

Code Parser

Project Comparator , a subcom-ponent of

Similarity Calculator

Code Parser , the

Data Encoder

Declaration Comparator computes thesimilarities between declarations. From the similarity scores,

Recommendation Engine

API Generator , oras usage patterns using

Code Builder , which are presented tothe developer. In the remainder of this section, we present ingreater details each of these components.

FOCUS is dependent on Rascal M [4] to function. RascalM is an intermediate model that performs static analysis onsource code to extract method declarations and invocationsfrom a set of projects. This model is an extensible andcomposable algebraic data type that captures both language-agnostic and Java-speciﬁc facts in immutable binary relations.These relations represent program information such asexisting declarations , method invocations , ﬁeld accesses , interfaceimplementations , class extensions , among others [4]. To gatherrelevant data, Rascal M leverages the Eclipse JDT CoreComponent to build and traverse the abstract syntax treesof the target Java projects.We consider the data provided by the declarations and methodInvocation relations of the M model [4]. Both ofthem contain a set of pairs (cid:104) v , v (cid:105) , where v and v arevalues representing locations . These locations are uniformresource identiﬁers that represent artifact identities (aka.logical locations) or physical pointers on the ﬁle system tothe corresponding artifacts (aka. physical locations). The declarations relation maps the logical location of an artifact(e.g., a method) to its physical location. The methodInvocation relation maps the logical location of a caller to that of a callee .Listing 1 depicts an excerpt of the M model extractedfrom the code presented in Fig. 1(a). The declarations relationlinks the logical location of the method findBoekrekeningen ,to its corresponding physical location in the ﬁle system. The methodInvocation relation states that the getCriteriaBuilder method of the EntityManager type is invoked by the findBoekrekeningen method in the current project.

Once all the method declarations and invocations have beenparsed with Rascal, FOCUS represents the relationships

Ranked Invocations

Code Parser API GeneratorCode Builder Code Snippets

RecommendationEngineData EncoderSimilarity Calculator

Project ComparatorDeclaration Comparator

Developer

Fig. 2. Overview of the FOCUS architecture. m3 . declarations = {<|java+method://StandaardBoekrekeningService/findBoekrekeningen|,|file:// ... /StandaardBoekrekeningService.java(501,531,<17,4>,<33,5>)|>,%[...]} m3 . methodInvocation = {<|java+method://StandaardBoekrekeningService/findBoekrekeningen|,|java+method://EntityManager/getCriteriaBuilder|>, [...]} Listing 1. Excerpt of the M model extracted from Fig. 1(a). among them using a rating matrix . Given a project, eachrow in the matrix corresponds to a declaration, and eachcolumn corresponds to an API call. A cell is set to if thedeclaration in the corresponding row contains the invocationin the column, otherwise it is set to . In Fig. 3, we showan example of the rating matrix for an explanatory project p with four declarations p (cid:51) ( d , d , d , d ) and fourinvocations ( i , i , i , i ) . In practice, a matrix is generallybig to contain a large number of methods and invocations.  i i i i d d d d  Fig. 3. Rating matrix for a project with 4 declarations and 4 invocations.

We conceptualized a 3D context-based ratings matrix tomodel the intrinsic relationships among various projects,declarations, and invocations. The third dimension of thismatrix represents a project, which is analogous to the so-called context in context-aware CF systems. For example,Fig. 4(a) depicts three projects P = ( p a , p , p ) representedby three slices with four method declarations and fourmethod invocations. Project p has already been introducedin Fig. 3 and, for the sake of readability, the column and rowlabels are omitted from all the slices in Fig. 4(a). There, p a isthe active project and it has an active declaration d a . Active heremeans the artifact (project or declaration), being consideredor developed. Both p and p are complete projects similar tothe active project p a . The former projects, i.e., p and p arealso called background data since they are already availableand serve as a base for the recommendation process. Inpractice, the more background projects we have, the better isthe chance that we recommend relevant API invocations. By exploiting the context-aware CF technique, the presence ofadditional invocations is deduced from similar declarationsand projects. Given an active declaration in an active project,it is essential to ﬁnd the subset of the most similar projects,and then the most similar declarations in that set of projects. To compute similarities, we devised a weighted directedgraph that models the relationships among projects andinvocations. Each node in the graph represents either aproject or an invocation. If project p contains invocation i , then there is a directed edge from p to i . The weight ofan edge p → i represents the number of times a project p performs the invocation i . Fig. 4(b) depicts the graph forthe set of projects in Fig. 4(a). For instance, p a has fourdeclarations and all of them invoke i . As a result, the edge p a → i has a weight of . In the graph, a question markrepresents missing information. For the active declaration in p a , it is not known yet whether invocations i and i shouldbe included.Considering ( i , i , .., i l ) as a set of neighbor nodes of p ,the feature set of p is the vector −→ φ = ( φ , φ , .., φ l ) , with φ k being the weight of node i k . Each constituent weight iscomputed as the term-frequency inverse document frequency value, i.e., φ k = f i k ∗ log ( | P | a ik ) , where f i k is the weight of theedge p → i k ; | P | is the number of all considered projects; and a i k is the number of projects connected to i k . Eventually, thesimilarity between p and q is computed as the cosine betweentheir corresponding feature vectors −→ φ = { φ k } k =1 ,..,l and −→ ω = { ω j } j =1 ,..,m , given below: sim α ( p, q ) = (cid:80) nt =1 φ t × ω t (cid:112)(cid:80) nt =1 ( φ t ) × (cid:112)(cid:80) nt =1 ( ω t ) (1)Given that F ( d ) and F ( e ) are the sets of invocationsfor declarations d and e , respectively, then the similaritiesbetween d and e are calculated using the Jaccard similarityindex as follows: sim β ( d, e ) = | F ( d ) (cid:84) F ( e ) || F ( d ) (cid:83) F ( e ) | (2) This component is a part of

Recommendation Engine , and itis used to generate a ranked list of API function calls. Asshown in Fig. 4(a), the active project p a already includes threedeclarations, and at the time of consideration, the developeris working on the fourth declaration, corresponding tothe last row of the matrix. p a has only two invocations,represented in the last two columns of the matrix, i.e.,cells marked with . The ﬁrst two cells are ﬁlled with aquestion mark ( ? ), implying that it is not clear if these twoinvocations should also be integrated into p a . API Generator predicts additional invocations for the active declarationby computing the missing ratings exploiting the followingcollaborative-ﬁltering formula [8]: r d,i,p = r d + (cid:80) e ∈ topsim ( d ) ( R e,i,p − r e ) · sim β ( d, e ) (cid:80) e ∈ topsim ( d ) sim β ( d, e ) (3) Active project (p a )Similar project (p ) Similar project (p ) w=sim α (p a ,p )=0.8 w=sim α (p a ,p )=0.3 Active declaration (d a ) Similar declaration (d )Similar declaration (d ) (a) 3D context-based rating matrix i i p a ? p p i ? i (b) Graph representation of projects and invocations Fig. 4. Matrix and graph representation for a set of three OSS projects.

Equation 3 is used to compute a score for the cell rep-resenting method invocation i , declaration d of project p ,where topsim ( d ) is the set of top similar declarations of d ; sim β ( d, e ) is the similarity between d and a declaration e ,computed using Eq. (2); r d and r e are the mean ratings of d and e , respectively; and R e,i,p is the combined rating of d for i in all the similar projects, computed as follows [8]: R e,i,p = (cid:80) q ∈ topsim ( p ) r e,i,q · sim α ( p, q ) (cid:80) q ∈ topsim ( p ) sim α ( p, q ) (4)where topsim ( p ) is the set of top similar projects of p , k = | topsim ( p ) | is the number of neighbor projects; and sim α ( p, q ) is the similarity between p and a project q ,computed using Eq. 1. Equation 4 implies that a higherweight is given to projects with higher similarity. In practice,it is reasonable since, given a project, its similar projectscontain more relevant API calls than less similar projects.Using Eq. 3 we compute all the missing ratings in the activedeclaration and get a ranked list of invocations with scores indescending order, which is then suggested to the developer.In Eq. 4, a set of k projects is used to compute the ranking,and no matter how large k is, eventually we obtain a realscore for each API. Therefore, the ﬁnal list always containsN items, regardless of k .In our implementation, we employed a sparse matrixto store the 3D tensor. This allows us to optimize both thestorage and computation, and thus increasing the numberof neighbor projects for the recommendation. By the currentversion, FOCUS is able to efﬁciently compute the recommen-dations, and maintain a trade-off between computationalcomplexity and effectiveness. This sub-component is responsible for recommending realcode snippets to developers. From the ranked list, top-N invocations are selected as query to search the corpus forrelevant declarations. To limit the search scope, we consideronly the most similar projects. Using the Jaccard index as thesimilarity metric, for each query, we search for declarationsthat contain as many invocations of the query as possible.Once the corresponding declarations are identiﬁed, theirsource code is retrieved using the declarations relation of theRascal M model. Thanks to its modularity, Rascal is ableto decompile and analyze projects written in different pro-gramming languages [5], e.g., Java [4], C/C++ [2], PHP [16].Rascal also allows us to compute M model from bothsource code folders and binaries, e.g., JAR ﬁles independently. Thus we implemented a dedicated function that extracts thereal source code of a method declaration by means of thecomputed M model and the project location. Finally, theresulting code snippet is suggested to the developer. This section describes two use cases that illustrate howFOCUS works in practice. Section 3.3.1 presents the ﬁnalresult produced by FOCUS for the motivating example inSection 2.1, while Section 3.3.2 describes the FOCUS IDEthrough a real development scenario, where we recommendboth a list of API function calls and real source code.

In Fig. 1(a), given that ﬁndBoekrekeningen is the activedeclaration, the invocations it contains are used togetherwith the other declarations in the current project as the queryto feed the recommendation engine. The produced outcomeis a ranked list of real code snippets, and we show the topone, named ﬁndByIdentiﬁer , in Listing 2. public List findByIdentifier(String identifier) {log.fine("getting Session instance by identifier: " + identifier);try {CriteriaBuilder cb = entityManager.getCriteriaBuilder();CriteriaQuery criteria = cb.createQuery(QuestionsStaged.class);Root qs = criteria.from(QuestionsStaged.class);criteria.select(qs).where(cb.equal(qs.get("identifier"), identifier));log.fine("get identifier successful");return entityManager.createQuery(criteria).getResultList();} catch (RuntimeException re) {log.severe("get identifier failed" + re);throw re;}}

Listing 2. Recommended source code for the snippet in Fig. 1(a).

By comparing the recommended code and the originalone in Fig. 1(b), we realize that though they are not thesame, they indeed share several method calls and a sharedintent: both snippets exploit a

CriteriaBuilder object to build,perform a query, and eventually retrieve some results.Furthermore, the outcome of both declarations is of the

List type. More importantly, compared to the original code inFig. 1(b), the recommended snippet appears to be of a higherquality and robustness. We conclude that for the motivatingexample, FOCUS is helpful since the recommended codetogether with the corresponding list of function calls, i.e., get , equal , where , select , etc., provides the developer with practicalinstructions on how to use the API at hand to implement thedesired functionality. Fig. 5. FOCUS IDE.

As shown in Fig. 5, FOCUS has been integrated into theEclipse IDE. The ﬁgure depicts a real development scenariowhere a developer is implementing the

SQLDump project byimproving the existing code with recommendations providedby FOCUS. SQLDump is a simple command-line utility thatexploits the apache-cli library to execute an SQL query andexport results as a CSV ﬁle. The ﬁrst implementation of the main method prints parameter errors to the console by usingJava I/O facilities, i.e., System.out.println

HelperFormatter class5 : the catch statement block is completely deﬁned and the

System.out.println invocation is replaced by

HelperFormatter provided by apache-cli . Meanwhile printHelp is a method of

HelperFormatter that prints both possible parameter errorsas well as an introduction on how to run

SQLDump fromcommand line. As a result, with the help of FOCUS, thedeveloper can learn how to use the method both from thecode snippets 3 and the list of API calls 4 .

VALUATION

The goal of this study is to evaluate FOCUS and compareit with two state-of-the-art tools, i.e., UP-Miner [54] andPAM [11], with the purpose of determining the extent towhich it can provide a developer with accurate and usefulrecommendations, featuring code snippets containing APIusage patterns relevant for the developers’ context. The quality focus relates to the API recommendation accuracy

4. Instruction to install the IDE: https://bit.ly/3joJpnT5. https://github.com/aparsons/SQLDump6. http://commons.apache.org/proper/commons-cli/ and completeness, the time required to provide a recom-mendation, and the extent to which developers perceive therecommendation useful.PAM has been chosen as baseline for comparison, since itis among the state-of-the-art tools in API recommendation:it has been shown [11] to outperform other similar toolssuch as MAPO [58] and UP-Miner [54]. To conduct thecomparison with PAM, we exploited its original sourcecode which has been made available online by its authors. Furthermore, to facilitate future replications, we publishedall the artifacts together with the tools used in our evaluationin G IT H UB [27].After formulating the research questions in Section 4.1,the following subsections describe datasets, analysis method-ology, and the evaluation metrics used to evaluate FOCUS. Our study aims to address the following research questions: (cid:66) RQ : How does FOCUS compare with UP-Miner andPAM?

Both UP-Miner [54] and PAM [11] are well-foundedAPI recommendation tools. UP-Miner has been shownto outperform MAPO [58], while PAM gains a superiorperformance compared to both UP-Miner and MAPO. In ourprevious work [26], we showed that FOCUS outperformsPAM on different datasets collected from G IT H UB and MVN.In this work, we compare FOCUS with UP-Miner and PAMon an Android dataset to further study their performance ona new application domain. (cid:66) RQ : How successful is FOCUS at providing recommendationsat different stages of a development process?

For a recommendersystem, it is essential to be able to return relevant recommen-dations, indicating by a high number of true positives as wellas a low number of both false positives and false negatives.

7. https://github.com/mast-group/api-mining

This research question evaluates to which extent our tool canprovide accurate and complete results. (cid:66) RQ : Is there a signiﬁcant correlation between the cardinality ofa category and accuracy?

We examine whether given a testingapp, having more apps of the same category is beneﬁcial tothe recommendation outcome. (cid:66) RQ : Can FOCUS recommend relevant code snippets?

Westudy if the recommended code snippets provided by FOCUSare relevant to support developers in fulﬁlling their tasks. (cid:66) RQ : How are FOCUS recommendations perceived by softwareengineers during a development task?

Finally, we are interestedin investigating whether FOCUS is useful from a developerpoint of view. To this end, we conducted a user studyto evaluate the relevance of API calls and code snippetsprovided by FOCUS to support a particular developmentcontext. A group of 16 Master’s students in ComputerEngineering has been involved to assess two real-worlddevelopment scenarios.

In the following, we describe the dataset used to addressRQ -RQ , as well as the data extraction method. As itis explained in Section 4.3, for RQ we rely on differentdatasets, because the aim is to let developers leverage FOCUSrecommendations, and tasks should be simple enough for anexperimental setting. While FOCUS is able to work with different data sourcesas well as programs written in various languages, theevaluation context for this paper focuses on the applicabilityto a speciﬁc domain, i.e., Android programming. AlthoughAndroid development is per se not very different fromthe development of other kinds of applications, after theevaluation reported in our previous paper featuring hetero-geneous Java programs [26], the aim of this evaluation isto show how, by learning from a training set belonging toapplications from the same ecosystem, FOCUS is capableof providing accurate recommendations. We have chosenAndroid not only because of the large availability of dataneeded to perform an empirical evaluation, but also becauserecommending API calls and usage patterns is deemed to beimportant in Android programming [10].Since FOCUS accepts as input data extracted by Rascal,which in turn requires a speciﬁc format, we devised ourown method to acquire an Android dataset eligible for theevaluation. The extraction process needs to comply withsome certain requirements, and it is illustrated in Fig. 6.First, we exploited the

AndroidTimeMachine platform [12] tocrawl open source projects. The platform fetches apps fromthe Google Play store and associates them with the opensource counterparts hosted in GitHub. The crawling processresulted in a set of 7,968 open source Android apps. Most ofthe apps (82%) in the dataset are written in Java; 4% in Kotlin;4% in JavaScript, 2% in C++, and 1% in C

8. https://play.google.com/

APK files

Extractor

AndroidTimeMachine

Apkpure dex2jar

Apps JAR files

Crawler

Fig. 6. The data extraction process. only the Java and Kotlin ones, which account for the majorityof the apps. Afterwards, we retrieved the correspondingcompiled APK ﬁles by querying the Apkpure platform usingsome tailored Python scripts [46]. The process culminated inthe ﬁnal corpus consisting of 2,600 APK binary ﬁles (minedfrom Apkpure) together with additional metadata (minedfrom Google Play), including authors, categories, star rating,price, and the number of downloads. By carefully inspectingthe data, we realized that most of the apps are highly ratedand they have a high number of downloads.We decompiled the APKs into the JAR format by meansof the dex2jar tool [1]. The JAR ﬁles were then fed as inputfor Rascal to convert them into the M format, which caneventually be consumed by FOCUS.In total, there are 26,854 API functions in the wholedataset, and most of them are invoked by a small number ofdeclarations (and thus projects ): 15,731 APIs are calledin only one project. Only a tiny fraction of the APIs isextremely popular by being included in a large numberof projects: ten APIs are called in more than 1,900 projectsand 15,000 declarations. The most popular API call is java/lang/StringBuilder/append(java.lang.String) and it appearsin 2,512 projects and 54,828 declarations.Altogether, this reﬂects the long tail effect which hasalready been encountered by third-party libraries recom-mendation [25]. Such an effect can be expressed as follows:For many outcomes, about 80% of consequences originatefrom 20% of the causes [19]. When we apply this to APIrecommendation, it is interpreted as: “ About 80% of theAPIs come from 20% of the apps. ” As it has been shown invarious studies [25], [52], providing products in the longtail is beneﬁcial to the ﬁnal recommendations. In a similarfashion, we suppose that the ability to suggest APIs rarelyincluded by apps, is of particular importance, as this mayhelp discover useful APIs that have been normally obscuredfrom search engines.A summary of the categories and their correspondingnumber of items in the considered dataset is also provided.Due to the space limit, we cannot show and discuss all theﬁgures here. Please refer to the online appendix for moredetails. With this dataset, we aim at evaluating if the proposedapproach is able to support mobile developers in diverseapplication domains as well as with various levels ofapps’ maturity, thereby attempting to resemble real-worlddevelopment scenarios. We use the collected dataset in RQ ,RQ , RQ , and RQ to evaluate FOCUS as well as to compareit with the two baselines.

9. https://apkpure.com/10. For the sake of presentation, from now on the two terms “app”and “project” are used interchangeably.11. https://mdegroup.github.io/FOCUS-Appendix/ Π Declaration n

P1P2P3 . . .

Declaration 1Declaration 2Removed declarations Testing project

Ground-truthinvocationsQueryinvocations

Testing/active declaration (d a ) Fig. 7. The extraction of data for a testing project.

Finally, the following main steps are conducted to createthe required metadata, which can then be used to feedFOCUS. • the corresponding Rascal M model is generated forevery project in the dataset; • the corresponding ARFF representation for each M model is generated in order to be used as input forapplying FOCUS and PAM during the actual evaluationsteps discussed in the next sections. To evaluate FOCUS in RQ -RQ , we simulate the behaviorof a developer who is programming a project and needspractical recommendations to complete it. Figure 7 providesan intuition on how the extraction of an active/testing project p a is done. The project consists of a set of declarations andthey are divided into three parts, namely P1 , P2 , and P3 ,which are explained as follows. • P1 : A set of complete declarations, e.g., Declaration 1,Declaration 2, etc. • P2 : A testing declaration, for this declaration, only aportion of code is available to feed the recommendationengine, while the rest is removed and saved as ground-truth data. This corresponds to the scenario in Fig. 1(a),where the developer is implementing the active decla-ration d a , and she needs recommendations on the nextAPIs to be added; • P3 : Removed declarations: A certain part consisting ofsome declarations is removed. This aims to simulate thescenario when the developer is only at an early stage ofthe project.Correspondingly, there are the following parameters: • ∆ is the number declarations in p a ( ∆ > ); • Only δ declarations ( δ < ∆ ) are used as input forrecommendation and the rest is discarded; • In total, d a has Π invocations, however only the ﬁrst π invocations ( π < Π ) are selected as testing, and the restis ground-truth data; • k is the number of neighbour projects (cf. Section 3.2.4); • Given a ranked list of APIs, the developer typically paysattention to the top-N items only, i.e., N is the cut-offvalue for the list.For d a , only a half of the code lines of the method’s bodyis selected to feed the recommendation engine. In fact, Rascal can parse only compilable code, thus there mightbe some compilation errors at some points, where thecode is incomplete. As a result, in practice, we supposethat FOCUS can provide recommendations only when thedeveloper temporarily stops at a certain point where thewhole declaration becomes compilable. Thus, to increasethe applicability of FOCUS, as a developer one shouldtry to make the code compilable as soon as they can byclosing open loops, try/catch blocks, return statements,etc. This is supported pretty well by IDEs such as Eclipsewhich automatically recommend and insert closed loops andtry/catch blocks. In this respect, we suppose that in mostcases, code is executable, though it is yet complete.Table 1 shows four conﬁgurations, i.e.,

C1.1 , C1.2 , C2.1 ,and

C2.2 , corresponding to different combinations of δ and π . Furthermore, C1.1 and

C1.2 as well as

C2.1 and

C2.2 arepairwise relevant. For example, both

C1.1 and

C1.2 have thesame number of method declarations ( δ ), they differ in thenumber of invocations in the testing declaration ( π ).For the purpose of validation, the original dataset (cf.Section 4.2.1) was split into two independent parts, namelya training set and a testing set . In practice, the training setrepresents the OSS projects that have been collected ex-ante,and they are available at the developer’s disposal, readyto be exploited for any mining purposes. The testing setrepresents the project being developed, or the active project .We opted for k-fold cross validation [20] as it has been widelychosen to study machine learning models. Depending onthe availability of input data, the dataset with n elements isdivided into f equal parts, so-called folds . For each validationround, one fold is used as testing data and the remaining f -1folds are used as training data. In our evaluation, two valueswere selected, i.e., f =10 and f =n. The former corresponds to ten-fold cross validation while the latter corresponds to leave-one-out cross validation [56], and they are exploited dependingon the purpose as well as the availability of data. Withten-fold cross validation, we shufﬂe the list of the appsconsidered in the evaluation, and then randomly split theminto ten equal parts. In the evaluation, we attempt to equallydistribute the projects into the folds, so as to maintain abalance among the folds with respect to the projects’ size. Forevery experiment, the execution is done ten times: each timeone fold is used for testing, and the remaining nine foldsare used as training data. Eventually, we averaged out themetrics obtained from the ten folds to get the ﬁnal results. For a testing project p , the outcome of a recommendationprocess is a ranked list of invocations, i.e., REC(p) . Webelieve that the ability to provide accurate invocations isimportant in the context of software development. Thus, weare interested in how well a system can recommend APIinvocations that eventually match with those stored in

GT(p) .To measure the performance of UP-Miner, PAM and FOCUS,we utilize

Success rate , Precision , Recall , Levenshtein distance ,and

Time . Given that

REC N ( p ) is the set of top-N items and match N ( p ) = GT ( p ) (cid:84) REC N ( p ) is the set of items in the top-N list that match with those in the ground-truth data,then the metrics are deﬁned as follows. TABLE 1Experimental conﬁgurations.

Conf. δ π

DescriptionC1.1 ∆ / − Nearly the ﬁrst half of the declarations is used and the second half is discarded. The last declaration of the ﬁrst half is selectedas the active declaration d a . For d a , only the ﬁrst invocation is provided as query, and the rest is used as ground-truth data,i.e., GT(p) . This conﬁguration represents an early stage of the development process and, therefore, only limited context data isavailable to feed the recommendation engine.

C1.2 ∆ / − Similarly to

C1.1 , almost the ﬁrst half of the declarations is retained and the second half is removed. d a is the last declaration ofthe ﬁrst half declarations. For d a , the ﬁrst four invocations are provided as query, and the rest is GT(p) . C2.1 ∆ − The last method declaration is selected as testing, i.e., d a and all the remaining declarations are used as training data. In d a , the ﬁrst invocation is kept and all the others are taken out as ground-truth data GT(p) . This mimics a scenario where the developer isalmost ﬁnished implementing p . C2.2 ∆ − Similar to

C2.1 , d a is selected as the last method declaration, and all the remaining declarations are used as training data. Theonly difference with C2.1 is that in d a , the ﬁrst four invocations are used as query and all the remaining ones are used as GT(p) . (cid:66) Success rate.

For a set of P testing projects, this metricmeasures the rate at which a recommender can return at leasta match among the top-N items for every project p ∈ P . success rate ( p ) = (cid:80) Ni match i ( p ) (cid:80) Ni | G i | × (5) (cid:66) Precision and recall.

Precision P@N is the fraction of the top-N recommended items to the total number of items inthe ground-truth, while recall R@N is the ratio of the ground-truth items being found in the top-N items: P @ N = | match N ( p ) | N (6) R @ N = | match N ( p ) || GT ( p ) | (7) (cid:66) Levenshtein distance.

Given two strings s and s , theLevenshtein edit distance between them corresponds to thenumber of substitutions performed to transform s to s . Themetric is deﬁned as follows. L s ,s ( i, j ) =  max ( i, j ) if min(i,j)=0 ,min  L s ,s ( i − , j ) + 1 L s ,s ( i, j −

1) + 1 L s ,s ( i − , j + 1) + 1 otherwise. (8)where i and j are the terminal character position of strings s and s , respectively. (cid:66) Recommendation time.

The time needed for the systemsto generate predictions is measured using a laptop with IntelCore i5-7200U CPU @ 2.50GHz ×

4, 8GB RAM, and Ubuntu16.04. -RQ (cid:66) RQ . To address RQ , we compare the performance ofFOCUS with that of UP-Miner and PAM. Our experience [26]reveals that PAM cannot scale well with large datasets, i.e., itsuffers from a high computational complexity. Meanwhile,FOCUS is more efﬁcient as it is capable of incorporatinga large number of background projects and swiftly pro-ducing recommendations. In particular, both systems wereexperimented on a mainstream laptop using a set of 549training projects with 80MB in size to measure the executiontime [26]. On average, PAM requires 320 seconds to providea recommendation, while FOCUS needs just 1.80 seconds.Through a careful observation on the Android dataset (cf.Section 4.2.1), we realized that many of them are big insize, and a training set of 2,360 apps may add up to morethan 2.0GB. This essentially means that it is infeasible torun PAM on the entire dataset, since the execution time mayexponentially soar. Thus, for RQ we can leverage only a

13. https://dzone.com/articles/the-levenshtein-algorithm-1 portion of the original corpus. To be more precise, we selected500 apps of average size. There are 39 categories in total andmost of them contain a small number of apps, while

Tools is still the biggest category with 151 apps, accounting for30.20% of the total amount. We opted for leave-one-out cross-validation [56], aiming to exhaustively exploit the backgrounddata. We study the performance of FOCUS by considering allthe four conﬁgurations listed in Table 1, i.e.,

C1.1 , C1.2 , C2.1 ,and

C2.2 . The cut-off value N is used to investigate howaccurately the system is able to provide recommendationswith respect to different lengths of the ranked list. In RQ ,we set N to 30, attempting to study the three systems on along list of recommendations. We also consider, as can beseen in Eq. 4, different values of the number of neighborapps, i.e., k = { , , , } . The evaluation was executed 500times, by each validation, one app is used as testing and allthe remaining 499 apps are used for training. To aim for areliable comparison, we ran UP-Miner and PAM using theiroriginal settings in our evaluation. (cid:66) RQ . For this research question, we made use of thewhole corpus introduced in Section 4.2.1, which containsall the 2,600 collected apps. Moreover, since we have alarger amount of data compared to RQ , we employ ten-fold cross-validation in this research question. We analyzethe performance of FOCUS for combinations of: (i) differentconﬁgurations, i.e., C1.1 , C1.2 , C2.1 , and

C2.2 ; (ii) differentvalues of N , i.e., N = { , , , , } ; and (iii) differentvalues of k , i.e., k = { , , , , , } . The rationale behindthe selection of such speciﬁc values is as follows. Weshould incorporate a certain number of neighbor projects k when computing recommendations, otherwise the matrixwill become big (cf. Fig. 4(a)), which possibly induces anexpensive computational cost. While such a large numberof N seems to be unrealistic, in the scope of our evaluation,we have to consider it to ensure the generalizability of ourﬁnal conclusions. In practice, a small enough number of N items should be presented to the developers, so as to avoidoverwhelming them. We report, for different conﬁgurationsand values of N and k , the success rate, and performancegain. Also, we plot the precision/recall curves for differentconﬁgurations and values of k . (cid:66) RQ . To address RQ , we perform controlled experimentson the whole dataset described in Section 4.2.1. Similarto RQ , we conducted the experiments following the ten-fold cross-validation methodology. The apps collected in thecorpus span over a total of 47 categories, such as Productivity , Communication , Music & Audio , or

Business . The cardinality(i.e., the number of apps within a category) of the categories varies considerably: most of them contain a small numberof apps, i.e., ranging from 1 to 20 items for almost half ofthe topics. The biggest category with 659 apps is Tools , whilethere are three categories with only two apps, i.e.,

Trivia , Music , and

Parenting .With this research question, we aim at examining if thereis a strong positive correlation between two variables, i.e.,the cardinality of a category and the corresponding precision.In other words, we hypothesize that apps belonging to popu-lous categories might possibly get a better recommendationsince they have more, presumably, relevant background data,i.e., projects coming from the same domains. This wouldhave an impact in practice as follows: once the developerspeciﬁes one or more domains for her app, we can searchfor recommendations just by looking for apps within thesame domains, aiming to narrow down the search scope.This is useful since it contributes to a reduction in the overallexecution time. However, this is a pure assumption, whichneeds to be carefully studied through concrete experiments.For each category, we computed the precision for allof its constituent apps following Eq. 6, and the precisionof a category was averaged out over the apps. Eventually,the correlation between the cardinality and precision iscomputed using the Spearman’s rank correlation coefﬁcientand Kendall, i.e., ρ and τ , respectively. The coefﬁcients rangefrom -1 (perfect negative correlation) to +1 (perfect positivecorrelation), while ρ =0 or τ =0 implies that the variablesare not correlated at all. The reason why we computeboth Spearman’s and Kendall’s correlation is because thenumber of categories is relatively small, and the Spearman’scorrelation may be more suitable in this case. We do not usethe Pearson’s correlation as we cannot assume the presenceof a linear relationship between categories and precision. (cid:66) RQ . In this research question, we study if FOCUS isable to recommend source code relevant to the methoddeclaration under development, exploiting the ten-foldcross-validation technique. As an example, we assume thatthe developer is working on the incomplete code snippetdepicted in Fig. 1(a), and FOCUS is expected to suggest realcode such as the one in Fig. 1(b), or the one in Listing 2.To evaluate the similarity between two declarations, wecompare their constituent APIs. This comparison is basedon the observation coming from an existing work [22] thatif projects or declarations share API calls implementing thesame requirements, then they are considered to be moresimilar than those that do not have similar API usage.Following the same line of reasoning, we evaluate thesimilarity/relevance between two snippets by examiningif they share common API function calls and have the samesequence of these calls.To address this research question, we leverage the datasetof 500 apps also used to address RQ . We deliberately makeuse of such a small dataset due to the following reason: withthis dataset, we analyze the ability of FOCUS to recommendrelevant code snippets, given that there is a fairly smallamount of training data. We conjecture that, as conﬁrmedlater in the paper, if FOCUS works effectively on a smalldataset, it will perform well on bigger ones. To evaluate if arecommended snippet is relevant to the query, we measurethe level of similarity between them using the Levenshteinedit distance [21], which has been used by prior work for similar purposes, e.g., tracking source code clones [49]. Giventhe source code of a declaration d , we parse it using Rascalto get the API invocations. Afterwards, we encode eachof the invocations using a unique character, resulting in astring s . Thus, the evaluation of the similarity betweentwo declarations d and d boils down to comparing thecorresponding strings s and s , by counting the number ofreplacements needed to convert s to s using Eq. 8. Such ametric takes into account not only the common charactersbetween s and s , but also the order in which they appear.Correspondingly, this means that two code snippets aresimilar/relevant if they share common API function calls aswell as have the same sequence of the calls. In this sense,the smaller the distance we get, the more similar the twosnippets are, and vice versa.To simplify the comparison performed in RQ , we onlyused Conﬁguration C1.2 (cf. Table 1). The rationale behindthe selection of the conﬁguration is as follows: it representsa more authentic development scenario, corresponding tothe situation where the developer already ﬁnishes a part ofthe declaration, and she expects to get recommendations.To be more concrete, given a testing project, we kept theﬁrst half of the declarations and removed the second half;the last declaration of the ﬁrst half declarations is selectedas the testing one d a . For d a , the ﬁrst four invocations areprovided as query, and the rest is GT(p) . Using the

CodeBuilder subcomponent (cf. Section 3.2.5), we extracted thereal source code of a declaration by means of the computedM model and the project location.In fact, APK ﬁles do not contain source code, thus it isnot possible to directly mine real code snippets from theapps. However, FOCUS allows us to extract the methodcanonical name of a recommended code snippet within theproject scope. Moreover, since the dataset is extracted from AndroidTimeMachine , there is a mapping between open-sourceGoogle store apps with their corresponding repositories. Tolocate the right pair of APK ﬁle and G IT H UB repository, wecheck the snapshot date when the mapping was created. Inthis way, we are able to trace back to the original source codefor those apps that have a counterpart in G IT H UB . Eventually,FOCUS is able to recommend source code, as long as thecorresponding app is associated with a source project rootedin G IT H UB . In this section, we study FOCUS’s usefulness of code andAPI recommendations by means of a task-based user studyto address RQ .The goal of this study is to evaluate FOCUS, with thepurpose of understanding whether it could help developerswith their implementation tasks. The quality target of thestudy is the perceived usefulness that developers haveof recommendations (code snippets and APIs) providedby FOCUS. The context consists of participants, i.e., 16Master’s students in Computer Engineering, and objects,i.e., programs involving command line argument parsingand HTML download/parsing. As shown in Table 2, the experimental design is a crossoverdesign in which participants were split into four groups (each TABLE 2Task assignments to the evaluator groups.

T1 T2Group I

Apache-cli, using FOCUS gson

Group II

Apache-cli gson, using FOCUS

Group III gson Apache-cli, using FOCUS

Group IV gson, using FOCUS Apache-cli participant worked individually, but each group received thesame treatment). Each participant had to carry out two imple-mentation exercises, one using FOCUS recommendations andanother without the availability of FOCUS. Different groupsfeatured different ordering of the treatments, to mitigate anyordering/learning effect.The two tasks focus on the usage of different libraries,i.e., commons-cli and jsoup , and require the completion ofthree partially implemented methods. commons-cli providesAPIs for parsing command-line options passed to programs,while jsoup is a library for parsing and manipulating HTMLpages using the best of DOM, CSS, and jquery-like methods. /*** Create apache-cli options for the following elements:* url (Mandatory),* username (Mandatory): -user * password (Mandatory): -pass * query (Mandatory): -sql * CSV file path: -f -file * includeHeaders: -headers* All the options contains and argument, with the exception of includeHeaders* @return Available command line options*/public static Options getOptions() {final Options options = new Options();final Option urlOption = new Option("url", true, "database url ");final OptionGroup urlGroup = new OptionGroup();urlGroup.setRequired(true);urlGroup.addOption(urlOption);options.addOptionGroup(urlGroup);//COMPLETE THE METHODreturn options;} Listing 3. The commons-cli getOption() partially implemented method. @Testpublic void getOptionTest() throws IOException {Options options = Launcher.getOptions();assertEquals(6,options.getOptions().size());}@Testpublic void parseOKTest() throws Exception {String[] arguments = new String[]{"-url", "a","-pass", "pass","-user", "user","-sql", "sql"};assertEquals(4,Launcher.parse(arguments).size());}@Testpublic void printUsageTest() throws IOException {assertNotEquals("", Launcher.printUsage());}

Listing 4. The unit tests for checking the correctness of the task.

For the tasks with commons-cli , the participants completedthree methods by: (i) implementing a method for specifyingthe command-line options (we provided the evaluator withthe parameter list); (ii) parsing the command line parametersand throwing an exception if the mandatory ones are missing; (iii) handling parsing exception by printing possible optionsto the console. Listing 3 shows an example of the partialimplementation and the method requirements for specifyingoptions. For a detailed description of the two performedtasks, due to space limit, interested readers are kindlyreferred to our online appendix.

14. https://commons.apache.org/proper/commons-cli/15. https://jsoup.org/16. https://mdegroup.github.io/FOCUS-Appendix/tasks.html

For each method to be completed, we provided (fortreatments having the availability of FOCUS) each evaluatorwith the top-5 snippets and top-20 method invocationsrecommended by FOCUS by giving the initial and partialmethod implementation as input.

Under the circumstance in which the experiment was con-ducted, it was neither possible to perform the experimentin a laboratory nor to ask participants to return the resultsimmediately. Instead, each participant could perform thetasks ofﬂine and return them to us. Before the study, weperformed a laboratory introductory session in video confer-ence, in which we introduced to participants the laboratorygoals and tasks (without details about our research question,to avoid biasing them), and left them a detailed instructiondocuments.During the tasks, participants could access any resourceavailable on the Internet, besides FOCUS recommendationswhen available based on the study design. Once a participantﬁnished the tasks, s/he had to complete a questionnaire consisting of the following questions: (i) three generalquestions asking about their experience in programmingand code search engine; (ii) four questions, in a 5-level Likertscale [29], related to the understandability and complexityof the assigned tasks; and (iii) four questions to evaluate therelevance and usefulness of the recommendations providedby FOCUS.Moreover, we asked the participants to submit theirimplementations. Such implementations have been used forunderstanding the correctness of the resulting code. For eachmethod to be completed, we deﬁned a speciﬁc JUnit unittest for checking their correctness. We did not provide theevaluator with the test methods to avoid bias towards theexperiment. Listing 4 reports the simple testing methods usedto check the correctness of the submitted task. Although theunit tests are rather simple, they have been able to effectivelycatch any possible implementation failures.Then, we involved a senior developer experienced withJava programming, gsoup and commons-CLI libraries tofurther investigate the method implementations where theunit test fails. The senior developer checked the severityof the identiﬁed errors and discarded those that are notrelated to the usage of the involved library. For instance,some evaluators named the parameters differently, e.g., theyused password instead of pass or username instead of user .Consequently, the dedicated parseOKTest test fails because ofa wrong parameter naming. We marked this type of failureas a minor one, and we considered the implementation ascorrect for the evaluation scope. There are the following analyses to address RQ : • We perform a Wilcoxon signed rank test [55] to de-termine whether there is any statistically signiﬁcantdifference between the number of passed tests for tasksimplemented with and without FOCUS ( H : there is no

17. Due to the COVID-19 emergency in 202018. https://forms.gle/uoqSTaQ94PArdUST619. http://junit.org/junit4 signiﬁcant difference between the percentage of tests passedwith and without the availability of FOCUS ). Also, wecompute the Cliff’s delta effect size [13]. • As for the questionnaire results, we report them usingdiverging stacked bar charts and discuss them.

ESULTS

This section analyzes the experimental results obtainedthrough the evaluation by referring to the four researchquestions mentioned in Section 4.1. : How does FOCUS compare with UP-Miner andPAM?

TABLE 3Success rate of UP-Miner, PAM and FOCUS.

UP-Miner PAM FOCUS — — k=1 k=2 k=3 k=4

C1.1

Table 3 reports the success rate for PAM and FOCUS, con-sidering different conﬁgurations and values of k representingthe number of neighbor apps. The cut-off value N was setto 30, attempting to investigate the systems’ performancefor a long list of recommendations. The table shows anevident outcome: FOCUS always achieves a much bettersuccess rate than that of PAM and UP-Miner by all theconﬁgurations. For instance, with C1.2 , FOCUS gets 89.81%,91.10%, 92.80%, and 93.22% as success rate by k =1, k =2, k =3,and k =4, respectively, while PAM and UP-Miner get 52.20%and 37.33%, respectively. With C2.2 , FOCUS gets a maximumsuccess rate of 92.10%, which is superior than 58.40% and40.66% obtained by PAM and UP-Miner, respectively. Wefurther conﬁrm the claim by Fowkes and Sutton [11], i.e.,PAM outperforms UP-Miner also in our setting. Concerningthe execution time, for the given dataset, both FOCUSand UP-Miner provide a recommendation in less than 0.01seconds. Speciﬁcally, such a time is of 3.8 × − for UP-Miner(the fastest), 8 × − seconds for FOCUS, and 1.6 seconds forPAM.The performance gain obtained by FOCUS is under-standable in the light of the following arguments. UP-Miner works on the basis of clustering techniques and itis dependent on the similarity among groups of APIs. Inother words, UP-Miner computes similarity at the sequencelevel, i.e., invocations that are usually found together. PAMis a complex system, which consists of six building blocks,i.e., probabilistic model, inference, learning, inferring newpatterns, candidate generation, and mining interesting pat-terns. The system uses a probability distribution to deﬁne adistribution over all possible API patterns present in clientcode, based on a set of API patterns. It also employs agenerative model to infer the most probable patterns fromARFF ﬁles. Finally, the system generates candidate patternsby relying on the highest support ﬁrst rule, i.e., searching forthe best candidate earlier. Due to these technical details, bothUP-Miner and PAM can recommend APIs that commonlyappear in different code snippets. In contrast, FOCUS is able to consider similarity both at the project level and thedeclaration level. Therefore, given an active project, FOCUSmines API calls from the most similar declarations in themost similar projects. As a result, this allows FOCUS tooutperform both UP-Miner and PAM in ﬁnding invocationsthat ﬁt well to a given context.It is worth noting that FOCUS gets a considerablyhigh performance, given that the dataset is fairly small.The maximum success rate obtained by C1.2 and

C2.2 is93.22% and 92.10%, respectively. Compared to our previouswork [26], where a set of 200 G IT H UB projects was consid-ered to compare FOCUS with PAM, we see that FOCUSsubstantially improves its recommendations when more datais incorporated into the training. A feature of the considereddatasets which may affect the results obtained by FOCUSis the level of dependencies in Android apps compared tothat of the G IT H UB projects. In particular, by counting thenumber of unique APIs in each app/project for both theAndroid dataset and the GitHub dataset we see that theformer contains more APIs compared to the latter. Manyapps have more than 400 unique APIs, meanwhile, most ofthe GitHub projects have less than 200 unique APIs. This isfurther supported by previous work [39], [53], which givesevidence that Android projects make heavy use of third-partylibraries as well as native libraries. Answer to RQ . While UP-Miner is the fastest tool in termsof recommendation time, FOCUS is the most effective one asit substantially outperforms both UP-Miner and PAM withrespect to the prediction performance, while keeping therecommendation time below 0.01 seconds. Moreover, FOCUSmines better on Android apps compared to G IT H UB projects. : How successful is FOCUS at providing rec-ommendations at different stages of a development pro-cess?

In this research question, we are interested in understandingthe completeness and accuracy of FOCUS’s recommendationsat different project’s development stages. For the former, weanalyze the corresponding success rate and performance gain,while for the latter, we take into consideration the obtained precision and recall values. Furthermore, we investigate thesystem’s ability to recommend APIs in the long tail . (cid:66) Success rate.

Table 4 compares the success rates obtainedby the considered experimental settings. For the smallest cut-off value N, i.e., N=1, FOCUS is still able to provide matches.For instance, with

C1.1 when k=2, the system gets 67.46%as success rate, and this score increases linearly along N:FOCUS gets a success rate of 76.84% and 82.80% when N=5and N=20, respectively. By Conﬁguration

C1.2 , comparedto

C1.1 , we see a sharp increase in performance by all thecut-off values. Take as an example, with k=2, we get 91.11%as success rate for N=5, and the score goes up to 94.07%when N=20. This demonstrates that FOCUS is capable ofproviding good match even when the developer wants to seea fairly short ranked list. Similarly, by

C2.1 and

C2.2 , FOCUSenhances its success rate alongside k and N.Next, we investigate the effect of changing the number ofneighbor apps used in computing recommendations, i.e., k,on the ﬁnal outcome by comparing the results columnwise.It is evident that when incorporating more neighbors for TABLE 4Success rate (%) for k = { , , , , } and N = { , , , , } . C1.1 C1.2N k=2 k=3 k=4 k=6 k=10 k=2 k=3 k=4 k=6 k=101 C2.1 C2.21 TABLE 5Performance gain (%) among the conﬁgurations.

Gain of C1.2 w.r.t. C1.1N k=2 k=3 k=4 k=6 k=101 Gain of C2.2 w.r.t. C2.11 computing recommendations, FOCUS yields a better successrate. For instance, with C1.1 , considering the success rateobtained by k=2 and k=3, we see that there is always a gainin performance: for N=5, FOCUS obtains 76.84% and 78.42%,respectively. This score improves substantially when we usemore neighbor projects to compute recommendations. Takeas an example, FOCUS has a success rate of 71.34% whenk=4 and 75.76% when k=10. In summary, FOCUS is moreaccurate if additional apps are considered for computing themissing ratings in the 3D matrix. (cid:66)

Performance gain.

Referring to Table 1, we see that

C1.1 and

C1.2 as well as

C2.1 and

C2.2 are pairwise comparable.For instance, both

C1.1 and

C1.2 share the same amountof method declarations ( δ ), they only differ in the numberof invocations used in the testing declaration ( π ). Thus, toinvestigate the effect of changing π on the recommendations,we consider each pair of related conﬁgurations. The resultsin Table 4 indicate a sharp rise in performance when theconﬁgurations change from C1.1 to C1.2 . Take as an example,when k=2 and N=1, that means we consider only the ﬁrstitem in the ranked list, FOCUS obtains 85.69% as success ratewhich is much better than 67.46%, the score yielded by

C1.1 .When k=10 and N=20, the maximum success rate for

C2.1 and

C2.2 is 90.11% and 96.92%, respectively. This suggeststhat incorporating more invocations, e.g., four instead of oneinvocation, helps FOCUS signiﬁcantly enhance its overallperformance. In practice, this means that given a declaration,the system is able to provide more accurate recommendationsproportionally to the project’s maturity.Given the results in Table 4, we analyze the performancegain in percentage (%) and report them it in Table 5. Thegreen color and various levels of density are employed torepresent the corresponding magnitude. From the table, it isevident that the color gradually fades when we move fromleft to right, top to bottom, implying that the enhancementgoes down linearly when we increase k and N. For example,the correlation between

C1.1 and

C1.2 is as follows: for N=1the gain is 27.02% with k=2, and it decreases to 26.67% and23.62% with k=3 and k=4, respectively; when k=10, the gainboils down to 18.74%. The same trend can be seen for othervalues of k and N. Likewise, the improvement obtained by

C2.2 in comparison to

C2.1 shares a similar pattern: it isbig with low k and N, and small with higher k and N. Forinstance, it reaches 25.14% for N=1 and k=2 and shrinks to7.95% for N=20 and k=10. Overall, this essentially meansthat while we get performance gain by incorporating more neighbors, at a certain point, such the gain becomes saturatedand there will be no further improvement. (cid:66)

Accuracy.

We report the accuracy achieved by all conﬁg-urations using the precision recall curves (PRCs) depictedin Fig. 8(a), Fig. 8(b), Fig. 8(c), and Fig. 8(d). The cut-offvalue N has been varied from 1 to 30, aiming to studyFOCUS’s performance further down in the ranked list. First,we examine the effect of changing k on the precision recallcurves. In fact, a system gets a good performance if itsprecision and recall are high at the same time, and thiscorresponds to a PRC close to the upper right corner of thediagram. From the ﬁgures, it is clear that incorporating moreneighbor apps in computing recommendations results in abetter accuracy by all conﬁgurations. For instance, with

C1.1 ,we see a performance gain when increasing the number ofneighbor apps: the best precision and recall are 0.75 and0.63, respectively, and they are obtained when k=10; while byother value of k, i.e., k = { , , , } , the system gets a lowerprecision and recall. Similarly by other conﬁgurations, k=10is also the number of neighbor apps used for computingratings that contributes to the best accuracy: with C1.2 ,FOCUS achieves 0.92 as precision and 0.84 as recall. With

C2.1 and

C2.2 , the gain in performance when using 10 appsfor computing recommendations becomes more evident,in comparison to other values of k, i.e., k = { , , , } .This is consistent with the outcomes we got by the successrate scores presented in Table 4: the system achieves abetter performance if it incorporates more similar apps forcomputing recommendations.In conclusion, we see that the performance of FOCUSusing C1.2 is superior to that when using

C1.1 . Similarly,compared to

C2.1 , the accuracy obtained by FOCUS using

C2.2 improves substantially, i.e., by equipping the query withmore invocations. These facts further conﬁrm that FOCUSis able to recommend more relevant invocations when thedeveloper intensiﬁes the declaration by adding more code.As can be seen in Eq. 4, when more invocations are available,the similarity among declarations can be better determined,resulting in a gain in performance. (cid:66)

The long tail.

We counted the APIs that are recommendedmore often by FOCUS. By carefully checking the top 20recommended items, we realized that most of them residein the long tail. For example, the java/lang/StringBuilder/-toString()

API has been provided 190 times by FOCUS, beingthe top most recommended item. However, this invocationis only ranked 646 in the list of all the APIs in the dataset.Altogether, this is to show that while recommending very k=2 k=3 k=4 k=6 k=10Recall P r ec i s i on (a) C1.1 k=2 k=3 k=4 k=6 k=10Recall P r ec i s i on (b) C1.2 k=2 k=3 k=4 k=6 k=10

Recall P r ec i s i on (c) C2.1 k=2 k=3 k=4 k=6 k=10

Recall P r ec i s i on (d) C2.2

Fig. 8. Precision and recall curves obtained for the conﬁgurations.Fig. 9. Bivariate analysis of Precision and Cardinality. popular APIs may make sense, FOCUS goes far beyond thatby recommending also items in the long tail. This is achievedsince FOCUS mines API from highly similar projects, givenan active project. Answer to RQ . FOCUS provides more accurate predictionswhen more similar projects are used for recommendation. Itis capable of suggesting APIs in the long tail. The systemimproves its accuracy while the developer keeps coding. : Is there a signiﬁcant correlation between thecardinality of a category and accuracy?

Table 6 depicts the Spearman coefﬁcients for all conﬁgura-tions, with respect to different numbers of N. The Kendallcoefﬁcients ( τ ) are comparable to the Spearman ones ( ρ ), sowe omitted them from the table, for the sake of clarity.By examining the results in Table 6 we see that, despitesome ﬂuctuations, mainly with C2.1 and

C2.2 , ρ is consider-ably small, i.e., the maximum value is ρ =0.160 for C1.2 andN=25. More importantly, most of the scores are close to 0,indicating an extremely low (e.g., by N = { , , } with C1.1 ) or almost no correlation (e.g., by N = { } with C1.1 and

C2.2 ). TABLE 6Correlation ( ρ ) between cardinality and precision, N = { , , , , } . N C1.1 C1.2 C2.1 C2.2

As an example, Fig. 9 depicts precision and cardinality aswell as their correlation for N=25. The variables are shownboth on the x-axis and y-axis, however at different partsof the axes. This allows us to comprehensively representthe relationship between the two variables for all the fourconﬁgurations. In particular, on the top-left corner, there isthe histogram of precision with respect to cardinality, whilethe other bar charts at the bottom show the histogram foreach of them individually. The middle frame in the toprow speciﬁes the correlation coefﬁcients between precisionand cardinality for all the conﬁgurations. Results showthat there is a very weak correlation between the twovariables. For instance, the coefﬁcient is 0.032 for

C1.1 , or0.036 for

C2.2 . As a whole, this unfortunately contradictsour initial conjecture: apps belonging to major categoriesdo not get a better recommendation, although they have inprinciple, more background data. This means that searchingfor recommendations just by looking at apps of the same domain(s)does not guarantee that we will gain beneﬁt.

We attempt toascertain the possible causes in the following.According to a previous work [22], if projects share APIcalls implementing the same requirements, then the projectsare considered to be more similar than those that do nothave similar API usage. We computed similarity amongapps using the

Similarity Calculator component presentedin Section 3.2.3. Such a similarity is measured based on theconstituent API function calls of an app (cf. Fig. 4(b) andEq. 1). By carefully examining the ﬁnal results, we realizedthat generally, similar apps do not originate from the samedomain. To be concrete, considering a ranked list with ﬁve items for all 2,600 apps, i.e., N=5, the percentage of items thathave similar apps coming from 1, 2, 3, 4, and 5 categoriesis 1.14%, 6.6%, 21.42%, 41.8%, and 29.0%, respectively. Forinstance, machinekit.appdiscover belongs to Libraries &Demo , however its highly similar apps are from

Education , Books & Reference , Health & Fitness , and

Tools . Since FOCUSrelies on the similarity function (cf. Section 3.2.3), it mayretrieve invocations from projects in completely differentdomains to generate recommendations. This explains whyprojects of a category with a low number of items still geta good accuracy, resulting in a weak correlation betweenthe cardinality of a category and accuracy. In a nutshell, there exists no correlation because even apps belonging to differentcategories still contain similar API usage.

Though the experiment suggests that we cannot save timeby looking into some certain categories, on the bright side, itreveals an interesting feature of FOCUS: the tool is able todiscover API calls from a wide range of apps, regardless oftheir origins.

Answer to RQ . There is no direct correlation between thecardinality of a category and prediction accuracy. Moreover,FOCUS is capable of mining API calls from apps belongingto various application domains. : Can FOCUS recommend relevant code snip-pets?

As shown in Section 3.3.1, by using the incomplete code inFig. 1(a) together with other testing declarations as queryto feed FOCUS, we obtained a relevant snippet depicted inListing 2, and this is just one of many good matches we got.To provide a concrete analysis, Fig. 10 depicts the distributionof the 500 apps dataset with respect to the number ofprojects (x-Axis) and Levenshtein distance between thetesting declaration and the corresponding project (y-Axis).

Number of apps D i s t a n ce Fig. 10. Levenshtein distance for the set of 500 apps.

To facilitate a better view, we mark the apps as fourseparate clusters. Almost a quarter of the projects or 24%corresponding to 120 projects get zero as the ﬁnal result,i.e., the distance between the recommended snippet andthe original one is zero. This means that for each of theseprojects, the recommended declaration perfectly matchesthe original one. By the remaining ones, 23 projects amongthem accounting for 4.6%, have a distance of one, which alsoindicates a high level of code similarity. Almost a half ofthe dataset, i.e., 233 apps corresponding to 46.60%, have adistance being larger than nine.

20. https://bit.ly/3pGKKIL (a) Q1: Does FOCUS retrieve code snippets relevant to thecontext? (b) Q2: Do the recommended code snippets help you completethe lab assignments?(c) Q3: Does FOCUS retrieve invocations relevant to the context? (d) Q4: Do the recommended invocations help you complete thelab assignments? Fig. 11. Results for evaluating the usefulness of the recommendations.

Figure 10 shows that, while FOCUS gains a good rec-ommendation performance for a considerably large numberof apps, it fails to retrieve matches for some others, i.e., thecorresponding Levenshtein distance is large, meaning thatthe recommended snippets are not relevant to the ground-truth ones. For instance, one project has a distance of 52,or another has a distance of 43. We attempt to ﬁnd outthe rationale behind this outcome. Our main intuition is asfollows, by those projects with a large Levenshtein distance,there is a lack of relevant training data. In other words, ifthere are not enough similar projects, FOCUS cannot discoverAPI invocations which eventually ﬁt to the active declaration.To validate the hypothesis, the following test was con-ducted: we computed the precision scores for all projects,and compared them with the Levenshtein distances usingthe Spearman’s rank correlation coefﬁcient (Similar to RQ ).The resulting score is ρ =-0.514, with p-value < 2.2e-16. Thiscan be interpreted as follows: the obtained precision isdisproportionate to the Levenshtein distance, or put anotherway, the higher the precision we get, the shorter the distance,and vice versa. The ﬁnding consolidates our assumption:if FOCUS achieves a high precision, it will be able torecommend more relevant code snippets. Furthermore, aswe already proved in RQ , FOCUS gets a higher precision ifwe use more similar apps for computing recommendations.Altogether, we conclude that our proposed approach is ableto return relevant code snippets if it is fed with more trainingdata. From the set of apps with a Levenshtein distance of0, we enumerated the APIs and sorted them in descendingorder to see which invocations have been recommendedmost. We got a similar outcome of RQ in Section 5.2: FOCUSrecommends several APIs which appear late in the rankedlist of the most popular invocations. Answer to RQ . FOCUS can provide relevant source codesnippets to a testing declaration, as long as we feed it with arich training dataset, i.e., there are more projects similar tothe one being considered. : How are FOCUS recommendations perceivedby software engineers during a development task?

As explained in Section 4, 16 participants took part in theuser study. Among them, 30% and 50% of them have threeyears and more than four years of programming experience,respectively. Most of them use a code search engine in a daily basis. Moreover, 80% of the participants agree that the tasksare clear and easy.First, we analyzed whether the use of FOCUS couldhelp participants produce more correct code. The mediannumber of passed tests was 2 out of 3 both with andwithout FOCUS. We compared (pairwise, by participant)the percentage of passed test cases using a Wilcoxon signed-rank test. The test did not indicate a statistically signiﬁcantdifference ( p -value=0.88), i.e., the correctness of the producedimplementations did not change with and without FOCUS.Also the Cliff’s d effect size is negligible ( d =0.01).We therefore looked at the perceived usefulness of therecommendations, in terms of API invocations and codesnippets. Results of the questionnaire related to the taskassignments in Table 2 are shown in Fig. 11(a), Fig. 11(b),Fig. 11(c), and Fig. 11(d).Concerning the ﬁrst question “ Q1: Does FOCUS retrievecode snippets relevant to the context? ” following Fig. 11(a),69% of the participants agree and strongly agree with thefact that the snippets are relevant, while the remaining31% of them have no concrete judgment on the results, i.e.,neutral. This means that most of the developers ﬁnd that therecommended code snippets ﬁt their programming tasks.By the second question: “

Q2: Do the recommended codesnippets help you complete the lab assignments? ” as shown inFig. 11(b), most of the participants ﬁnd that the snippetsrecommended by FOCUS are helpful to solve the tasks. Inparticular, 73% of them agree and strongly agree with thequestion.With the third question: “

Q3: Does FOCUS retrieve invo-cations relevant to the context? ” we are interested in under-standing whether FOCUS can fetch invocations related to thegiven context. The results in Fig. 11(c) suggest that more thana half of the evaluators, i.e., 56% think that the provided APIsare relevant, while 44% of them have no concrete judgment.Finally, the results in Fig. 11(d), corresponding to the lastquestion: “

Q4: Do the recommended invocations help you completethe lab assignments? ” show that 7% of the participants disagreethat the APIs are useful, while 27% of them feel neutralabout the results. Still, most of them, i.e., 67%, appreciate therecommended APIs, which are helpful to solve their tasks.Altogether, the Likert scores indicate that FOCUS pro-vides decent recommendations: both the suggested APIs andcode snippets are meaningful to the given contexts. Answer to RQ . The majority of the study participantspositively perceived the context-speciﬁc relevance and theusefulness of the recommendations (APIs and code snippets)provided by FOCUS.

ISCUSSIONS

In this section, we discuss the experience gained from theexperiments in Section 6.1. The threats that might hamperthe validity of our ﬁndings are discussed in Section 6.2.

With FOCUS, by choosing a suitable number of similar projectsused for computing recommendations, we get a satisfactory predic-tion accuracy, while still maintaining a reasonable computationalcomplexity.

We suppose that for a training set with bigprojects, i.e., those that have a large number of declarationsand APIs, resulting in a very large 3D tensor, it is necessaryto apply matrix factorization techniques. This allows us torepresent the tensor in a lower dimensional latent space, soas to efﬁciently handle high dimension data as well as toincrease the prediction performance.The recommended snippet in Listing 2 is of high qualityas it matches well with the developer’s context. Nevertheless,there is no guarantee that such a good quality will be heldfor all the possible cases, as this depends a lot on the trainingdata. Thus, we plan to implement a module to ask developersto rate/provide feedback on the given recommendations. Bydoing this, we would be able to collect information about thequality of a recommended snippet which can then be usedto reinforce the learning of FOCUS.Through the results obtained for RQ , we conﬁrm theimportance of computing similarity among OSS projects [22],[25]. FOCUS improves its prediction performance substan-tially, given that we incorporate more similar projects tocompute the missing ratings. This implies that, given aproject for which we would like to obtain recommendations,if we cannot ﬁnd any similar projects, then it is not possibleto recommend relevant API calls. This, along with the resultsobtained for RQ , let us conclude that FOCUS relies on theavailability of enough training data from similar projects, in orderto provide relevant API calls as well as code snippets.

The resultsof RQ also imply that it would become more difﬁcult forFOCUS to recommend code samples when the context is notfully available, or worse, is missing. Under the circumstances,we can apply code synthesis approaches [59] to generate codefor a location where there is no concrete context. Moreover,we believe that it is important to improve the way FOCUScomputes the similarity among projects and declarations,for instance by optimizing the global mapping using theHungarian algorithm [57]. We consider all these issues asfuture improvements for FOCUS.The RQ results reveal two ﬁndings about FOCUS. First,FOCUS can provide good recommendations, regardless ofthe categories used for training. Second, given a project ina category, we can ﬁnd very similar projects in differentcategories. The outcome provides valuable insight intothe meaning of category extracted from Google Play: Thecategories speciﬁed by developers provide a rough abstractdescription of an app, rather than an informative summaryof what the app does. This essentially means that these categories do not have much to do with similarity in API usages.

Recently,attempts have been made to automatically assign a categoryto projects/apps [7], [43]. Among others, supervised learningtechniques perform computation by exploiting labeled data,e.g., the apps and their corresponding categories speciﬁedby developers. However, we suppose that someone may failif they try to classify apps according to API calls togetherwith the categories speciﬁed in Google Play. Thus, currentlyFOCUS makes use of similarity techniques working at theAPI level without considering any induced categorization.Nevertheless, for a huge amount of training data, weanticipate that the categorization of apps may help increasethe efﬁciency as follows. We can perform preprocessing stepsto group similar apps in terms of API usages into the samecluster by means of unsupervised clustering techniques [28].Given the presence of such clusters, every time there is anactive app, FOCUS looks only in the cluster(s) that contain(s)similar apps by computing similarity with some of the mostrepresentative apps in the clusters. In this way, we canconsiderably narrow down the search scope for the activeapp, thereby speeding up the search.In our empirical study, we focused on Java and Kotlinprojects. However, FOCUS can be used to recommend APIsand source code for projects written in other languages, suchas PHP, and C/C++, since Rascal also supports them [5].Moreover, there are various reverse engineering tools thatcan extract declarations and invocations from source code.For example, the Eclipse JDT parser has been widely usedto parse Java source code in the related studies [54], [58].Similarly, dotPeek uses a combination of debug informationand web services to reconstruct C (cid:66) Internal validity.

Threats to internal validity are relatedto confounding factors, internal to our study, that couldhave inﬂuenced the results. One probable threat can be seenthrough the results obtained for the smaller dataset whichconsists of 500 apps in RQ . As we already mentioned, thisdataset was used to compare FOCUS with UP-Miner andPAM due to the limited scalability of PAM. Moreover, wealso deliberately made use of such a small dataset in RQ to study the extent to which FOCUS is able to recommendrelevant source code. The intuition is that if it performs well on a limited amount of training data, it will be also effectiveon a larger one.In the comparison between UP-Miner, PAM, and FOCUSwe used the implementation of PAM which was publishedonline by its authors. Since the original implementationof UP-Miner is no longer available, we made use of thesource code re-implemented by the authors of PAM. Tomitigate the threats that may affect internal validity, we alsoevaluated the systems using exactly the same dataset andevaluation metrics. Furthermore, we ran several trials andcounter-checks to validate the evaluation outcomes.Concerning the user study, we ( i ) limited the extent towhich results depend on personal skills by involving 16Master’s students having similar development backgroundand experience; and ( ii ) did not disclose the goals of theexperiment to avoid hypothesis guessing. Another threatis that, to simplify the setting, developers used FOCUSrecommendations as HTML pages instead of having them inthe IDE. However, we evaluated the perceived usefulnessof the recommendations, not of the tool. (cid:66)

External validity.

The main threat to external validity isthat our proposed approach is currently limited to Java andKotlin programs. As stated in Section 3, however, FOCUSmakes few assumptions on the underlying language andonly requires information about method declarations andinvocations to build the 3D matrix. This information couldbe extracted from programs written in any object-orientedprogramming language, and we wish to generalize FOCUS toother languages in the future. Also, in the future FOCUS maybeneﬁt from an in-ﬁeld evaluation in an industrial setting. (cid:66)

Construct validity.

The main threat to construct valid-ity concerns the simulated setting used to evaluate theapproaches, as opposed to performing a user study. Wemitigated this threat by introducing four conﬁgurations thatsimulate different stages of the development process. In areal development setting, however, the order in which onewrites statements might not fully reﬂect our simulation. Also,in a realistic usage setting, there may be cases in whichan API recommender turns out to be more useful (e.g.,when recommending API usages for which the developerhas limited skills/knowledge), and cases (obvious codecompletion, or recommending usage scenarios for commonly-used APIs), where it is less useful. Such a threat has beenmitigated with the user study in which participants evaluatedthe recommendations provided by FOCUS.In the user study, we evaluated the perceived usefulnessof the recommendations, which may or may not correspondto the actual usefulness. The outcome of the performed taskdid not show any signiﬁcant difference in terms of correct-ness of the produced artifacts. This could possibly dependon the study setting (i.e., ofﬂine) in which participants hadthe time to properly implement the task and, possibly, searchfor plausible solutions and/or API documentation.

ELATED W ORK

The adoption of recommender systems in software engi-neering aims at supporting developers in navigating large

23. See for instance https://bit.ly/3d3i2hY information spaces and getting instant recommendations,which can give some guidance to undertake the particulardevelopment task at hand [44]. In this section, we present anoverview of some representative recommender systems byfocusing on those speciﬁcally conceived to support softwaredevelopment activities. (cid:4)

API usage pattern recommendation.

Acharya et al. [3]present a framework to extract API patterns as partial ordersfrom client code. To this aim, control-ﬂow-sensitive staticAPI traces are extracted from source code and sequentialpatterns are computed. Our approach is able to recommendboth a list of API calls and related source code.Zhong et al. implemented MAPO, a tool to retrieve APIusage patterns from client projects [58]. MAPO extracts APIusages from source ﬁles, and groups API methods intoclusters. Afterwards, it mines API usage patterns from theclusters, ranks them according to their similarity with thecurrent development context, and eventually suggests codesnippets to developers. Similarly, UP-Miner [54] extracts APIusage patterns by relying on

SeqSim , a clustering strategy thatreduces patterns redundancy as well as improves coverage.While these approaches are based on clustering techniques,and they consider all client projects in the mining regardlessof their similarity with the current project, FOCUS narrowsdown the search scope by looking into similar projects. Inthis work, we see that FOCUS clearly outperforms UP-Miner.PAM (Probabilistic API Miner) mines API usage patternsbased on a parameter-free probabilistic algorithm [11]. Thetool uses the structural Expectation-Maximization (EM)algorithm to infer the most probable API patterns from clientcode. PAM outperforms both MAPO and UP-Miner (lowerredundancy and higher precision). Through a comparison ofFOCUS with PAM, we see that our approach obtains a betterperformance with respect to various metrics.NCBUP-miner (Non Client-based Usage Patterns) [42]is a technique that identiﬁes unordered API usage patternsfrom the API source code, based on both structural (methodsthat modify the same object) and semantic (methods thathave the same vocabulary) relations. The same authors alsopropose MLUP [41], which is based on vector representationand clustering, but in this case client code is also considered.XSnippet [40] suggests relevant code snippets startingfrom the developer’s context. The system invokes differentqueries that consider both the parents of the class and thelexical visible type. Then, queries computed in such a way arepassed to a module that mines relevant paths by relying on agraph-based structure. The ranking module eventually ranksthe obtained snippets by employing six different heuristics.DeepAPI [15] is a deep-learning method used to generateAPI usage sequences given a query in natural language.The learning problem is encoded as a machine translationproblem, where queries are considered the source languageand API sequences the target language. Only commentedmethods are considered during the search. The same au-thors [14] present CODEnn (COde-Description EmbeddingNeural Network), where, instead of API sequences, codesnippets are retrieved to the developer based on semanticaspects such as API sequences, comments, method names,and tokens.With respect to the aforementioned approaches, FOCUSuses CF techniques to recommend API calls and usage patterns from a set of similar projects. In the end, not onlyrelevant API invocations are recommended, but also codesnippets are returned to the developer as usage examples. (cid:4) API-related code search engines.

Strathcona [17] is anEclipse plug-in that extracts the structural context of codeand uses it as a query to request API usages from a remoterepository. The system performs the match by employing sixheuristics associated to class inheritance, method calls, andﬁeld types. In a similar fashion, the technique proposed by

Buse and Weimer [6] synthesizes API examples for a given datatype. An algorithm based on data-ﬂow analysis, k-Medoidsclustering and pattern abstraction is designed. Its outcomeis a set of syntactically correct and well-typed code snip-pets where example length, exception handling, variablesinitialization and naming, abstract uses are considered.MUSE (Method USage Examples) is an approach pro-posed by

Moreno et al. [23] to recommending code examplesa given API method. MUSE extracts API usages fromclient code, simpliﬁes code examples with static slicing,and detects clones to group similar snippets. It also ranksexamples according to certain properties, i.e., reusability,understandability, and popularity).SWIM (Synthesizing What I Mean) [33] seeks API struc-tured call sequences (control and data-ﬂows are considered),and then synthesizes API-related code snippets accordingto a query in natural language. The underlying learningmodel is also built with the EM algorithm. Similarly, Raychev et al. [35] propose a code completion approach based onnatural language processing, which receives as input a partialprogram and outputs a set of API call sequences ﬁlling thegaps of the input. Both invocations and invocation argumentsare synthesized considering multiple types of an API.

Thummalapenta and Xie propose SpotWeb [48], an ap-proach that provides starting points (hotspots) for under-standing a framework, and highlights where examplesﬁnding could be more challenging (coldspots). Other toolsexploit StackOverﬂow discussions to suggest code snippetsand documentation [9], [31], [32], [34], [36], [47], [50].

ONCLUSIONS

We presented FOCUS, a recommender system to providedevelopers with suitable API function calls and code snippetswhile they are programming. A thorough evaluation has beenconducted (i) on an Android dataset to study the approach’sperformance, and (ii) in a user study with 16 participants toassess the perceived usefulness of FOCUS recommendations.We succeeded in integrating FOCUS into the Eclipse IDE,and we made available online the developed tool togetherwith the parsed metadata [27]. This aims at providingthe research community at large with a sound replicationpackage, which then allows one to seamlessly reproduce theexperiments presented in our paper.Future research in this area includes (i) replicating theempirical evaluation on further projects, by also, possibly,supporting further programming languages, and (ii) up-dating the code base of the Eclipse Scava project (whichembraces all the development outcomes produced in thecontext of the EU CROSSMINER project) with the FOCUStool as presented in this paper. A CKNOWLEDGMENTS

The research described has been carried out as part of theCROSSMINER Project, which has received funding from theEuropean Union’s Horizon 2020 Research and InnovationProgramme under grant agreement No. 732223. We thankGian Luca Scoccia for providing us with the tool to extractAPK ﬁles from Apkpure. We also thank the students whokindly participated in the user study, despite difﬁcultiescaused by the unprecedented pandemic. Finally, we aregrateful to the anonymous reviewers for their valuablecomments and suggestions that helped us improve the paper. R EFERENCES [1] “dex2jar,” library Catalog: tools.kali.org. [Online]. Available:https://tools.kali.org/reverse-engineering/dex2jar[2] R. Aarssen, “cwi-swat/clair: v0.1.0,” Sep. 2017. [Online]. Available:https://doi.org/10.5281/zenodo.891122[3] M. Acharya, T. Xie, J. Pei, and J. Xu, “Mining API Patterns As PartialOrders from Source Code: From Usage Scenarios to Speciﬁcations,”in . New York: ACM, 2007, pp. 25–34.[4] B. Basten, M. Hills, P. Klint, D. Landman, A. Shahi, M. J. Steindorfer,and J. J. Vinju, “M3: A General Model for Code Analytics in Rascal,”in . Piscataway:IEEE, 2015, pp. 25–28.[5] B. Basten, J. [van den Bos], M. Hills, P. Klint, A. Lankamp, B. Lisser,A. [van der Ploeg], T. [van der Storm], and J. Vinju, “Modularlanguage implementation in rascal – experience report,”

Science ofComputer Programming , vol. 114, pp. 7 – 19, 2015, lDTA (LanguageDescriptions, Tools, and Applications) Tool Challenge.[6] R. P. L. Buse and W. Weimer, “Synthesizing API Usage Examples,”in . Piscataway:IEEE, 2012, pp. 782–792.[7] A. Capiluppi, D. Di Ruscio, J. Di Rocco, P. T. Nguyen, andN. Ajienka, “Detecting java software similarities by using differentclustering techniques,”

Information and Software Technology , vol. 122,p. 106279, 2020.[8] A. Chen, “Context-Aware Collaborative Filtering System: Predict-ing the User’s Preference in the Ubiquitous Computing Environ-ment,” in

First International Conference on Location- and Context-Awareness . Berlin, Heidelberg: Springer, 2005, pp. 244–253.[9] J. Cordeiro, B. Antunes, and P. Gomes, “Context-Based Recommen-dation to Support Problem Solving in Software Development,” in

Third International Workshop on Recommendation Systems for SoftwareEngineering . Piscataway: IEEE, 2012, pp. 85–89.[10] M. Fazzini, Q. Xin, and A. Orso, “Automated api-usage updatefor android apps,” in

Proceedings of the 28th ISSTA , ser. ISSTA 2019.New York, NY, USA: Association for Computing Machinery, 2019, p. 204–215. [11] J. Fowkes and C. Sutton, “Parameter-free Probabilistic API MiningAcross GitHub,” in . New York: ACM, 2016, pp.254–265.[12] F. Geiger, I. Malavolta, L. Pascarella, F. Palomba, D. Di Nucci,and A. Bacchelli, “A graph-based dataset of commit history ofreal-world android apps,” in , 2018, pp. 30–33.[13] R. J. Grissom and J. J. Kim,

Effect sizes for research: A broad practicalapproach , 2nd ed. Lawrence Earlbaum Associates, 2005.[14] X. Gu, H. Zhang, and S. Kim, “Deep Code Search,” in . New York: ACM,2018, pp. 933–944.[15] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep API Learning,”in . New York: ACM, 2016, pp. 631–642.[16] M. Hills and P. Klint, “Php air: Analyzing php systems withrascal,” in .IEEE, 2014, pp. 454–457.[17] R. Holmes and G. C. Murphy, “Using Structural Context to Recom-mend Source Code Examples,” in . New York: ACM, 2005, pp. 117–125. [18] H. Jang, B. Jin, S. Hyun, and H. Kim, “Kerberoid: A practicalandroid app decompilation system with multiple decompilers,”in Proceedings of the 2019 ACM SIGSAC Conference on Computerand Communications Security , ser. CCS ’19. New York, NY, USA:Association for Computing Machinery, 2019, p. 2557–2559.[19] R. Koch,

The 80/20 Principle: The Secret of Achieving More with Less ,ser. A Currency book. Doubleday, 1999.[20] R. Kohavi, “A Study of Cross-validation and Bootstrap for AccuracyEstimation and Model Selection,” in . San Francisco: Morgan KaufmannPublishers Inc., 1995, pp. 1137–1143.[21] V. Levenshtein, “Binary Codes Capable of Correcting Deletions,Insertions and Reversals,”

Soviet Physics Doklady , vol. 10, p. 707,1966.[22] C. McMillan, M. Grechanik, and D. Poshyvanyk, “Detecting similarsoftware applications,” in

Proceedings of the 34th InternationalConference on Software Engineering , ser. ICSE ’12. Piscataway, NJ,USA: IEEE Press, 2012, pp. 364–374.[23] L. Moreno, G. Bavota, M. Di Penta, R. Oliveto, and A. Marcus,“How Can I Use This Method?” in . Piscataway: IEEE, 2015, pp. 880–890.[24] S. M. Nasehi, J. Sillito, F. Maurer, and C. Burns, “What Makesa Good Code Example?: A Study of Programming Q&A inStackOverﬂow,” in . Piscataway: IEEE, 2012, pp. 25–34.[25] P. T. Nguyen, J. Di Rocco, D. Di Ruscio, and M. Di Penta, “CrossRec:Supporting Software Developers by Recommending Third-partyLibraries,”

Journal of Systems and Software , p. 110460, 2019.[26] P. T. Nguyen, J. Di Rocco, D. Di Ruscio, L. Ochoa, T. Degueule,and M. Di Penta, “FOCUS: A Recommender System for MiningAPI Function Calls and Usage Patterns,” in

Proceedings of the41st International Conference on Software Engineering , ser. ICSE ’19.Piscataway, NJ, USA: IEEE Press, 2019, pp. 1050–1060.[27] P. T. Nguyen, J. Di Rocco, C. Di Sipio, D. Di Ruscio, and M. DiPenta, “TSE FOCUS replication package,” Jan. 2021. [Online].Available: https://doi.org/10.5281/zenodo.4415618[28] P. T. Nguyen, K. Eckert, A. Ragone, and T. Di Noia, “Modiﬁcationto K-Medoids and CLARA for Effective Document Clustering,” in

Foundations of Intelligent Systems . Cham: Springer InternationalPublishing, 2017, pp. 481–491.[29] A. N. Oppenheim,

Questionnaire Design, Interviewing and AttitudeMeasurement . Pinter Publishers, 1992.[30] D. L. Parnas, “Information Distribution Aspects of Design Method-ology,” Departement of Computer Science, Carnegie Mellon Uni-versity, Pittsburgh, Tech. Rep., 1971.[31] L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza,“Mining StackOverﬂow to Turn the IDE into a Self-conﬁdentProgramming Prompter,” in . New York: ACM, 2014, pp. 102–111.[32] L. Ponzanelli, S. Scalabrino, G. Bavota, A. Mocci, R. Oliveto,M. Di Penta, and M. Lanza, “Supporting Software Developers witha Holistic Recommender System,” in . Piscataway: IEEE, 2017, pp. 94–105. [33] M. Raghothaman, Y. Wei, and Y. Hamadi, “SWIM: Synthesizing

What I Mean: Code Search and Idiomatic Snippet Synthesis,” in . New York:ACM, 2016, pp. 357–367.[34] M. Rahman, S. Yeasmin, and C. Roy, “Towards a Context-AwareIDE-Based Meta Search Engine for Recommendation about Pro-gramming Errors and Exceptions,” in

Conference on Software Mainte-nance, Reengineering, and Reverse Engineering . Piscataway: IEEE,2014, pp. 194–203.[35] V. Raychev, M. Vechev, and E. Yahav, “Code Completion withStatistical Language Models,” in . New York:ACM, 2014, pp. 419–428.[36] P. C. Rigby and M. P. Robillard, “Discovering Essential Code Ele-ments in Informal Documentation,” in . Piscataway: IEEE, 2013, pp. 832–841.[37] M. P. Robillard, “What Makes APIs Hard to Learn? Answers fromDevelopers,”

IEEE software , vol. 26, no. 6, pp. 27–34, 2009.[38] M. P. Robillard, E. Bodden, D. Kawrykow, M. Mezini, andT. Ratchford, “Automated API Property Inference Techniques,”

IEEE Transactions on Software Engineering , vol. 39, no. 5, pp. 613–637,2013.[39] I. J. M. Ruiz, M. Nagappan, B. Adams, and A. E. Hassan, “Un-derstanding reuse in the Android Market,” in . Passau,Germany: IEEE, Jun. 2012, pp. 113–122.[40] N. Sahavechaphan and K. Claypool, “Xsnippet: Mining for samplecode,”

SIGPLAN Not. , vol. 41, no. 10, p. 413–430, Oct. 2006.[41] M. A. Saied, O. Benomar, H. Abdeen, and H. Sahraoui, “MiningMulti-level API Usage Patterns,” in . Piscataway: IEEE,2015, pp. 23–32.[42] M. A. Saied, H. Abdeen, O. Benomar, and H. Sahraoui, “Could WeInfer Unordered API Usage Patterns Only Using the Library SourceCode?” in .Piscataway: IEEE, 2015, pp. 71–81.[43] B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero, and P. G. Bringas,“On the automatic categorisation of android applications,” in ,2012, pp. 149–153.[44] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen,

CollaborativeFiltering Recommender Systems . Berlin, Heidelberg: Springer BerlinHeidelberg, 2007, pp. 291–324.[45] ——, “The Adaptive Web: Methods and Strategies of Web Personal-ization,” P. Brusilovsky, A. Kobsa, and W. Nejdl, Eds. Berlin, Hei-delberg: Springer, 2007, ch. Collaborative Filtering RecommenderSystems, pp. 291–324.[46] G. L. Scoccia, S. Ruberto, I. Malavolta, M. Autili, and P. Inverardi,“An investigation into android run-time permissions from the endusers’ perspective,” in , 2018, pp.45–55.[47] W. Takuya and H. Masuhara, “A Spontaneous Code Recommen-dation Tool Based on Associative Search,” in . New York: ACM, 2011, pp. 17–20.[48] S. Thummalapenta and T. Xie, “SpotWeb: Detecting FrameworkHotspots and Coldspots via Mining Open Source Code on the Web,”in . Washington: IEEE, 2008, pp. 327–336.[49] S. Thummalapenta, L. Cerulo, L. Aversano, and M. Di Penta,“An empirical study on the maintenance of source code clones,”

Empirical Softw. Engg. , vol. 15, no. 1, p. 1–34, Feb. 2010.[50] C. Treude and M. P. Robillard, “Augmenting API Documentationwith Insights from Stack Overﬂow,” in . New York: ACM, 2016, pp. 392–403.[51] G. Uddin and M. P. Robillard, “How API Documentation Fails,”

IEEE Software , vol. 32, no. 4, pp. 68–75, 2015.[52] S. Vargas and P. Castells, “Improving sales diversity by recom-mending users to items,” in

Proceedings of the 8th ACM Conferenceon Recommender Systems , ser. RecSys ’14. New York, NY, USA:Association for Computing Machinery, 2014, p. 145–152.[53] N. Viennot, E. Garcia, and J. Nieh, “A measurement study of googleplay,” in

The 2014 ACM international conference on Measurement andmodeling of computer systems - SIGMETRICS ’14 . Austin, Texas,USA: ACM Press, 2014, pp. 221–233.[54] J. Wang, Y. Dang, H. Zhang, K. Chen, T. Xie, and D. Zhang, “MiningSuccinct and High-coverage API Usage Patterns from Source

Code,” in . Piscataway: IEEE, 2013, pp. 319–328.[55] F. Wilcoxon, “Individual comparisons by ranking methods,”

Bio-metrics Bulletin , vol. 1, no. 6, pp. 80–83, 1945.[56] T.-T. Wong, “Performance Evaluation of Classiﬁcation Algorithmsby K-fold and Leave-one-out Cross Validation,”

Pattern Recognition ,vol. 48, no. 9, pp. 2839–2846, 2015.[57] H. Zhong and N. Meng, “Towards reusing hints from pastﬁxes: An exploratory study on thousands of real samples,”in

Proceedings of the 40th International Conference on SoftwareEngineering , ser. ICSE ’18. New York, NY, USA: Associationfor Computing Machinery, 2018, p. 885. [Online]. Available:https://doi.org/10.1145/3180155.3182550[58] H. Zhong, T. Xie, L. Zhang, J. Pei, and H. Mei, “MAPO: Mining andRecommending API Usage Patterns,” in . Berlin, Heidelberg: Springer, 2009,pp. 318–343.[59] S. Zhou, B. Shen, and H. Zhong, “Lancer: Your code tell me whatyou need,” in2019 34th IEEE/ACM International Conference onAutomated Software Engineering (ASE)