[PDF] Mining API Usage Scenarios from Stack Overflow

Abstract

We propose a framework to mine API usage scenarios from Stack Overflow. Each task consists of a code example, the task description, and the reactions of developers towards the code example. First, we present an algorithm to automatically link a code example in a forum post to an API mentioned in the textual contents of the forum post. Second, we generate a natural language description of the task by summarizing the discussions around the code example. Third, we automatically associate developers reactions (i.e., positive and negative opinions) towards the code example to offer information about code quality. We evaluate the algorithms using three benchmarks.

Full PDF

MMining API Usage Scenarios from Stack Overﬂow

Gias Uddin, Foutse Khomh, and Chanchal K Roy

McGill University, Polytechnique Montreal, and Unversity of Saskatchewan, Canada.

AbstractContext:

APIs play a central role in software development. The seminalresearch of Carroll et al. [15] on minimal manual and subsequent studies byShull et al. [79] showed that developers prefer task-based API documentationinstead of traditional hierarchical oﬃcial documentation (e.g., Javadoc). TheQ&A format in Stack Overﬂow oﬀers developers an interface to ask andanswer questions related to their development tasks.

Objective:

With a view to produce API documentation, we study auto-mated techniques to mine API usage scenarios from Stack Overﬂow.

Method:

We propose a framework to mine API usage scenarios from StackOverﬂow. Each task consists of a code example, the task description, andthe reactions of developers towards the code example. First, we present analgorithm to automatically link a code example in a forum post to an APImentioned in the textual contents of the forum post. Second, we generatea natural language description of the task by summarizing the discussionsaround the code example. Third, we automatically associate developers re-actions (i.e., positive and negative opinions) towards the code example tooﬀer information about code quality.

Results:

We evaluate the algorithms using three benchmarks. We com-pared the algorithms against seven baselines. Our algorithms outperformedeach baseline. We developed an online tool by automatically mining APIusage scenarios from Stack Overﬂow. A user study of 31 software devel-opers shows that the participants preferred the mined usage scenarios inOpiner over API oﬃcial documentation. The tool is available online at:http://opiner.polymtl.ca/.

Conclusion:

With a view to produce API documentation, we propose aframework to automatically mine API usage scenarios from Stack Overﬂow,supported by three novel algorithms. We evaluated the algorithms against

Preprint submitted to Journal IST February 18, 2021 a r X i v : . [ c s . S E ] F e b total of eight state of the art baselines. We implement and deploy theframework in our proof-of-concept online tool, Opiner. Keywords:

API, Mining, Usage, Documentation.

1. Introduction

In 1987, the seminal research of Carroll et al. [15] introduced ‘minimalmanual’ by advocating the redesigning of traditional documentation aroundtasks, i.e., describe the software components within the contexts of devel-opment tasks. They observed that developers are more productive whileusing those manuals. Since then this format is proven to work better thanthe traditional API documentation [8, 77, 50]. APIs (Application Program-ming Interfaces) oﬀer interfaces to reusable software components. In 2000,Shull et al. [79] compared traditional hierarchical API documentation (e.g.,Javadocs) against example-based documentation, each example correspond-ing to a development task. They observed that the participants quicklymoved to task-based documentation to complete their development tasks.However, task-based documentation format is still not adopted in API oﬃ-cial documentation (e.g., Javadocs).Indeed, despite developers’ reliance on API oﬃcial documentation as amajor resource for learning and using APIs [68], the documentation can oftenbe incomplete, incorrect, and not usable [93]. This observation leads to thequestion of how we can improve API documentation if the only people whocan accomplish this task are unavailable to do it. One potential way is toproduce API documentation by leveraging the crowd [84], such as miningAPI usage scenarios from online Q&A forums where developers discuss howthey can complete development tasks using APIs. Although these kindsof solutions do not have the beneﬁt of authoritativeness, recent researchshows that developers leverage the reviews about APIs to determine how andwhether an API can be selected and used, as well as whether a provided codeexample is good enough for the task for which it was given [90, 89, 45]. Thus,the combination of API reviews and code examples posted in the forum postsmay constitute an acceptable expedient in cases of rapid evolution or depleteddevelopment resources, oﬀering ingredients to on-demand task-centric APIdocumentation [75].In this paper, with a view to assist in the automatic generation of task-based API documentation, we propose to automatically mine code examples2 tack Overflow Posts

Link Code Examples to API MentionsGenerate Usage Scenario Summary Description

Algorithm 1Algorithm 2

Associate Reactions to Code Examples

Algorithm 3

To properly connect a code example to an API name mentioned in the post text, i.e., about which the code example is discussedTo generate a natural language description of the task, i.e., the problem and the solution addressed by the code example by analyzing the post textsTo inform developers of the quality and specific potential usage constraints of the code example associated to an API based on the expert comments of other developers

Mined API Usage Scenarios

Figure 1: Our API usage scenario mining framework from Stack Overﬂow with threeproposed algorithms associated to diﬀerent APIs and their relevant task-based usage discussionsfrom Stack Overﬂow. We propose an automated mining framework that canbe leveraged to automatically mine API usage scenarios from Stack Over-ﬂow. To eﬀectively mine API usage scenarios from Stack Overﬂow with highperformance, we have designed and developed three algorithms within ourproposed framework. In Figure 1, we oﬀer an overview of the three algo-rithms and show how they are used in sequence to automatically mine APIusage scenarios from Stack Overﬂow. • Algorithm 1. Associate Code Examples to API Mentions.

A codesnippet is provided in a forum post to complete a development task. Given acode snippet found in a forum post, we ﬁrst need to link the snippet to an APIabout which the snippet is provided. Consider the two snippets presentedin Figure 2. Both of the snippets use multiple types and methods from thejava.util API. In addition, the ﬁrst snippet uses the java.lang API. However,both snippets are related to the conversion of JSON data to JSON object.As such, the two snippets introduce two open source Java APIs to completethe task (Google GSON in snippet 1 and org.json in snippet 2). The state ofart traceability techniques to link code examples in forum posts [84, 19, 67]3ill link the scenarios to both the utility (i.e., java.util, java.lang) and theopen source APIs. For example, the techniques will link the ﬁrst scenario toall the three APIs (java.util, java.lang, and GSON APIs), even though thescenario is actually provided to discuss the usage of GSON API. This focusis easier to understand when we look at the textual contents that describethe usage scenario.Our algorithm links a code example to an API mentioned in the textualcontents of forum post. For example, we link the ﬁrst snippet in Figure 2 tothe API GSON and the second to the API org.json. We do this by observingthat both GSON and org.json are mentioned in the textual contents of thepost, as well as the code examples consist of class and methods from the twoAPIs, respectively. We adopt the deﬁnition of an API as originally proposedby Martin Fowler, i.e., a “set of rules and speciﬁcations that a softwareprogram can follow to access and make use of the services and resourcesprovided by its one or more modules” [98]. This deﬁnition allows us toconsider a Java package as an API. For example, in Figure 2, we considerthe followings as APIs: 1. Google GSON, 2. Jackson, 3. org.json, 4. java.util,and 5. java.lang. Each API package thus can contain a number of modulesand elements (e.g., class, methods, etc.). This abstraction is also consistentwith the Java oﬃcial documentation. For example, the java.time packagesare denoted as the Java date APIs in the new JavaSE oﬃcial tutorial [59]).As we observe in Figure 2, this is also how APIs can be mentioned in onlineforum posts. • Algorithm 2. Generate Textual Task Description.

Given that eachcode snippet is provided to complete a development task, a textual descrip-tion of the task as provided in forum posts is necessary to learn about thetask as well as the underlying contexts (e.g., speciﬁc API version). To oﬀer atask-based documentation for a given code snippet that is linked to an API,we made two design decisions: 1.

Title.

We associate each code examplewith the title of the question, e.g., the title of a thread in Stack Overﬂow.2.

Description.

We associate relevant texts from both answer (where thecode example is found) and question posts. For example, in Figure 2, theﬁrst sentence (“check website . . . ”) is not important to learn about the tasks(i.e., JSON parsing). However, for the ﬁrst snippet, all the other sentencesbefore snippet 1 are necessary to learn about the solution (because they areall related to the API GSON that is linked to snippet 1). In addition, theproblem description as addressed by the task can be found in the question4 ow to convert JSON data to JSON object

C1. The code is buggy. In the new version of GSON, TypeToken is not public, hence you will get constructor error.C2. Using actual version of GSON (2.2.4) it works perfectly!C3. I found org.json a bit buggy when converting a Json ArrayC4. The code using org.json worked for me flawlessly!C5. I would recommend using the Jackson API.C6. Thank you!Check Java JSON website for competing APIs, such as Jackson, Gson, org.json. Google Gson supports generics and nested beans that should map to a Java collection such as List. It’s pretty simple! You have a JSON object with several properties of which groups property represents an array of nested objects of the very same type. This can be parsed with Gson the following way:

Import java.util.*;import java.lang.reflect.Type;import com.google.gson.Gson;Import com.google.gson.reflect.TypeToken;Class Data { private String title; private long id; private List groups;}Type listType = new TypeToken>(){}.getType();List dataList = new Gson().fromJson(jsonArray, listType);

If you don’t need object de-serialization but to simply get an attribute, you can try org.json. import java.util.*;JSONObject obj = new JSONObject(jsonString);System.out.println(obj.toString()); C o d e E x a m p l e Question A n s w e r C o mm e n t s C o d e E x a m p l e Figure 2: How API usage scenarios are discussed in Stack Overﬂow. • Algorithm 3. Associate Reactions to a Code Example.

As notedbefore reviews about APIs can be useful to learn about speciﬁc nuances andusage of the provided code examples [90, 89]. Consider the reactions in thecomments in Figure 2. Out of the six comments, two (C1, C2) are associatedwith the ﬁrst scenario and two others (C3, C4) with the second scenario. Theﬁrst comment (C1) complains that the provided scenario is not buggy in thenewer version of the GSON API. The second comment (C2) conﬁrms thatthe usage scenario is only valid for GSON version 2.2.4. The third comment(C3) complains that the conversion of JsonArray using org.json API is abit buggy, but the next comment (C4) conﬁrms that scenario 2 (i.e., theone related to org.json API) works ﬂawlessly. Given a code example, ourproposed algorithm associates relevant reactions based on heuristics, such asmentions of the linked API in a reaction (e.g., In Figure 2, C1 mentions theAPI GSON, which is linked to code snippet 1).We evaluated the algorithms using three benchmarks that we createdbased on inputs from a total of six diﬀerent human coders. The ﬁrst bench-mark consists of 730 code examples from Stack Overﬂow forum posts, eachmanually associated with an API mentioned in the post where the code ex-ample was found. We use the ﬁrst benchmark to evaluate our Algorithm 1,i.e., associate code examples to API mentions. A total of three coders par-ticipated in the benchmark creation process. We use the second benchmarkto evaluate our proposed Algorithm 2, i.e., generate textual task descriptionaddressed by a code example in Stack Overﬂow. The second benchmark con-sists of 216 code examples out of the 730 code examples that we used for the6 reprocess Post Contents

Stack Overflow

Posts

Link Code Examples to API MentionsGenerate Usage Scenario Summary Description

Parser Linker Generator

Associate Reactions to Code Examples

Associator

API Database

Figure 3: The major components of our API usage scenario mining framework ﬁrst benchmark. The 216 code examples were found in answer posts in StackOverﬂow. The natural language summary of each of the 216 code exampleswas manually created based on consultations from two human coders. Weuse the third benchmark to evaluate our Algorithm 3, i.e, associate positiveand negative reactions to a code example. The third algorithm was createdby manually associated all the reactions to each of the 216 code examplesthat we use for the second benchmark. A total of three human coders par-ticipated in the benchmark creation process. The ﬁrst author was the ﬁrstcoder in all the three benchmarks.We observed precisions of 0.96, 0.96, and 0.89 and recalls of 1.0, 0.98, and0.94 with the linking of a code example to an API mention, the producedsummaries, and the association of reactions to the code examples. We com-pared the algorithms against seven state of the art baselines. Our algorithmsoutperformed all the baselines. We deployed the algorithms in our onlinetool to mine task-based documentation from Stack Overﬂow. We evaluatedthe eﬀectiveness of the tool by conducting a user study of 31 developers,each completed four coding tasks using our tool, API oﬃcial documentation,Stack Overﬂow, and search engine. The developers wrote more correct codein less time and less eﬀort using our tool.

2. The Mining Framework

We designed our framework to mine task-based API documentation byanalyzing Stack Overﬂow, a popular forum to discuss API usage. The frame-work takes as input a forum post and outputs the usage scenarios found inthe post. For example, given as input the forum post in Figure 2, the frame-work returns two task-based API usage scenarios: (1) The code example 1 by7ssociating it to the API Google GSON, the two comments (C1, C2) as reac-tions, and a description of the code example in natural language to inform ofthe speciﬁc development task addressed by the code example. (2) The codeexample 2 by associating it to the API org.json, the two comments (C3, C4)as reactions, and a summary description.Our framework consists of ﬁve major components (Figure 3):1. An

API database to identify the API mentions.2. A suite of

Parsers to preprocess the forum post contents.3. A

Linker to associate a code example to an API mention.4. A

Generator to produce a textual task description.5. An

Associator to ﬁnd reactions towards code examples.

An API database is required to infer the association between a codeexample and an API mentioned in forum post text. Our database consistsof open source and oﬃcial Java APIs. An open-source API is identiﬁed by aname. An API consists of one or more modules. Each module can have oneor more packages. Each package contains code elements (class, method). Asnoted in Section 1, we consider an oﬃcial Java package as an API. For eachAPI, we record the following meta-information: (1) the name of the API,(2) the dependency of the API on other APIs, (3) the names of the modulesof the API, (4) the package names under each module, (5) the type namesunder each package, and (6) the method names under each type. The lastthree items (package, type, and method names) can be collected from eitherthe binary ﬁle of an API (e.g., a jar) or the Javadoc of the API. We obtainedthe ﬁrst three items from the pom.xml ﬁles of the open-source APIs hostedin online Maven Central repository. Maven Central is the primary source forhosting and searching for Java APIs with over 70 million downloads everyweek [22].

Given as input a forum post, we preprocess its content as follows: (1) Wecategorize the post content into two types: (a) code snippets ; and (b) sen- We detect code snippets as the tokens wrapped with the < code > tag. igure 4: A popular scenario with a syntax error (Line 1) [60] tences in the natural language text . (2) Following Dagenais and Robillard [19],we discard the following invalid code examples based on Language-speciﬁcnaming conventions: (a) Non-code snippets (e.g., XML), (b) Non-Java snip-pets (e.g., JavaScript). We consider the rest of the code examples as valid . • Hybrid Code Parser.

We parse each valid code snippet using a hybridparser combining ANTLR [66] and Island Parser [55]. We observed that codeexamples in the forum posts can contain syntax errors which an ANTLRparser is not designed to parse. However, such errors can be minor and thecode example can still be useful. Consider the code example in Figure 4. AnANTLR Java parser fails at line 1 and stops there. However, the post wasstill considered as helpful by others (upvoted 11 times). Our hybrid parserworks as follows: 1. We split the code example into individual lines. For thispaper, we focused only on Java code examples. Therefore, we use semi-colonas the line separator indicator. 2. We parse each line using the ANTLRparser by feeding it the Java grammar provided by the ANTLR package.If the ANTLR parser throws an exception citing parsing error, we use ourIsland Parser. • Parsing Code Examples.

We identify API elements (types and methods)in a code example in three steps.

1. Detect API Elements:

We detect API elements using Java nam-ing conventions, similar to previous approaches (e.g., camel case for Classnames) [19, 73]. We collect types that are not declared by the user. Considerthe ﬁrst code example in Figure 2. We add

Type , Gson and

TypeToken , but not

Data , because it was declared in the same post:

Class Data .

2. Infer Code Types From Variables:

An object instance of a codetype declared in another post can be used without any explicit mention of thecode type. For example, consider the example:

Wrapper = mapper.readValue(jsonStr,Wrapper.class) . We associate the mapper object to the

ObjectMapper type,because it was deﬁned in another post of the same thread as:

ObjectMappermapper = new ObjectMapper() .

3. Generate Fully Qualiﬁed Names (FQNs):

For each valid9 etect API MentionsAssociate API Mentions to Code Examples Proximity-Based LearningProbabilistic Learning

Stack Overflow

PostAPI Database

Type FilterMethod FilterDependency Filter

Figure 5: The components to link a scenario to API mention

GSON com.google.code.gsongsonorg.immutablesgsonorg.easygsoneasy-gsonorg.nd4jnd4j-gson e x a c t f u zz y … 5 more... 120 more Figure 6: Partial Mention Candidate List of GSON in Figure 2 type detected in the parsing, we attempt to get its fully qualiﬁed name byassociating it to an import type in the same code example. Consider thefollowing example: import com.restfb.json.JsonObject;JsonObject json = new

JsonObject(jsonString);

We associate

JsonObject to com.restfb.json.JsonObject . We leverageboth the fully and the partially qualiﬁed names in our algorithm to asso-ciate code examples to API mentions. Given as input a code example in a forum post, we associate it to an APImentioned in the post in two steps (Figure 5):

Step 1. Detect API Mentions

We detect API mentions in the textual contents of forum posts followingUddin and Robillard [94]. Therefore, each API mention in our case is a token10or a series of tokens) if it matches at least one API or module name. Similarto [94], we apply both exact and fuzzy matching. For example, for APImention ‘Gson’ in Figure 2, an exact match would be the ‘gson’ module in theAPI ‘com.google.code.gson’ and a fuzzy match would be the ‘org.easygson’API. For each such API mention, we produce a Mention Candidate List(MCL), by creating a list of all exact and fuzzy matches. For example, inFigure 6, we show a partial Mention Candidate List for the mention ‘gson’.Each rectangle denotes an API candidate with its name at the top and oneor more module names at the bottom (if module names matched).For each code example, we create three buckets of API mentions: (1)

SamePost Before B b : each mention found in the same post, but before the codesnippet. (2) Same post After B a : each mention found in the same post,but after the code snippet. (3) Same thread B t : all the mentions found inthe title and in the question. Each mention is accompanied by a MentionCandidate List, i.e., a list of APIs from our database. Step 2. Associate Code Examples to API Mentions

We associate a code example in a forum post to an API mention bylearning how API elements in the code example may be connected to a can-didate API in the mention candidate lists of the API mentions. We call this proximity-based learning, because we start to match with the API mentionsthat are more closer to the code example in the forum before consideringthe API mentions that are further away. For well-known APIs, we observedthat developers sometimes do not mention any API name in the forum texts.In such cases, we apply probabilistic learning , by assigning the code snippetto an API that could most likely be discussed in the snippet based on theobservations in other posts. • Proximity-Based Learning uses Algorithm 1 to associate a code ex-ample to an API mention. The algorithm takes as input two items: 1. Thecode example C , and 2. The API mentions in the three buckets: before thecode example in the post B b , after the code example in the post B a , and inthe question post of the same thread B t . The output from the algorithm isan association decision as a tuple ( d mention , d api ), where d mention is the APImention as found in the forum text (e.g., GSON for the ﬁrst code example inFigure 2) and d api is the name of the API in the mention candidate list of theAPI mention that is used in the code example (e.g., com.google.code.gson for the ﬁrst code example in Figure 2).The algorithm uses three ﬁlters (L1, discussed below). Each ﬁlter takes11 nput : (1) Code Example C = ( T, E ), (2) API Mentions inbuckets B = ( B b , B a , B t ) output: Association decision, D = { d mention , d api } Proximity Filters F = [ F type , F method , F dep ]; D = ∅ , N = length( B ) , K = length( F ) ; for i ← to N do B i = B [ i ], H = getMentionApiTuples ( B i ); for k ← to K do Filter F k = F [ k ], H = getHits ( F k , C , H , L i ); if | H | = then D = H [1]; break ; procedure getMentionApiTuples( B ) List < MentionAPI > M = ∅ ; foreach Mention m ∈ B do M CL = { a , a , . . . a n } (cid:46) MCL of m ; foreach API a i ∈ M CL do MentionAPI ma = { m, a i } ; M . add (ma) return M ; procedure getHits ( F k , C , H ) S = ∅ ; for i ← to length ( H ) do S [ i ] = compute score of H [ i ] for C using F k ; if max ( S ) = 0 then return H ; else H new = ∅ ; for i ← to length ( H ) do if S [ i ] = max ( S ) then H new . add ( H [ i ]); return H new return D Algorithm 1:

Associate a code example to an API mention12s input a list of tuples in the form (mention, candidate API). The outputfrom the ﬁlter is a set of tuples, where each tuple in the set is ranked thehighest based on the ﬁlter. The higher the ranking of a tuple, the more likelyit is associated to the code example based on the ﬁlter. For each mentionbucket (starting with B b , then B a , followed by B t ), we ﬁrst create a list oftuples H using getMentionApiTuples (L4, L8-14). Each tuple is a pair ofAPI mention and a candidate API. We apply the three ﬁlters on this listof tuples. Each ﬁlter produces a list of hits (L6) using getHits procedure(L15-24). The output from a ﬁlter is passed as an input to the next ﬁlter,following the principle of deductive learning [84]. If the list of hits has onlyone tuple, the algorithm stops and the tuple is returned as an associationdecision (L7). F1. Type Filter.

For each code type (e.g., a class) in the code example,we search for its occurrence in the candidate APIs from Mention CandidateList. We compute type similarity between a snippet s i and a candidate c i asfollows. Type Similarity = | Types( s i ) (cid:84) Types( c i ) || Types( s i ) | (1)Types( s i ) is the list of types for s i in bucket. Types( c i ) is the list of thetypes in Types( s i ) that were found in the types of the API. We associate thesnippet to the API with the maximum type similarity. In case of more thanone such API, we create a hit list by putting all those APIs in the list. Eachentry is considered as a potential hit. F2. Method Filter.

For each of candidate APIs returned in the listof hits from type ﬁlter, we compute method similarity between a snippet s i and a candidate c i :Method Similarity = | Methods( s i ) (cid:84) Methods( c i ) || Methods( s i ) | (2)We associate the snippet to the API with the maximum similarity. In caseof more than one such API, we create a hit list of all such APIs and pass itto the next ﬁlter. F3. Dependency Filter.

We create a dependency graph by consultingthe dependencies of APIs in the hit list. Each node corresponds to an APIfrom the hit list. An edge is established, if one API depends on another API.From this graph, we ﬁnd the API with the maximum number of incoming13 org.easygsoncom.google.code.gsonorg.immutablesorg.nd4j

Figure 7: Dependency graph given a hit list edges, i.e., the API on which most of the other APIs depend on. If there isjust one such API, we assign the snippet to the API. This ﬁlter is developedbased on the observation that developers mention a popular API (e.g., oneon which most other APIs depend on) more frequently in the forum postthan its dependents.In Figure 7, we show an example dependency graph (left) and a partialdependency graph for the four candidate APIs from Figure 6 (right). In theleft, both C2 and C5 have incoming edges, but C2 has maximum numberof incoming edges. In addition, C5 depends on C2. Therefore, C2 is mostlikely the core and most popular API among the ﬁve APIs. The dependencyﬁlter is useful when a code example is short, with generic type and methodnames. In such cases, the code example can potentially match with manyAPIs. Consider a shortened version of the ﬁrst code example in Figure 2: import com.google.code.Gson;Data json = new

Gson().fromJson(string, Data. class ); Both the type (com.google.code.Gson) and methods (Gson() and fromJ-son(. . . )) can be found in the two APIs in Figure 6: org.immutables andcom.google.code.gson. However, as we see in Figure 7 (right), all the APIsdepend on com.google.code.gson. Therefore, we assign the snippet to themention Gson and the API com.google.code.gson. • Probabilistic Learning is used when an API mention is not found inpost texts, i.e., we cannot link a code example to an API using proximitylearning. In such cases, we associate a code example to an API that wasmost frequently associated in other code examples. We do this by computingthe coverage of an API across those code examples linked by the proximitylearning. A coverage is the total number of times the types of an API isfound in those snippets. Suppose, for four code examples C1-C4, C1 and C214 ssociate Relevant Texts

Code ExampleAssociated API Mention

Stack Overflow

Post & Title

Develop Undirected Weighted Graph Detect Important Nodes in GraphProduce Summary Using Important Nodes

Summary Description

Figure 8: Steps to produce summary description of a scenario are already linked to API A1, and C3 to API A2, but no API is mentioned inthe post where C4 is found. In such cases, we compute the coverage of typesin C4 (say T1, T2) in the linked snippets. If T1 is present in C1 and C2, andT2 in C3, we have coverage of 2 for API A1, and coverage of 1 for API A2.Thus, we link C4 to API A1. This learning is based on two observations:(1) developers tend to refer to the same API types in many diﬀerent forumposts, and (2) when an API type is well-known, developers tend to refer toit in the code examples without mentioning the API (see for example [61]).

We produce textual description for code examples that are found in theanswer posts, because such a code example is in need to be understood fora development task [84]. Our algorithm is based on the TextRank algo-rithm [53]. Our algorithm operates in four steps (Figure 8):1.

Associate Relevant Texts.

We produce an input as a list of sen-tences from the forum post where the code example is found. Each sentenceis selected by considering its proximity from the API mention linked to thecode example. For example, for the ﬁrst code example in Figure 2 linkedto the API Gson, we pick all the sentences before the code example exceptthe ﬁrst one. To pick the sentences, we apply beam-search. We start withthe ﬁrst sentence in the forum post where API is mentioned. We then picknext possible sentence by looking for two types of signals: (a) it refers to theAPI (e.g., using a pronoun), and (b) it refers to an API feature. To identifyfeatures, we use noun phrases based on shallow parsing [41]. By adhering tothe principle of task-oriented documentation, we organize the relevant textsinto three parts: (a)

Task Title . The one line description of the task, as15ound in the title of the question. (b)

Problem.

The relevant texts ob-tained from the question that describe the speciﬁc problem related to thetask. (c)

Solution.

The relevant texts obtained from the answer where thecode example is found. We produce a summarized description by applyingSteps 2 and 3 once for ‘Problem’ texts and another for the ‘Solution’ texts.2.

Develop Undirected Weighted Text Graph.

We remove stopwords from each input sentence and then vectorize the sentence into textualunits (e.g., ngram). We compute the distance between two sentences. Adistance is deﬁned as (1 - similarity). Similarity can be detected using stan-dard metrics, such as cosine similarity. An edge is established between twosentences, if they show some similarity between them. The weight of eachedge is the computed distance.3.

Detect Important Nodes in Graph.

We traverse the text graphusing the PageRank algorithm to ﬁnd optimal weight for each node in thegraph by repeatedly iterating over the following equation (until no furtheroptimization is possible):

W S ( V i ) = (1 − d ) ∗ (cid:88) V j ∈ ( V i ) w ji (cid:80) v k ∈ Out ( V j ) w jk W S ( V j ) (3)Here d is the damping factor, V are nodes, W S are the weights. ∈ ( V i ) arethe incoming edges to node V i .4. Produce Summary Using Important Nodes.

In order to producethe summary using important nodes, we ﬁrst pick the top N nodes withthe most weights among all the nodes. We then rank the nodes based ontheir appearance in the original post (i.e., problem or solution). Each nodeessentially corresponds to a sentence in the post. We then combine all theranked sentences to produce the summary.Finally, we produce a description by combining the three items in order,i.e., Title, Problem and Solution summaries.

The ﬁnal part of our proposed framework is to associate reactions tothe usage scenarios. In order to do this, we ﬁrst gather all the commentsof the post where the code example is found. We then use the principlesof discourage learning [46] to associate the reactions in the comments (i.e.,negative and positive opinions) towards the code examples. The inputs tothe algorithm are all the comments towards the post where the code example16s found. Our algorithm works as follows. 1. We sort the comments in thetime of posting. The earliest comment is placed at the top. We identifyopinionated sentences in each comment. 2. We identify the API mentionsin each comment. 3. We label an opinionated comment as relevant to anAPI mention if it refers to the API mention by name or by pronoun. Todetermine whether a pronoun refers to an API mention, we determine thedistance between the API mention and the pronoun and whether another APIwas mentioned in between. If the opinionated comment is related to the APImention associated to the code example, we associate the comment to thecode example. For example, in Figure 2, the comment C4 is not consideredas relevant to the code example 1, because the closest and most recent APIname to the comment is the org.json API in comment C3. 4. For opinionatedcomments that do not directly/indirectly refer to an API mention (e.g., usingpronoun), we associate those to the code example based on a notion called implicit reference . We consider a comment as implicitly related to the codeexample, if no other APIs are mentioned at least two comments above it.To analyze the opinionated sentences, our algorithm can use the output ofany sentiment detection tools. The current framework uses an adaptation ofthe Domain Sentiment Orientation (DSO) algorithm as originally proposedby Hu et al. [34]. The algorithm was previously adopted by Google to analyzelocal service reviews [6]. The algorithm is called ‘OpnerDSO’. Given as inputa sentence, the algorithm assigns it a polarity label (i.e., positive, negative,or neutral) in three steps:1.

Detect potential sentiment words.

We identify adjectives in thesentence and match those against a list of sentiment words. Each wordin the list corresponds to either a positive or a negative polarity. Thelist consists of 2746 words (all adjectives) collected from three publiclyavailable datasets (the original publications of DSO [34], MPQA [99]and AFINN [57]). In addition, the list contains 750 software domainspeciﬁc sentiment words that we collected by automatically crawlingStack Overﬂow based on two approaches, Gradability [31] and Co-herency [47]. Each matched adjective with a positive polarity is givena score of +1 and each an adjective with a negative polarity is given ascore of -1. Each score is called a sentiment orientation.2.

Handle negations.

We alternate the sign of a matched adjective inthe presence of a negation word around the adjective, e.g., ‘not good’is given a score of -1 instead of +1.17.

Label sentence.

We take the sum of all the sentiment orientations.If the sum is greater than 0, we label the sentence as ‘positive’. If thesum is less than 0, then we label it as ‘negative’. Otherwise, we labelit as ‘neutral’.We evaluated the performance of OpinerDSO on a benchmark of 4,522 sen-tences that we collected from 1338 Stack Overﬂow posts. A total of eight hu-man coders labeled each sentence for polarity. The details of the benchmarkcreation process are described in [92]. We compared OpinerDSO againstthree sentiment detection tools developed for software engineering: Senti4SD [9],SentiCR [4], and SentistrengthSE [35]. The overall precision and recall ofOpinerDSO (F1-score Macro = 0.495) are comparable to Senti4SD (MacroF1-score = 0.510), SentiCR (Macro F1-score = 0.430), and SentistrengthSE(Macro F1-score = 0.454). Macro average is useful when we would like toemphasize the performance on classes with few instances, which were thepositive and negative polarity classes in our benchmark. The details of theevaluation are provided in [92].

3. Evaluation

We extensively evaluated the feasibility of our mining framework by in-vestigating the accuracy of the three proposed algorithms. In particular, weanswer the following three research questions:1. What is the performance of the algorithm to link code examples toAPIs mentioned in forum texts?2. What is the performance of generating the natural language summaryfor a scenario?3. What is the performance of linking the reactions (the positive andnegative opinions) to a scenario?Both high precision and recall are required in the mining of scenarios. Aprecision in the linking of a scenario to an API mention ensures we do notlink a code example to a wrong

API, a high recall ensures that we do notmiss many usage scenarios relevant to an API. Similarly, a high precision anda high recall are required to associate reactions to a code example. For thesummary description of a code example, a high precision is more importantbecause otherwise we associate a wrong description to a code example.18iven that all our three proposed algorithms are information retrieval innature, we report four standard evaluation metrics (Precision P , Recall R ,F1-score F

1, and Accuracy A ) as follows: P = T PT P + F P , R = T PT P + F N , F ∗ P ∗ RP + R , A = T P + T NT P + F P + T N + F NT P = Nb. of true positives, and

F N = Nb. false negatives.

Evaluation Corpus.

We analyze the Stack Overﬂow threads tagged as‘Java+JSON’, i.e., the threads contained discussions related to the JSON-based software development tasks using Java APIs. We selected the JavaJSON-based APIs because JSON-based techniques support diverse develop-ment scenarios, such as, both specialized (e.g., serialization) as well as utility-based (e.g., lightweight communication), etc. We used the ‘Java+JSON’threads from Stack Overﬂow dump of 2014 for the following reasons:1. It oﬀers a rich set of competing APIs with diverse usage discussions, asreported by other authors previously [90].2. It allowed us to also check whether the API oﬃcial documentation wereupdated with scenarios from the dataset (see Section 4). Intuitively,our mining framework is more useful when the framework can be usedto update API oﬃcial documentation by automatically mining the APIusage scenarios, such as when the oﬃcial documentation is found to benot updated with the API usage scenarios discussed in Stack Overﬂoweven when suﬃcient time is spent between when such as scenario isdiscussed in Stack Overﬂow and when an API oﬃcial documentationis last updated.In Table 1 we show descriptive statistics of the dataset. There were 22,733posts from 3,048 threads with scores greater than zero. Even though ques-tions were introduced during or before 2014, each question is still active inStack Overﬂow, i.e., the underlying tasks addressed by the questions are stillrelevant. There were 8,596 valid code snippets and 4,826 invalid code snip-pets. On average each valid snippet contained at least 7.9 lines. The lastcolumn “Users” show the total number of distinct users that posted at leastone answer/comment/question in those threads.We evaluated our proposed three algorithms by creating three bench-marks out of our evaluation corpus. In our previous research of two surveys19 able 1: Descriptive statistics of the dataset (Valid Snippets)

Threads Posts Sentences Words Snippet Lines Users

Average

Table 2: Distribution of Code Snippets By APIs

Overall Top 5API Snippet Avg STD Snippet Avg Max Min

175 8596 49.1 502.7 5196 1039.2 1951 88of 178 software developers, we found that developers consider the combina-tion of code examples and reviews from other developers towards the codeexamples in online developer forums (e.g., Stack Overﬂow) as a form of APIdocumentation. We also found that developers use such documentation tosupport diverse development tasks (e.g., bug ﬁxing, API selection, featureusage, etc.) [88]. Therefore, it is necessary that our mining framework iscapable of supporting any development scenario. This can be done by link-ing any code example to an API mention, and by producing a task-baseddocumentation of an API to support any development task. Therefore, tocreate the benchmarks from the evaluation corpus, we pick code examplesusing random sampling that oﬀers representation of the diverse developmentscenarios in online developer forums in general without focusing on a speciﬁcdevelopment scenario (e.g., How-to, bug-ﬁxing) [36, 105].The 8596 code examples are associated with 175 distinct APIs using ourlinking algorithm (see Table 2). The majority (60%) of the code exampleswere associated to ﬁve APIs for parsing JSON-based ﬁles and texts in Java:java.util, org.json, Google Gson, Jackson, and java.io. Some API types aremore widely used in the code examples than others. For example, the

Gson class from Google Gson API was found in 679 code examples out of the 1053code examples linked to the API (i.e., 64.5%). Similarly, the

JSONObject class from the org.json API was found in 1324 of 1498 code examples linkedto the API (i.e., 88.3%). Most of those code examples also contained othertypes of the APIs. Therefore, if we follow the documentation approach ofBaker [84], we would have at least 1324 code examples linked to the Javadoc20 able 3: Distribution of Reactions in Scenarios with at least one reaction

Scenarios Comments Positive Negativew/reactions Total Avg Total Avg Total Avg

JSONObject for the API org.json. This is based on the parsing of our3048 Stack Overﬂow threads. Among the API usage scenarios in our studydataset, we found 1154 scenarios contained at least one reaction (i.e., positiveor negative) using our proposed algorithm to associate reactions to an APIusage scenario. In Table 3, we show the distributions of comments andreactions in the 1154 scenarios. There are a total of 7,538 comments foundin the corresponding posts of those scenarios, out of which 2,487 are sentenceswith positive polarity and 1,216 are sentences with negative polarity. Performance of Linking Code Example to API Mention3.1.1. Approach

We assess the performance of our algorithm to link code examples to APImentions using a benchmark that consists of randomly selected 730 code ex-amples from our entire corpus. 375 code examples were sampled from the8589 valid code snippets and 355 from the 4826 code examples that werelabeled as invalid by the invalid code detection component of our framework.The size of each subset (i.e., valid and invalid samples) is determined to cap-ture a statistically signiﬁcant snapshot of our entire dataset (95% conﬁdenceinterval). The evaluation corpus was manually validated by three coders:The ﬁrst two coders are the ﬁrst two authors of this paper. The third coderis a graduate student and is not a co-author. The benchmark creation processinvolved three steps: (1) The three coders independently judged randomlyselected 80 code examples out of the 730 code examples: 50 from the validcode examples and 30 from the invalid code examples. (2) The agreementamong the coders was calculated, which was near perfect (Table 4): pair-wise Cohen κ was 0.97 and the percent agreement was 99.4%. To resolvedisagreements on a given code example, we took the majority vote. (3) Sincethe agreement level was near perfect, we considered that any of the coderscould complete the rest of the coding without introducing any subjectivebias. The ﬁrst coder then labeled the rest of the code examples. The manualassessment found nine code examples as invalid. We labeled our algorithm21 able 4: Analysis of agreement among the coders to validate the association of APIs tocode examples (Using Recal3 [24]) Kappa (Pairwise) Fleiss Percent Krippen α Overall

Valid

Discarded • Baselines.

We compare our algorithm against one baseline: (B1) Baker [84],and We describe the baseline below.

B1. Baker:

As noted in Section 1, related techniques [84, 67, 19] ﬁndfully qualiﬁed names of the API elements in the code examples. Therefore,if a code example contains code elements from multiple APIs, the techniqueslink the code example to all APIs. We compare our algorithm against Baker,because it is the state of the art technique to leverage an API database in thelinking process (unlike API usage patterns [67]). Given that Baker was notspeciﬁcally designed to address the type of problem we attempt to addressin this paper, we analyze both the originally proposed algorithm of Baker aswell as an enhanced version of the algorithm to ensure fair comparison.

Baker (Original).

We apply the original version of the Baker algorithm [84]on our benchmark dataset as follows.1. Code example consisting of code elements (type, method) only fromone API: We attempt to link it using the technique proposed in Baker [84].2. Code example consisting of code elements from more than one API:if the code example is associated to one of the API mentioned in thepost, we leave it as ‘undecided’ by Baker.

Baker (Major API).

For the ‘undecided’ API mentions by Baker (Origi-nal), we further attempt to link an API as follows. For a code examplewhere Baker (original) could not decide to link it to an API mention,we link it to an API that was used the most frequently in the codeexample. We do this by computing the call frequency of each API inthe code example. Suppose, we model a code example as an API callmatrix A × T , where A stands for an API and T stands for a type(class, method) of the API that is reused in the code example. The22 able 5: Performance of linking code examples to API Mentions Proposed Algorithm Precision Recall F1 Score AccDetect Invalid - - - 0.97

Link Valid w/Partial info

Link Valid w/Full info

Overall w/Partial Info

Overall w/Full Info

Baselines (applied to valid code examples)B1a. Baker (Original)

B1b. Baker (Major API) A i , T j ) has a value 1 if type T j from API A i is called in the codeexample. We compute the reuse frequent of each API A i using thematrix by summing the number of distinct calls (to diﬀerent types) ismade in the code example. Thus S i = (cid:80) mj =1 T j . We assign the codeexample to the API A i with the maximum S i among all APIs reused. We achieved a precision of 0.96 and a recall of 1.0 using our algorithm(Table 5). A recall of 1.0 was achieved due to the greedy approach of ouralgorithm which attempts to ﬁnd an association for each code example. Thebaseline Baker (Original) shows the best precision among all (0.97), butwith the lowest recall (0.49). This level of precision corroborates with theprecision reported by Baker on Android SDKs [84]. The low recall is due tothe inability of Baker to link a code example to an API mention, when morethan one API is used in the code example. For those code examples whereBaker (Original) was undecided, we further attempted to improve Baker toﬁnd an API that is the most frequently used in the code example. The Baker(Major API) baseline improves the recall of Baker (Original) from 0.49 to0.66. However, the precision of Baker (Major API) drops to 0.88 from 0.97.The drop in precision is due to the fact the major API is not the API forwhich the code example is provided. This happened due to the extensiveusage of Java oﬃcial APIs (e.g., java.util) in the code example, while thementioned API in the textual content referred to an open-source API (e.g.,for Jackson/org.json for JSON parsing). In some cases the major API could23ot be determined due to multiple APIs having the maximum occurrencefrequency as well as the presence of ambiguous types in the code example.An API type is ambiguous in our case if more than API can have a typewith the same name. For example,

JSONObject is a popular class nameamong more than 900 APIs in Maven central only. Even the combination oftype and method could be ambiguous. For example, the method getValue is common for a given type, such as

JSONObject.getValue(...) . In suchcases, the usage of API mentions in the textual contents oﬀered our proposedalgorithm an improvement in precision and recall over Baker.We report the performance of our algorithm under diﬀerent settings:1.

Detect Invalid.

We observed an accuracy of 0.97 to detect invalid codeexamples. 2.

Link to valid with Partial Info.

We are able to link a validcode to an API mention with a precision of 0.94 using only the type-basedﬁlter from the proximity learning and probabilistic learning. This exper-imentation was conducted to demonstrate how much performance we canachieve with minimal information about the candidate APIs. Recall thatthe type-based ﬁlter only leverages API type names, unlike a combinationof API type and method names (as used by API fully qualiﬁed name infer-ence techniques [84, 19, 67]. Out of the two learning rules in our algorithm,Proximity learning shows better precision than Probabilistic learning (2 vs14 wrong associations). 3.

Link to valid with Full Info.

When we used allthe ﬁlters under proximity learning, the precision level was increased to 0.96to link a valid code example to an API mention. The slight improvement inprecision conﬁrms previous ﬁndings that API types (and not methods) arethe major indicators for such linking [84, 19]. 4.

Overall.

We achieved anoverall precision of 0.94 and a recall of 0.97 while using partial information.Almost one-third of the misclassiﬁed associations happened due to thecode example either being written in programming languages other than Javaor the code example being invalid. The following JavaScript code snippet waserroneously considered as valid. It was then assigned to a wrong API: varjsonData; $.ajax(type: ‘POST’)... .Five of the misclassiﬁcations occurred due to the code examples beingvery short. Short code examples lack suﬃcient API types to make an in-formed decision. Misclassiﬁcations also occurred due to the API mentiondetector not being able to detect all the API mentions in a forum post. Forexample, the following code example [62] was erroneously assigned to the com.google.code.gson

API. However, the correct association would bethe com.google.gwt

API. The forum post (answer id 20374750) contained24oth API mentions. However, com.google.gwt was mentioned using anacronym GWT and the API mention detector missed it.

AutoBean b = AutoBeanUtils.getAutoBean(ts) return

AutoBeanCodex.encode(b).getPayload(); Performance of Producing Textual Task Description3.2.1. Approach

The evaluation of natural language summary description can be con-ducted in two ways [17]: 1. User study: participants are asked to rate thesummaries 2. Benchmark: The summaries are compared against a bench-mark. We follow benchmark-based settings, which compare produced sum-maries are compared against those in the benchmark using metrics, e.g.,coverage of the sentences.In our previous benchmark (RQ ), out of the 367 valid code example,216 code examples were found in the answer posts. The rest of the validcode examples (i.e., 151) were found in the answer posts. We assess theperformance of our summarization algorithm for the 216 code examples thatare found in the answer posts, because each code example is provided in anattempt to suggest a solution to a development task and our goal is to createtask-based documentation support for APIs.We create another benchmark by manually producing summary descrip-tion for the 216 code examples using two types of information: 1. the descrip-tion of the task that is addressed by the code example, and 2. the descriptionof the solution as carried out by the code example. Both of these informa-tion types can be obtained from forum posts, such as problem deﬁnition fromthe question post and solution description from the answer post. We pickedsentences following principles of extractive summarization [17] and minimalmanual [15], i.e., pick only sentences that are related to the task. Consider atask description, “I cannot convert JSON string into Java object using Gson.I have previously used Jackson for this task”. If the provided code exampleis linked to the API Gson, we pick the ﬁrst sentence as relevant to describethe problem, but not the second sentence. A total of two human coders wereused to produce the benchmark. The ﬁrst coder is the ﬁrst author of thispaper. The second coder is a graduate student and is not a co-author of thispaper. The two coders followed the following steps: 1. create a coding guideto determine how summaries can be produced and evaluated, 2. randomlypick n code examples out of the 216 code examples, 3. produce summary25 able 6: Agreement between the coders for RQ2 benchmark Iteration 1 (5) Iteration 2 (15) Iteration 3 (30)Problem

Solution

Overall • Baselines.

We compare against four oﬀ-the-shelf extractive summariza-tion algorithms [25]: (B1) Luhn, (B2) Lexrank, (B3) TextRank, and(B4) Latent Semantic Analysis (LSA). The ﬁrst three algorithms were pre-viously used to summarize API reviews [90]. The LSA algorithms are com-monly used in information retrieval and software engineering both for textsummarization and query formulation [28]. Extractive summarization tech-niques are the most widely used automatic summarization algorithms [25].Our proposed algorithm utilizes the TextRank algorithm. Therefore, by ap-plying the TextRank algorithm without the adaption that we proposed, wecan estimate the impact of the proposed changes.

Table 7 summarizes the performance of the algorithm and the baselines toproduce textual task description. We achieved the best precision (0.96) andrecall (0.98) using our proposed engine that is built on top of the TextRankalgorithm. Each summarization algorithm takes as input the following texts:1. the title of the question, and 2. all the textual contents from both thequestion and the answer posts. By merely applying the TextRank algorithm26 able 7: Algorithms to produce summary description

Techniques Precision Recall F1 Score AccProposed Algorithm

B1. Luhn

B2. Textrank

B3. Lexrank

B4. LSA Performance of Linking Reactions to Code Examples3.3.1. Approach

We assess the performance of our algorithm using a benchmark that isproduced by manually associating reactions towards the 216 code examplesthat we analyzed for RQ1 and RQ2. Our focus is to evaluate the performanceof the algorithm to correctly associate a reaction (i.e., positive and negativeopinionated sentence) to a code example. As such, as we noted in Section 2.5,our framework supports the adoption of any sentiment detection tool to de-tect the reactions. Given that the focus of this evaluation is on the correct association of reactions to code examples, we need to mitigate the threatsin the evaluation that could arise due to the inaccuracies in the detectionof reactions by a sentiment detection tool [58]. We thus manually label thepolarity (positive, negative, or neutral) of each sentence in our benchmarkfollowing standard guidelines in the literature [9, 35].Out of the 216 code examples in our benchmark, 68 code examples from59 answers consisted of at least one comment (total 201 comments). The201 comments had a total of 493 sentences (190 positive, 55 negative, 248neutral). Four coders judged the association of each reaction (i.e., positive27 able 8: Analysis of Agreement Between Coders To Validate the Association of Reactionsto Code Examples (Using Recal2 [23])

Total Percent Kappa (pairwise) Krippen α C1-C2

174 83.9% 0.46 0.45

C2-C3

51 62.7% 0.12 0.05

C1-C3

103 84.5% 0.50 0.51and negative sentences) towards the code examples. For each reaction, welabel it either 1 (associated to the code example) or 0 (non-associated). Theassociation of each reaction to code example was assessed by at least twocoders. The ﬁrst coder (C1) is the ﬁrst author, the second (C2) is a graduatestudent, third (C3) is an undergraduate student, and fourth (C4) is thesecond author of the paper. The second and third coders are not co-authorsof this paper. The ﬁrst coder coded all the reactions. The second and thirdcoders coded 174 and 103 reactions, respectively. For each reaction, we tookthe majority vote (e.g., if C2 and C3 label as 1 but C1 as 0, we took 1, i.e.,associated). The fourth coder (C4) was consulted when a majority was notpossible. This happened for 22 reactions where two coders (C1 and C2/C3)were involved and they disagreed. The labeling was accompanied by a codingguide. Table 8 shows the agreement among the ﬁrst three coders. • Baselines.

We compare against two baselines: (B1) All Comments. Acode example is linked to all the comments. (B2) All Reactions. A codeexample is linked to all the positive and negative comments. The ﬁrst baselineoﬀers us insights on how well a blind association technique without sentimentdetection may work. The second baseline thus includes only the subset ofsentences from all sentences (i.e., B1) that are either positive or negative.However, not all the reactions may be related to a code example. Therefore,the second baseline (B2) oﬀers us insights on whether the simple reliance onsentiment detection would suﬃce or whether we need a more sophisticatedcontextual approach like our proposed algorithm that picks a subset of thepositive and negative reactions out of all reactions.

We observed the best precision (0.89) and recall (0.94) using our proposedalgorithm to link reactions to code examples. The baseline ‘All Reactions’shows much better precision than the other baseline, but still lower than our28 able 9: Performance of associating reactions to code examples

Technique Precision Recall F1 Score AccProposed Algorithm

B1. All Comments

B2. All Reactions

4. Discussion

We implemented our framework in an online tool, Opiner [91]. Using theframework deployed in Opiner, a developer can search an API by its nameto see all the mined usage scenarios of the API from Stack Overﬂow. Wepreviously developed Opiner to mine positive and negative opinions aboutAPIs from Stack Overﬂow. Our proposed framework in this paper extendsOpiner by also allowing developers to search for API usage scenarios, i.e.,code examples associated to an API and their relevant usage information.The current version shows results from our evaluation corpus. We presentthe usage scenarios by grouping code examples that use the same types (e.g.,class) of the API. As noted in Section 3, our evaluation corpus uses StackOverﬂow 2014 dataset. This choice was not random. We wanted to see, givensuﬃcient time, whether the usage scenarios in our corpus were included inthe API oﬃcial documentation. We found a total of 8596 valid code exam-ples linked to 175 distinct APIs in our corpus. The majority of those (60%)29ere associated to ﬁve APIs: java.util, org.json, Gson, Jackson, java.io. Mostof the mined scenarios for those APIs were absent in their oﬃcial documen-tation, e.g., for Gson, only 25% types are used in the code examples of itsoﬃcial documentation, but 81.8% of the types are discussed in our mined us-age scenarios. Therefore, the automatic mining of the usage scenarios usingour framework can assist the API authors who could not include those in theAPI oﬃcial documentation.In Figure 9, we show screenshots of our tool. A user can search an API byname in 1 to see the mined tasks of the API 3 . An example task is shownin 4 . Other relevant tasks (i.e., that use the same classes and methods ofthe API) are grouped under ‘See Also’ ( 5 ). Each task under the ‘See Also’can be further explored ( 6 ). Each task is linked to the corresponding postin Stack Overﬂow where the code example was found (by clicking on the details label). The front page shows the top 10 APIs with the most minedtasks 2 . • Eﬀectiveness of our Tool.

Although we extensively evaluated the accu-racy of our algorithms, we also measured the eﬀectiveness of our tool with auser study. Given that the focus of evaluation of this paper is to study theaccuracy of the proposed three algorithms in our mining framework and notallude on the eﬀectiveness of Opiner as a tool, we brieﬂy describe the userstudy design and results below.

Participants . We recruited 31 developers. Among them, 18 were re-cruited through the online professional developers site, Freelancer.com. Theother participants (13) were recruited from four universities, two in Canadaand two in Bangladesh. Each participant had professional software develop-ment experience in Java. Each freelancer was remunerated with $

20. Amongthe 31 participants 88.2% were actively involved in software development(94.4% among the freelancers and 81.3% among the university participants).Each participant had a background in computer science and software engi-neering. The number of years of experience of the participants in softwaredevelopment ranged between less than one year to more than 10 years: three(all of them being students) with less than one year of experience, nine be-tween one and two, 12 between three and six, four between seven and 10 andthe rest (nine) had more than 10 years of experience.

Tasks.

The developers each completed four coding tasks involving fourAPIs (one task for each of Jackson [21], Gson [26], Spring [80] and Xstream [95]).The four APIs were found in the list of top 10 most discussed APIs in our30

An Example Usage Scenario for API com.google.code.gson

Front Page of Online API Usage Scenario Search & Summarizer Engine Each Usage Scenario Has a See Also SectionA Scenario in See Also Section Can be Expanded Upon Click Figure 9: Screenshots of online our task-based API documentation tool • G1: Jackson (Stack Overﬂow), Gson (Javadoc), Xstream (Opiner),Spring (Everything including Search Engine) • G2: Spring (Stack Overﬂow), Jackson (Javadoc), Gson (Opiner), Xstream(Everything including Search Engine) • G3: Xstream (Stack Overﬂow), Spring (Javadoc), Jackson (Opiner),Gson (Everything including Search Engine) • G4: Gson (Stack Overﬂow), Xstream (Javadoc), Spring (Opiner), Jack-son (Everything including Search Engine)We collected the time took to complete each task and eﬀort spent usingNASA TLX index [30] (nasatlx.com). We assessed the correctness of a so-lution by computing the coverage of correct API elements. We summarizemajor ﬁndings below. More details of the study the results are provided inour online appendix [1].While using our tool Opiner, the participants on average coded with morecorrectness, spent the least time and eﬀort out of all resources. For example,using Opiner the average time developers spent to complete a coding task was18.6 minutes and the average eﬀort as reported in their TLX metrics was 45.8.In contrast, participants spent the highest amount of time (23.7 minutes) andeﬀort (63.9) per coding solution when using the oﬃcial documentation. Thediﬀerence between Opiner and oﬃcial documentation with regards to time32pent is statistically signiﬁcant (p-value = 0.049) with a medium eﬀect size.We use Mann Whitney U Test [78] to compute statistical signiﬁcance, whichis suitable for non-parametric testing. We use cliﬀ’s delta to compute theeﬀect size and follow the eﬀect size categorization of Romano et al. [76].The diﬀerences between Opiner and the other resources are not statisticallysigniﬁcant for other metrics. Therefore, while the API usage scenarios inOpiner oﬀer improvement over the resources, there is room for improvement.After completing the tasks, 29 participants completed a survey to sharetheir experience. More than 80% of the participants considered the mined us-age summaries as an improvement over both API oﬃcial documentation andStack Overﬂow, because our tool oﬀered an increase in productivity, conﬁ-dence in usage and reduction in time spent. According to one participant: “Itis quicker to ﬁnd solution in [tool] since the subject is well covered and usefulinformation is collected.”

The participants considered that learning an APIcould be quicker while using our tool than while using oﬃcial documentationor Stack Overﬂow, because our tool synthesizes the information from StackOverﬂow by APIs using both sentiment and source code analyses.Out of the participants, 87.1% wanted to use our tool either daily intheir development tasks, or whenever they have speciﬁc needs (e.g., learninga new API). All the participants (100%) rated our tool as usable for beinga single platform to provide insights about API usage and being focusedtowards a targeted audience. The developers praised the usability, search,and analytics-driven approach in the tool. According to one participant: “Indepth knowledge plus the ﬁltered result can easily increase the productivity ofdaily development tasks, . . . with the quick glimpse of the positive and negativefeedback.”

As a future improvement, the developers wished our tool to mineusage scenarios from multiple online forums.

5. Threats to Validity • External Validity threats relate to the generalizability of our ﬁndingsand our approach. In this paper, we focus on Stack Overﬂow, which is oneof the largest and most popular Q&A websites for developers. Our ﬁndingsmay not generalize to other non-technical Q&A websites that do not focus onsoftware development. While our evaluation corpus consists of 22.7K postsfrom Stack Overﬂow, the results will not carry the automatic implicationthat the same results can be expected in general. • Internal Validity threats relate to experimenter bias and errors while33onducting the analysis. We evaluated the performance of the three pro-posed algorithms in our framework by developing three benchmarks. Wemitigated the bias using manual validation (e.g., our benchmark datasetswere assessed by multiple coders). In our user study, we assigned the studyparticipants four tasks using for tools, including Opiner. Despite using a‘between-subject’ setting following previous research [100], the assignmentswere not fully counterbalanced, e.g., one out of the four groups had onemore participant than the other groups. We compute the average of the ef-fectiveness metrics (correctness, time, and eﬀort spent). The absence of fullcounterbalance may still introduce some unobserved bias/error. • Construct Validity threats relate to the diﬃculty in ﬁnding datarelevant to identify rollback edits and ambiguities. Hence, we use revisionsof the body of questions and answers from the Stack Exchange data dump,which we think are reasonable and reliable for capturing the reasons andambiguities of revisions. We also parse the web pages to create a large dataset to apply our ambiguity detection algorithms. However, we discard theincomplete and noisy records to keep our data set clean and reliable. • Reliability Validity threats concern the possibility of replicating thisstudy. We provide the necessary data in an online appendix [1].

6. Related Work

Related work can broadly be divided into three areas: (1) Research insoftware engineering related to our three proposed algorithms, (2) Softwarecode search tools and techniques, and (3) crowd-sourced documentation.

As we noted in Section 1, we are aware of no techniques that can associatereactions towards code examples in forums (Section 2.5).Our algorithm to generate summary description of tasks (Section 2.4) isdiﬀerent from the generation of natural language description of API elements(e.g., class [56], method [82, 83]), which takes as input source code (e.g., classnames, variable names, etc.) to produce a description. We take as input thetextual discussions around code examples in forum posts. Our approach isdiﬀerent from API review summaries [90], because our summary can containboth opinionated and neutral sentences.Our approach to generate task description from an answer diﬀers from Xuet al. [101], who proposed AnswerBot to automatically summarize multiple nswers relevant to a developer task. The input to AnswerBot is a naturallanguage query describing a development task. Based on the query, Answer-Bot ﬁrst ﬁnds all the questions in Stack Overﬂow whose titles closely matchthe query. AnswerBot then applies a set of heuristics based on MaximalMarginal Relevance (MMR) [14] to ﬁnd most novel and diverse paragraphsin the answers. The ﬁnal output is the ranked order of the paragraphs asa summary of the answers that could be used to complete the developmenttask. Unlike Xu et al. [101] who focuses on the summarization of multipleanswers for a given task, we focus on summarizing the contents of one an-swer. Unlike Xu et al. [101] who utilize only the textual contents of answersto produce the summary, we utilize both the contents from the question andanswer to produce the summary. A summary of relevant textual contentsfrom questions provides an overview of the problem (i.e., development task).Such a problem deﬁnition adds contextual information over the question title,which may not be enough to explain properly the development task. Thisassumption is consistent with our previous ﬁndings of surveys of softwaredevelopers who reported the necessity of adding contextual and situationallyrelevant information into summaries produced from developer forums [89].Our algorithm to associate a code example to an API mention in a forumpost (Section 2.3) diﬀers from the existing traceability techniques for codeexamples in forum posts [84, 67, 104] as follows: • As we noted in Section 3.1, the most directly comparable to our techniqueis Baker [84], because both Baker and our proposed technique rely on a pre-deﬁned database of APIs. Given a code example as an input, our techniquediﬀers from Baker by considering both code examples and textual contentsin the forum posts to learn about which API from the API database tolink to the code example. Baker does not consider textual contents in theforum posts. • As we noted in Section 3.1, given that our technique relies speciﬁcallyon an API database similar to Baker [84], our algorithm is not directlycomparable to StatType as proposed by Phan et al. [67]. StatType relies onAPI usage patterns, i.e., how frequently a method and class name is foundto be associated with an API in the diﬀerent GitHub code repositories. Wedo not rely on the analysis of client software code to infer usage patternsof an API. • Unlike Subramanian et al. [84, 19, 67], we can operate both with incomplete complete

API database against which API mentions can be checkedfor traceability. This ﬂexibility allowed us to use an online incomplete

APIdatabase (Maven central), instead of constructing an oﬄine database. Allthe existing traceability techniques [84, 19] requires the generation of anoﬄine complete

API database to support traceability. • Unlike Ye et al. [104], we link a code example in a forum post to an APImentioned in the textual contents of the forum post. Speciﬁcally, Ye etal. [104] focus on ﬁnding API methods and type names in the textual con-tents of forum posts, e.g., identify ‘numpy’, ‘pandas’ and ‘apply’ in the text‘While you can also use numpy, the documentation describes support forPandas apply method using the following code example’. In contrast, ourproposed algorithm links a provided code example with an API mentionedin the textual contents. For example, for the above textual content whereYe et al. [104] link both ‘Pandas’ and ‘numpy’ APIs, our algorithm willlink the provided code example to only the ‘Pandas’ API.In Section 3.1, we compared our traceability algorithm with the state of theart technique, Baker [84]. The recall of Baker was 0.49, i.e., using Bakerwe could not link more than 50% code examples in our evaluation - becausethose contained references to multiple API types/methods, but the textualcontents referred to only one of those APIs. Our technique could ﬁnd alink for all (i.e., 100% recall) with more than 96% precision. Our evaluationsample is statistically representative of our corpus of 8589 code examples.Therefore, using Baker we could have only found links for only 4100 of those,while our technique could link all 8589 with a very high precision. StackOverﬂow contains millions of other code examples. Therefore, our techniquesigniﬁcantly advances the state of the art of code example traceability tosupport task-based documentation.

Software development requires writing code to complete developmenttasks. Finding code examples similar to the task in hand can assist de-velopers to complete the task quickly and eﬃciently. As such, a huge volumeof research in software engineering has focused on the development and im-provement of code search engines [40, 69, 27, 52, 16, 28, 48, 32, 39, 49, 7]. Theengines vary given the nature of input and output as well as the underlyingsearching, ranking, and visualization techniques. Based on input and output,36he techniques can broadly be divided into following types: (1) Code to codesearch, (2) Code to relevant information search, (3) Natural language queryto code search, (4) code snippet + natural language query to code searchKim et al. [40] proposed FaCoy a code-to-code search engine, i.e., givenas input a code snippet, the engine ﬁnds other code snippets that are se-mantically similar to the input code example. While our and FaCoY’s goalsremain the same, i.e., to help developers in their development tasks, we dif-fer from each other with regards to both the outputs and the approaches.For example, given as input a code example in Stack Overﬂow post, we linkit to an API name as mentioned in the textual contents of the post. Incontrast, given as input a code example, FaCoY ﬁnds other similar code ex-amples. Ponzanelli et al. [69] developed an Eclipse Plug-in that takes intoaccount the source code in a given ﬁle as a context and use that to searchStack Overﬂow posts to ﬁnd relevant discussions (i.e., code to relevant infor-mation ). The relevant discussions are presented in a multi-faceted rankingmodel. In two empirical studies, Prompter’s recommendations were foundto be positive in 74% cases. They also found that such recommendationsare ‘volatile’ in nature, since the recommendations can change at one year ofdistance.Natural language queries are used by leveraging text retrieval techniquesto ﬁnd relevant code examples (i.e., natural language query to code search ).McMillan et al. [52] developed Portfolio, a search engine to ﬁnd relevantcode functions by taking as input a natural language search query that of-fers cues of the programming task in hand. To assist in the usage of thereturned functions, Portfolio also visualizes their usages. Hill et al. [32] pro-poses an Eclipse plug-in CONQUER that takes as input a natural languagequery and ﬁnds relevant source for maintenance by incorporating multiplefeedback mechanisms into the search results view, such as prevalence of thequery words in the result set, etc. Lv et al. [49] proposes CodeHow, a codesearch technique to recognize potential APIs related to an input user query.CodeHow ﬁrst attempts to understand the input query, and then expandsthe query with the potentially relevant APIs. CodeHow then performs coderetrieval using the expanded query by applying a Extended Boolean Model.The model considers the impact of both text similarity and potential APIsduring code search. In 26K C

The automated mining of crowd-sourced knowledge from developer fo-rums has generated considerable attention in recent years. To oﬀer a point ofreference of our analysis of related work, we review the research papers listedin the Stack Exchange question ‘Academic Papers Using Stack ExchangeData’ [63] and whose titles contain the keywords (‘documentation’ and/or‘API’) [96, 38, 81, 85, 51, 104, 12, 13, 5, 3, 97, 18, 64, 65, 37, 11, 87, 20, 43, 42].Existing research has focused on the following areas: • Assessing the feasibility of forum contents for documentation and APIdesign (e.g., usability) needs, 39

Answer question in Stack Overﬂow using formal documentation, • Recommend new documentation by complementing both oﬃcial anddeveloper forum contents, and • Categorizing forum contents (e.g., detecting issues).Our work diﬀers from the above work by proposing three novel algorithmsthat can be used to automatically generate task-based API documentationfrom Stack Overﬂow. As we noted in Section 1, we follow the concept of“minimal manual” which promotes task-centric documentation of manual [15,8, 77, 50]. We diﬀer from the above work as follows: 1. We include commentsposted in the forum as reactions to a code example in our usage scenarios.2. We automatically mine API usage scenarios from online forum, therebygreatly reducing the time and complexity to produce minimal manual.Given the advance in techniques developed to automatically mine insightsfrom crowd-sourced software forums, recent research on crowd-sourced APIdocumentation has focused speciﬁcally on the analysis of quality in the sharedknowledge. A number of high-impact recent research papers [106, 103, 86]warn against directly copying code from Stack Overﬂow, because such codecan have potential bugs or misuse patterns [106] and that such code may notbe directly usable (e.g., not compilable) [103, 86]. We observed both issuesduring the development of our proposed mining framework. We attemptedto oﬀer solutions to both issues within the context of our goal, i.e., producingtask-based documentation. For example, in Section 2.2, we discussed thatshared code examples can have minor syntax problem (e.g., missing semi-colon at the end of a source code line in Java), but they are still upvotedby Stack Overﬂow users, i.e., the users considered those code examples asuseful. Therefore, to ensure such code examples can still be included in ourtask-based documentation, we developed a hybrid code parser that combinesIsland parsing with ANTLR grammar to parse code examples line by line.Based on the output of the parser, we thus can decide whether to includecode example with syntax error or not. For example, if a code example has aminor error (e.g., missing semi-colon), we can decide to include it. We can,however, discard a code example that has many syntax errors (e.g., say 50%of the source code lines have some errors).While the issues with regards to code usability in crowd-sourced codeexamples [103, 86] could be addressed by converting those into compilablecode examples, such approach requires extensive research and technological40dvancement due to the diversity of such issues and the huge number ofavailable programming languages in modern programming environment. Asa ﬁrst step towards making progress in this direction, in our framework, wedeveloped the algorithm to associate reactions of other developers towards acode example. The design and development of the algorithm was motivatedby our ﬁndings from previous surveys of 178 software developers [89]. Thedevelopers reported that they consider the combination of a code exampleand reviews about those code examples in the forum posts as a form of APIdocumentation and they especially leverage the reviews to understand thepotential beneﬁts and pitfalls of reusing the code example.

7. Conclusions • Summary.

APIs are central to the modern day rapid software develop-ment. However, APIs can be hard to use due to the shortcomings in APIoﬃcial documentation, such as incomplete or not usable [74]. This resultedin plethora of API discussions in forum posts. We present three algorithmsto automatically mine API usage scenarios from forums that can be usedto produce a task-based API documentation. We developed an online task-based API documentation engine based on the three proposed algorithms.We evaluated the three algorithms using three benchmarks. Each benchmarkwas created by taking inputs from multiple human coders. We compared thealgorithms against seven state-of-the-art baselines. Our proposed algorithmsoutperformed the baselines. • Future Work.

Our future work focuses on three major directions: (1) Theextension of the proposed framework to include all API usage scenarios fromdiverse developer forums, (2) The improvement of the API usage scenarioranking in Opiner online user interface, and (3) The utilization of the frame-work to produce/ﬁx/complement traditional API documentation.The ranking of API usage scenarios in Opiner website is simply based on‘recency’, i.e., the most recent code example is put at the top. This approachmay not be suitable when, for example, the most recent code example is notproperly described or commented. Another problem could be when the link-ing of the code example is wrong. Our future work will focus on investigatingoptimal ranking strategy for API usage scenarios in Opiner that can includeboth recency and additional contextual information. For example, betweentwo most recent API usage scenarios, we can place the one scenario at the41op that contains more description and more comments. We can also lever-age the current research eﬀorts to detect low quality posts in Stack Overﬂowduring the ranking process [70, 71, 102, 29, 44, 10]. In addition, we will alsofocus on improving the API to code example linking accuracy.In our user study, the participants suggested that the usage scenariosin Opiner could be integrated into traditional API documentation. Giventhat oﬃcial API documentation can be often incomplete, incorrect and ob-solete [93, 74], we will focus on the utilization of our proposed frameworkto improve API documentation resources, such as the development of tech-niques to automatically recommend ﬁxes to common API documentationproblems (e.g., ambiguity, incorrectness) [93, 74], to associate the mined us-age scenarios to speciﬁc API versions, and to produce on-demand developerdocumentation [75].

References [1]

Online appendix for Mining Task-Based API Documentation . https://github.com/anonsubmissions/ist2019 , 5 August 2019 (last ac-cessed).[2] R. Agrawal and R. Srikant. Fast algorithms for mining association rulesin large databases. In Proc. Conf. of the 20th Int. Conf. on Very LargeDatabases , pages 192–202, 1994.[3] M. Ahasanuzzaman, M. Asaduzzaman, C. K. Roy, and K. A. Schneider.Classifying stack overﬂow posts on api issues. In

Proceedings of the IEEE25th International Conference on Software Analysis, Evolution and Reengi-neering , pages 244–254, 2018.[4] T. Ahmed, A. Bosu, A. Iqbal, and S. Rahimi. Senticr: A customized senti-ment analysis tool for code review interactions. In

Proceedings of the 32ndInternational Conference on Automated Software Engineering , pages 106–111, 2017.[5] S. Azad, P. C. Rigby, and L. Guerrouj. Generating api call rules from versionhistory and stack overﬂow posts.

ACM Transactions on Software Engineer-ing and Methodology , 25(4):22, 2017.[6] S. Blair-Goldensohn, K. Hannan, R. McDonald, T. Neylon, G. A. Reis, andJ. Reyner. Building a sentiment summarizer for local search reviews. In

WWW Workshop on NLP in the Information Explosion Era , page 10, 2008.

7] J. Brandt, M. Dontcheva, M. Weskamp, and S. R. Klemmer. Example-centricprogramming: integrating web search into the development environment. In

Proceedings of the SIGCHI Conference on Human Factors in ComputingSystems , pages 513–522, 2010.[8] I. Cai.

Framework Documentation: How to document object-oriented frame-works. An Empirical Study . PhD in Computer Sscience, University of Illinoisat Urbana-Champaign, 2000.[9] F. Calefato, F. Lanubile, F. Maiorano, and N. Novielli. Sentiment polaritydetection for software development.

Journal Empirical Software Engineering ,pages 2543–2584, 2017.[10] F. Calefato, F. Lanubile, and N. Novielli. How to ask for technical help?evidence-based guidelines for writing questions on stack overﬂow.

Journalof Information and Software Technology , 94:186–207, 2018.[11] J. C. Campbell, C. Zhang, Z. Xu, A. Hindle, and J. Miller. Deﬁcient docu-mentation detection: A methodology to locate deﬁcient project documenta-tion using topic analysis. In

Proceedings of the 10th International WorkingConference on Mining Software Repositories , pages 57–60, 2013.[12] E. Campos, L. Souza, and M. Maia. Searching crowd knowledge to rec-ommend solutions for api usage tasks.

Journal of Software: Evolution andProcess , 28(10):863–892, 2016.[13] E. C. Campos, M. Monperrus, and M. A. Maia. Searching stack overﬂow forapi-usage-related bug ﬁxes using snippet-based queries. In

Proceedings of the26th Annual International Conference on Computer Science and SoftwareEngineering , pages 232–242, 2016.[14] J. Carbonell and J. Goldstein. The use of mmr, diversity-based rerankingfor reordering documents and producing summaries. In

In Proceedings ofthe 21st annual international ACM SIGIR conference on Research and de-velopment in information retrieval , pages 335–336, 1998.[15] J. M. Carroll, P. L. Smith-Kerker, J. R. Ford, and S. A. Mazur-Rimetz. Theminimal manual.

Journal of Human-Computer Interaction , 3(2):123–153,1987.[16] W.-K. Chan, H. Cheng, and D. Lo. Searching connected api subgraph viatext phrases. In

Proceedings of the ACM SIGSOFT 20th International Sym-posium on the Foundations of Software Engineering , pages 1–11, 2012.

17] A. Cohan and N. Goharian. Revisiting summarization evaluation for scien-tiﬁc articles. In

Proc. Language Resources and Evaluation , page 8, 2016.[18] B. Dagenais and M. P. Robillard. Creating and evolving developer documen-tation: Understanding the decisions of open source contributors. In

Proc.18th Intl. Symp. Foundations of Soft. Eng. , pages 127–136.[19] B. Dagenais and M. P. Robillard. Recovering traceability links between anAPI and its learning resources. In , pages 45–57, 2012.[20] F. Delﬁm and M. M. Kl´erisson Paix˜ao, Damien Cassou. Redocumenting apiswith crowd knowledge: a coverage analysis based on question types.

Journalof the Brazilian Computer Society , 29(1), 2016.[21] FasterXML.

Jackson . https://github.com/FasterXML/jackson ,2016.[22] B. Fox. Now Available: Central download statistics for OSS projects , 2017.[23] D. Freelon. ReCal2: Reliability for 2 coders. http://dfreelon.org/utils/recalfront/recal2/ , 2016.[24] D. Freelon. ReCal3: Reliability for 3+ coders. http://dfreelon.org/utils/recalfront/recal3/ , 2017.[25] M. Gambhir and V. Gupta. Recent automatic text summarization tech-niques: a survey.

Artiﬁcial Intelligence Review , 47(1):166, 2017.[26] Google.

Gson . https://github.com/google/gson , 2016.[27] X. Gu, H. Zhang, and S. Kim. Deep code search. In Proceedings of the 40thInternational Conference on Software Engineering , pages 933–944, 2018.[28] S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. D. Lucia, and T. Menzies.Automatic query reformulations for text retrieval in software engineering. In

Proc. 35th IEEE/ACM International Conference on Software Engineering ,pages 842–851, 2013.[29] F. M. Harper, D. Raban, S. Rafaeli, and J. A. Konstan. Predictors of answerquality in online q&a sites. In

Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems , pages 865–874, 2008.

30] S. G. Hart and L. E. Stavenland. Development of NASA-TLX (Task LoadIndex): Results of empirical and theoretical research. pages 139–183, 1988.[31] V. Hatzivassiloglou and J. M. Wiebe. Eﬀects of adjective orientation andgradability on sentence subjectivity. In

In the 18th Conference of the Asso-ciation for Computational Linguistics , pages 299–305.[32] E. Hill, M. Roldan-Vega, J. A. Fails, and G. Mallet. Nl-based query reﬁne-ment and contextualized code search results: A user study. In

Proceedings ofthe IEEE Conference on Software Maintenance, Reengineering, and ReverseEngineering , pages 34–43, 2014.[33] R. Holmes, R. Cottrell, R. J. Walker, and J. Denzinger. The end-to-end useof source code examples: An exploratory study. In

Proceedings of the IEEEInternational Conference on Software Maintenance , pages 555–558, 2009.[34] M. Hu and B. Liu. Mining and summarizing customer reviews. In

ACMSIGKDD International Conference on Knowledge Discovery and Data Min-ing , pages 168–177, 2004.[35] M. R. Islam and M. F. Zibran. Leveraging automated sentiment analysisin software engineering. In

Proc. 14th International Conference on MiningSoftware Repositories , pages 203–214, 2017.[36] S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Summarizing source codeusing a neural attention model. In

In Proceedings of the Annual Meeting ofthe Association for Computational Linguistics , pages 2073–2083, 2016.[37] H. Jiau and F.-P. Yang. Facing up to the inequality of crowdsourced apidocumentation.

ACM SIGSOFT Software Engineering Notes , 37(1):1–9,2012.[38] D. Kavaler, D. Posnett, C. Gibler, H. Chen, P. Devanbu, and V. Filkov.Using and asking: Apis used in the android market and asked about instackoverﬂow. In

In Proceedings of the INTERNATIONAL CONFERENCEON SOCIAL INFORMATICS , pages 405–418, 2013.[39] I. Keivanloo, J. Rilling, and Y. Zou. Spotting working code examples. In

Proceedings of the 36th International Conference on Software Engineering ,pages 664–675, 2014.[40] K. Kim, D. Kim, T. F. Bissyand´e, E. Choi, L. Li, J. Klein, and Y. L. Traon.Facoy a code-to-code search engine. In

In Proceedings of the IEEE/ACM , pages 946 – 957,2018.[41] D. Klein and C. D. Manning. Accurate unlexicalized parsing. In Proc. 41stAnnual Meeting on Association for Computational Linguistics , pages 423–430, 2003.[42] J. Li, A. Sun, and Z. Xing. Learning to answer programming questionswith software documentation through social context embedding.

Journal ofInformation Sciences , 448–449:46–52, 2018.[43] J. Li, Z. Xing, and A. Kabir. Leveraging oﬃcial content and social con-text to recommend software documentation.

IEEE Transactions on ServicesComputing , page 1, 2018.[44] L. Li, D. He, W. Jeng, S. Goodwin, and C. Zhang. Answer quality character-istics and prediction on an academic q&a site: A case study on researchgate.In

Proceedings of the 24th International Conference on World Wide Web ,pages 1453–1458, 2015.[45] B. Lin, F. Zampetti, G. Bavota, M. D. Penta, and M. Lanza. Pattern-basedmining of opinions in Q&A websites. In

Proc. 41st ACM/IEEE InternationalConference on Software Engineering , page 12, 2019.[46] B. Liu.

Sentiment Analysis and Subjectivity . CRC Press, Taylor and FrancisGroup, Boca Raton, FL, 2nd edition, 2010.[47] B. Liu.

Sentiment Analysis and Opinion Mining . Morgan & Claypool Pub-lishers, 1st edition, May 2012.[48] M. Lu, X. Sun, S. Wang, D. Lo, and Y. Duan. Query expansion via wordnetfor eﬀective code search. In

Proceedings of the IEEE 22nd InternationalConference on Software Analysis, Evolution, and Reengineering , pages 545–549, 2015.[49] F. Lv, H. Zhang, J. guang Lou, S. Wang, D. Zhang, and J. Zhao. Code-how: eﬀective code search based on api understanding and extended booleanmodel. In

Proceedings of the 30th IEEE/ACM International Conference onAutomated Software Engineering , pages 260–270, 2015.[50] H. V. D. Maij. A critical assessment of the minimalist approach to documen-tation. In

Proc. 10th ACM SIGDOC International Conference on SystemsDocumentation , pages 7–17, 1992.

51] L. Mastrangelo, L. Ponzanelli, A. Mocci, M. Hauswirth, N. Nystrom, andM. Lanza. Use at your own risk: The java unsafe api in the wild. In

Pro-ceedings of the International Conference on Object Oriented ProgrammingSystems Languages & Applications , pages 695–710, 2015.[52] C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu. Portfolio:Finding relevant functions and their usages. In

Proc. 33rd InternationalConference on Software Engineering , pages 111–120, 2011.[53] R. Mihalcea and P. Tarau. Textrank: Bringing order into texts. In

Proceed-ings of the Conference on Empirical Methods in Natural Language Process-ing , pages 404–411, 2004.[54] G. A. Miller. Wordnet: A lexical database for english.

Communications ofthe ACM , 38(11):39–41, 1995.[55] L. Moonen. Generating robust parsers using island grammars. In

Proc.Eighth Working Conference on Reverse Engineering , pages 13–22, 2001.[56] L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. Pollock, and K. Vijay-Shanker. Automatic generation of natural language summaries for Javaclasses. In

Proceedings of the 21st IEEE International Conference on Pro-gram Comprehension , pages 23–32, 2013.[57] A. Nielsen. A new ANEW: Evaluation of a word list for sentiment analysis inmicroblogs. In

In the 8th Extended Semantic Web Conference , pages 93–98.[58] N. Novielli, F. Calefato, and F. Lanubile. The challenges of sentiment de-tection in the social programmer ecosystem. In

Proceedings of the 7th Inter-national Workshop on Social Software Engineering , pages 33–40, 2015.[59] Oracle.

The Java Date API . http://docs.oracle.com/javase/tutorial/datetime/index.html , 2017.[60] S. Overﬂow. http://stackoverflow.com/questions/1688099/ ,2010.[61] S. Overﬂow. Name/value pair loop of JSON Object with Java & JSNI . http://stackoverflow.com/questions/7141650/ , 2010.[62] S. Overﬂow. Converting JSON to Hashmap¡String, POJO¿ using GWT . https://stackoverflow.com/questions/20374351 , 2017.

63] S. Overﬂow.

Academic Papers Using Stack Exchange Data . https://meta.stackexchange.com/questions/134495/academic-papers-using-stack-exchange-data , 2019. Lastaccessed on 12 May 2019.[64] C. Parnin and C. Treude. Measuring api documentation on the web. In Proceedings of the 2nd International Workshop on Web 2.0 for SoftwareEngineering , pages 25–30, 2011.[65] C. Parnin, C. Treude, L. Grammel, and M.-A. Storey. Crowd documentation:Exploring the coverage and dynamics of api discussions on stack overﬂow.Technical report, Technical Report GIT-CS-12-05, Georgia Tech, 2012.[66] T. Parr.

The Deﬁnitive ANTLR Reference: Building Domain-Speciﬁc Lan-guages . Pragmatic Bookshelf, 1st edition, 2007.[67] H. Phan, H. A. Nguyen, N. M. Tran, L. H. Truong, A. T. Nguyen, and T. N.Nguyen. Statistical learning of API fully qualiﬁed names in code snippets ofonline forums. In

Proceedings of 40th International Conference on SoftwareEngineering , pages 632–642, 2018.[68] L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza. Prompter:Turning the IDE into a self-conﬁdent programming assistant.

EmpiricalSoftware Engineering , 21(5):2190–2231, 2016.[69] L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza. Prompter:Turning the IDE into a self-conﬁdent programming assistant.

EmpiricalSoftware Engineering , 21(5):2190–2231, 2016.[70] L. Ponzanelli, A. Mocci, A. Bacchelli, and M. Lanza. Understanding andclassifying the quality of technical forum questions. In

Proceedings of the14th International Conference on Quality Software , pages 343–352, 2014.[71] L. Ponzanelli, A. Mocci, A. Bacchelli, M. Lanza, and D. Fullerton. Improvinglow quality stack overﬂow post detection. In

Proceedings of the IEEE Inter-national Conference on Software Maintenance and Evolution , pages 541–544,2014.[72] M. Raghothaman, Y. Wei, and Y. Hamadi. Swim: synthesizing what i mean:code search and idiomatic snippet synthesis. In

Proceedings of the 38thInternational Conference on Software Engineering , pages 357–367, 2016.

73] P. C. Rigby and M. P. Robillard. Dicovering essential code elements ininformal documentation. In

Proc. 35th IEEE/ACM International Conferenceon Software Engineering , pages 832–841, 2013.[74] M. P. Robillard. What makes APIs hard to learn? Answers from developers.

IEEE Software , 26(6):26–34, 2009.[75] M. P. Robillard, A. Marcus, C. Treude, G. Bavota, O. Chaparro, N. Ernst,M. A. Gerosa, M. Godfrey, M. Lanza, M. Linares-Vasquez, G. C. Murphy,L. M. D. Shepherd, and E. Wong. On-demand developer documentation.In

Proc. 33rd IEEE International Conference on Software Maintenance andEvolution New Idea and Emerging Results , page 5, 2017.[76] J. Romano, J. D. Kromrey, J. Coraggio, J. Skowronek, and L. Devine. Ex-ploring methods for evaluating group diﬀerences on the nsse and other sur-veys: Are the t-test and cohens d indices the most appropriate choices? In

Proc. Annual meeting of the Southern Association for Institutional Research ,page 51, 2006.[77] M. B. Rosson, J. M. Carrol, and R. K. Bellamy. Smalltalk scaﬀolding: a casestudy of minimalist instruction. In

Proc. ACM SIGCHI Conf. on HumanFactors in Computing Systems , pages 423–430, 1990.[78] Scipy.

Mann Whitney U Test . https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.mannwhitneyu.html , 2017.[79] F. Shull, F. Lanubile, and V. R. Basili. Investigating reading techniques forobject-oriented framework learning. IEEE Transactions on Software Engi-neering , 26(11):1101–1118, 2000.[80] P. Software.

Spring Framework . https://spring.io/ , 2017.[81] L. Souza, E. Campos, , and M. Maia. On the extraction of cookbooks for apisfrom the crowd knowledge. In Proceedings of the 28th Brazilian Symposiumon Software Engineering , pages 21–30, 2014.[82] G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker. To-wards automatically generating summary comments for Java methods. In

Proc. 25th IEEE/ACM international conference on Automated software en-gineering , pages 43–52, 2010.

83] G. Sridhara, L. Pollock, and K. Vijay-Shanker. Automatically detecting anddescribing high level actions within methods. In

Proc. 33rd InternationalConference on Software Engineering , pages 101–110, 2011.[84] S. Subramanian, L. Inozemtseva, and R. Holmes. Live API documentation.In

Proceedings of 36th International Conference on Software Engineering ,pages 643–652, 2014.[85] J. Sunshine, J. D. Herbsleb, , and J. Aldrich. Searching the state space: Aqualitative study of api protocol usability. In

Proceedings of the InternationalConference on Program Comprehension , pages 82–93, 2015.[86] V. Terragni, Y. Liu, and S.-C. Cheung. Csnippex: automated synthesisof compilable code snippets from q&a sites. In

In Proceedings of the 25thInternational Symposium on Software Testing and Analysis , pages 118–129,2016.[87] C. Treude and M. P. Robillard. Augmenting API documentation with in-sights from stack overﬂow. In

Proc. 38th International Conference on Soft-ware Engineering , pages 392–403, 2016.[88] G. Uddin, O. Baysal, and L. Guerrouj. Understanding how and why devel-opers seek and analyze api-related opinions.

IEEE Transactions on SoftwareEngineering , page 13, 2017.[89] G. Uddin, O. Baysal, L. Guerrouj, and F. Khomh. Understanding how andwhy developers seek and analyze API-related opinions.

IEEE Transactionson Software Engineering , 2019.[90] G. Uddin and F. Khomh. Automatic summarization of API reviews. In

Proc. 32nd IEEE/ACM International Conference on Automated SoftwareEngineering , pages 159–170, 2017.[91] G. Uddin and F. Khomh. Automatic summarization of api reviews. In

Sub-mitted to 32nd IEEE/ACM International Conference on Automated SoftwareEngineering , page 12, 2017.[92] G. Uddin and F. Khomh. Automatic opinion mining from API reviews fromstack overﬂow.

IEEE Transactions on Software Engineering , pages 1–37,2019.[93] G. Uddin and M. P. Robillard. How API documentation fails.

IEEE Soft-awre , 32(4):76–83, 2015.

94] G. Uddin and M. P. Robillard. Automatic resolution of API mentions ininformal documents. In

McGill Technical Report , page 6, 2017.[95] J. Walnes.

Xstream . http://x-stream.github.io/ , 2017.[96] W. Wang and M. W. Godfrey. Detecting api usage obstacles: A study ofios and android developer questions. In In Proceedings of the 10th WorkingConference on Mining Software Repositories , pages 61–64, 2013.[97] W. Wang and M. W. Godfrey. Detecting API usage obstacles: a study of iOSand Android developer questions. In

Proceedings of the 10th InternationalWorking Conference on Mining Software Repositories , pages 61–64, 2013.[98] Wikipedia.

Application programming interface . http://en.wikipedia.org/wiki/Application_programming_interface , 2014.[99] T. Wilson, J. Wiebe, and P. Hoﬀmann. Recognizing contextual polarityin phrase-level sentiment analysis. In In Proceedings of the conference onHuman Language Technology and Empirical Methods in Natural LanguageProcessing , pages 347–354, 2005.[100] C. Wohlin, P. Runeson, M. H¨ost, M. C. Ohlsson, B. Regnell, and A. Wessl´en.

Experimentation in software engineering: an introduction . Kluwer AcademicPublishers, Norwell, MA, USA, 2000.[101] B. Xu, Z. Xing, X. Xia, and D. Lo. Answerbot: automated generation of an-swer summary to developers’ technical questions. In

Proc. 32nd IEEE/ACMInternational Conference on Automated Software Engineering , pages 706–716, 2017.[102] Y. Ya, H. Tong, T. Xie, L. Akoglu, F. Xu, and J. Lu. Detecting high-quality posts in community question answering sites.

Journal of InformationSciences , 302(1):70–82, 2015.[103] D. Yang, A. Hussain, and C. V. Lopes. From query to usable code: ananalysis of stack overﬂow code snippets. In

In Proceedings of the 13th Inter-national Conference on Mining Software Repositories , pages 391–402, 2016.[104] D. Ye, Z. Xing, C. Y. Foo, J. Li, and N. Kapre. Learning to extract api men-tions from informal natural language discussions. In

Proceedings of the 32ndInternational Conference on Software Maintenance and Evolution , page 12,2016. In Proceed-ings of the 15th International Conference on Mining Software Repositories ,pages 476–486, 2018.[106] T. Zhang, G. Upadhyaya, A. Reinhardt, H. Rajan, and M. Kim. Are codeexamples on an online q&a forum reliable?: a study of api misuse on stackoverﬂow. In

In Proceedings of the 40th International Conference on SoftwareEngineering , pages 886–896, 2018., pages 886–896, 2018.