[PDF] Automatic Detection of Five API Documentation Smells: Practitioners' Perspectives

Abstract

The learning and usage of an API is supported by official documentation. Like source code, API documentation is itself a software product. Several research results show that bad design in API documentation can make the reuse of API features difficult. Indeed, similar to code smells or code antipatterns, poorly designed API documentation can also exhibit 'smells'. Such documentation smells can be described as bad documentation styles that do not necessarily produce an incorrect documentation but nevertheless make the documentation difficult to properly understand and to use. Recent research on API documentation has focused on finding content inaccuracies in API documentation and to complement API documentation with external resources (e.g., crowd-shared code examples). We are aware of no research that focused on the automatic detection of API documentation smells. This paper makes two contributions. First, we produce a catalog of five API documentation smells by consulting literature on API documentation presentation problems. We create a benchmark dataset of 1,000 API documentation units by exhaustively and manually validating the presence of the five smells in Java official API reference and instruction documentation. Second, we conduct a survey of 21 professional software developers to validate the catalog. The developers agreed that they frequently encounter all five smells in API official documentation and 95.2% of them reported that the presence of the documentation smells negatively affects their productivity. The participants wished for tool support to automatically detect and fix the smells in API official documentation. We develop a suite of rule-based, deep and shallow machine learning classifiers to automatically detect the smells. The best performing classifier BERT, a deep learning model, achieves F1-scores of 0.75 - 0.97.

Full PDF

AAutomatic Detection of Five API DocumentationSmells: Practitioners’ Perspectives

Junaed Younus Khan a , Md. Tawkat Islam Khondaker a , Gias Uddin b and Anindya Iqbal aa Bangladesh University of Engineering and Technology and b University of Calgary

Abstract —The learning and usage of an API is supported byofﬁcial documentation. Like source code, API documentation isitself a software product. Several research results show thatbad design in API documentation can make the reuse of APIfeatures difﬁcult. Indeed, similar to code smells or code anti-patterns, poorly designed API documentation can also exhibit‘smells’. Such documentation smells can be described as baddocumentation styles that do not necessarily produce an incorrectdocumentation but nevertheless make the documentation difﬁcultto properly understand and to use. Recent research on APIdocumentation has focused on ﬁnding content inaccuracies inAPI documentation and to complement API documentation withexternal resources (e.g., crowd-shared code examples). We areaware of no research that focused on the automatic detection ofAPI documentation smells. This paper makes two contributions.First, we produce a catalog of ﬁve API documentation smells byconsulting literature on API documentation presentation prob-lems. We create a benchmark dataset of 1,000 API documentationunits by exhaustively and manually validating the presence ofthe ﬁve smells in Java ofﬁcial API reference and instructiondocumentation. Second, we conduct a survey of 21 professionalsoftware developers to validate the catalog. The developers agreedthat they frequently encounter all ﬁve smells in API ofﬁcialdocumentation and 95.2% of them reported that the presenceof the documentation smells negatively affects their productivity.The participants wished for tool support to automatically detectand ﬁx the smells in API ofﬁcial documentation. We develop asuite of rule-based, deep and shallow machine learning classiﬁersto automatically detect the smells. The best performing classiﬁerBERT, a deep learning model, achieves F1-scores of 0.75 - 0.97.

Index Terms —API Documentation, Smell, Benchmark, Survey,Shallow Learning, Deep Learning.

I. I

NTRODUCTION

APIs (Application Programming Interfaces) are interfacesto reusable software libraries and frameworks. Proper learningof APIs is paramount to support modern day rapid softwaredevelopment. To support this, APIs typically are supported byofﬁcial documentation. An API documentation is a productitself, which warrants the creation and maintenance principlessimilar to any existing software product. A good documenta-tion can facilitate the proper usage of an API, while a baddocumentation can severely harm its adoption [6], [61], [62].A signiﬁcant body of API documentation research hasfocused on studying API documentation problems based onsurveys and interviews of software developers [6], [11], [13],[26], [27], [42], [61], [62], [64], [91]. Broadly, API docu-mentation problems are divided into two types, what (i.e.,what is documented) and how (i.e., how it is documented) [7],[81]. Tools and techniques are developed to address the ‘what’problems in API documentation, such as detection of code

Create a benchmark of five API documentation smells by consulting literature and software practitionersInvestigate the effectiveness of automated documentation smell detection techniques using the benchmark Conduct survey of professional software developers on the prevalence and perceived impact of the API documentation smells

Fig. 1. The three major phases used in this study. comment inconsistency [57], [75], [84], [92], natural languagesummary generation of source code [44], [47], [65], [71],adding description of API methods by consulting externalresources (e.g., online forums) [5], detecting obsolete APIdocumentation by comparing API version [16], [17], andcomplementing ofﬁcial documentation by incorporating in-sights and code examples from developer forums [72], [78].In contrast, not much research has focused on the automaticdetection of ‘how’ problems, e.g., bad design in API docu-mentation that can make the reuse of API features difﬁcultdue to lack of usability [81]. Recently, Treude et al. [77]ﬁnd that not all API documentation units are equally readable.This ﬁnding reinforces the needs to automatically detect APIdocumentation presentation issues as ‘documentation smells’,as previously highlighted by Aghajani et al. [6]. Unfortunately,we are not aware of any research on the automatic detectionof such API documentation smells.As a ﬁrst step towards developing techniques to detectsmells in API documentation, in this paper, we follow threephases (see Fig. 1). First, we identify ﬁve API documentationsmells by consulting API documentation literature [7], [81](Section II). Four of the smells (bloated, fragmented andtangled description of API documentation unit, and excessstructural info in the description) are reported as presentationproblems by Uddin and Robillard [81]. The other smellis called ‘Lazy documentation’ and it refers to inadequatedescription of an API documentation unit (e.g., no explana-tion of method parameters). Such incomplete documentationis reported in literature [7] and in online discussions. Weexhaustively explore ofﬁcial API documentation to ﬁnd the a r X i v : . [ c s . S E ] F e b .8% Maybe documentationsmells hinderdeveloper producivity Fig. 2. Survey responses from professional developers on whether thepresence of the smells in API documentation hinders productivity. occurrences of the ﬁve smells. The focus was to develop abenchmark of smelly

API documentation units. A total of 19human coders participated in this exercise. This phase resultedin a benchmark of 1,000 API documentation units, where778 units have at least one of the ﬁve smells. To the bestof our knowledge, this is the ﬁrst benchmark with real-worldexamples of the ﬁve documentation smells.In the second phase (Section III), we conducted a survey of21 professional software developers to validate our catalog ofAPI documentation smells. All the participants reported thatthey frequently encounter the ﬁve API documentation smells.More than 95% of the participants (20 out of 21) reportedthat the presence of the ﬁve smells in API documentationnegatively impacts their productivity (see Fig. 2). The par-ticipants asked for tool support to automatically detect andﬁx the smells in API ofﬁcial documentation. These ﬁndingscorroborate previous research that design and presentationissues in API documentation can hinder API usage [6], [81].In the third phase (Section IV), we investigate a suite ofrule-based, shallow and deep machine learning models usingthe benchmark to investigate the feasbility of automaticallydetecting the ﬁve smells. The best performing classifer BERT,a deep learning model, achieves F1-scores of 0.75 - 0.97. Tothe best of our knowledge, ours are the ﬁrst techniques toautomatically detect the ﬁve API documentation smells. Themachine learning models can be used to monitor and warnabout API documentation quality by automatically detectingthe smells in real-time with high accuracy.

Replication Package with benchmark, code, and survey isshared at https://github.com/disa-lab/SANER2021-DocSmellII. A B

ENCHMARK OF

API D

OCUMENTATION S MELLS

We describe the methodology to create our benchmark ofAPI documentation smells (Section II-A) and then present thebenchmark with real-world examples (Sections II-B - II-C).

A. Benchmark Creation Methodology

Code and design smells are relatively well studied ﬁelds ofsoftware engineering. However, to the best of our knowledge,this is the ﬁrst research on API documentation smells. Assuch, we needed to investigate both the literature on APIdocumentation [6], [7], [62], [81] and the diverse API docu-mentation resources (e.g., Java SE docs) during the creation ofour catalog of API documentation smells. We followed a three-step process, which closely mimics the standard approachesfollowed in code/design smell formulation studies [2], [3]. Thethree steps are outlined in Fig. 3 and are explained below.

Documentation Problems, Documentation Sources, Documentation PracticesCode Smells, Design Anti-Patterns, Software Development Practices A c a t a l o g o f f i v e d o c u m e n t a t i o n s m e ll s Knowledge Acquisition

Mapping of Smell Definitions from Paper to Documents S e v e r a l R e a l e x a m p l e s o f e a c h o f f i v e s m e ll s w i t h c o m p l e t e a g r ee m e n t a m o n g t e a m m e m b e r s Feasibility Analysis

API DocumentsFocus Group Literature B e n c h m a r k o f t h e f i v e S m e ll s Benchmark Creation

Collect Large Samples of API Documentation UnitsFilter Potential Smelly API Documentation UnitsRecruit CodersCreate Coding Guide

Fig. 3. The three major steps in benchmark creation process.

Knowledge Acquisition.

Similar to code and design smellsthat do not directly introduce a defect or a bug into a softwaresystem, documentation smells refer to presentation issues thatdo not make a documentation incorrect, rather they hinderits proper usage due to the lack of quality in the design ofthe documented contents. As such, we studied extensivelythe API documentation literature that reported issues relatedto API documentation presentation and usability [7], [81].For example, the most recent paper on this topic was byAghajani et al. [6], [7], who divided the ‘how’ problems inAPI documentation into four categories: maintainability (e.g.,lengthy ﬁles), readability (e.g., clarity), usability (e.g., infor-mation organization like dependency structure), and usefulness(e.g., content not useful in practice). Previously, Uddin andRobillard [81] studied 10 common problems in API documen-tation by surveying 323 IBM developers. They observed fourcommon problems related to presentation, i.e., bloated (i.e., toolong description), tangled (complicated documentation), frag-mented (i.e., scattered description), and excessive structuralinformation (i.e., information organization like dependencystructure). Given that the four problems appeared in both stud-ies, we included each as a documentation smell in our study.In addition, we added lack of proper description of an APImethod as a ‘lazy’ documentation smell, because incompletedocumentation problems are discussed in literature [6], [81]as well as in online developer discussions (see Fig. 4).

Feasibility Analysis.

Once we decided on the ﬁve smells,we conducted a feasibility study by looking for real-worldexamples of the smells in ofﬁcial and instructional APIdocumentation. This was important to ensure that the smellsare prevalent in API documentation and that we can ﬁnd ig. 4. Tweet complaining about lazy documentation of API method. those with reasonable conﬁdence, because otherwise there isno way we can design automated techniques to detect thoseautomatically. We combined our knowledge of the ﬁve smellsgained from API documentation literature with active explo-ration of the ﬁve smells in the API ofﬁcial documentation. Weconducted multiple focus group discussions where all the fourauthors discussed together by analyzing potential examplesof the ﬁve smells in API documentation and by mapping thecharacteristics of such API documentation with the descriptionof the smells in the literature/developer discussions. Beforeevery such focus group meeting, the ﬁrst two authors createda list of 50 API documentation units with their labels of theﬁve smells in the units. The four authors discussed those labelstogether, reﬁned the labels, and identiﬁed/ﬁltered the labelingcriteria. This iterative process led to increased understandingamong the group members on the speciﬁc characteristics of theﬁve documentation smells. From multiple discussion sessions,the ﬁnal output was a list of 50 labeled datapoints.

Benchmark Creation.

In the last step of the benchmarkcreation process, we expanded our initial list of 50 APIdocumentation units with smell labels as follows. We col-lected documentations of over 29K methods belonging to over4K classes of 217 different packages. We extracted thesedocumentations from the online JAVA API Documentationwebsite [1] through web crawling and text parsing techniques.Since a documentation can contain multiple smells at thesame time, this is a multi-labeled dataset. We produced thebenchmark as follows. First, all the authors mutually discussedthe documentation smells. Then, we randomly selected 950documentations from a total of 29K that we extracted. Thenthe ﬁrst two authors labeled the ﬁrst 50 documentationsseparately. When they ﬁnished, they consulted other co-authors and resolved the disagreement based on the discussion.Then they continued with the next 50 documentations andrepeated the same process. Their agreement of labeling hasbeen recorded using Cohen’s Kappa Coefﬁcient [45] for eachiteration, i.e., labeling 50 documentations (Table I). After thethird iteration, both the authors reached a perfect agreementlevel with Cohen’s Kappa Coefﬁcient of 0 .

83. Then theyprepared a coding guideline for the labeling task which waslater presented to 17 computer science undergraduate students.The students labeled the remaining 800 documentation units.During the entire coding sessions by the 17 coders, the ﬁrsttwo authors remained available to them via Skype/Slack. Eachcoder consulted their labels with the two authors. This ensured

TABLE IM

EASURE OF AGREEMENT BETWEEN TWO LABELERS

Iteration ID Documentation Unit κ quality and mitigated subjective bias in the manual labeling ofthe benchmark. B. The Five Documentation Smells in the Benchmark

Bloated Documentation Smell.

By ‘Bloated’ we mean thedocumentation whose description (of an API element type) isverbose or excessively elaborate. It is difﬁcult to understandor follow a lengthy documentation [81]. Moreover, it cannotbe effectively managed that makes it hard to modify whenneeded, e.g., in case of any update in the API source code.In our benchmark, we found many documentations that arelarger than necessary. For example, the documentation shownin Fig. 5 is so verbose and lengthy that it is hard to followand use it. Hence, it is a bloated documentation.

Fig. 5. Example of Bloated Smell.

Excess Structural Information Smell.

Such a description of adocumentation unit (e.g., method) contains too many structuralsyntax or information, e.g., the Javadoc of the java.lang.Objectclass. Javadoc lists all the hundreds of subclasses of theclass. In our study, we ﬁnd this type of documentation tocontain many class and package names. For instance, thedocumentation of Fig. 6 contains many structural information(marked in red rectangle) that are quite unnecessary for thepurpose of understanding and using the underlying method. ig. 6. Example of Excess Structural Information.

Tangled Documentation Smell.

A documentation of an APIelement (method) is ‘Tangled’ if it’s description is tangled withvarious information (e.g., from other methods). This makes itcomplex and thereby reduces the readability and understand-ability of the description. Fig. 7 depicts an example of tangleddocumentation which is hard to follow and understand.

Fig. 7. Example of Tangled Smell.

Fragmented Documentation Smell.

Sometimes it is seenthat the information of documentation (related to an APIelement) is scattered (i.e., fragmented) over too many pagesor sections. In our empirical study, we found a good numberof documentation that contain many URLs and referencesthat indicate possible fragmentation smell. For example, the documentation of Fig. 8 is fragmented as it refers the readersto other pages or sections for details.

Fig. 8. Example of Fragmented Smell.

Lazy Documentation Smell.

We categorize a documentationas ‘Lazy’ if it contains very small information to convey tothe readers. In many cases, it is seen that the documentationdoes not contain any extra information except what can beperceived directly from the function name. Hence, this kind ofdocumentation does not have much to offer to the readers. Wesee a lazy documentation in Fig. 9 where the documentationsays nothing more about the underlying method than what issuggested by the prototype itself.

Fig. 9. Example of Lazy Smell.

C. Distribution of API Documentation Smells in Benchmark

We calculated the total number of smells in our dataset(Fig. 10). We found that 778 documentations (almost 78%)of our dataset contain at least one smell. While most (524)of the smelly documentations contain only one type of smell,a small number (19) of documentations show as high as foursmells at the same time. We also determined the distributionof different smells in our dataset (Fig. 11). It shows that allthe ﬁve types of smells discussed occur in the dataset with aconsiderable frequency where the most frequent smell in ourdataset is ‘Lazy’ with 275 occurrences and the least frequentsmell is ‘Bloated’ with 141 occurrences.In multi-label learning, the labels might be interdependentand correlated [31]. We used Phi Coefﬁcients to determinesuch interdependencies and correlations between differentdocumentation smells. The Phi Coefﬁcient is a measure ofassociation between two binary variables [15]. It ranges from-1 to +1, where ±1 indicates a perfect positive or negativecorrelation and 0 indicates no relationship. We report the PhiCoefﬁcients between each pair of labels in Fig. 12. We ﬁndthat there is almost no correlation between ‘Fragmented’ andany other smell (except ‘Lazy’). By deﬁnition, the information ig. 10. Smell distribution by of fragmented documentation is scattered in many sections orpages. Hence, it has little to do with smells like ‘Bloated’,‘Excess Structural Information’, or ‘Tangled’. We also observethat there is a weak positive correlation (+0.2 to +0.4) amongthe ‘Bloated’, ‘Excess Structural Information’, and ‘Tangled’smells. One possible reason might be that if a documentation isﬁlled with complex and unorganized information (Tangled) orunnecessary structural information (Excess Structural Informa-tion), it might be prone to become bloated as well. On the otherhand, ‘Lazy’ smell has a weak negative correlation (-0.2 to -0.3) with all other groups since these kinds of documentationare often too small to contain other smells. However, none ofthese coefﬁcients is high enough to imply a strong or moderatecorrelation between any pair of labels. Hence, all types ofsmells in our study are more or less unique in nature.III. D

EVELOPERS ’ S

URVEY OF D OCUMENTATION S MELLS

Four out of the ﬁve API documentation smells in ourstudy were previously reported as commonly observed byIBM developers [81]. The other smell (lazy documentation)is reported as a problem in API documentation in multiplestudies [6], [61]. Given that we extended previous studies bycreating a benchmark of the smells with real-world examples,we needed to further ensure that our collected examplesof smelly documentation units do resonate with softwaredevelopers. We, therefore, conducted a survey of professionalsoftware developers (1) to validate our catalog of the ﬁve APIdocumentation smells and (2) to understand whether, similar toprevious research, developers agree with the negative impact

Fig. 12. Correlation between different documentation smells in our bench-mark. Red, Blue, and Gray mean positive, negative, and no correlation.Intensity of color indicates the level of correlation. of the documentation smells. In particular, we explore thefollowing two research questions:RQ1. How do software developers agree with our catalog andexamples of the ﬁve API documentation smells?RQ2. How do software developers perceive the impact of thedetected documentation smells?

A. Survey Setup

We recruited 21 professional software developers who areworking in the software industry. We ensured that each de-veloper is actively involved in daily software developmentactivities like API reuse and documentation consultation. Theparticipants were collected through personal contacts. First,each participant had to answer two demographic questions:current profession and years of experience in software de-velopment. We then presented each participant two Javadocexamples of each smell and asked him/her whether they agreedthat this documentation example belonged to that particularsmell. Then, we asked them about how frequently they facedthese documentation smells. Finally, we inquired them ofthe negative impact of the documentation smells on theiroverall productivity during software development. Out of the21 participants, 14 participants had experience less than 5years and the rest had more than 5 years. Majority of theparticipants had experience less than 5 years because they arelikely to be more engaged in studying API documentation aspart of their software programming responsibility. Developerswith experience more than ﬁve years are more engaged indesign of the software and its architecture.

B. How do software developers agree with our catalog andexamples of the ﬁve API documentation smells? (RQ1)

We showed each participant two examples of each smell,i.e., 10 examples in total. For each example, we asked twoquestions: (1) Do you think the documentation mentionedabove is [smell, e.g., lazy]? The options are in Likert scale, i.e.,strongly agree, agree, neutral, disagree, and strongly disagree.and (2) Based on your experience of the last three months, ig. 13. Survey response on whether the software developers agreed with ourlabeled documentation smell examples.Fig. 14. Survey response on how frequently the participants faced thedocumentation smells in the last three months. how frequently did you observe this [smell, e.g., lazy] indocumentation? The options are: never, once or twice, occa-sionally, frequently, and no opinion. The options are pickedfrom literature [81]. Two examples per smell ensure increasedconﬁdence on the feedback we get from each participant.Fig. 13 shows the responses of the participants to the ﬁrstquestion. More than 75% participants agreed to the examplesof three smells: bloated, tangled, and excess structural info.At least 50% of the participants agreed to the examples of theother two smells. Only 5-25% of the participants disagreed tothe examples. Overall, each example of the API documentationsmell was agreed by at least 50% of the participants. Thisvalidates out catalog of API documentation smells based onfeedback from the professional developers.Fig. 14 shows the frequency of the documentation smells thedevelopers observed in the last three months (second question).We found that 50% of the participants had faced all thesmells and lazy smell was the most frequently encountered.On the other hand, half of the participants did not face bloateddocumentation smells in the last three months, while 60%-65%of the participants faced tangled, excess structural info, andfragmented API documentation. This study reveals that APIdocumentation is becoming less explicable, more complex,and unnecessarily structured to keep the documentation short.To solve this problem, API documentation needs to be more understandable and elaborated to explain the API functionality.

C. How do software developers perceive the impact of thedetected documentation smells? (RQ2)

We asked the participants how severely the documenta-tion smells impact their development tasks. The responseswere taken on a scale of ﬁve degrees: “Blocker”, “Severe”,“Moderate”, “Not a Problem”, and “No opinion”. The optionswere picked from similar questions on API documentationpresentation problems from literature [81].

Fig. 15. The perceived impact of the ﬁve documentation smells by severityand frequency. Circle size indicates the percentage of participants whostrongly agreed or agreed to the smells.

We analyzed the impact of the documentation smells withrespect to the frequency of the smells the participants hadobserved over the past three months (see Fig. 15). For eachsmell, we compute the frequency scale (x-axis) as the percent-age of response “Frequently”, “Occasionally”, and “Once ortwice”. For example, regarding whether the participants hadobserved lazy documentation in the past three months, 25%answered “Frequently”, 35% answered “Occasionally”, and30% answered “Once or twice”, leading to a total 90% inthe frequency scale. We constructed the severity scale (y-axis)by combining the percentage of the participants respondedwith “Blocker”, “Severe”, and “Moderate”. For example, dueto fragmented documentation smells, 5% of the participantscould not use that particular API and picked another API(“Blocker”), 20% of the participants believed that they wasteda lot of time ﬁguring out the API functionality (“Severe”),and 25% of the participants felt irritated (“Moderate”) withthe fragmented documentation. The circle size indicates thepercentage of the participants “Strongly Agree” or “Agree”with the examples containing documentation smells.From Fig. 15, we observed that lazy documentation had themost frequent and the most negative impact (90%). Tangleddocumentation was identiﬁed as the second most severe smell(85%). Although bloated documentation was considered moresevere (65%) than excess structural info (55% severity) andfragmented (50% severity) documentation, bloated occurredless frequently than the later two. The most important ﬁndingof this survey is that the coordinates of all the circles (referringto documentation smells) in Fig. 15 were above or equal to0. This indicates that according to the majority of the partic-ipants, these documentations smells are occurring frequentlyand hindering the productivity of the development tasks.IV. A

UTOMATIC D ETECTION OF T HE S MELLS

The responses from the survey validate our catalog ofAPI documentation smells. The perceived negative impactof the smells on developers’ productivity, as evidenced bythe responses from our survey participants, necessitates theneeds to ﬁx API documentation by removing the smells.To do that, we ﬁrst need to detect the smells automaticallyin the API documentation. The automatic detection offerstwo beneﬁts: (1) we can use the techniques to automaticallymonitor and warn about bad documentation quality and (2) wecan design techniques to ﬁx the smells based on the detection.In addition, manual effort can also be made for improvingdetected examples. With a view to determine the feasibilityof techniques to detect API documentation smells using ourbenchmark, we answer three research questions:RQ3. How accurate are rule-based classiﬁers to automaticallydetect the documentation smells?RQ4. Can the shallow machine learning models outperformthe rule-based classiﬁers?RQ5. Can the deep machine learning models outperform theother models?The shallow and deep learning models are supervised, forwhich we used 5-fold iterative stratiﬁed cross-validation asrecommended for a multilabel dataset in [67]. Traditional k -fold cross-validation is a statistical method of evaluatingmachine learning algorithms which divides data into k equallysized folds and runs for k iterations [59]. In each iteration, eachof the k folds is used as the held-out set for validation whilethe remaining k − k -fold cross-validation is impractical sincemost groups might consist of just a single example. Iterativestratiﬁcation, proposed by [67], solves this issue by employinga greedy approach of selecting the rarest groups ﬁrst andadding them to the smallest folds while splitting.We report the performances using four standard metricsin information retrieval [43]. Accuracy ( A ) is the ratio ofcorrectly predicted instances out of all the instances. Precision( P ) is the ratio between the number of correctly predictedinstances and all the predicted instances for a given smell.Recall ( R ) represents the ratio of the number of correctlypredicted instances and all instances belonging to a given class.F1-score ( F

1) is the harmonic mean of precision and recall. P = TPTP + FP , R = TPTP + FN , F = ∗ P ∗ RP + R , A = TP + TNTP + FP + TN + FN Fig. 16. Flowchart of rule-based classiﬁcation approach.

TP = Correctly classiﬁed as a smell, FP = Incorrectlyclassiﬁed as a smell, TN = Correctly classiﬁed as not a smell,FN = Incorrectly classiﬁed as not a smell.

A. Performance of Rule-Based Classiﬁers (RQ3)

Based on manual analysis of a statistically signiﬁcantrandom sample of our benchmark dataset (95% conﬁdenceinterval and 5 levels), we designed six metrics to establishﬁve rule-based classiﬁers as described below.

1) Rule-based Metrics: (a)

Documentation Length.

We usethe length of every documentation in order to capture theextensiveness of the bloated documentations. (b)

ReadabilityMetrics.

We measure Flesch readability metrics [25] for thedocumentations to analyze the understandability of docu-mentation. This feature might be useful to detect tangleddocumentations. (c)

Number of Acronyms and Jargons.

Sinceacronyms and jargons increase the complexity of a readingpassage [10], we use the number of acronyms and jargonsin every documentation to detect the tangled documentation.(d)

Number of URLs is computed because URLs are hintsof possible fragmentation in the documentation. (e)

Numberof function, class, and package name mentioned in documen-tation is computed to capture excess structural informationsmell. (f)

Edit Distance.

The edit distance (i.e., measure ofdissimilarity) between the description of a lazy documentationand its’ corresponding unit deﬁnition (i.e., method prototype)can be smaller than non-lazy documentations. We calculatethe Levenshtein distance [38] between the documentationdescription and method prototype.

2) Rule-based Classiﬁers:

Fig. 16 shows ﬂowchart of therule-based classiﬁcation approach. For each metric, we studyaverage, 25 th , 50 th , 75 th , and 90 th percentiles as thresholds.

3) Results:

In Table II, we reported the performances ofthe baseline models for each documentation smell. Differentthresholds of features achieved higher performance for differ-ent documentation smells. For example, taking 90 th percentilesof the features’ values, baseline model achieved the higherperformance for bloated documentation detection, while lazy ABLE IIC

LASS - WISE PERFORMANCE OF RULE - BASED BASELINE MODELS BY THE METRIC THRESHOLDS (P STANDS FOR PERCENTILE ) Bloated Lazy Excess Struct Tangled FragmentedModel Threshold

A P R F1 A P R F1 A P R F1 A P R F1 A P R F1

RuleBased AVG .77 .38 .86 .52 .58 .39 .71 .51 .68 .35 .34 .34 .49 .13 .18 .15 .65 .33 .52 .40 .39 .18 .64 .29 .96 .96 .93 .95 .67 .38 .30 .34 .54 .09 .09 .09 .52 .31 .90 .4750P .64 .28 .79 .41 .77 .55 .84 .66 .75 .37 .50 .42 .45 .20 .40 .26 .61 .34 .71 .46 .89 .56 .93 .70 .52 .36 .67 .47 .65 .32 .31 .31 .37 .25 .75 .37 .67 .29 .27 .28 .95 .97 .85 .90 .37 .30 .56 .39 .75 .50 .17 .25 .33 .26 .96 .41 .72 .27 .10 .15 and excess structural information smell detection required25 th percentile and 50 th percentile, respectively. Notably, theperformance of the baseline models in detecting bloated (.90F1-score) and lazy (F1 = .95) documentation were higher thandetecting excess structural info (F1 = .42), tangled (F1 = .41),and fragmented (F1 = .47) documentations. B. Performance of Shallow Learning Models (RQ4)1) Shallow Learning Models:

Since documentation smelldetection is multi-label classiﬁcation problem, we employeddifferent decomposition approaches: One-Vs-Rest (OVR), La-bel Powerset (LPS), and Classiﬁer Chains (CC) [18], [79],[80], [89] with Support Vector Machine (SVM) [14] as thebase estimator. We chose SVM and OVR-SVM since those aresuccessfully used for multi-label text classiﬁcation [22], [23],[28], [35], [79]. Each model trains a single classiﬁer per class,with the samples of that class as positive samples and all othersamples as negatives. Each individual classiﬁer then separatelygives predictions for unseen data. We used linear kernel forthe SVM classiﬁers as recommended by earlier works [86],[90]. [33], [87]. We also evaluated adapted approaches likeMulti label (ML) k NN [88] in this study. It ﬁnds the k nearest neighborhood of an input instance using k NN, thenuses Bayesian inference to determine the label set of theinstance. We studied this method because it has been reportedto achieve considerable performance for different multi-labelclassiﬁcation tasks in previous studies [8], [88]. For eachalgorithm, we picked the best model using standard practices,e.g., hyper parameter tuning in SVM as recommended byHsu [83], choice of K in ML-kNN as recommended by [8].

2) Studied Features:

We used two types of features: (1)rule-based metrics (described in Section IV-A1) and (2) bag ofwords (BoW) [32]. Bag of words (BoW) is a common featureextraction procedure for text data and has been successfullyused for text classiﬁcation problems [46], [66].

3) Results:

Table III presents the performance of theshallow learning models. The best performer is OVR-SVM,followed closely by CC-SVM. CC-based models are generallysuperior to OVR-based models because of the capability ofcapturing label correlation [58]. Since the labels (types) ofthe presentation smells are not correlated (see Section II-C),the CC-based SVM could not exhibit higher performancethan the OVR-based SVM. Using rule-based features, OVR-SVM achieved a higher F1-score (0.88) than the other modelsfor bloated documentation detection. Because documentation length (a rule-based) was more effective in detecting bloateddocumentation than bag of words. On the other hand, LPS-SVM achieved a higher F1-score (0.58) for fragmented doc-umentation detection using bag of words, as bag of wordsmore successfully determined whether the documentation wasreferring to other documentation than any rule-based features.Overall, the shallow models outperformed the rule-based clas-siﬁers for four smell types (except for lazy documentationsmell). Therefore, the documentation smell detection doesnot normally depend on a single rule-based metric, rather,it depends on a combination of different metrics and theirthresholds. The shallow learning models attempted to capturethis combination of thresholds, and therefore, achieved betterperformances than the baseline models.

4) Feature Importance Analysis:

We veriﬁed the impor-tance of our rule-based features by applying permutationfeature importance technique [9], [24] in the best performingshallow model, i.e., OVR-SVM. We ﬁrst train OVR-SVMwith all the features. While testing, we randomly shufﬂe thevalues of one feature at a time while keeping other featurevalues unchanged. A feature is important if shufﬂing its valuesaffects the model performance. We calculate the change inperformance in two ways. First, we measure the change in theaverage F1-score of the OVR-SVM model for the permutationof a feature. Second, we report the change of the speciﬁcclass that the feature was intended for (i.e., ‘DocumentationLength’ for ‘Bloated’). We observe that the permutation ofany of our rule-based features degrades the model performance(see Table IV). For example, after permutation of the values ofthe ‘Documentation Length’ of test data, the average F1-scoredecreases by 0.17 (from 0.62 to 0.45) and the F1-score ofthe desired class (i.e., ‘Bloated’) decreases by 0.46 (from 0.88to 0.42). This analysis conﬁrms the importance of combiningrule-based metrics as features in the models.

C. Performance of Deep Learning Models (RQ5)1) Deep Learning Models:

We evaluated two deep learningmodels, Bidirectional LSTM (Bi-LSTM) and Bidirectional En-coder Representations from Transformers (BERT). We pickedBi-LSTM, because it is more capable of exploiting contextualinformation than the unidirectional LSTM [30]. Hence, theBi-LSTM network can detect the documentation smell bycapturing the information of the API documentations fromboth directions. BERT is a pre-trained model which wasdesigned to learn contextual word representations of unlabeled

ABLE IIIC

LASS - WISE PERFORMANCE OF SHALLOW MACHINE LEARNING MODELS

Bloated Lazy Excess Struct Tangled FragmentedFeature Model

A P R F1 A P R F1 A P R F1 A P R F1 A P R F1

RuleBasedFeats OVR-SVM .96 .88 .89 .88 .94 .86 .94 .90 .74 .45 .23 .31 .82 .67 .56 .61 .80 .69 .25 .37

LPS-SVM .94 .86 .70 .77 .91 .77 .97 .86 .74 .44 .21 .28 .80 .70 .40 .51 .81 .73 .32 .45

CC-SVM .96 .88 .87 .88 .92 .79 .97 .87 .75 .47 .24 .32 .82 .68 .54 .60 .80 .71 .27 .39

ML- k NN .93 .73 .89 .80 .91 .86 .80 .83 .75 .49 .31 .38 .80 .63 .54 .58 .79 .57 .50 .53 BoWFeats OVR-SVM .93 .84 .66 .74 .95 .87 .96 .91 .75 .49 .47 .48 .78 .57 .54 .56 .79 .55 .54 .55

LPS-SVM .93 .89 .63 .74 .94 .83 .97 .89 .75 .50 .49 .50 .79 .59 .58 .58 .80 .59 .58 .58CC-SVM .93 .85 .67 .75 .94 .85 .96 .90 .74 .48 .47 .48 .78 .57 .54 .56 .78 .54 .54 .54

ML- k NN .93 .86 .60 .71 .88 .75 .83 .79 .73 .44 .29 .35 .79 .59 .53 .56 .80 .63 .41 .50TABLE IVOVR-SVM PERFORMANCE DECREASE IN FEATURE PERMUTATION

Permuted Desired Decrease in F1Feature Class C Overall Desired C

Doc Length Bloated .17 .46Readability Tangled .06 .11 texts [21]. We picked BERT, because it is found to signiﬁcantlyoutperform other models in various natural language process-ing and text classiﬁcation tasks [4], [29], [39], [40], [48],[52], [76]. We constructed a Bi-LSTM model with 300 hiddenstates. We used ADAM optimizer [37] with an initial learningrate of 0.001. We trained the model with batch size 256 over10 epochs. We used BERT-Base for this study which has 12layers with 12 attention heads and 110 million parameters.We trained it on benchmark for 10 epochs with a mini-batchsize of 32. We used early-stop to avoid overﬁtting [56] andconsidered validation loss as the metric of the early-stopping[55]. The maximum length of the input sequence was set to256. We used AdamW optimizer [41] with the learning rateset to 4e -5 , β β (cid:15) to 1e -8 [21], [73]. Weused binary cross-entropy to calculate the loss [63].

2) Studied Features:

We used word embedding as featurewhich is a form of word representation that is capable ofcapturing the context of a word in a document by mappingwords with similar meaning to a similar representation. ForBi-LSTM, we used 100-dimensional pre-trained GloVe em-bedding which was trained on a dataset of one billion tokens(words) with a vocabulary of four hundred thousand words[53]. We used the pre-trained embedding in BERT model [21].

3) Results:

Table V shows the performance of the deeplearning models. BERT outperformed Bi-LSTM, the shallow,and rule-based classiﬁers to detect each smell (F1-score). Theincrease in F1-score in BERT compared to the best perform-ing shallow learning model per smell is as follows: bloated(5.7% over OVR-SVM Rule), lazy (6.6% over OVR-SVMBoW), Excess Structural Information (52% over ML-kNNBoW), tangled (36.1% over OVR-SVM Rule), and fragmented(36.4% over OVR-SVM BoW). SVM and kNN-based models produced more false-negative results because the number ofpositive instances for an individual smell type is lower than thenumber of negative instances for that type. As a result, SVMand kNN-based models showed low recalls for some types(Excess structural information, Tangled, and Fragmented) andconsequently resulted in low F1-scores. On the other hand, Bi-LSTM and BERT achieved better performance because theyfocused on capturing generalized attributes for each smell type.We manually analyzed the misclassiﬁed examples of ExcessStructural Information and fragmented documentation whereBERT achieved below 0.8 accuracy. For the Excess StructuralInformation smell detection, BERT falsely considered somejava objects and methods as structural information; therefore,the model produced some false positive cases. In some ex-amples, BERT could not identify whether the information ofdocumentation was referring to other documentation. As aresult, the model misclassiﬁed the fragmented documentation.V. D

ISCUSSIONS

Implications of Findings.

Thanks to the signiﬁcant researchefforts to understand API documentation problems using em-pirical and user studies, we now know with empirical evidencethat the quality of API ofﬁcial documentation is a concern bothfor open source and industrial APIs [7], [26], [27], [62], [81].The ﬁve API documentation smells we studied in this paperare frequently referred to as documentation presentation/de-sign problems in the literature [7], [81]. Our comprehensivebenchmark of 1,000 API ofﬁcial documentation units has778 units each exhibiting one or more of the smells. Thevalidity of the smells by professional software developersproves that this benchmark can be used to foster a new areaof research in software engineering on the automatic detectionof API documentation quality - which is now an absolutemust due to the growing importance of APIs and softwarein our daily lives [54], [62]. The superior performance of ourmachine learning classiﬁers, in particular the deep learningmodel BERT, offers promise that we can now use such toolsto automatically monitor and warn about API documentationquality in real-time. Software companies and open sourcecommunity can leverage our developed model to analyze thequality of their API documentation. Software developers couldsave time by focusing on good quality API documentation

ABLE VC

LASS - WISE PERFORMANCE OF DEEP LEARNING MODELS

Bloated Lazy Excess Struct Tangled FragmentedFeature Model

A P R F1 A P R F1 A P R F1 A P R F1 A P R F1

WordEmbed Bi-LSTM .92 .92 .92 .91 .89 .90 .89 .90 .76 .72 .76 .73 .78 .74 .78 .74 .67 .64 .67 .63

BERT .93 .93 .93 .93 .97 .97 .97 .97 .76 .75 .76 .76 .83 .83 .83 .83 .75 .75 .75 .75 smells should beﬁxed in APIdocumentation

Fig. 17. Survey responses on whether the ﬁve documentation smells shouldbe ﬁxed to improve API documentation quality. instead of the bad ones as detected by our model. Based onsuch real-time feedback, tools can be developed to improve thedocumentation quality by ﬁxing the smells. Indeed, when weasked our survey participants (Section III) whether the ﬁvesmells need to be ﬁxed, more than 90% responded with a‘Yes’, 9.5% with a ‘Maybe’, 0% with a ‘No’ (see Fig. 17).

Threats to Validity.

Internal validity threats relate to authors’bias while conducting the analysis. We mitigated the bias inour benchmark creation process by taking agreement from17 coders and co-authors and by consulting API documen-tation literature. The machine learning models are trained,tested, and reported using standard practices. There was nocommon data between the training and test set.

Constructvalidity threats relate to the difﬁculty in ﬁnding data to createour catalog of smells. Our benchmark creation process wasexhaustive, as we processed more than 29K unit examplesfrom ofﬁcial documentation.

External validity threats relate tothe generalizability of our ﬁndings. We mitigated this threat bycorroborating the ﬁve smells in our study with ﬁndings fromstate-of-the-art research in API documentation presentationand design problems. Our analysis focused on the validationand detection of ﬁve API documentation smells. Similar tocode smell literature, additional documentation smells can beadded into our catalog as we continue to research on this area.VI. R

ELATED W ORK

Related work is divided into studies on understanding (1)documentation problems and (2) how developers learn APIsusing documentation, and developing techniques (3) to detecterrors in documentation and (4) to create documentation.

Studies.

Research shows that traditional Javadoc-type ap-proaches to API ofﬁcial documentation are less usefulthan example-based documentation (e.g., minimal man-ual [13]) [68] Both code examples and textual descriptionare required for better quality API documentation [19], [26],[49]. Depending of the types of API documentation, reabilityand understandability of the documentation can vary [77].Broadly, problems in API ofﬁcial documentation can beabout ‘what’ contents are documented and ‘how’ the contentsare presented [6], [7], [61], [62], [81]. Literature in API documentation quality discussed four desired attributes ofAPI documentation: completeness, consistency, usability andaccessibility [77], [91]. Several studies show that externalinformal resources can be consulted to improve API ofﬁcialdocumentation [20], [34], [36], [50], [74], [82], [85]The ﬁve documentation smells studied in this paper aretaken from ﬁve commonly discussed API documentation de-sign and presentation issues in literature [6], [81]. In contrastto the above papers that aim to understand API documentationproblems, we focus on the development of techniques toautomatically detect documentation smells.

Techniques.

Tools and techniques are proposed to automati-cally add code examples and insights from external resources(e.g., online forums) into API ofﬁcial documentation [5],[72], [78]. Topic modeling is used to develop code booksand to detect deﬁcient documentation [12], [69], [70]. APIofﬁcial documentation and online forum data are analyzedtogether to recommend ﬁxes API misuse scenarios [60]. Thedocumentation of an API method can become obsolete/incon-sistent due to evolution in source code [16], [84]. Severaltechniques are proposed to automatically detect code commentinconsistency [57], [75], [92]. A large body of research isdevoted to automatically produce natural lanugage summarydescription of source code method [44], [47], [65], [71].Unlike previous research, we focus on the detection of ﬁveAPI documentation smells that do not make a documentationinconsistent/incorrect, but nevertheless make the learning ofthe documentation difﬁcult due to the underlying design/pre-sentation issues. We advance state-of-the-art research on APIdocumentation quality analysis by offering a benchmark ofreal-world examples of ﬁve documentation smells and a suiteof techniques to automatically detect the smells.VII. C

ONCLUSIONS

The learning of an API is challenging when the ofﬁcialdocumentation resources are of low quality. We identify ﬁveAPI documentation smells by consulting API documentationliterature on API documentation design and presentation is-sues. We present a benchmark of 1,000 API documentationunits with ﬁve smells in API ofﬁcial documentation. Feedbackfrom 21 industrial software developers shows that the smellscan negatively impact the productivity of the developers duringAPI documentation usage. We develop a suite of machinelearning classiﬁers to automatically detect the smells. The bestperforming classiﬁer BERT, a deep learning model, achievesF1-scores of 0.75 - 0.97. The techniques can help automati-cally monitor and warn about API documentation quality.

EFERENCES[1]

Javadoc SE 7 . https://docs.oracle.com/javase/7/docs/api/, 2020.[2] M. Abidi, M. Grichi, F. Khomh, and Y. G. Gu´eh´eneuc. Anti-patternsfor multi-language systems. In , page Article No. 42, 2019.[3] M. Abidi, M. Grichi, F. Khomh, and Y. G. Gu´eh´eneuc. Code smellsfor multi-language systems. In , page Article No. 12, 2019.[4] A. Adhikari, A. Ram, R. Tang, and J. Lin. Docbert: Bert for documentclassiﬁcation. arXiv preprint arXiv:1904.08398 , 2019.[5] E. Aghajani, G. Bavota, M. Linares-V´asquez, and M. Lanza. Auto-mated documentation of android apps.

IEEE Transactions on SoftwareEngineering , page 17, 2019.[6] E. Aghajani, C. Nagy, M. Linares-V´asquez, L. Moreno, G. Bavota,M. Lanza, and D. C. Shepherd. Software documentation: The prac-titioners’ perspective. In , page 12, 2020.[7] E. Aghajani, C. Nagy, O. L. Vega-M´arquez, M. Linares-V´asquez,L. Moreno, G. Bavota, and M. Lanza. Software documentation issuesunveiled. In ,page 1199–1210, 2019.[8] W. Alkhatib, C. Rensing, and J. Silberbauer. Multi-label text clas-siﬁcation using semantic features and dimensionality reduction withautoencoders. In

International Conference on Language, Data andKnowledge , pages 380–394. Springer, 2017.[9] L. Breiman. Random forests.

Machine Learning , 45(1):5–32, 2001.[10] O. M. Bullock, D. Col´on Amill, H. C. Shulman, and G. N. Dixon.Jargon as a barrier to effective science communication: Evidence frommetacognition.

Public Understanding of Science , 28(7):845–853, 2019.[11] I. Cai.

Framework Documentation: How to document object-orientedframeworks. An Empirical Study . PhD in Computer Sscience, Universityof Illinois at Urbana-Champaign, 2000.[12] J. C. Campbell, C. Zhang, Z. Xu, A. Hindle, and J. Miller. Deﬁcientdocumentation detection: A methodology to locate deﬁcient project doc-umentation using topic analysis. In

Proceedings of the 10th InternationalWorking Conference on Mining Software Repositories , pages 57–60,2013.[13] J. M. Carroll, P. L. Smith-Kerker, J. R. Ford, and S. A. Mazur-Rimetz. The minimal manual.

Journal of Human-Computer Interaction ,3(2):123–153, 1987.[14] C. Cortes and V. Vapnik. Support-vector networks.

Machine learning ,20(3):273–297, 1995.[15] H. Cram´er.

Mathematical methods of statistics , volume 43. Princetonuniversity press, 1999.[16] B. Dagenais.

Analysis and Recommendations for Developer LearningResources . PhD in Computer Sscience, McGill University, 2012.[17] B. Dagenais and M. P. Robillard. Using traceability links to recommendadaptive changes for documentation evolution.

IEEE Transactions onSoftware Engineering , 40(11):1126–1146, 2014.[18] A. C. de Carvalho and A. A. Freitas. A tutorial on multi-labelclassiﬁcation techniques. In

Foundations of computational intelligencevolume 5 , pages 177–195. Springer, 2009.[19] S. C. B. de Souza, N. Anquetil, and K. M. de Oliveira. A study ofthe documentation essential to software maintenance. In , pages 68–75, 2005.[20] F. Delﬁm and M. M. Kl´erisson Paix˜ao, Damien Cassou. Redocumentingapis with crowd knowledge: a coverage analysis based on question types.

Journal of the Brazilian Computer Society , 29(1), 2016.[21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-trainingof deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 , 2018.[22] S. Dumais et al. Using svms for text categorization.

IEEE IntelligentSystems , 13(4):21–23, 1998.[23] A. Elisseeff and J. Weston. A kernel method for multi-labelledclassiﬁcation. In

Advances in neural information processing systems ,pages 681–687, 2002.[24] A. Fisher, C. Rudin, and F. Dominici. All models are wrong, but manyare useful: Learning a variable’s importance by studying an entire classof prediction models simultaneously.

Journal of Machine LearningResearch , 20(177):1–81, 2019.[25] R. Flesch and A. J. Gould.

The art of readable writing , volume 8.Harper New York, 1949. [26] A. Forward and T. C. Lethbridge. The relevance of software documen-tation, tools and technologies: A survey. In

Proc. ACM Symposium onDocument Engineering , pages 26–33, 2002.[27] G. Garousi, ahid Garousi-Yusifo´glu, G. Ruhe, J. Zhi, M. Moussavi, andB. Smith. Usage and usefulness of technical software documentation:An industrial case study.

Information and Software Technology , 57:664–682, 2015.[28] T. F. Gharib, M. B. Habib, and Z. T. Fayed. Arabic text classiﬁcationusing support vector machines.

Int. J. Comput. Their Appl. , 16(4):192–199, 2009.[29] S. Gonz´alez-Carvajal and E. C. Garrido-Merch´an. Comparing bertagainst traditional machine learning text classiﬁcation. arXiv preprintarXiv:2005.13012 , 2020.[30] A. Graves and J. Schmidhuber. Framewise phoneme classiﬁcationwith bidirectional lstm and other neural network architectures.

Neuralnetworks , 18(5-6):602–610, 2005.[31] Q. Gu, Z. Li, and J. Han. Correlated multi-label feature selection. In

Proceedings of the 20th ACM international conference on Informationand knowledge management , pages 1087–1096, 2011.[32] Z. S. Harris. Distributional structure.

Word , 10(2-3):146–162, 1954.[33] C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al. A practical guide to supportvector classiﬁcation, 2003.[34] H. Jiau and F.-P. Yang. Facing up to the inequality of crowdsourced apidocumentation.

ACM SIGSOFT Software Engineering Notes , 37(1):1–9,2012.[35] T. Joachims. Text categorization with support vector machines: Learningwith many relevant features. In

European conference on machinelearning , pages 137–142. Springer, 1998.[36] D. Kavaler, D. Posnett, C. Gibler, H. Chen, P. Devanbu, and V. Filkov.Using and asking: Apis used in the android market and asked about instackoverﬂow. In

In Proceedings of the INTERNATIONAL CONFER-ENCE ON SOCIAL INFORMATICS , pages 405–418, 2013.[37] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[38] V. I. Levenshtein. Binary codes capable of correcting deletions, in-sertions, and reversals. In

Soviet physics doklady , volume 10, pages707–710, 1966.[39] X. Li, L. Bing, W. Zhang, and W. Lam. Exploiting bert for end-to-end aspect-based sentiment analysis. arXiv preprint arXiv:1910.00883 ,2019.[40] Y. Liu. Fine-tune bert for extractive summarization. arXiv preprintarXiv:1903.10318 , 2019.[41] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 , 2017.[42] H. V. D. Maij. A critical assessment of the minimalist approach todocumentation. In

Proc. 10th ACM SIGDOC International Conferenceon Systems Documentation , pages 7–17, 1992.[43] C. D. Manning, P. Raghavan, and H. Sch¨utze.

An Introduction toInformation Retrieval . Cambridge Uni Press, 2009.[44] P. W. McBurney and C. McMillan. Automatic documentation generationvia source code summarization of method context. In , pages 279 – 290, 2014.[45] M. L. McHugh. Interrater reliability: the kappa statistic.

Biochemiamedica: Biochemia medica , 22(3):276–282, 2012.[46] M. McTear, Z. Callejas, and D. Griol. Spoken language understanding.In

The Conversational Interface , pages 161–185. Springer InternationalPublishing, 2016.[47] L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. Pollock, and K. Vijay-Shanker. Automatic generation of natural language summaries for Javaclasses. In

Proceedings of the 21st IEEE International Conference onProgram Comprehension , pages 23–32, 2013.[48] M. Munikar, S. Shakya, and A. Shrestha. Fine-grained sentimentclassiﬁcation using bert. In , volume 1, pages 1–5. IEEE, 2019.[49] J. Nykaza, R. Messinger, F. Boehme, C. L. Norman, M. Mace, andM. Gordon. What programmers really want: Results of a needsassessment for SDK documentation. In

Proc. 20th Annual InternationalConference on Computer Documentation , pages 133–141, 2002.[50] C. Parnin and C. Treude. Measuring api documentation on the web. In

Proceedings of the 2nd International Workshop on Web 2.0 for SoftwareEngineering , pages 25–30, 2011.[51] V. L. Parsons. Stratiﬁed sampling.

Wiley StatsRef: Statistics ReferenceOnline , pages 1–11, 2014.52] Y. Peng, S. Yan, and Z. Lu. Transfer learning in biomedical natural lan-guage processing: An evaluation of bert and elmo on ten benchmarkingdatasets. arXiv preprint arXiv:1906.05474 , 2019.[53] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors forword representation. In

Proceedings of the 2014 conference on empiricalmethods in natural language processing (EMNLP) , pages 1532–1543,2014.[54] L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza.Prompter: Turning the IDE into a self-conﬁdent programming assistant.

Empirical Software Engineering , 21(5):2190–2231, 2016.[55] L. Prechelt. Automatic early stopping using cross validation: quantifyingthe criteria.

Neural Networks , 11(4):761–767, 1998.[56] L. Prechelt. Early stopping-but when? In

Neural Networks: Tricks ofthe trade , pages 55–69. Springer, 1998.[57] F. Rabbi and M. S. Siddik. Detecting code comment inconsistency usingsiamese recurrent network. In

Proceedings of the 28th InternationalConference on Program Comprehension , pages 371–375, 2020.[58] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classiﬁer chains formulti-label classiﬁcation. In

Joint European Conference on MachineLearning and Knowledge Discovery in Databases , pages 254–269.Springer, 2009.[59] P. Refaeilzadeh, L. Tang, and H. Liu. Cross-validation.

Encyclopediaof database systems , 5:532–538, 2009.[60] X. Ren, J. Sun, Z. Xing, X. Xia, and J. Sun. Demystify ofﬁcial apiusage directives with crowdsourced apimisuse scenarios, erroneous codeexamples and patches. In , page 12, 2020.[61] M. P. Robillard. What makes APIs hard to learn? Answers fromdevelopers.

IEEE Software , 26(6):26–34, 2009.[62] M. P. Robillard and R. DeLine. A ﬁeld study of API learning obstacles.

Empirical Software Engineering , 16(6):703–732, 2011.[63] L. Rosasco, E. D. Vito, A. Caponnetto, M. Piana, and A. Verri. Are lossfunctions all the same?

Neural Computation , 16(5):1063–1076, 2004.[64] M. B. Rosson, J. M. Carrol, and R. K. Bellamy. Smalltalk scaffolding:a case study of minimalist instruction. In

Proc. ACM SIGCHI Conf. onHuman Factors in Computing Systems , pages 423–430, 1990.[65] A. M. S. Haiduc, J. Aponte. Supporting program comprehension withsource code summarization. In

In Proceedings of the 32nd InternationalConference on Software Engineering , pages 223–226, 2010.[66] F. Sebastiani. Machine learning in automated text categorization.

ACMcomputing surveys (CSUR) , 34(1):1–47, 2002.[67] K. Sechidis, G. Tsoumakas, and I. Vlahavas. On the stratiﬁcation ofmulti-label data. In

Joint European Conference on Machine Learningand Knowledge Discovery in Databases , pages 145–158. Springer, 2011.[68] F. Shull, F. Lanubile, and V. R. Basili. Investigating reading techniquesfor object-oriented framework learning.

IEEE Transactions on SoftwareEngineering , 26(11):1101–1118, 2000.[69] L. Souza, E. Campos, , and M. Maia. On the extraction of cookbooksfor apis from the crowd knowledge. In

Proceedings of the 28th BrazilianSymposium on Software Engineering , pages 21–30, 2014.[70] L. B. Souza, E. C. Campos, F. Madeiral, K. P. ao, A. M. Rocha, andM. de Almeida Maia. Bootstrapping cookbooks for apis from crowdknowledge on stack overﬂow.

Information and Software Technology ,111:3749, 2019.[71] G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker.Towards automatically generating summary comments for java methods.In

Proceedings of the IEEE/ACM international conference on Automatedsoftware engineering , pages 43–52, 2010.[72] S. Subramanian, L. Inozemtseva, and R. Holmes. Live API documen-tation. In

Proceedings of 36th International Conference on SoftwareEngineering , pages 643–652, 2014.[76] I. Tenney, D. Das, and E. Pavlick. Bert rediscovers the classical nlppipeline. arXiv preprint arXiv:1905.05950 , 2019. [73] C. Sun, X. Qiu, Y. Xu, and X. Huang. How to ﬁne-tune bert for textclassiﬁcation? In

China National Conference on Chinese ComputationalLinguistics , pages 194–206. Springer, 2019.[74] J. Sunshine, J. D. Herbsleb, , and J. Aldrich. Searching the state space:A qualitative study of api protocol usability. In

Proceedings of theInternational Conference on Program Comprehension , pages 82–93,2015.[75] S. H. Tan, D. Marinov, L. Tan, and G. T. Leavens. tcomment: Testingjavadoc comments to detect comment-code inconsistencies. In

Inter-national Conference on Software Testing, Veriﬁcation, and Validation ,pages 260 – 269, 2012.[77] C. Treude, J. Middleton, and T. Atapattu. Beyond accuracy: Assessingsoftware documentation quality. In

ACM Joint European SoftwareEngineering Conference and Symposium on the Foundations of SoftwareEngineering - Vision and Reﬂections Track , page 4, 2020.[78] C. Treude and M. P. Robillard. Augmenting api documentation with in-sights from stack overﬂow. In

Proc. IEEE 38th International Conferenceon Software Engineering , pages 392–402, 2016.[79] G. Tsoumakas and I. Katakis. Multi-label classiﬁcation: An overview.

International Journal of Data Warehousing and Mining (IJDWM) ,3(3):1–13, 2007.[80] G. Tsoumakas and M.-L. Zhang. Learning from multi-label data. 2009.[81] G. Uddin and M. P. Robillard. How API documentation fails.

IEEESoftawre , 32(4):76–83, 2015.[82] W. Wang and M. W. Godfrey. Detecting api usage obstacles: A study ofios and android developer questions. In

In Proceedings of the 10thWorking Conference on Mining Software Repositories , pages 61–64,2013.[83] C. wei Hsu, C. chung Chang, and C. jen Lin. A practical guide tosupport vector classiﬁcation.[84] F. Wen, C. Nagy, G. Bavota, and M. Lanza. A large-scale empirical studyon code-comment inconsistencies. In , page 53–64, 2019.[85] D. Yang, A. Hussain, and C. V. Lopes. From query to usable code:an analysis of stack overﬂow code snippets. In

In Proceedings of the13th International Conference on Mining Software Repositories , pages391–402, 2016.[86] Y. Yang and X. Liu. A re-examination of text categorization methods. In

Proceedings of the 22nd annual international ACM SIGIR conference onResearch and development in information retrieval , pages 42–49, 1999.[87] B. Yekkehkhany, A. Safari, S. Homayouni, and M. Hasanlou. A com-parison study of different kernel functions for svm-based classiﬁcationof multi-temporal polarimetry sar data.

The International Archivesof Photogrammetry, Remote Sensing and Spatial Information Sciences ,40(2):281, 2014.[88] M.-L. Zhang and Z.-H. Zhou. Ml-knn: A lazy learning approach tomulti-label learning.

Pattern recognition , 40(7):2038–2048, 2007.[89] M.-L. Zhang and Z.-H. Zhou. A review on multi-label learningalgorithms.

IEEE transactions on knowledge and data engineering ,26(8):1819–1837, 2013.[90] W. Zhang, T. Yoshida, and X. Tang. Text classiﬁcation based onmulti-word with support vector machine.

Knowledge-Based Systems ,21(8):879–886, 2008.[91] J. Zhia, V. Garousi-Yusifo´glubc, B. Sun, G. Garousi, S. Shahnewaz,and G. Ruhe. Cost, beneﬁts and quality of software developmentdocumentation: A systematic mapping.

Journal of Systems and Software ,99:175–198, 2015.[92] Y. Zhou, R. Gu, T. Chen, Z. Huang, S. Panichella, and H. Gall.Analyzing apis documentation and code to detect directive defects. In39th International Conference on Software Engineering