[PDF] What's in a GitHub Repository? -- A Software Documentation Perspective

Abstract

Developers use and contribute to repositories on GitHub. Documentation present in the repositories serves as an important source by helping developers to understand, maintain and contribute to the project. Currently, documentation in a repository is diversified, among various files, with most of it present in ReadMe files. However, other software artifacts in the repository, such as issue reports and pull requests could also contribute to documentation, without documentation being explicitly specified. Hence, in this paper, we propose a taxonomy of documentation sources by analyzing different software artifacts, developer interviews and card-sorting approach. We inspected multiple artifacts of 950 public GitHub repositories, written in four different programming languages, C++, C#, Python and Java, and analyzed the type and amount of documentation that could be extracted from these artifacts. To this end, we observe that, about 25.93% of information extracted from all sources proposed in the taxonomy contains error-related documentation, and that pull requests contribute to around 18.21% of extracted information.

Full PDF

WWhat’s in a GitHub Repository? - A SoftwareDocumentation Perspective

Akhila Sri Manasa Venigalla, Sridhar Chimalakonda

Research in Intelligent Software & Human Analytics (RISHA) LabDepartment of Computer Science and EngineeringIndian Institute of Technology Tirupati

Tirupati, India { cs19d504, ch } @iittp.ac.in Abstract —Developers use and contribute to repositories onGitHub. Documentation present in the repositories serves as animportant source by helping developers to understand, maintainand contribute to the project. Currently, documentation in arepository is diversiﬁed, among various ﬁles, with most of itpresent in ReadMe ﬁles. However, other software artifacts inthe repository, such as issue reports and pull requests couldalso contribute to documentation, without documentation beingexplicitly speciﬁed. Hence, in this paper, we propose a taxonomy ofdocumentation sources by analyzing different software artifacts,developer interviews and card-sorting approach. We inspectedmultiple artifacts of 950 public GitHub repositories, written infour different programming languages, C++, C error-related documentation,and that pull requests contribute to around 18.21% of extractedinformation.

Index Terms —Software Artifacts, GitHub, Documentation,Issue Reports, Pull Requests, Commit Messages

I. I

NTRODUCTION

GitHub is one of the leading platforms of open sourcesoftware projects [1]. GitHub facilitates several developers tocollaborate and contribute to projects by performing severalactions such as updating projects using commits, viewingchanges made to projects through commit history [2], loggingissues or defects through issue reports and contributing to otherprojects through pull requests and so on [3], [4]. Open sourceprojects originating from well known organizations such as vscode from Microsoft, react-native from Facebook and ten-sorﬂow from Google, have more than 1K , 2K and 2.8K contributors respectively (as of January 2021). Developersalso contribute to and reuse several other publicly availablerepositories on GitHub [5]. Documentation in the reposito-ries facilitates developers to understand about a project andconsequently helps them in deciding on projects they wish tocontribute [6]. Repositories contain Readme ﬁles that provideinformation on the purpose, requirements, usage instructionsand various other information about the repositories [7], [8]. https://github.com/microsoft/vscode https://github.com/facebook/react-native https://github.com/tensorﬂow/tensorﬂow Several other ﬁles of the repository such as

License ﬁles,

UML ﬁles and so on also present different types of repositoryinformation such as license permissions and design decisions[9], [10]. Though many developers are interested contributeto GitHub repositories, they face multiple hurdles during thisprocess, resulting in reduced motivation towards contributingto repositories [11], [12]. The existing insufﬁcient and scat-tered documentation on these repositories makes it difﬁcult fordevelopers to understand about the repositories, consequentlyreducing the advantages of huge number of contributions [13].Developers tend to visit other artifacts in the repositories tobetter understand about the repository, which is effort-intensive[14], [15]. We believe that contribution efforts from wide rangeof developers could be well leveraged if documentation isimproved and consolidated.Researchers have explored multiple dimensions of under-standing, usage and ways to improve software documentation,as it plays a signiﬁcant role in performing various tasks.These tasks include software development, testing, integration,maintenance and so on [16], [17], but, the documentation isinsufﬁcient in majority of the projects [13], [18]. For example,documentation helps in improving software development andmaintenance by providing necessary information to users indifferent roles such as system integrators, quality analysts,developers and so on [19]. Good documentation helps in re-engineering existing software during maintenance and migra-tion [18], [20]. It has also been observed that developers regarddocumentation as an important aspect, even in projects withfaster release cycles such as agile projects [21].Considering the wide usage of GitHub and activities per-formed by developers on GitHub, documentation of softwarerepositories that are hosted on GitHub also plays a majorrole in various stages of the project such as development,deployment, maintenance and so on. It has been observedthat projects having better popularity tend to have betterdocumentation [22]. Developers of popular projects tend toimprove documentation by regular updates, to provide betterinsights to users [23]. Also, documentation is one of the factorsconsidered by developers before contributing to repositorieson GitHub [6]. Currently, documentation is spread acrossmultiple ﬁles such as source code , design diagrams , readmeﬁles , license ﬁles and so on [7], [9], [24]. We hypothesize that a r X i v : . [ c s . S E ] M a r ther artifacts could also provide valuable information about arepository, that could be considered as documentation, withoutdocumentation being explicitly present. Researchers have alsoobserved that developers spend considerable effort on multipleartifacts, other than source code, in end-user oriented projects,supporting our idea of considering documentation from mul-tiple sources [14], [15]. Currently, sources of documentationare diversiﬁed in a repository. The documentation present inthese sources is also unclear. Unifying the information frommultiple sources could help in enhancement of documentation,and eventually reduce effort for developers, thus motivatingthem to contribute to GitHub repositories.

Though there arestudies to identify different types of documentation present insoftware repositories [7], [9], [25], we are not aware of anywork that integrates documentation from multiple artifacts,thus, motivating our work. Hence, in this paper, we aim toidentify and gather different artifacts of software repositories,that could contain information relevant to documentation. Themain contributions of this paper are: • A taxonomy of documentation sources in GitHub Repos-itories based on card-sorting approach and developer-based interviews with 20 developers. • An empirical analysis on 950 GitHub repositories, offour programming languages, C++, C • Results of the empirical study, along with the percentageof available information that could contribute to doc-umentation, in each of the software artifacts that areidentiﬁed as potential sources of documentation. Theresults of the study and dataset used for the study canbe accessed here . Documentation Type - We deﬁne type of documentationbased on content present in the documentation. For example,documentation that refers to reporting or ﬁxing an error isconsidered as

Error-related documentation type.

Documentation Source - We deﬁne documentation source asa software artifact of the project, capable of providing poten-tially useful information related to software documentation.II. R

ESEARCH M ETHODOLOGY

In this paper, we aim to analyze contents in GitHub reposi-tories from a software documentation perspective, speciﬁcallyfocusing on various software artifacts that could serve assources of documentation. We followed an approach thatcomprises of the following six phases. • Research Questions Deﬁnition - Four research questionsaimed to identify possible sources of software documen-tation have been deﬁned. • Card Sorting Approach and Developer Interviews - Formanual analysis, 60 of the 950 GitHub repositories, which Results & Dataset - https://osf.io/dfx9r/?view only=0954174b44054893a3dfbe6cc7dd5db5 include 15 top trending repositories from C++, C • Data Extraction - 950 public GitHub repositories, whichinclude 233, 241, 253 and 223 repositories from C++, C • Topic Modelling and Manual Inspection - LDA topicmodelling techniques were applied to 20% of the scrapeddata to identify topics in the data. The topics identiﬁedare manually walked through to label the categories. • Automated Analysis - Models to analyze extracted dataagainst research questions are developed. • Result Comprehension - Research questions are answeredas a result of automated analysis.

A. Research Questions

The main goal of this study is to propose a taxonomyof documentation sources in GitHub, by identifying differentpotential sources of software documentation in a GitHubrepository, including the sources in which documentation isnot explicitly speciﬁed. We conduct this study by answeringthe following research questions: • RQ1: What are the types of documentation presentin GitHub repositories ?

To answer this question, we employed card-sorting ap-proach and developer interviews with 20 developersto understand different documentation types present inGitHub repositories, as perceived by GitHub users.

Summary:

We observe that majority of the GitHubrepositories comprise of six documentation types, ofwhich at least three of them are not explicitly mentionedas documentation in the repositories. • RQ2: What are the sources of documentation inGitHub repositories?

We answer this question by identifying various sourcesthat are perceived to contribute to different documentationtypes through developer interviews with 20 developersand card-sorting approach with 3 participants.

Summary:

We observe that majority of the developersuse six artifacts of GitHub Repositories, to understandabout a repository, and further contribute to it. This alsoincluded issues and pull requests , which do not explicitlyconvey documentation related information. • RQ3: What is the distribution of different documen-tation types in each of the identiﬁed sources?

To answer this question, we categorize the informationextracted from each source of 950 GitHub repositoriesinto one of the documentation types identiﬁed, and cal-culate the percentage of each type in the sources. ig. 1. Proposed Taxonomy of Documentation SourcesFig. 2. Proposed Taxonomy of Documentation Types

Summary:

Of the six documentation types identiﬁed, error-related documentation is present to a larger extentin majority of the documentation sources. • RQ4: What is the contribution of identiﬁed sourcesto different documentation types?

To answer this question, we calculate the percentagecontribution of a source by ﬁnding the ratio of amountof information present in a source with respect to thedocumentation type to the total amount of information ofthe documentation type, present across different sources.

Summary:

Following a well-established notion, majorityof the documentation was observed to be present in source-code comments and textual documents . However, commit logs and issues were also observed to contributeto documentation, with a minor variation in the percent-age, in comparison to the textual documentation.

B. Card Sorting Approach and Developer Interviews

We observe that card-sorting [26] approach has been usedwidely towards arriving at various taxonomies and classiﬁca-tions in the software engineering literature, such as classifyingrequirements change [27] and research ideas [28], whichmotivated us to employ card-sorting as an initial step towardsarriving at different types of documentation and their sources.We have downloaded 60 of the 950 GitHub repositories,such that they include 15 top trending repositories, from eachof the four programming languages being considered, for man-ual analysis. Three individuals, comprising two researchers and one under-graduate student were assigned the task of iden-tifying different types and sources of documentation presentin the repositories. Initially, all the three individuals haveexplored readme ﬁles of the repositories, considering them tobe a basic and easily available source of documentation. Weemployed open card sorting approach, without pre-deﬁningpossible groups of documentation. In the ﬁrst step, each ofthe three individuals came up with different number of groups(4, 6 and 10).On further iterative discussions on labels of each of thegroups, we observed similarities between the informationcontent present among the 10 groups identiﬁed by the thirdindividual, resulting in a set of 5 groups (T1 to T4 andT6 documentation types presented in Table I). We furthercompared and discussed about the data content among othergroups, identiﬁed by each individual and ﬁnally arrived at acommon decision, with 6 possible types of documentation,accompanied by labels to these types, as mentioned in TableI. As a next step, all the three individuals navigated throughvarious artifacts within the repositories, and attempted toidentify documentation in those artifacts. On identiﬁcationof documentation, each artifact was labelled with the corre-sponding documentation type (identiﬁed in the previous step)it would be representing. While one individual narrowed downto ﬁve artifacts, the other two individuals pointed out sixpossible artifacts containing documentation. Each of the threeindividuals have explained the reasons for selecting respectivenumber of artifacts, based on which, we ﬁnally decided toonsider six different artifacts (mentioned in Fig. 1) as part ofthe proposed taxonomy.We later organised developer interviews with 20 profes-sional developers to validate the types and sources of docu-mentation obtained as a result of the card sorting approach. Weapproached 20 developers working in different organizations,with work experience ranging from 2 to 30 years. All the 20developers interviewed belonged to organizations with morethan 1000 employees and with a team size of at least 10.We selected developers with this proﬁle to ensure that theyare well-equipped with the collaborative project developmentand management. All the 18 developers with less than 10years of experience mentioned that they actively contributeto open-source projects. We asked them open-ended questionsto understand their perception about types of documentationand the sources of documentation present on GitHub. 12 ofthe developers, with work experience ranging between 3 to10 years explained that they ﬁnd six documentation types(presented in Table I) in the projects they interact with, onGitHub. 2 developers with work experience of 29 and 30 yearspointed out that they ﬁnd seven types of documentation inthe projects, which included migration guidelines , apart fromthe six types identiﬁed in Table I.

Migration guidelines couldhowever be considered as project-related documentation, at ahigher level of abstraction, as they specify details necessaryin migrating the project. The rest of the developers with lessthan three years of experience have expressed that they haveobserved only three types of documentation (T1, T2 and T5in Table I). This varied insights from developers could alsobe due to varied abstractions of projects they are exposed to,in their respective organizations. We then discussed with themthe documentation types identiﬁed previously, through card-sorting approach. 18 of the 20 developers have later agreedto the identiﬁed six documentation types, while the other twodevelopers have suggested inclusion of one more category,which corresponds to migration guidelines of the project.Considering the majority responses, we decided to proceedfurther, with six documentation types. We further queried thedevelopers to understand different artifacts on GitHub that theyuse to understand about a repository, which could be helpful,if collated as a single documentation ﬁle. All the 20 developershave pointed out six sources of documentation, as shown inFigure 1, in GitHub repositories, and also mentioned thatﬁve (

Source code , textual documents , commits , pull requests and issue reports ) of these six documentation sources, arefrequently referred by them.Based on the card sorting approach and developerinterviews, we propose six potential documentation sources,that includes 3 software artifacts which do not havedocumentation explicitly speciﬁed, as shown in Fig. 1. Thus,we observe six types of documentation and six sources ofdocumentation, as mentioned in Table I and Fig. 1 respectively. TABLE IO

BSERVED D OCUMENTATION T YPES AND CORRESPONDING DEFINITIONS

S. Types of Deﬁnitionno. Documentation

T1 API-related Documentation capable of providing informationdocumentation about APIs used in the projectT2 File-related Documentation capable of providingDocumentation ﬁle-level information such as updates made tothe ﬁles and dependencies of ﬁlesT3 Project-related Documentation capable of providingDocumentation project-level information such as installationinstructions, branches in projects, enhancementsto the projects and so onT4 License-related Documentation capable of providing informationDocumentation about licenses in the projectT5 Error/Bug-related Documentation capable of providing informationdocumentation about errors or bugs encounteredin multiple ﬁles of projectT6 Architecture- Documentation capable of providing informationrelated about architecture of the project, such asDocumentation module interactions.

RQ1:

We observed six different documentation types inGitHub Repositories.

API-related , ﬁle-related , project-related , license-related , error/bug-related and architecture-related documentations are the six different documentation typesidentiﬁed. Insights-

Developers perceived ﬁle-related and project-related documentation types to be more useful in contributing to aproject, among the six types.

RQ2:

We observed six different sources containing potentialinformation related to documentation in GitHub repositories.

Source code, textual documents, commits, pull requests, issuereports and design diagrams are the six artifacts, observed tobe potential sources of documentation in GitHub repositories.

Insights-

Apart from the widely accepted documentationsources - source code comments , design diagrams and textualdocuments , our interviews with developers revealed that pullrequests , commits and issues can also contribute to documen-tation. C. Data Extraction

We identiﬁed top-starred 300 GitHub repositories, writtenin each of the four different programming languages, C++,Python, C , as of 01 January 2021. Wefurther ﬁltered out forked repositories and other repositorieswith zero pull requests, to exclude tutorial-based repositories.This resulted in 233, 241, 253 and 223 repositories corre-sponding to C++, C https://githut.info/ABLE IIF ILE C LASSIFICATION RULES

Category File Format

Textual ‘.txt’ , ‘.md’ , ‘readme’ , ‘license’Images ‘.png’, ‘.jpg’, ‘.jpeg’Design Diagrams ‘.xmi’,‘.uml’Source Code ‘.cpp’, ‘.cs’, ‘.py’, ‘.java’Others Any other extensions

Based on the ﬁndings from developer interviews and card-sorting approach, we classiﬁed all ﬁles in the repositoriesinto 5 categories - Textual Documentation, Images, DesignDiagrams, Source Code and Others. The rules employed forthis categorization are presented in the Table II.Files with extensions - ‘.txt’, ‘.md’, are catego-rized to textual documentation. All ﬁles with extensions -‘.xmi’, ‘.uml’ are categorized to UML diagrams. Files with‘.jpg’,‘.jpeg’,‘.png’ extensions are primarily categorized as Im-ages, which could be categorized as design diagrams throughfurther analysis. Files containing extension of programminglanguages considered, i.e., .cpp, .py, .java and .cs are classiﬁedas source code ﬁles. Though the repositories were extractedbased on the programming language in which they are written,a majority of the repositories also contain ﬁles written in otherprogramming languages.Considering the vast number of programming languagessupported by GitHub, it is difﬁcult to analyze ﬁles of otherprogramming languages, as programming constructs differfrom one language to another. Hence, ﬁles with formatsnot belonging to the above four classes are categorized as

Others . Comments from ﬁles in the source code categoryhave been extracted. This exploration of repository ﬁles hasextracted information that could contribute to documentation,from two documentation sources mentioned in the taxonomy,namely - textual documents and source code ﬁles . Also, designdiagrams for each of the repositories have been identiﬁedand information from these diagrams could be extracted usingimage processing techniques. In view of the technical effortand difﬁculty involved in choosing and applying appropriateimage processing techniques, we consider data extraction fromdesign diagrams to be out of the scope of this study.For each of the repositories, other software artifacts, i.e., pull requests, issues and commits , that were identiﬁed assources of documentation in the proposed taxonomy, not olderthan three years from January 01, 2021, were extracted usingGitHub API. However, blank data was returned for 46 pythonrepositories, leaving us with pull request data of only 177of the 223 Python repositories considered. This blank datacould be due to the large number of pull requests in theserepositories, which could not be extracted due to rate limitsof GitHub API, or those repositories with pull requests olderthan three years, from 01 January, 2021.We have identiﬁed ﬁelds that are capable of containingdocumentation related information through manual inspectionof extracted artifacts in 50 of the 950 repositories, selectedrandomly. These ﬁelds are presented in Table III, and the data

TABLE IIID

OCUMENTATION S OURCES AND FIELDS CONSIDERED

Documentation Source Fields Identiﬁed

Issues title , body, commentsPull Requests title, body, commentsCommits message, comments of these ﬁelds is stored as text documents. This accomplishesthe task of extracting information useful for documentationfrom three sources of documentation, mentioned in the pro-posed taxonomy, namely -

Pull Requests, Issues and

Commits .The ﬁelds that contain textual data and contribute to docu-mentation types are identiﬁed (as presented in Table III) andcorresponding text is stored for further analysis.

D. Topic Modelling and Manual Inspection

We observed that software artifacts that are identiﬁedas sources of documentation contain information that couldsupport different documentation types. Hence, we analyzethis information and categorize into different documentationtypes. Topic modelling technique can identify keywords, thatbelong to different number of topics speciﬁed, in a givendocument. This feature of topic modelling ﬁts well with ourrequirement of categorizing text into different documentationtypes. We performed Latent Dirichlet Allocation (LDA) topicmodelling technique to categorize the information. Thoughwe have consolidated different types of documentation, allsources of documentation need not contribute to all types ofdocumentation. Hence, we primarily tried to understand theoptimal number of topics into which information from eachof the software artifacts could be classiﬁed. Towards this, LDAmodels with topics in the range of 2 to 20 have been gener-ated , for textual data from each of the software artifacts for 50randomly selected repositories in each programming language.Coherence scores for these models have been calculated andthe number of topics of model with highest coherence score areidentiﬁed as optimal number of topics for the correspondingdata source. It is to be noted that data of different repositoriesor different artifacts was not combined during this process, toensure that the topic number is not inﬂuenced by the data ofother repositories. We observed that the number of optimaltopics for issue data varied between 4 and 5, with 5 havinghigher frequency than 4. The optimal topics for commit dataand pull request data were also observed to be varying between4 and 5, but, with 4 being more frequently repeated, than 5.Hence, the optimal number of topics for commit data, issuedata and pull request data are 4, 5 and 4 respectively. LDAtopic modelling was applied on the same data with optimalnumber of topics and the top 10 keywords for each of thetopics were obtained.A manual inspection of the keywords obtained for eachartifact has been performed repository wise, for the 200 reposi-tories (50 from each programming language) being considered. LDA has been applied on data of each artifact individually, for eachrepository Same data used to identify optimal number of topics e have manually walked through keywords in each topic,compared them to source code of the repository and otherﬁles in the repository to identify the object of reference, andlabelled the topics accordingly. During the manual inspectionof keywords, we did not ﬁnd any keywords that explicitlycorrespond to architecture-related documentation. Consideringthe design diagrams to contain more relevant architecturerelated documentation, we assumed such documentation tobe absent in the documentation sources being considered forthis study. Based on the identiﬁed keywords, it could also beobserved that though these sources might contain architecture-related documentation, its presence is almost negligible, thus,resulting in a set of 5 documentation types being present inthe considered documentation sources. Though the optimalnumber of keywords for pull request data and commit data wasobserved to be 4 through LDA approach, we observed the datato contain keywords from the 5 th category as well during ourmanual analysis. This indicates that though the informationpresent in pull requests and commit data can be classiﬁedinto 4 categories, the set of these four categories, differacross repositories. For example, some repositories might com-prise of API-related , File-related , Project-related and

Error-related documentation, while some repositories might contain

License-related, File-related, Project-related and

Error-related documentation in the pull request and commit data.While labelling, in some cases, more than one topics wereobserved to have same labels, which indicate that differentrepositories have varied distributions of information in viewof documentation types. Keywords of similar topics identiﬁedfor all the artifact data are compared and 10 most frequentlyrepeated keywords for each documentation type are identiﬁed.

E. Automated Analysis

A rule based classiﬁcation model is built by considering the10 most frequent keywords of each topic and its correspondinglabel. The topics obtained as a result of topic modellingfor each software artifact are labelled using the rule basedclassiﬁer. The rule based classiﬁer includes rules to label topicsbased on similarity score of the topic with respect to keywordsets of each of the 5 identiﬁed categories( error-related , ﬁle-related , project-related , license-related and API-related ). Top-ics that have almost equal similarity scores (difference lessthan 0.05) for all categories are labelled into others category.Thus, artifact data having almost equal possibility to belongto more than one categories are classiﬁed into others category.The percentage of each documentation type is calculatedbased on the topic frequency in the artifact. These two features- identifying documentation types and calculating percentages,are integrated into a result generator script written in pythonprogramming language. This result generator takes as inputlist of information extracted from multiple software artifacts ofall 950 repositories and automatically generates percentage ofdifferent documentation types present in each of the softwareartifacts for all the 950 repositories. Thus, the percentagesgenerated imply frequency of related-keywords of each docu-mentation type, in each of the artifacts. III. R

ESULTS

The distribution of documentation types obtained as a resultof automated analysis of each artifact data is presented in Fig.9. The contribution of sources to all documentation types ispresented in Fig. 16. Also the results of artifact contributionsto speciﬁc documentation types and distribution of docu-ments among speciﬁc sources across multiple programminglanguages are presented in the form of plots.

A. Distribution of Documentation Types in DocumentationSources • Source Code Comments -

The percentage distribution ofall documentation types in

Source code comments is pre-sented in Fig. 3. It has been observed that majority of theinformation present in source code comments contributeto license based documentation, in repositories of all pro-gramming languages considered. Also, we observed thatinformation present in source code comments contributethe least to error-related documentation, in repositoriesof C

API-related documentation.

Fig. 3. Distribution of document types in source code comments inrepositories of the four programming languages considered.Fig. 4. Distribution of document types in textual documents in repositoriesof the four programming languages considered. • Textual Documents-

The plot shown in Fig. 4 indicatesthat most of the information in textual documents of therepositories contribute to license-related documentation,followed by ﬁle-related documentation in repositories ofC++ and Python programming languages. ig. 5. Distribution of document types in commits in repositories of the fourprogramming languages considered.Fig. 6. Distribution of document types in issues in repositories of the fourprogramming languages considered.Fig. 7. Distribution of document types in pull requests in repositories of thefour programming languages considered.Fig. 8. Average distribution of document types in all sources, across all 950repositories. • Commits-

The plot displayed in Fig. 5 indicates thatmost of the information present in textual sentences ofcommit logs contribute to project-related documentationin majority of the repositories, followed by error-related documentation. For repositories in C++ programminglanguage, majority of the information in commit logscontributes to error-related documentation, followed by project-related documentation. • Issues-

The plot shown in Fig. 6 indicates that percentagedistribution of error-related documentation and project-related documentation are prominent in textual sentenceinformation of issues, with error-related documentationhaving highest percentage distribution across repositoriesin all 4 programming languages. • Pull Requests-

Fig. 7 shows that most of the informationin textual sentences of pull requests could potentiallycontribute to error-related documentation, with all repos-itories having an average of more than 30% informationthat contributes to error-related documentation.A consolidated distribution of documentation types acrossall identiﬁed sources of documentation is presented in Fig.8, which indicates that minimal amount of information in commits , issues and pull requests , and majority of informationin textual documents and source code comments contributes to license based documentation.Fig. 9 indicates that error-related documentation largelyexists in the information extracted. We further observed thatabout 12.75% of the total information did not contribute toany of the identiﬁed documentation types. RQ3:

Majority of the total extracted documentation frommultiple sources, consists of error-related documentation(25.9%), followed by project-related documentation (23.6%).

File-related documentation and

License-related documenta-tion account to 16.04% and 15.99% of the total extracteddocumentation respectively. 5.63% of the total extracted doc-umentation consists of

API-related documentation.

Insights-

Tagging information across documentation sources,based on corresponding documentation types could help de-velopers in better comprehension of the project.

Fig. 9. Percentage of documentation types present in information extractedfrom all sources of documentation for all 950 repositoriesig. 10. Contribution of documentation sources to ﬁle-related documentationin repositories of the four programming languages consideredFig. 11. Contribution of documentation sources to error/bug-related docu-mentation in repositories of the four programming languages consideredFig. 12. Contribution of documentation sources to project-related documen-tation in repositories of the four programming languages considered

B. Contribution of Documentation Sources to Documentationtypes • File-related Documentation-

The plot in Fig. 10 repre-sents contribution of each documentation source to ﬁle-related documentation, which varied across repositoriesof the 4 programming languages. We observe that pullrequests contribute to most of the ﬁle-related documen-tation in repositories of C textual documents

Fig. 13. Contribution of documentation sources to

API-related documentationin repositories of the four programming languages consideredFig. 14. Contribution of documentation sources to license-related documen-tation in repositories of the four programming languages consideredFig. 15. Average contribution of documentation sources to all documentationtypes, across all 950 repositories contribute the most in repositories of C++ programminglanguage. We also observed that source-code commentscontribute the least to ﬁle-related documentation in repos-itories of C++, C • Error-related Documentation-

Fig. 11 shows that maxi-mum contribution of error-related documentation is from pull requests and issues . It can also be observed that textual documents and source-code comments contain theleast amount of potential information with respect to error-related documentation. • Project-related Documentation-

The plot in Fig. 12 ig. 16. Percentage of contribution of documentation sources to alldocumentation types for all 950 repositories. indicates that most of the project-related documentationis contributed by information in commit logs , whencompared to other artifacts being considered. • API-related Documentation-

We deduce that source-code comments in repositories contain more informationabout

API-related documentation, succeeded by textualdocuments in the repositories from the plot in Fig. 13. • License-related Documentation-

It could be observedfrom the plot shown in Fig. 14, that textual documentsin majority of the repositories in all four programminglanguages contribute to license-related documentation,followed by source code comments in repositories.The consolidated contribution of documentation sourcestowards all documentation types is presented in Fig. 15, whichindicates that most of the potential information for ﬁle-related documentation is obtained by data in textual documents.Fig. 16 indicates that commit logs contribute the mosttowards documentation, succeeding the source code comments and textual documents . RQ4:

Majority of the total documentation is extracted fromsource code comments (23.04%), followed by textual docu-ments (22.58%). Commit logs and issues contributed to 18.5%and 18.21% of the total extracted documentation respectively.17.63% of the total extracted documentation was contributedby issues in the repositories.

Insights-

Structured formats for each of these sources, couldhelp in easy extraction and towards generation of uniﬁeddocumentation for a repository.IV. D

ISCUSSION

Towards answering the research questions proposed, werealised that different documentation types could be derivedfrom other software artifacts in the GitHub repositories. Re-searchers could explore the direction of consolidating thisinformation across multiple artifacts to arrive at documentationfor the repository. Establishing traceability links among doc-umentation types and documentation sources is an importantfuture direction. Tools towards generating documentation frommultiple sources could be developed to ease the efforts of soft-ware practitioners. It might be difﬁcult to generate meaningfuldocumentation from the sources identiﬁed, requiring software practitioners to improve the textual information being loggedthrough these artifacts. Also, there is a need for approachesand tools to migrate existing unstructured documentation intoa structured form, for eliciting valuable information frommultiple sources of software documentation. Furthermore,researchers could also explore speciﬁc formats for each ofthe identiﬁed documentation sources, to ensure extraction ofinformation that could be easily converted to a meaningfuldocumentation from the sources.V. T

HREATS TO V ALIDITY

Internal Validity -

The categories in proposed taxonomyare based on developer interviews and card-sorting approach.Thus, there is a possibility of missing categories in the pro-posed taxonomy of types and sources of documentation, thatmight not have been observed during our interviews with 20developers or during the execution of card-sorting approach.Moreover, as a ﬁrst step towards considering documentationsources, we considered only pull requests, commits, issues,source code ﬁles and textual ﬁles which could be obtainedthrough GitHub API or by downloading the repositories.Other sources such as Wiki, which are speciﬁc to GitHubprojects, and, which require separate cloning, other than thatof repository were not considered in this study. Such sourcescould be considered in the future versions of this study toimprove the documentation being extracted.Except for the top 50 repositories in each programminglanguage, topics of artifact data are labelled by comparingtopic keywords with most frequently occurring keywordsof respective labelled-topics, in top 50 repositories in eachlanguage. This generalization of keywords might not considerprominent keywords that could occur in other repositories, andthus could be biased towards the top 20% repositories in eachprogramming language. In addition, as we only analyzed tex-tual sentences of data from issues , pull requests and commits ,we might have missed information from other data present inthese artifacts such as date and status . Also, only comments insource code ﬁles with extensions .cpp, .py, .cs and .java wereanalyzed, considering the complexity involved in addressingﬁles of all programming languages. Also, text of ﬁles in theformat speciﬁed in Table II has been analyzed. As a result,useful insights on documentation that might be present in ﬁlesof other formats such as .html, .tex, .pdf , and comments inother programming language ﬁles are compromised. Construct Validity -

All the results obtained are validonly for the dataset of considered 950 repositories. Performingthis study on a different dataset might yield different results,However, considering the presence of repositories of variedprogramming languages in the dataset, indicating a repre-sentative sample of all repositories in the four programminglanguages, similar distribution of documentation types andsources could be identiﬁed in other repositories across GitHub,suggesting (but not proving) generalizability of the results.

External Validity -

The results obtained are conﬁned toversion of repositories and their and corresponding software

ABLE IVD

OCUMENTATION S OURCES AND T YPES O BSERVED FROM L ITERATURE

Documentation Type Source of Documentation Literature

API-related example ﬁles, readme ﬁle [29] [7]Architecture-related UML ﬁles, non-textual ﬁles [9]License-related commit messages [30] [7]issues, License ﬁles, readmeError/Bug-related issue comments [31]Project installation-related readme [7]Project Enhancement-related readme [7] [31]Project Background-related readme [7]Project Status-related readme, commits, issues, [7] [32]pull requests.Project Team-related readme, commits, issues, [7] [33]pull requests. [32]Project Advantage-related readme [7]References-related readme [7]Contribution Guidelines readme [7] artifacts as on 01 January 2021, as this empirical study isperformed on locally downloaded repositories.The accuracy of categorizing the extracted information intodifferent documentation type classes is dependant on theefﬁciency of LDA approach. Different set of keywords mightbe identiﬁed for each class, consequently resulting in somesentences to be classiﬁed into different classes, if a differentNatural Language Processing technique is employed.VI. R

ELATED W ORK

Repositories on GitHub comprises of various artifacts suchas pull requests , issue reports , star-count , fork-count , num-ber of watchers and so on, that provide different types ofinformation about the repository. Stars, forks and watchersare numeric in nature, indicating information on popularityof repositories [6] whereas software artifacts such as issuereports , pull requests , commit logs contain information relatedto development and maintenance of the repositories [33]–[35].Researchers have conducted several studies to understandthe role of these software artifacts as an aid for developersin improving quality and better handling of the project. Aconsolidated list of documentation types observed from theliterature, along with their sources is presented in Table IV. A. Using Individual Artifact DataIssues , commits and pull requests data is used in the litera-ture for multiple research studies. Kikas et al. have observedvarious features present in issue reports , as an aid in predictionof issue lifetimes [36]. They have observed that analysis ofcontextual information from commits of the repository suchas last commit date , recent project activity , date of issuecreation and other dynamic features such as number of actionsperformed on the issue, number of users who have worked onthe issue, comment count and so on, could help in predictinglife time of the issue [36].Liao et al. have observed that users who comment on issues and who create pull requests related to the issues contribute more towards project development. Zhang et al.have linked multiple related issues to support easy resolutionand easy querying of issues by applying a combination of information retrieval technique and deep learning techniques[37]. They also developed a tool to calculate frequency, wordand document similarity scores between the queried issueand pending issues, based on the mentioned techniques torecommend related issues [37].Michaud et al. have attempted to identify branch of commitsin the repository based on commit messages and types ofmerges to track the evolution history of a repository [38].Tsay et al. have presented a study that analyzed variousfactors of pull requests that contribute towards their acceptanceor rejection [2]. It has been observed that pull requests withlarge number of changed ﬁles and more number of commentshave lesser probability to be accepted, depending on othersocial factors [2]. B. Using Integrated Data from Multiple Software Artifacts

Zhou et al. have considered various features from issuereports and commits to detect security threats or vulnerabilitiesin the repository [39]. An optimal machine learning modelhas been built using commit messages from commits andmultiple features of issue reports such as title, description,comments and so on, to detect unidentiﬁed security threats inthe repository [39].Considering the importance of information related to issuereports and commit history in assessing software quality andother factors of a repository, RCLinker has been proposed tolink issue reports to their corresponding commits by compar-ing source code of two consequent commits and summarizingthe difference in source code [40].Coelho et al. have considered features of multiple softwareartifacts such as issues, pull requests, commits and so onto identify status of maintenance of a project [32]. In anattempt to identify various issues in software documentation,Aghajani et al. have studied multiple sources that discuss aboutsoftware documentation, which also include issue reports andpull requests on GiHhub, apart from developer email lists anddiscussions on knowledge sharing platforms [13].

C. Documentation in Software Projects

A study has been performed by Borges et al. to identifyvarious software artifacts that contribute to star count ofGitHub repositories [6]. It was observed that code quality anddocumentation largely inﬂuence star count. This study alsore-emphasizes the importance of software documentation inGitHub repositories [6].Considering this importance of documentation, several at-tempts to improve and generate different types of softwaredocumentation are being developed.

Quasoledo has been proposed to evaluate quality of docu-mentation in

Eclipse projects, based on quality metrics relatedto completeness and readability [41]. Fowkes et al. haveproposed an approach using probabilistic model of sequencesto generate patterns of API, based on multiple usages in aproject [29]. Source code examples have been linked to ofﬁcialAPI documentation based on method calls and references toenable better API usage and understanding [42]. A study haseen performed by Hebig et al. to understand the number ofprojects in GitHub, that use uml diagrams [9].In a survey with 146 software practitioners, Aghajani et al.have presented 13 documentation types, that were observedto be used by practitioners, for accomplishing multiple tasks[18]. Of these 13 documentation types, around 6 of thedocumentation types correspond to end-user interactions withthe project, such as video tutorials , installation guides , and soon. These set of documentation types also included CommunityKnowledge , which could correspond to artifacts such as pullrequests, issues and commits [18] .Prana et al. have identiﬁed 8 categories of informationpresent in ReadMe ﬁles through manual analysis of 50ReadMe ﬁles, and annotated 150 ReadMe ﬁles based on thesecategories [7]. Information in artifacts of the repository such ascommit messages and discussions on issue trackers have beenanalyzed to obtain insights about changes in licenses used inthe software [10]. Vendome et al. have attempted to trace thereasons for change in licenses through investigation of commithistory and discussions in issue trackers and consequently ob-taining insights on when (from commit history) and why (fromissue tracker discussions) the licenses have been changed [10].The existing literature has emphasized the need for softwaredocumentation, analyzed the availability of documentationtypes in GitHub repositories and has also presented approachesthat consider information present in software artifacts, togenerate respective documentation. In our analysis of theliterature, we observe that the generation of documentationconsiders information from individual artifacts and is mostlylimited to information present in bug reports , source code and readme ﬁles . We also observe that documentation is presentin many software artifacts, but is not explicitly mentioned.Extracting documentation information present in various soft-ware artifacts could help in enhancing the documentation.It is important to identify software artifacts that serve aspotential sources of documentation. However, to the best ofour knowledge, we are not aware of any work in the literaturethat identiﬁes and comprehends all the potential sources ofdifferent types of documentation. Hence, as a ﬁrst step towardsidentifying potential documentation sources, we present ataxonomy of sources of documentation in GitHub.VII. C ONCLUSION AND F UTURE W ORK

In this paper, we proposed a taxonomy of documenta-tion sources in GitHub repositories. We identiﬁed sourcesof documentation for different documentation types throughcard-sorting approach and developer interviews. This resultedin a taxonomy of six different documentation sources thatcould provide potential information for different documenta-tion types identiﬁed. We also perform an empirical analysis tounderstand distribution of documentation types across multiplesoftware artifacts that are identiﬁed as documentation sources,and the contribution of each source to the type of documenta-tion. 950 GitHub repositories, with 233, 241, 253 and 223repositories from C++, C issues , commits and pull requests of these repositories, not present explicitly in scraped datawere fetched using GitHub API. Topic modelling has beenapplied on data extracted from all documentation sources andthe obtained topics are labelled accordingly. The total ex-tracted documentation from multiple sources, consisted 25.9%of error-related documentation and 23.6% of project-related documentation. File-related documentation and license-related documentation account to 16.04% and 15.99% of the to-tal extracted documentation respectively. 5.63% of the totalextracted documentation consisted of

API-related documen-tation.

Source code comments contributed to 23.04% and textual documents contributed to 22.58% of the total extracteddocumentation.

Commit logs and issues contributed to 18.5%and 18.21% of the total extracted documentation respectively.17.63% of the total extracted documentation was contributedby issues in the repositories.As a part of future work, we plan to identify other softwareartifacts that could be considered as potential sources ofdocumentation. We also plan to improve the current taxonomyby adding more levels to the taxonomy. In future, informationobtained from sources in the proposed taxonomy could be pro-cessed and used to generate documentation of different types.Also, techniques to extract text from design diagrams couldbe implemented, to leverage other documentation informationthat could be present in design diagrams.R

EFERENCES1 Gousios, G., Vasilescu, B., Serebrenik, A., and Zaidman, A., “Leanghtorrent: Github data on demand,” in

Proceedings of the 11th workingconference on mining software repositories , 2014, pp. 384–387.2 Tsay, J., Dabbish, L., and Herbsleb, J., “Inﬂuence of social and technicalfactors for evaluating contribution in github,” in

Proceedings of the 36thinternational conference on Software engineering , 2014, pp. 356–366.3 Jiang, J., Lo, D., He, J., Xia, X., Kochhar, P. S., and Zhang, L., “Whyand how developers fork what from whom in github,”

Empirical SoftwareEngineering , vol. 22, no. 1, pp. 547–578, 2017.4 Jiang, J., Lo, D., Ma, X., Feng, F., and Zhang, L., “Understanding inactiveyet available assignees in github,”

Information and Software Technology ,vol. 91, pp. 44–55, 2017.5 Zagalsky, A., Feliciano, J., Storey, M.-A., Zhao, Y., and Wang, W.,“The emergence of github as a collaborative platform for education,”in

Proceedings of the 18th ACM Conference on Computer SupportedCooperative Work & Social Computing , 2015, pp. 1906–1917.6 Borges, H. and Valente, M. T., “What’s in a github star? understandingrepository starring practices in a social coding platform,”

Journal ofSystems and Software , vol. 146, pp. 112–129, 2018.7 Prana, G. A. A., Treude, C., Thung, F., Atapattu, T., and Lo, D.,“Categorizing the content of github readme ﬁles,”

Empirical SoftwareEngineering , vol. 24, no. 3, pp. 1296–1327, 2019.8 Perez-Riverol, Y., Gatto, L., Wang, R., Sachsenberg, T., Uszkoreit, J.,da Veiga Leprevost, F., Fufezan, C., Ternent, T., Eglen, S. J., Katz, D. S. et al. , “Ten simple rules for taking advantage of git and github,”

PLoScomputational biology , vol. 12, no. 7, 2016.9 Hebig, R., Quang, T. H., Chaudron, M. R., Robles, G., and Fernan-dez, M. A., “The quest for open source projects that use uml: mininggithub,” in

Proceedings of the ACM/IEEE 19th International Conferenceon Model Driven Engineering Languages and Systems , 2016, pp. 173–183.10 Vendome, C., Linares-V´asquez, M., Bavota, G., Di Penta, M., German, D.,and Poshyvanyk, D., “License usage and changes: a large-scale study ofjava projects on github,” in . IEEE, 2015, pp. 218–228.1 Mendez, C., Padala, H. S., Steine-Hanson, Z., Hilderbrand, C., Hor-vath, A., Hill, C., Simpson, L., Patil, N., Sarma, A., and Burnett, M.,“Open source barriers to entry, revisited: A sociotechnical perspective,”in

Proceedings of the 40th International Conference on Software Engi-neering , 2018, pp. 1004–1015.12 Steinmacher, I., Conte, T. U., Treude, C., and Gerosa, M. A., “Overcomingopen source project entry barriers with a portal for newcomers,” in

Pro-ceedings of the 38th International Conference on Software Engineering ,2016, pp. 273–284.13 Aghajani, E., Nagy, C., Vega-M´arquez, O. L., Linares-V´asquez, M.,Moreno, L., Bavota, G., and Lanza, M., “Software documentation issuesunveiled,” in . IEEE, 2019, pp. 1199–1210.14 Fronchetti, F., Wiese, I., Pinto, G., and Steinmacher, I., “What attractsnewcomers to onboard on oss projects? tl; dr: Popularity,” in

IFIPInternational Conference on Open Source Systems . Springer, 2019, pp.91–103.15 Robles, G., Gonzalez-Barahona, J. M., and Merelo, J. J., “Beyond sourcecode: the importance of other artifacts in software development (a casestudy),”

Journal of Systems and Software , vol. 79, no. 9, pp. 1233–1248,2006.16 Garousi, G., Garousi-Yusifo˘glu, V., Ruhe, G., Zhi, J., Moussavi, M., andSmith, B., “Usage and usefulness of technical software documentation:An industrial case study,”

Information and Software Technology , vol. 57,pp. 664–682, 2015.17 Mahmood, S. and Khan, A., “An industrial study on the importance ofsoftware component documentation: A system integrator‘s perspective,”

Information Processing Letters , vol. 111, no. 12, pp. 583–590, 2011.18 Aghajani, E., Nagy, C., Linares-V´asquez, M., Moreno, L., Bavota, G.,Lanza, M., and Shepherd, D. C., “Software documentation: the practi-tioners’ perspective,” in . IEEE, 2020, pp. 590–601.19 Kipyegen, N. J. and Korir, W. P., “Importance of software documentation,”

International Journal of Computer Science Issues (IJCSI) , vol. 10, no. 5,p. 223, 2013.20 de Souza, S. C. B., Anquetil, N., and de Oliveira, K. M., “A study ofthe documentation essential to software maintenance,” in

Proceedings ofthe 23rd annual international conference on Design of communication:documenting & designing for pervasive information , 2005, pp. 68–75.21 Stettina, C. J. and Heijstek, W., “Necessary and neglected? an empiricalstudy of internal documentation in agile software development teams,”in

Proceedings of the 29th ACM international conference on Design ofcommunication , 2011, pp. 159–166.22 Cosentino, V., Izquierdo, J. L. C., and Cabot, J., “A systematic mappingstudy of software development with github,”

IEEE Access , vol. 5, pp.7173–7192, 2017.23 Aggarwal, K., Hindle, A., and Stroulia, E., “Co-evolution of projectdocumentation and popularity within github,” in

Proceedings of the 11thWorking Conference on Mining Software Repositories , 2014, pp. 360–363.24 Ma, Y., Fakhoury, S., Christensen, M., Arnaoudova, V., Zogaan, W., andMirakhorli, M., “Automatic classiﬁcation of software artifacts in open-source applications,” in . IEEE, 2018, pp. 414–425.25 Chimalakonda, S. and Venigalla, A. S. M., “Software documentation andaugmented reality: love or arranged marriage?” in

Proceedings of the 28thACM Joint Meeting on European Software Engineering Conference andSymposium on the Foundations of Software Engineering , 2020, pp. 1529–1532.26 Spencer, D. and Warfel, T., “Card sorting: a deﬁnitive guide,”

Boxes andarrows , vol. 2, pp. 1–23, 2004. 27 Nurmuliani, N., Zowghi, D., and Williams, S. P., “Using card sortingtechnique to classify requirements change,” in

Proceedings. 12th IEEEInternational Requirements Engineering Conference, 2004.

IEEE, 2004,pp. 240–248.28 Lo, D., Nagappan, N., and Zimmermann, T., “How practitioners perceivethe relevance of software engineering research,” in

Proceedings of the2015 10th Joint Meeting on Foundations of Software Engineering , 2015,pp. 415–425.29 Fowkes, J. and Sutton, C., “Parameter-free probabilistic api mining acrossgithub,” in

Proceedings of the 2016 24th ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering , 2016, pp. 254–265.30 Vendome, C., Bavota, G., Di Penta, M., Linares-V´asquez, M., German, D.,and Poshyvanyk, D., “License usage and changes: a large-scale study ongithub,”

Empirical Software Engineering , vol. 22, no. 3, pp. 1537–1577,2017.31 Rastkar, S., Murphy, G. C., and Murray, G., “Summarizing software arti-facts: a case study of bug reports,” in , vol. 1. IEEE, 2010, pp. 505–514.32 Coelho, J., Valente, M. T., Silva, L. L., and Shihab, E., “Identifyingunmaintained projects in github,” in

Proceedings of the 12th ACM/IEEEInternational Symposium on Empirical Software Engineering and Mea-surement , 2018, pp. 1–10.33 Rahman, M. M. and Roy, C. K., “An insight into the pull requestsof github,” in

Proceedings of the 11th Working Conference on MiningSoftware Repositories , 2014, pp. 364–367.34 Dabbish, L., Stuart, C., Tsay, J., and Herbsleb, J., “Social coding ingithub: transparency and collaboration in an open software repository,”in

Proceedings of the ACM 2012 conference on computer supportedcooperative work , 2012, pp. 1277–1286.35 Sheoran, J., Blincoe, K., Kalliamvakou, E., Damian, D., and Ell, J.,“Understanding” watchers” on github,” in

Proceedings of the 11th workingconference on mining software repositories , 2014, pp. 336–339.36 Kikas, R., Dumas, M., and Pfahl, D., “Using dynamic and contextualfeatures to predict issue lifetime in github projects,” in . IEEE,2016, pp. 291–302.37 Zhang, Y., Wu, Y., Wang, T., and Wang, H., “ilinker: a novel approachfor issue knowledge acquisition in github projects,”

World Wide Web , pp.1–31, 2020.38 Michaud, H. M., Guarnera, D. T., Collard, M. L., and Maletic, J. I.,“Recovering commit branch of origin from github repositories,” in . IEEE, 2016, pp. 290–300.39 Zhou, Y. and Sharma, A., “Automated identiﬁcation of security issuesfrom commit messages and bug reports,” in

Proceedings of the 2017 11thJoint Meeting on Foundations of Software Engineering , 2017, pp. 914–919.40 Le, T.-D. B., Linares-V´asquez, M., Lo, D., and Poshyvanyk, D., “Rclinker:Automated linking of issue reports and commits leveraging rich contextualinformation,” in . IEEE, 2015, pp. 36–47.41 Schreck, D., Dallmeier, V., and Zimmermann, T., “How documentationevolves over time,” in

Ninth international workshop on Principles ofsoftware evolution: in conjunction with the 6th ESEC/FSE joint meeting ,2007, pp. 4–10.42 Subramanian, S., Inozemtseva, L., and Holmes, R., “Live api documen-tation,” in