Sosed: a tool for finding similar software projects
Egor Bogomolov, Yaroslav Golubev, Artyom Lobanov, Vladimir Kovalenko, Timofey Bryksin
SSosed : a tool for finding similar software projects
Egor Bogomolov
JetBrains [email protected]
Yaroslav Golubev
JetBrains [email protected]
Artyom Lobanov
JetBrains [email protected]
Vladimir Kovalenko
JetBrains Research, JetBrains [email protected]
Timofey Bryksin
JetBrains [email protected]
ABSTRACT
In this paper, we present
Sosed , a tool for discovering similar soft-ware projects. We use fastText to compute the embeddings of sub-tokens into a dense space for 120,000 GitHub repositories in 200languages. Then, we cluster embeddings to identify groups of se-mantically similar sub-tokens that reflect topics in source code. Weuse a dataset of 9 million GitHub projects as a reference searchbase. To identify similar projects, we compare the distributionsof clusters among their sub-tokens. The tool receives an arbitraryproject as input, extracts sub-tokens in 16 most popular program-ming languages, computes cluster distribution, and finds projectswith the closest distribution in the search base. We labeled sub-token clusters with short descriptions to enable
Sosed to produceinterpretable output.
Sosed is available at https://github.com/JetBrains-Research/ sosed/ .The tool demo is available at . The multi-language extractor of sub-tokens is avail-able separately at https://github.com/JetBrains-Research/ buckwheat/ . ACM Reference Format:
Egor Bogomolov, Yaroslav Golubev, Artyom Lobanov, Vladimir Kovalenko,and Timofey Bryksin. 2020.
Sosed : a tool for finding similar software projects.In
Proceedings of ACM Conference (Conference’17).
ACM, New York, NY, USA,4 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Identification of similar projects in a large set of open-source reposi-tories can help in several software engineering tasks: rapid prototyp-ing, program understanding, plagiarism detection [25]. Additionally,it requires the development of new approaches to understand themeaning behind code and represent software projects at a largescale. In turn, if the developed methods can detect similar projects,they might be also applied in other software engineering tasks.While popular search engines provide an option to search forweb pages or images similar to the input, there is no commonapproach for finding similar software projects. For instance, priorwork on similar projects detection leveraged several sources of data:
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
Conference’17, July 2017, Washington, DC, USA © 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn
Java API calls [24], contents of README files [32], user reactionsin the form of GitHub stars [32], tags on SourceForge [29].Recently, several papers proposed to split code tokens into sub-tokens to improve results in method name prediction [8], variablemisuse identification [15], and source code topic modeling [23].Following these advances, we suggest a novel approach to representarbitrary fragments of code based on sub-token embeddings, e.g., numerical representations in a dense space. We train sub-tokenembeddings with fastText [10], an algorithm for training wordembeddings that takes into account both words and their subparts.As prior work demonstrated, words with similar embeddingstend to be semantically related [27]. We retrieve groups of relatedsub-tokens by clustering their embeddings with the spherical K-means algorithm [16], a modification of the regular K-means [21]that works with cosine distance. These clusters represent topicsthat occurred in a large corpus of source code. We represent codeas a distribution of clusters among its sub-tokens.We implemented the suggested approach to represent code as atool for detecting similar projects called
Sosed . We define similarityof projects as the similarity of the corresponding cluster distribu-tions. To measure it, we suggest using either KL-divergence [18],or cosine similarity of the distribution vectors.
Sosed identifies similar projects based solely on their codebaseand supports 16 most popular languages. It does not make useof collaboration data ( e.g.,
GitHub stars) to avoid popularity bias.Currently,
Sosed supports the search of similar repositories across9 million repositories that comprise all unique public projects onGitHub as of the end of 2016. In future, we plan to update the datasetto use an up-to-date snapshot of Github.An important feature of
Sosed is the explainability of its output.We manually labeled the sub-token clusters with short descriptionsof their topics. For each query result, we can provide descriptionsof topics that contributed the most to the similarity measure.The main contribution of our work is
Sosed — an open-source toolfor finding similar repositories based on the novel code representa-tion.
Sosed provides explainable output, supports 16 programminglanguages, and searches across millions of reference projects.The tool is available on GitHub [4]. The part of
Sosed used forsub-token extraction and language identification is also availableas a standalone tool [3].
Previous work on detecting similar repositories leveraged severalsources of data. McMillan et al. [24] suggested CLAN, a Java-specificapproach that detected similar Java applications by analyzing their a r X i v : . [ c s . S E ] J u l onference’17, July 2017, Washington, DC, USA Egor Bogomolov, Yaroslav Golubev, Artyom Lobanov, Vladimir Kovalenko, and Timofey Bryksin API calls. The authors applied Latent Semantic Indexing [12] to anoccurrence matrix, where columns represent projects, and rowsrepresent API calls. The authors obtained vector representations ofJava applications and defined the similarity of two projects as thecosine similarity of the corresponding vectors.Aside from analyzing the code, several approaches to similaritysearch used data specific to code hosting platforms ( e.g.,
Source-Forge [29] or GitHub [32]). Thung et al. [29] used the SourceForge’stags system to define similarity of the projects. Tags are short de-scriptions of project characteristics: category, language, user inter-face, and so on. Since some tags are more descriptive than others,the authors proposed to assign a weight to each tag. Then, theycomputed similarity of two projects from their sets of tags andtheir intersection. Zhang et al. [32] measured similarity of projectshosted on GitHub based on the stars given by the same user in ashort period of time and contents of the projects’ README files.The problem of detecting similar applications is also activelyresearched in the domain of mobile apps [11, 14, 19, 20]. The maindifference from open source software projects is the data associatedwith each app. For apps in app stores, source code is often not openlyaccessible, but there are multiple other kinds of data available:description, images, permissions, user reviews, download size.Another method related to measuring similarity of projects istopic modeling on code. The goal of topic modeling is to automat-ically detect topics in a corpus of unlabeled data, e.g., softwareprojects. The output of a topic modeling algorithm is a set of topics,and a distribution of topics in each item from the corpus. A topic isusually represented by a group of reference words or labels that aremost frequent across data comprising the topic. According to thesurvey by Sun et al. [28], the most popular approach to topic mod-eling in software engineering is LDA [9]. It treats source code as abag of tokens, such as variable names, function names, and otheridentifiers. Markovtsev et al. [23] used ARTM [31], an algorithmsimilar to LDA, to identify topics across 9 million GitHub projects,which makes it, to the best of our knowledge, the largest study oftopic modeling on source code.
In this work, we present
Sosed , a tool for finding similar softwareprojects based on a novel representation of code.
Outline of
Sosed ’s internals.
Figure 1 provides an overview of
Sosed ’s internals. To find similar projects, we should define a searchspace, represent projects in a way suitable for searching, and setup a similarity measure.As for the search space, we use the dataset of 9 million GitHubrepositories collected by Markovtsev et al. [23]. To the best of ourknowledge, it is the largest deduplicated dataset of software projects,which is suitable for our task straight-away.As a preprocessing step, we transform projects into numericalvectors. Firstly, we train embeddings of sub-tokens on a large corpusof code [22] with fastText [10]. Secondly, we find K clusters ofsub-tokens with spherical K-means algorithm [16], where K isa manually selected parameter. Finally, for each repository, wecompute the distribution of clusters among its sub-tokens. Thedistribution for a project is a K -dimensional vector, where each component C is a probability of cluster C appearing among theproject’s sub-tokens.We implement two methods for measuring similarity of projects:explicitly computing KL-divergence [18] ( i.e., a measure of distribu-tion similarity) of their cluster distributions, or computing cosinesimilarity of the distribution vectors. In both cases, we use Faiss [17]library to find the closest distributions.In the rest of this section we describe parts of the tool in moredetails. ReferenceprojectsCorpus ofcode Extractsequencesof sub-tokens Trained fastText modelTrain fastTextmodel Trained Spherical K-means modelClusterembeddingsExtractfrequenciesof sub-tokens Computeembeddingsfor OOV sub-tokensQueryprojects Get clusterindicesfor sub-tokensVectorrepresentationsfor projectsCosinesimilarity KLdivergenceBuild FaissindexExtractfrequenciesof sub-tokens Load pre-computedcluster indices for sub-tokens
Figure 1: Overview of the algorithm to compute projects’similarityReference projects.
For each repository, the dataset introduced by Markovtsev etal. [23] contains a set of all sub-tokens found in the project. Wedescribe the process of extracting sub-tokens latter in this section.The dataset is already cleared of both explicit and implicit forks( i.e., copies of other projects that are not marked as forks on GitHubby its authors). It contains all the GitHub projects as of the end of2016. Even though the projects in the dataset are not up-to-date, itallows us to implement the search in a vast amount of projects. Infuture, we plan to create an up-to-date version of the dataset.
Training sub-token embeddings.
For training sub-token em-beddings, we use a dataset of identifiers extracted from 120,000GitHub repositories [22]. It contains sequences of sub-tokens fromfiles in approximately 200 programming languages.We use fastText [10] to compute embeddings of sub-tokens into a100-dimensional space. Alongside with embeddings of input words,fastText also computes embeddings of encountered n-grams. Itis helpful in the source code domain, because even at sub-tokenlevel there are some highly repetitive n-grams. Another importantfeature of fastText is its ability to compute embeddings for out-of-vocabulary (OOV) tokens: sub-tokens of reference projects notencountered in the corpus used for training embeddings. We com-puted embeddings for OOV sub-tokens with the trained fastTextmodel, which gave us a set of 40 million known sub-tokens.
Extracting sub-tokens from repositories.
A part of this workused for sub-token extraction and language identification might beuseful for other tasks as well. To share it with the community and osed : a tool for finding similar software projects Conference’17, July 2017, Washington, DC, USA facilitate its reuse, we make it available as a separate project [3].The input of sub-token extractor is a list of either links to GitHubrepositories or paths to local directories. The output is a list of allextracted sub-tokens and quantities of sub-tokens for each project.On the first step of tokenization, we use enry [2] to recognizelanguages in files in each project. enry is a Go-based language toolthat employs several strategies to determine the language of a givenfile, including its name, extension, and content. enry features thesupport of 382 languages, fast performance, and does not require agit repository to work, meaning that the input project can be anycollection of files.When run on a directory, enry outputs a JSON file with therecognized languages as keys and lists of files as values. Usingthese keys, we filter languages that we are interested in. Basedon the statistics on programming languages popularity [6], wecurrently support 16 languages, namely: C, C
Tree-sitter [7], a fast parsingtool that uses language-specific grammars to parse a given file intoan abstract syntax tree (AST). We then filter the AST leaves toobtain various kinds of identifiers, names, constants, etc.The four remaining languages (Scala, Swift, Kotlin, and Haskell)either do not have a
Tree-sitter grammar at the time of writing orthe grammar is in development. The files in these languages arepassed on to
Pygments [5] lexers. A
Pygments lexer splits the codeinto tokens, each of which also has a certain type. From the list oftokens, we extract those that are of interest to us: this includes the token.Name type by default, but for some languages it also makessense to gather other types.The last step of tokenization is splitting each token into sub-tokens. Following Markovtsev et al. [23], we split the tokens bycamel case and snake case, append short sub-tokens (less than threecharacters) to the adjacent longer ones, and stem sub-tokens longerthan 6 characters using the Snowball stemmer [26].For a given project, we carry out identifier extraction and subto-kenization for all files written in the supported languages and ac-cumulate the results: in the end, the repository is represented as adictionary with sub-tokens as keys and their counts as values.
Clustering sub-token embeddings.
We use the spherical K-means algorithm [16] to find clusters of similar sub-tokens. Thealgorithm is similar to the regular K-means [21], but it works withcosine distance instead of the Euclidean distance. Since we workwith millions of high-dimensional vectors and cosine distance, otherapproaches like DBSCAN [13] turn out to be too computationallyexpensive.Spherical K-means requires choosing the number of clusters K beforehand. We estimate an optimal number of clusters with gapstatistic [30], a technique based on comparing the distribution ofthe inner-cluster distances with a uniform distribution. It has notshown any significant difference for the number of clusters above 256, so we decided to set K to 256 to reduce the dimensionality ofproject representations at the next step.Clusters represent groups of semantically similar sub-tokens.They can be seen as topics at the sub-token level. As in topic mod-eling, the topic can be guessed from a set of representatives. Inour case, the representatives are the most frequent sub-tokens inthe cluster and sub-tokens closest to the cluster center. To furtherelevate this information and make Sosed ’s output explainable, wemanually labeled clusters with short descriptions by looking bothat the representatives and projects where they are frequently used.
Project representations.
From the previous step, we get a map-ping from sub-tokens to clusters. Then, we compute the distributionof clusters among sub-tokens in each project. For each repository,we get a K -dimensional vector where a coordinate along the dimen-sion C is equal to the probability of the cluster C appearing amongproject’s sub-tokens.We applied the described technique to compute representationsof 9 million repositories from the dataset of Markovtsev et al. [23],which includes all unique projects (excluding both the explicit andimplicit forks) on GitHub as of the end of 2016. This large set ofprojects forms the Sosed ’s search space.
Searching for similar repositories.
To find similar reposito-ries to a given one, we should compute a cluster distribution for it.Firstly, we tokenize the project as previously described. Then, wecollect pre-computed cluster indices for the sub-tokens encounteredin reference projects. We do not compute embeddings for OOV sub-tokens in the new projects for two reasons. Firstly, their number issmall, because the reference projects contain 40 million differentsub-tokens. Secondly, OOV sub-tokens may refer to libraries andtechnologies that emerged after the reference dataset had beencollected , i.e., the end of 2016. In this case, the embeddings will notreflect the underlying semantics of sub-tokens.We implement two methods to compare cluster distributionsbetween projects from a query and reference projects: direct compu-tation of KL-divergence [18] between two distributions and cosinesimilarity of the distribution vectors. Cosine similarity equals to theinner product of the normalized distribution vectors. KL-divergencecan be expressed by the following formula: D KL ( P Q || P R ) = (cid:213) c ∈ Clusters P Q ( c ) log P Q ( c ) P R ( c ) , where P Q and P R are cluster distributions for a query and a refer-ence project, respectively. Finding a reference project R that mini-mizes KL-divergence for the given query project is equivalent tomaximizing the following function: (cid:213) c ∈ Clusters P Q ( c ) log P R ( c ) . The function is an inner product of the cluster distribution P Q and a point-wise logarithm of the distribution P R . Thus, both forKL-divergence and cosine similarity, the search of similar projectsreduces to maximizing an inner product between two vectors.We utilized the Faiss [17] library to find vectors giving the maxi-mal inner product. Faiss transforms reference vectors into an index-ing structure that can be further used for querying. The indexingstructure used in our work does not introduce a significant memoryoverhead, which allows us to use it with a large search space. onference’17, July 2017, Washington, DC, USA Egor Bogomolov, Yaroslav Golubev, Artyom Lobanov, Vladimir Kovalenko, and Timofey Bryksin To enable the tool to provide explanations for project similarity,we find sub-token clusters corresponding to the terms that con-tributed the most to the vectors’ inner product. Within the tool’soutput, we display their contributions alongside with manuallygiven labels and sub-tokens from these clusters.
To the best of our knowledge, the only approach to evaluate theoutput of algorithms for finding similar projects used in previouswork [24, 29, 32] is conducting a survey of developers.Since
Sosed works with programming projects in 16 languages,thorough evaluation of its performance without diving deep intospecific ecosystems becomes challenging. We plan to conduct asurvey of a large group of programmers with different expertise inorder for its results to be reliable.For now, we evaluated
Sosed ’s output on a set of 94 GitHubprojects that comprises top-starred repositories in different lan-guages. The results are available on our GitHub page [4]. For exam-ple, top-5 most similar projects to TensorFlow are deep learningand machine learning frameworks. For Bitcoin Sosed detectedother open-sourced cryptocurrencies. Among projects similar toPython we found Brython, a Python implementation running ina browser. Finding similar software projects among a large set of repositoriesmight be beneficial for practical software engineering tasks likequick prototyping and program understanding. Aside from that,it requires development of new methods for representing sourcecode, which can find application in other software-related tasks.We created a novel approach to represent code based on thetopic distribution among its sub-tokens. We implemented it as atool for finding similar software repositories called
Sosed . The mainfeatures of
Sosed are explainability of its output, support of 16programming languages, and independence of project popularity.
Sosed is available on GitHub [3, 4].For now,
Sosed searches among a set of 9 million GitHub projects.While it is a large set of data, open-source community grew rapidlyover the recent years [1]. In order to catch up with the growth ofthe open-source ecosystem, we plan to collect a new dataset, whichwill contain an up-to-date set of GitHub projects.Implementation of open-source tools for the novel ideas hasseveral benefits. This way, we can quickly evaluate the method’sperformance, check its practical applicability, and gather feedbackfrom the tool’s users. We encourage others to create open-sourcesoftware based on the developed methods in order to speed upcommunication and evolution in the research community.
REFERENCES [1] 2019.
The State of the Octoverse . https://octoverse.github.com/[2] 2020. go-enry GitHub: enry . https://github.com/go-enry/enry/[3] 2020.
JetBrains Research GitHub: Identifiers Extractor . https://github.com/JetBrains-Research/buckwheat/ https://github.com/tensorflow/tensorflow/ https://github.com/bitcoin/bitcoin/ https://github.com/python/cpython/ https://github.com/brython-dev/brython/ [4] 2020. JetBrains Research GitHub: Sosed . https://github.com/JetBrains-Research/sosed/[5] 2020.
Pygments: Python syntax highlighter . https://pygments.org/[6] 2020.
The most popular languages of GitHub’s pull requests, 1 quarter, 2020 . https://madnight.github.io/githut/ tree-sitter GitHub: tree-sitter . https://github.com/tree-sitter/tree-sitter/[8] Uri Alon, Omer Levy, and Eran Yahav. 2018. code2seq: Generating Se-quences from Structured Representations of Code.
CoRR abs/1808.01400 (2018).arXiv:1808.01400 http://arxiv.org/abs/1808.01400[9] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent DirichletAllocation.
J. Mach. Learn. Res.
3, null (March 2003), 993–1022.[10] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. En-riching Word Vectors with Subword Information.
Transactions of the Associationfor Computational Linguistics
WSDM 2015 - Proceedings of the 8th ACM International Conference on Web Searchand Data Mining . https://doi.org/10.1145/2684822.2685305[12] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas,and Richard A. Harshman. 1990. Indexing by Latent Semantic Analysis.
J. Am.Soc. Inf. Sci.
41 (1990), 391–407.[13] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.In
Proceedings of the Second International Conference on Knowledge Discovery andData Mining (Portland, Oregon) (KDD’96) . AAAI Press, 226–231.[14] Hugo Gonzalez, Natalia Stakhanova, and Ali Ghorbani. 2014. DroidKin: Light-weight Detection of Android Apps Similarity, Vol. 152. https://doi.org/10.1007/978-3-319-23829-6_30[15] Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, andDavid Bieber. 2020. Global Relational Models of Source Code. In
InternationalConference on Learning Representations . https://openreview.net/forum?id=B1lnbRNtwr[16] Kurt Hornik, Ingo Feinerer, Martin Kober, and Christian Buchta. 2012. Sphericalk-Means Clustering.
Journal of Statistical Software
50 (09 2012), 1–22. https://doi.org/10.18637/jss.v050.i10[17] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similaritysearch with GPUs. arXiv preprint arXiv:1702.08734 (2017).[18] S. Kullback and R. A. Leibler. 1951. On Information and Sufficiency.
Ann. Math.Statist.
22, 1 (03 1951), 79–86. https://doi.org/10.1214/aoms/1177729694[19] L. Li, T. F. Bissyandé, and J. Klein. 2017. SimiDroid: Identifying and ExplainingSimilarities in Android Apps. In . 136–143.[20] M. Linares-Vásquez, A. Holtzhauer, and D. Poshyvanyk. 2016. On automaticallydetecting similar Android apps. In . 1–10.[21] S. Lloyd. 1982. Least squares quantization in PCM.
IEEE Transactions on Informa-tion Theory
28, 2 (mar 1982), 129–137. https://doi.org/10.1109/tit.1982.1056489[22] Vadim Markovtsev. 2017. GitHub word2vec 120k. https://data.world/vmarkovtsev/github-word-2-vec-120-k.[23] Vadim Markovtsev and Eiso Kant. 2017. Topic modeling of public repositories atscale using names in source code. arXiv preprint arXiv:1704.00135 (2017).[24] Collin McMillan, Mark Grechanik, and Denys Poshyvanyk. 2012. DetectingSimilar Software Applications. In
Proceedings of the 34th International Conferenceon Software Engineering (Zurich, Switzerland) (ICSE ’12) . IEEE Press, 364–374.[25] Tom Mens, Alexander Serebrenik, and Anthony Cleve. 2014.
Evolving SoftwareSystems . Springer Publishing Company, Incorporated.[26] Martin F Porter. 2001. Snowball: A language for stemming algorithms.[27] Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015.Evaluation methods for unsupervised word embeddings. In
Proceedings of the2015 Conference on Empirical Methods in Natural Language Processing . Associationfor Computational Linguistics, Lisbon, Portugal, 298–307. https://doi.org/10.18653/v1/D15-1036[28] Xiaobing Sun, Xiangyue Liu, Li Bin, Yucong Duan, Hui Yang, and Jiajun Hu. 2016.Exploring topic models in software engineering data analysis: A survey. 357–362.https://doi.org/10.1109/SNPD.2016.7515925[29] F. Thung, D. Lo, and L. Jiang. 2012. Detecting similar applications with collabora-tive tagging. In . 600–603.[30] Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Es-timating the number of clusters in a data set via the gap statis-tic.
Journal of the Royal Statistical Society: Series B (Statistical Method-ology)
63, 2 (2001), 411–423. https://doi.org/10.1111/1467-9868.00293arXiv:https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9868.00293[31] Konstantin Vorontsov and Anna Potapenko. 2015. Additive regularization oftopic models.