[PDF] Predicting the Programming Language of Questions and Snippets of StackOverflow Using Natural Language Processing

Abstract

Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted in Stack Overflow usually contain a code snippet. Stack Overflow relies on users to properly tag the programming language of a question and it simply assumes that the programming language of the snippets inside a question is the same as the tag of the question itself. In this paper, we propose a classifier to predict the programming language of questions posted in Stack Overflow using Natural Language Processing (NLP) and Machine Learning (ML). The classifier achieves an accuracy of 91.1% in predicting the 24 most popular programming languages by combining features from the title, body and the code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 81.1%. Finally, we propose a classifier of code snippets only that achieves an accuracy of 77.7%. These results show that deploying Machine Learning techniques on the combination of text and the code snippets of a question provides the best performance. These results demonstrate also that it is possible to identify the programming language of a snippet of few lines of source code. We visualize the feature space of two programming languages Java and SQL in order to identify some special properties of information inside the questions in Stack Overflow corresponding to these languages.

Full PDF

PPredicting the Programming Language of Questions and Snippets ofStackOverﬂow Using Natural Language Processing

Kamel Alreshedy, Dhanush Dharmaretnam, Daniel M. German, Venkatesh Srinivasan and T. Aaron GulliverDepartment of Computer Science, University of VictoriaPO Box 1700, STN CSC, Victoria BC, Canada V8W 2Y2Kamel, Dhanushd, dmg, [email protected], [email protected]

Abstract — Stack Overﬂow is the most popular Q&A websiteamong software developers. As a platform for knowledgesharing and acquisition, the questions posted in Stack Overﬂowusually contain a code snippet. Stack Overﬂow relies on usersto properly tag the programming language of a question and itsimply assumes that the programming language of the snippetsinside a question is the same as the tag of the question itself. Inthis paper, we propose a classiﬁer to predict the programminglanguage of questions posted in Stack Overﬂow using NaturalLanguage Processing (NLP) and Machine Learning (ML). Theclassiﬁer achieves an accuracy of 91.1% in predicting the 24most popular programming languages by combining featuresfrom the title, body and the code snippets of the question. Wealso propose a classiﬁer that only uses the title and body of thequestion and has an accuracy of 81.1%. Finally, we proposea classiﬁer of code snippets only that achieves an accuracy of77.7%. These results show that deploying Machine Learningtechniques on the combination of text and the code snippetsof a question provides the best performance. These resultsdemonstrate also that it is possible to identify the programminglanguage of a snippet of few lines of source code. We visualizethe feature space of two programming languages ‘Java andSQL’ in order to identify some special properties of informationinside the questions in Stack Overﬂow corresponding to theselanguages.

Index Terms — Stack Overﬂow, Machine Learning, Program-ming Languages and Natural Language Processing.

I. INTRODUCTIONIn the last decade, Stack Overﬂow has become a widelyused resource in software development. Today inexperiencedprogrammers rely on Stack Overﬂow to address questionsthey have regarding their software development activities.Along with the growth of Stack Overﬂow, the numberof programming languages in use has increased. In its2018 Developer’s Survey, Stack Overﬂow lists 38 differentprogramming languages in its list of most “loved”, “dreaded”and “wanted” languages. The TIOBE Programming Lan-guage index tracks more than 100 languages [22].Forums like Stack Overﬂow rely on the tags of questionsto match them to users who can provide answers. However,new users in Stack Overﬂow or novice developers may nottag their posts correctly. This leads to posts being downvotedand ﬂagged by moderators even though the question may berelevant and adds value to the community. In some cases,Stack Overﬂow questions that are related to programminglanguages may lack a programming language tag. For ex-ample, Pandas is a popular Python library that provides data structures and powerful data analysis tools; however,its Stack Overﬂow questions usually do not include a Pythontag. This could create confusion among developers who maybe new to a programming language and might not be familiarwith all of its popular libraries. The problem of missinglanguage tags could be addressed if posts are automaticallytagged with their associated programming languages.Another related problem is the identiﬁcation of the pro-gramming language of a snippet of code. Given a few linesof code, it is often necessary to identify the language inwhich they are written. Stack Overﬂow relies on the tag ofa question to determine how to typeset any snippet in it. Ifa question is not tagged with any programming language,then the code is not typeset; however, the moment a tag isadded, the code is rendered with different colours for thedifferent syntactic constructs of the language. If a questionhas snippets in two or more languages, Stack Overﬂow willonly use the ﬁrst programming language tag to typeset allthe snippets of the question.The problem of identifying the language of snippets spansbeyond Stack Overﬂow. Snippets are widely included in blogposts and stored in cut-and-paste tools (such as Github’sGists and Pastebin). They might also be stored locally bythe user in tools such as QSnippets or Dash. Gists requirethe user to give each snippet a ﬁlename, which it usesto classify the programming language. Pastebin, like mostsnippet management tools, requires the user to select thelanguage of the snippet manually. In both cases, the onus ison the developer to specify the language. Most tools that usethe source code in documentation (such as recommenders)typically require that the document is already tagged with aprogramming language to be processed (and assumes all thesnippets in it are written in that language).In this paper, we evaluate the use of Machine Learning(ML) models to predict the programming languages in StackOverﬂow questions. Our research questions are:1)

RQ1. Can we predict the programming languageof a question in Stack Overﬂow? RQ2. Can we predict the programming languageof a question in Stack Overﬂow without using codesnippets inside it? RQ3. Can we predict the programming languageof code snippets in Stack Overﬂow questions?

For the ﬁrst research question, we are interested in evaluat- a r X i v : . [ c s . S E ] S e p ng how machine learning performs while trying to identitythe language of a question when all the information in aStack Overﬂow question is used; this includes its title, body(textual information) and code snippets in it. The purpose ofthe second research question is to determine if the inclusionof code snippets is an essential factor to determine theprogramming language that a question refers to. And ﬁnally,the purpose of the third research question is to evaluate theability to use machine learning to predict the language ofa snippet of source code; a successful predictor will haveapplications beyond Stack Overﬂow, as it could also beapplied to snippet management tools and code search enginesthat scan documentation and blogs for relevant information.The main contributions of the paper are as follows:1) A prediction method that uses a combination of codesnippets and textual information in a Stack Over-ﬂow question. This classiﬁers achieves an accuracy of91.1%, precision is 0.91 and recall 0.91 in predictingthe tag of programming language.2) A classiﬁer that uses only textual information in StackOverﬂow questions to predict their programming lan-guage. This classiﬁer achieves an accuracy of 81.1%,a precision of 0.83 and a recall of 0.81 which ismuch higher than the previous best model (Baquero et al. [13]).3) A prediction model based on Random Forest [24] andXGBoost [23] classiﬁers that predicts the programminglanguage using only a code snippet in a Stack Overﬂowquestion. This model is shown to provide an accuracyof 77.7%, a precision of 0.79 and a recall of 0.77.4) Use of Word2Vec [15] to study the features of twoprogramming languages, Java and SQL; these featuresare projected into a dimensional vector space usingWord2Vec and visualized using t-SNE.The rest of the paper is organized as follows. We begin bydescribing the dataset extraction and processing in Section II.Our methodology is described in Section III. The resultsand discussion are presented in Section IV and Section Vrespectively. Sections VI and VII discuss related and futurework. Finally, threats to validity and conclusions are outlinedin the last two sections of the paper (Sections VIII and IX).II. D ATASET E XTRACTION AND P ROCESSING

In this section, the details of the Stack Overﬂow datasetare discussed. Then, the preprocessing steps used for dataextraction and processing are explained.

A. Stack Overﬂow Selection

As of July 2017, Stack Overﬂow had 37.21 million posts,of which 14.45 million are questions with 50.9k differenttags. In this paper, the programming language tags in StackOverﬂow are of interest. The most popular 24 programminglanguages as per the 2017 Stack Overﬂow developer surveywere selected for analysis [17]. They constitute about 93%of the questions in Stack Overﬂow. The languages selectedfor our study are: Assembly, C, C

B. Extraction and Processing of Stack Overﬂow Questions

The Stack Overﬂow July 2017 data dump was used foranalysis. In our study, questions with more than one program-ming language tag were removed to avoid potential problemsduring training. Questions chosen contained at least one codesnippet, and the code snippet had at least characters.For each programming languages , random questionswere extracted; however, two programming languages hadless than , questions: Coffee Script (4,267) and Lua(8,460). The total number of questions selected was 232,727.Fig. 1 (a) shows an example of Stack Overﬂow post . Itcontains (1) the title of the post (2) the text body (3) the codesnippet and (4) the tags of the post. It should be noted thatthe tags of the question were removed and not included as apart of text features during the training process to eliminateany inherent bias. (a) Before applying NLP techniques. maven tomcat version way maven tomcat versioncommand tomcat instance version tomcat way plugindocs (b) After applying NLP techniques. Fig. 1: An example of a Stack Overﬂow Question.The .xml data was parsed using xmltodict and the PythonBeautiful Soup library to extract the code snippets and textfrom each question separately. See Fig 2. A Stack Overﬂowquestion consists of a title, body and code snippet. In somecases, a question contained multiple code snippets; thesewere combined into one. The questions were divided intothree datasets: their titles, their bodies and their snippets.The title and body (to which we refer to as textualinformation) and code snippet were used to answer theﬁrst research question. The textual information was usedto answer the second research question. Finally, the codesnippets were used to answer the last research question.Machine learning models cannot be trained on raw textbecause their performance is affected by noise present inthe data. The textual information (title and body) needto be cleaned and prepared before the machine learningmodel can be trained to provide a better prediction. Fewpreprocessing steps were required to clean the text. First, the https://stackoverﬂow.com/questions/1642697/ ig. 2: The dataset extraction process.non-alphanumeric characters such as punctuation, numbersand symbols were removed. Second, the entity names wereidentiﬁed using the dependency parsing of the Spacy Library[18]. An entity name is proper noun (for example, the nameof an organization, company, library, function etc.). Third,the stop words such as after, about, all, and from etc. wereremoved. Fourth, since the entity name can have differentforms (such as study, studies, studied and studious), it isuseful to train using one of these words and predict thetext containing any of the word forms. To achieve this goal,stemming and lemmatization was performed using the NLTKlibrary in Python [8]. At the end of all the preprocessingsteps, the remaining words were used as features to helptrain machine learning model. Fig. 1(a) is the original StackOverﬂow post and Fig. 1(b) is Stack Overﬂow post afterapplication of NLP techniques.The extracted set of questions provide a good coverage ofdifferent versions of a programming languages. For example, code snippets were extracted for the python tags: python-3.x,python-2.7, python-3.5, python-2.x, python-3.6, python-3.3and python-2.6, for the Java tags java-8 and java-7, and forthe C++ tags c++11, c++03, c++98 and c++14. The snippetsextracted had a signiﬁcant variation in lines of code, asshown in Fig. 3. The number of lines of code in the snippetsvaried from to .III. M ETHODOLOGY

Textual information (title and body) and code snippetsextracted from the Stack Overﬂow questions were split usingthe Term Frequency-Inverse Document Frequency (Tf-IDF)vectorizer from the Scikit-learn library [9]. The MinimumDocument Frequency (min-df) was set to 10, which meansthat only words present in at least ten documents were se-lected (a document can be either the code snippet, the textualinformation, or both—code snippet and textual information).This step eliminates infrequent words from the dataset whichhelps machine learning models learn from the most importantvocabulary. The Maximum Document Frequency (max-df)was set to default because the stop words were alreadyremoved in the data preprocessing step discussed in SectionII.

A. Classiﬁers

The ML algorithms Random Forest Classiﬁer (RFC) andXGBoost (a gradient boosting algorithm) were employed.These algorithms provided the higher accuracy comparedto the other algorithms we explored, ExtraTree and Multi-NomialNB. The performance metrics used in this paper forthe classiﬁers are precision, recall, accuracy, F1 score andconfusion matrix.

1) Random Forest Classiﬁer (RFC):

RFC [24] is anensemble algorithm which combines more than one classifer.This classiﬁer generates a number of decision trees fromrandomly selected subsets of training dataset. Each subsetprovides a decision tree that votes to make the ﬁnal decisionduring test. The ﬁnal decision made depends on the decisionof majority of trees. One advantage of this classiﬁer is thatif one or few of trees make a wrong decision, it will notaffect the accuracy of the result signiﬁcantly. Also, it avoidsthe overﬁtting problem seen in the Decision Tree model.Thetotal number of trees in the forest is extremely importantparameter because a large number of trees in the forest givehigh accuracy.

2) XGBoost:

XGBoost [23], standing for “Extreme Gradi-ent Boosting”, is a tree based model similar to Decision Treeand RFC. The idea behind boosting is to modify the weaklearner to be a better learner. Recall that Random Forest isa simple ensemble algorithm that generates many subtreesand each tree predicts the output independently. The ﬁnaloutput will be decided by the majority of the votes from thesubtrees. However, the XGBoost is more intelligent becauseeach subtree makes the prediction sequentially. Hence, eachsubtree learns from the mistakes that were made by theprevious subtree. The idea of XGBoost came from gradientig. 3: Box plots showing the number of lines of code in the extracted code snippets for all the languages. It should alsobe noted here that there were at least 400 posts which had more than 200 lines of code, however, were not included whilemaking this plot.boosting, but XGBoost uses the regularized model to helpcontrol overﬁtting and give a better performance.The machine learning models were tuned using Random-SearchCV, which is a tool for parameter search in the Scikit-learn library. The XGBoost algorithm has many parame-ters, such as minimum child weight, max depth, L1 andL2 regularization, and evaluation metrics such as ReceiverOperating Characteristic (ROC), accuracy and F1 score.RFC is a bagging classiﬁer and has a parameter number ofestimators which is the number of subtrees used to ﬁt themodel. It is important to tune the models by varying theseparameters. However, parameter tuning is computationallyexpensive using a technique such as grid search. Therefore,a deep learning technique called Random Search (RS) tuningis used to train the models. All model parameters were ﬁxedafter RS tuning on the cross-validation sets (stratiﬁed ten-fold cross-validation). For this purpose, the datasets weresplit into training and test data using the ratio 80:20.An important contribution of this paper is the study ofthe vocabulary and feature space for two programminglanguages (Java and SQL). A word2vec model [15] wasused to visualize the features from code snippets and textualinformation (title and body) datasets using Gensim, whichis a Python framework for vector space modelling [14]. Theresulting model represented each word in the vocabulary ina dimensional vector space. The selection of asthe number of dimensions for the trained word-vector is asper the recommendation of the original paper by T. Mikolov[21]. It is impossible to visualize concepts in such a largespace, so T-SNE [16] was used to reduce the number ofdimensions to . The top frequent 3% words were selectedfrom the vectors for ‘Java and SQL’ and analyzed using wordsimilarity and cosine distance. The code snippets and textualinformation features which are close to each other in thevector space were selected for Java and SQL using the cosine distance. Then, these features were visualized to understandthe similarities and differences between Java and SQL.IV. R ESULTS

In this section, the results obtained for the three researchquestions are described in detail.

RQ1. Can we predict the programming language of aquestion in Stack Overﬂow?

To answer this question, XGBoost and RFC classiﬁerswere trained on the combination of textual information andcode snippet datasets. The XGBoost classiﬁer achieves anaccuracy of 91.1%, and the average score for precision,recall and F1 score are 0.91, 0.91 and 0.91 respectively; onthe other hand, the FRC achieves an accuracy of 86.3%,and the average score for precision, recall and F1 score are0.87, 0.86 and 0.86 respectively. The results for XGBoostclassiﬁer are discussed in further detail because it providesthe best performance. In Table I, the performance metrics foreach programming language with respect to precision, recalland F1 score are given for XGBoost. Most programminglanguages have a high F1 score: Swift (0.97), GO (0.97),Groovy (0.97) and Coffeescript (0.97) had the highest, whilejava (0.75), SQL (0.78), C

RQ2. Can we predict the programming language of aquestion in Stack Overﬂow without using code snippetsinside it?

To answer this research question, two machine learningmodels were trained using the XGBoost and RFC classiﬁerson the dataset that contained only the textual information.The XGBoost classiﬁer achieved an accuracy of 81.1%, andthe average score for precision, recall and F1 score wereig. 4: Confusion matrix for the XGboost classiﬁer trained on code snippet and textual information features. The diagonalrepresent the percentage of programming language that was correctly predicted.0.83, 0.81 and 0.81 respectively. In Table II, the perfor-mance metrics for each programming language with respectto precision, recall and F1 score are given for XGBoost.RFC achieved slightly lower performance than XGboost,with an average score for precision, recall and F1 scoreof 0.76, 0.74 and 0.75 respectively. Note that the accuracyof XGBoost using textual information decreased by about10% compared to using both the textual information and its code snippet. The top performing languages based on the F1score are coffeescript (0.94), javascript (0.92), swift (0.92),Go (0.92), Haskell (0.92), C (0.91) Objective-C (0.90) andAssembly (0.89). Note further that the F1 scores of mostof the programming languages in the table decreased byapproximately 5% with a few exceptions (such as vb.net,vba, PHP, Lua, C rogramming Precision Recall F1-scoreSwift 0.98 0.96 0.97Go 0.98 0.96 0.97Groovy 0.99 0.95 0.97Coffeescript 0.98 0.96 0.97Javascript 0.97 0.95 0.96C 0.98 0.95 0.96C++ 0.97 0.93 0.95Objective-c 0.97 0.94 0.95Assembly 0.96 0.95 0.95Haskell 0.95 0.95 0.95Python 0.97 0.91 0.94Vb.net 0.95 0.93 0.94PHP 0.94 0.91 0.93Ruby 0.89 0.93 0.91Perl 0.91 0.91 0.91Matlab 0.92 0.90 0.91R 0.91 0.89 0.90Lua 0.94 0.86 0.90Typescript 0.90 0.88 0.89Vba 0.85 0.91 0.88Scala 0.85 0.92 0.88C

TABLE I: Performance for the proposed classiﬁer trained ontextual information and code snippet features.have a high F1 score and performed very well in both the(RQ1) and (RQ2). Java and SQL have the worst performancemetrics in (RQ1) and (RQ2), especially in (RQ2) the F1 scoredecreased by as much as 25%.

RQ3. Can we predict the programming language ofcode snippets in Stack Overﬂow questions?

To predict the programming language from a given codesnippet, two ML classiﬁers were trained on the code snippetdataset. XGBoost achieved an accuracy of 77.7%, and theaverage score for precision, recall and F1 score are 0.79, 0.77and 0.77 respectively. In Table III, the performance metricsfor each programming language with respect to precision,recall and F1 score are given while using XGBoost. RFCachieved accuracies of accuracy of 70.1%, and the averagescore for precision, recall and F1 score are 0.72, 0.72 and0.70 respectively. The results obtained for the code snippetdataset show the worst performance for both classiﬁers.The programming languages JavaScript (0.91), CoffeeScript(0.89) and PHP (.88) had a good F1 score. The F1 score ofPHP in (RQ1) is close to the average and in (RQ2) is oneof the worst; however, in (RQ3) the PHP language has thethird highest F1 score. Objective-C has the worst F1 scoreand precision (0.56 and 0.42); but its recall is extremelyhigh (0.85). When the programming language of a codesnippet is extremely hard to identity, XGBoost frequentlymisclassiﬁed it as Objective-C, while we observed that RFCmisclassiﬁed such snippets as Typescript. We have lookedmanually at some of these code snippets and are not able

Programming Precision Recall F1-scoreCoffeescript 0.96 0.91 0.94Javascript 0.94 0.89 0.92Swift 0.94 0.89 0.92Go 0.95 0.89 0.92Haskell 0.92 0.91 0.92C 0.93 0.88 0.91Objective-c 0.94 0.87 0.90Assembly 0.92 0.87 0.89Python 0.95 0.82 0.88Groovy 0.95 0.82 0.88C++ 0.92 0.83 0.87Ruby 0.86 0.88 0.87R 0.88 0.82 0.85Perl 0.88 0.81 0.84Matlab 0.88 0.80 0.84Scala 0.80 0.90 0.84Typescript 0.86 0.80 0.83Vb.net 0.82 0.82 0.82Vba 0.76 0.81 0.78PHP 0.82 0.72 0.77Lua 0.73 0.59 0.65C

TABLE II: Performance for the proposed classiﬁer trainedon textual information features.to identify the programming language easily . This is themain motivation for combining the textual information andcode snippet in (RQ1). If the classiﬁer gets confused whenpredicting the programming langauges from a code snippet,the textual information (title and body) will help the machinelearning model to make a better prediction. Fig. 5 shows theconfusion matrix for the XGBoost classiﬁer. Table V showshow the accuracy improves as the minimum size of the codesnippet in the dataset is increased from 10 to 100 characters.The comparison between the results of textual informationdataset and code snippet dataset shows a slightly differencein accuracy of 5% (in average). On the other hand, usingthe combination of textual information and code snippetsigniﬁcantly increased the accuracy by 10% compared tousing only the textual information and 14% compared tousing only code snippets. Since many Stack Overﬂow postscan have a large textual information and a small code snippetor vice versa, combining the two gives a high accuracy in(RQ1). V. D

ISCUSSION

The most important observation in the previous sectionis that for the research question (RQ1), XGBoost achieveshigh accuracy of 91.1%, while for (RQ2) and (RQ3), it onlyachieves an accuracy of 81.1% and 77.7% respectively. Thisobservations highlights the importance of using the combin-ing of textual information and code snippets in predictingtags in comparison to textual information or code snippetonly. In some cases, Stack Overﬂow posts contain very https://stackoverﬂow.com/questions/855360/ https://stackoverﬂow.com/questions/942772/ https://stackoverﬂow.com/questions/9986404/ https://stackoverﬂow.com/questions/2115227/ ig. 5: Confusion matrix for the XGboost classiﬁer trained on code snippet features. The diagonal represent the percentageof programming language that was correctly predicted.small code snippet making it extremely hard to identify itslanguage as many programming languages sharing the samesyntax.Dependency parsing and extracting of entity names usinga Neural Network (NN) through Spacy appeared to helpreduce noise and to extract important features from StackOverﬂow questions. This is likely the main reason for the signiﬁcant improvement in performance compared to previ-ous approaches in the literature.The analysis of the feature space of the top performing lan-guages indicates that these languages have unique code snip-pet features (keywords/identiﬁers) and textual informationfeatures (libraries, functions). For example, when the tex-tual information based features were visualized for Haskell, rogramming Precision Recall F1-scoreJavascript 0.94 0.88 0.91Coffeescript 0.92 0.86 0.89PHP 0.91 0.85 0.88Go 0.92 0.84 0.87Groovy 0.91 0.84 0.87Swift 0.91 0.82 0.86C 0.86 0.82 0.84Vb.net 0.89 0.81 0.84Haskell 0.87 0.8 0.83C++ 0.86 0.78 0.82Vba 0.82 0.77 0.80Lua 0.87 0.74 0.80Assembly 0.76 0.76 0.76Python 0.85 0.67 0.75Ruby 0.72 0.79 0.75Matlab 0.79 0.72 0.75Scala 0.71 0.79 0.75SQL 0.70 0.77 0.73C TABLE III: Performance for the proposed classiﬁer trainedon code snippet features.words such as ‘GHC’, ‘GHCI’, ‘Yesod’ and ‘Monad’ wereobtained. ‘GHC’ and ‘GHCI’ are compilers for Haskell,‘Yesod’ is a web-based framework and ‘Monad’ is a func-tional programming paradigm (Haskell is a functional pro-gramming language). Most of the top performing languageshave a small feature space (vocabulary) as compared to morepopular languages such as Java, Vba and C

ELATED W ORK

Baquero et al. [13] proposed a classiﬁer to predict theprogramming language of a Stack Overﬂow question. Theyextracted a set of questions from Stack Overﬂow thatcontained text and code snippets, questions for eachof programming languages. They trained two classiﬁersusing a Support Vector Machine model on two differentdatasets which are text body and code snippet features.The evaluation achieved an accuracy of 60% for text bodyfeatures and 44% for code snippet features which are muchlower than the results obtained in this paper. Table IVsummarizes our results in comparison to [13] as, to thebest of our knowledge, it is the only previous work in theliterature that tackles the problem of predicting programminglanguage tags for Stack Overﬂow questions.Kennedy et al. [20] studied the problem of using naturallanguage identiﬁcation to identify the programming languageof entire source code ﬁles from GitHub (rather than ques-tions from Stack Overﬂow). Their classiﬁer is based onﬁve statistical language models from NLP and identiﬁes programming languages and can achieve a high accuracyof 97.5%. In our work, programming languages arepredicted using small code snippets rather than source codeﬁle. Similarly, Khasnabish et al. [19] proposed a model todetect programming languages using source code ﬁles.Four algorithms were used to train and test the model usingBayesian learning techniques, i.e. NB, Bayesian Network(BN) and Multinomial Naive Bayes (MNB). It was shownthat MNB provides the highest accuracy of . %.Some editors such as Sublime and Atom add highlightsto code based on the programming language. However, thisrequires an explicit extension, e.g. .html, .css, .py. Portfolio[3] is a search engine that supports programmers in ﬁndingfunctions that implement high-level requirements in queryterms. This engine does not identify the language, but itanalyzes code snippets and extracts functions which can bereused. Holmes et al. [5] developed a tool called Strathconathat can ﬁnd similar snippets of code.Rekha et al. [7] proposed a hybrid auto-tagging systemthat suggests tags to users who create questions. When thepost contains a code snippet, the system detects the pro-gramming language based on the code snippets and suggestsmany tags to users. Multinomial Naive Bayes (MNB) wastrained and tested for the proposed classiﬁer which achieved72% accuracy. Saha et al. [1] converted Stack Overﬂowquestions into vectors, and then trained a Support VectorMachine using these vectors and suggested tags used themodel obtained. The tag prediction accuracy with this modelis 68.47%. Although it works well for some speciﬁc tags,it is not effective with some popular tags such as Java. odel Description Accuracy Precision Recall F1 score Previous

Baquero [18] code snippets A model trained using Support Vector Machine onquestion questions from Stack Overﬂow using code features 44.6% 0.45 0.44 0.44Baquero [18] textual information A model trained using Support Vector Machine onquestions from Stack Overﬂow using text features 60.8% 0.68 0.60 0.60

Proposed code snippet features XGBoost classiﬁer trained on Stack Overﬂowquestions using code snippet features 77.7% 0.79 0.77 0.78textual information features XGBoost classiﬁer trained on Stack Overﬂowquestions using textual information features 81.1% 0.83 0.81 0.81code snippet and textual information features XGBoost classiﬁer trained on Stack Overﬂow questionsusing code snippets and textual information features 91.1% 0.91 0.91 0.91

TABLE IV: A Comparison of Previous and Proposed classiﬁers (a) Java code snippet features.(b) Java textual information features.

Fig. 6: Code snippet and textual information features of Java represented in two dimensions after using t-SNE on a trainedWord2Vec model.

The Minimum Characters Accuracy Precision Recall F1-scoreMore than 10 77.7% 0.79 0.77 0.78More than 25 79.1% 0.80 0.97 0.79More than 50 81.7% 0.82 0.81 0.81More than 75 83.1% 0.83 0.83 0.83More than 100 84.7% 0.85 0.84 0.84

TABLE V: Effect of the minimum number of characters incode snippet on accuracyStanley and Byrne [2] used a cognitive-inspired Bayesianprobabilistic model to choose the most suitable tag for a post. This is the tag with the highest probability of beingcorrect given the a priori tag probabilities. However, thismodel normalizes the top for all questions, so it is unableto differentiate between a post where the top predictedtag is certain, and a post where the top predicted tag isquestionable. As a consequence, the accuracy is only 65%.VII. F

UTURE W ORK

The study of programming language prediction from tex-tual information and code snippets is still new, and muchremains to be done. Most of the existing tools focus on a) SQL code snippet features.(b) SQL textual information features.

Fig. 7: Code snippet and text information features of SQL represented in two dimensions using t-SNE on a trained Word2Vecmodel.ﬁle extensions rather than the code itself. In recent years,there has been tremendous progress made in the ﬁeld of deeplearning, especially for time series or sequence-based modelssuch as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. RNN and LSTM modelscan be trained using source code one character at a time asinput, but they can have a high computational cost.NLP and ML techniques perform much better in predictinglanguages compared to tools that predict directly from codesnippets. Stack Overﬂow text is somewhat unique in thesense that it captures the tone, sentiments and vocabulary ofthe developer community. This vocabulary varies dependingon the programming language. Therefore, it is important thatthe vocabulary for each programming language is captured,understood and separated. It is worth exploring if a CNNcombined with Word2Vec can be used for this task.In the future, our model will be evaluated using program-ming blog posts, library documentation and bug repositories.This would help us understand how general the model is.VIII. T

HREATS TO V ALIDITY

Construct Validity: In creating the datasets from StackOverﬂow, only the most popular programming languageswere extracted, and this was based solely on the program-ming language tag. However, some tags synonymous with languages were not included in the extraction process. Forexample, ‘SQL SERVER’, ‘PLSQL’ and ‘MICROSOFT SQLSERVER’ are related to ‘SQL’ but were discarded.Internal validity: After the datasets were extracted, de-pendency parsing was used to select entity names so as toinclude only the most relevant code snippet and text features.The use of dependency parsing can result in the loss ofcritical vocabulary and might affect our results. However,we manually analyzed the vocabulary before and after thedependency parsing to ensure that information related to thelanguages was not lost. Further, selecting additional featuressuch as lines of code and programming paradigm could haveimproved our results but was not considered.External validity: The focus of this paper was to obtaina classiﬁer for predicting languages due to the lack of opensource tools for this task. Stack Overﬂow was used in thisstudy as the data source but other sources such as GitHubrepositories were not explored. Therefore, no conclusionscan be made about the results with other sources of codesnippets and text on programming languages. Furthermore,some common programming languages such as Cobol andPascal were not considered in this study.X. C

ONCLUSIONS

This work tackles the important problem of predictingprogramming languages from code snippets and textual infor-mation. In particular, it chose to focus on predicting the pro-gramming language of Stack Overﬂow questions. Our resultsshow that training and testing the classiﬁer by combining thetextual information and code snippet achieves the highestaccuracy of 91.1%. Other experiments using either textualinformation or code snippets, achieve accuracies of 81.1%and 77.7% respectively. This implies that information fromtextual features is easier for a machine learning model tolearn as compared to information from code snippet features.Our results also show that it is possible to identify theprogramming language of a snippet of few lines of sourcecode. We believe that our classiﬁer could be applied inother scenarios such as code search engines and snippetmanagement tool. R

EFERENCES[1] A. K. Saha, R. K. Saha, and K. A. Schneider, “A discriminativemodel approach for suggesting tags automatically for stack overﬂowquestions,” In

Proc. of the 10th Working Conf. on Mining SoftwareRepositories (MSR), 2013, pp. 73-–76.[2] C. Stanley and M. D. Byrne, “Predicting Tags for StackOverﬂowPosts,” In

Proc. of Int. Conf. on Cognitive Modeling (ICCM), 2013.pp. 414–419[3] C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu,“Portfolio: A search engine for ﬁnding functions and their usages,”In

Proc. of 33rd Int. Conf. on Software Engineering (ICSE), 2011, pp.1043—1045.[4] M. Revelle, B. Dit, and D. Poshyvanyk, “Using Data Fusion and WebMining to Support Feature Location in Software,” In

Proc. of 18thIEEE Int. Conf. on Program Comprehension (ICPC), 2010, pp. 14-–23.[5] R. Holmes, R. J. Walker, and G. C. Murphy, “Strathcona examplerecommendation tool,”

ACM SIGSOFT Software Engineering Notes ,30(5), 2005, pp. 237-–240.[6] C. B. Seaman, “The Information Gathering Strategies of SoftwareMaintainers,” In

Proc. of the 18th Int. Conf. on Software Maintenance (ICSM), 2002, pp. 141–149.[7] V. S. Rekha, N. Divya, and P. S. Bagavathi, “A Hybrid Auto-tagging System for StackOverﬂow Forum Questions,” in

Proc. of the2014 Int. Conf. on Interdisciplinary Advances in Applied Computing (ICONIAAC), 2014, pp. 56:1-–56:5.[8] E. Loper and S. Bird, “NLTK: The Natural Language Toolkit,” In

Proc. of 42nd Annual Meeting of the Association for ComputationalLinguistics (ACL), 2004. pp. 63–70 [9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J.Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E.Duchesnay, “Scikit-learn: Machine Learning in Python,”

J. MachineLearning Research

12, 2011, pp. 2825–2830.[10] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann,“Design lessons from the fastest Q&A site in the west,” In

Proc. of theSIGCHI Conference on Human factors in Computing Systems (CHI),2011, pp. 2857-–2866.[11] M. Asaduzzaman, A. S. Mashiyat, C. K. Roy, and K. A. Schneider,“Answering questions about unanswered questions of stack overﬂow,”In

Proc. of the 10th Working Conference on Mining Software Repos-itories (MSR), 2013, pp. 97–100.[12] SoStats, 2017. [Online]. Available: https://sostats.github.io/last30days/[13] J. F. Baquero, J. E. Camargo, F. Restrepo-Calle, J. H. Aponte, andF. A. González, “Predicting the Programming Language: ExtractingKnowledge from Stack Overﬂow Posts,” In

Proc. of Colombian Conf.on Computing (CCC), 2017, Springer, pp. 199–21.[14] R. Rehurek and P. Sojka, “Gensim: Python Framework for VectorSpace Modelling,” NLP Centre, Faculty of Informatics, MasarykUniversity, Brno, Czech Republic, 2011.[15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed Representations of Words and Phrases and their Compo-sitionality,” In

Proc. of 26th Annual Conference on Neural InformationProcessing Systems (NIPS), 2013, pp. 3111–3119.[16] L. J. P. van der Maaten and G. E. Hinton, “Visualizing Data Usingt-SNE,”

J. Machine Learning Research 9 , pp. 2431–2456, Nov. 2008.[17] Developer Survey Results 2017. [Online]. Available:https://insights.stackoverﬂow.com/survey/2017

Proc. of Conf. on EmpiricalMethods in Natural Language Processing (EMNLP), 2015, pp. 1373—1378.[19] J. N. Khasnabish, M. Sodhi, J. Deshmukh, and G. Srinivasaraghavan,“Detecting Programming Language from Source Code Using BayesianLearning Techniques,” In

Proc. of Int. Conf. on Machine Learningand Data Mining in Pattern Recognition (MLDM), 2014, Springer,pp. 513–522.[20] J. Kennedy, V Dam and V. Zaytsev, “Software Language Identiﬁcationwith Natural Language Classiﬁers,” In

Proc. of IEEE Int. Conf. onSoftware Analysis, Evolution, and Reengineering (SANER), 2016, pp.624–628.[21] T. Mikolov, K.Chen, G. Corrado, and J. Dean.“Efﬁcient Estimationof Word Representations in Vector Space,” In

Proc. of Int. Conf, onLearning Representations

Proc. of 22nd SIGKDD Conf. on Knowledge Discoveryand Data Mining (KDD), 2016. pp. 785–794.[24] L. Breiman, “Random Forests,”