Predicting the Programming Language of Questions and Snippets of StackOverflow Using Natural Language Processing
Kamel Alreshedy, Dhanush Dharmaretnam, Daniel M. German, Venkatesh Srinivasan, T. Aaron Gulliver
PPredicting the Programming Language of Questions and Snippets ofStackOverflow Using Natural Language Processing
Kamel Alreshedy, Dhanush Dharmaretnam, Daniel M. German, Venkatesh Srinivasan and T. Aaron GulliverDepartment of Computer Science, University of VictoriaPO Box 1700, STN CSC, Victoria BC, Canada V8W 2Y2Kamel, Dhanushd, dmg, [email protected], [email protected]
Abstract — Stack Overflow is the most popular Q&A websiteamong software developers. As a platform for knowledgesharing and acquisition, the questions posted in Stack Overflowusually contain a code snippet. Stack Overflow relies on usersto properly tag the programming language of a question and itsimply assumes that the programming language of the snippetsinside a question is the same as the tag of the question itself. Inthis paper, we propose a classifier to predict the programminglanguage of questions posted in Stack Overflow using NaturalLanguage Processing (NLP) and Machine Learning (ML). Theclassifier achieves an accuracy of 91.1% in predicting the 24most popular programming languages by combining featuresfrom the title, body and the code snippets of the question. Wealso propose a classifier that only uses the title and body of thequestion and has an accuracy of 81.1%. Finally, we proposea classifier of code snippets only that achieves an accuracy of77.7%. These results show that deploying Machine Learningtechniques on the combination of text and the code snippetsof a question provides the best performance. These resultsdemonstrate also that it is possible to identify the programminglanguage of a snippet of few lines of source code. We visualizethe feature space of two programming languages ‘Java andSQL’ in order to identify some special properties of informationinside the questions in Stack Overflow corresponding to theselanguages.
Index Terms — Stack Overflow, Machine Learning, Program-ming Languages and Natural Language Processing.
I. INTRODUCTIONIn the last decade, Stack Overflow has become a widelyused resource in software development. Today inexperiencedprogrammers rely on Stack Overflow to address questionsthey have regarding their software development activities.Along with the growth of Stack Overflow, the numberof programming languages in use has increased. In its2018 Developer’s Survey, Stack Overflow lists 38 differentprogramming languages in its list of most “loved”, “dreaded”and “wanted” languages. The TIOBE Programming Lan-guage index tracks more than 100 languages [22].Forums like Stack Overflow rely on the tags of questionsto match them to users who can provide answers. However,new users in Stack Overflow or novice developers may nottag their posts correctly. This leads to posts being downvotedand flagged by moderators even though the question may berelevant and adds value to the community. In some cases,Stack Overflow questions that are related to programminglanguages may lack a programming language tag. For ex-ample, Pandas is a popular Python library that provides data structures and powerful data analysis tools; however,its Stack Overflow questions usually do not include a Pythontag. This could create confusion among developers who maybe new to a programming language and might not be familiarwith all of its popular libraries. The problem of missinglanguage tags could be addressed if posts are automaticallytagged with their associated programming languages.Another related problem is the identification of the pro-gramming language of a snippet of code. Given a few linesof code, it is often necessary to identify the language inwhich they are written. Stack Overflow relies on the tag ofa question to determine how to typeset any snippet in it. Ifa question is not tagged with any programming language,then the code is not typeset; however, the moment a tag isadded, the code is rendered with different colours for thedifferent syntactic constructs of the language. If a questionhas snippets in two or more languages, Stack Overflow willonly use the first programming language tag to typeset allthe snippets of the question.The problem of identifying the language of snippets spansbeyond Stack Overflow. Snippets are widely included in blogposts and stored in cut-and-paste tools (such as Github’sGists and Pastebin). They might also be stored locally bythe user in tools such as QSnippets or Dash. Gists requirethe user to give each snippet a filename, which it usesto classify the programming language. Pastebin, like mostsnippet management tools, requires the user to select thelanguage of the snippet manually. In both cases, the onus ison the developer to specify the language. Most tools that usethe source code in documentation (such as recommenders)typically require that the document is already tagged with aprogramming language to be processed (and assumes all thesnippets in it are written in that language).In this paper, we evaluate the use of Machine Learning(ML) models to predict the programming languages in StackOverflow questions. Our research questions are:1)
RQ1. Can we predict the programming languageof a question in Stack Overflow? RQ2. Can we predict the programming languageof a question in Stack Overflow without using codesnippets inside it? RQ3. Can we predict the programming languageof code snippets in Stack Overflow questions?
For the first research question, we are interested in evaluat- a r X i v : . [ c s . S E ] S e p ng how machine learning performs while trying to identitythe language of a question when all the information in aStack Overflow question is used; this includes its title, body(textual information) and code snippets in it. The purpose ofthe second research question is to determine if the inclusionof code snippets is an essential factor to determine theprogramming language that a question refers to. And finally,the purpose of the third research question is to evaluate theability to use machine learning to predict the language ofa snippet of source code; a successful predictor will haveapplications beyond Stack Overflow, as it could also beapplied to snippet management tools and code search enginesthat scan documentation and blogs for relevant information.The main contributions of the paper are as follows:1) A prediction method that uses a combination of codesnippets and textual information in a Stack Over-flow question. This classifiers achieves an accuracy of91.1%, precision is 0.91 and recall 0.91 in predictingthe tag of programming language.2) A classifier that uses only textual information in StackOverflow questions to predict their programming lan-guage. This classifier achieves an accuracy of 81.1%,a precision of 0.83 and a recall of 0.81 which ismuch higher than the previous best model (Baquero et al. [13]).3) A prediction model based on Random Forest [24] andXGBoost [23] classifiers that predicts the programminglanguage using only a code snippet in a Stack Overflowquestion. This model is shown to provide an accuracyof 77.7%, a precision of 0.79 and a recall of 0.77.4) Use of Word2Vec [15] to study the features of twoprogramming languages, Java and SQL; these featuresare projected into a dimensional vector space usingWord2Vec and visualized using t-SNE.The rest of the paper is organized as follows. We begin bydescribing the dataset extraction and processing in Section II.Our methodology is described in Section III. The resultsand discussion are presented in Section IV and Section Vrespectively. Sections VI and VII discuss related and futurework. Finally, threats to validity and conclusions are outlinedin the last two sections of the paper (Sections VIII and IX).II. D ATASET E XTRACTION AND P ROCESSING
In this section, the details of the Stack Overflow datasetare discussed. Then, the preprocessing steps used for dataextraction and processing are explained.
A. Stack Overflow Selection
As of July 2017, Stack Overflow had 37.21 million posts,of which 14.45 million are questions with 50.9k differenttags. In this paper, the programming language tags in StackOverflow are of interest. The most popular 24 programminglanguages as per the 2017 Stack Overflow developer surveywere selected for analysis [17]. They constitute about 93%of the questions in Stack Overflow. The languages selectedfor our study are: Assembly, C, C
B. Extraction and Processing of Stack Overflow Questions
The Stack Overflow July 2017 data dump was used foranalysis. In our study, questions with more than one program-ming language tag were removed to avoid potential problemsduring training. Questions chosen contained at least one codesnippet, and the code snippet had at least characters.For each programming languages , random questionswere extracted; however, two programming languages hadless than , questions: Coffee Script (4,267) and Lua(8,460). The total number of questions selected was 232,727.Fig. 1 (a) shows an example of Stack Overflow post . Itcontains (1) the title of the post (2) the text body (3) the codesnippet and (4) the tags of the post. It should be noted thatthe tags of the question were removed and not included as apart of text features during the training process to eliminateany inherent bias. (a) Before applying NLP techniques. maven tomcat version way maven tomcat versioncommand tomcat instance version tomcat way plugindocs (b) After applying NLP techniques. Fig. 1: An example of a Stack Overflow Question.The .xml data was parsed using xmltodict and the PythonBeautiful Soup library to extract the code snippets and textfrom each question separately. See Fig 2. A Stack Overflowquestion consists of a title, body and code snippet. In somecases, a question contained multiple code snippets; thesewere combined into one. The questions were divided intothree datasets: their titles, their bodies and their snippets.The title and body (to which we refer to as textualinformation) and code snippet were used to answer thefirst research question. The textual information was usedto answer the second research question. Finally, the codesnippets were used to answer the last research question.Machine learning models cannot be trained on raw textbecause their performance is affected by noise present inthe data. The textual information (title and body) needto be cleaned and prepared before the machine learningmodel can be trained to provide a better prediction. Fewpreprocessing steps were required to clean the text. First, the https://stackoverflow.com/questions/1642697/ ig. 2: The dataset extraction process.non-alphanumeric characters such as punctuation, numbersand symbols were removed. Second, the entity names wereidentified using the dependency parsing of the Spacy Library[18]. An entity name is proper noun (for example, the nameof an organization, company, library, function etc.). Third,the stop words such as after, about, all, and from etc. wereremoved. Fourth, since the entity name can have differentforms (such as study, studies, studied and studious), it isuseful to train using one of these words and predict thetext containing any of the word forms. To achieve this goal,stemming and lemmatization was performed using the NLTKlibrary in Python [8]. At the end of all the preprocessingsteps, the remaining words were used as features to helptrain machine learning model. Fig. 1(a) is the original StackOverflow post and Fig. 1(b) is Stack Overflow post afterapplication of NLP techniques.The extracted set of questions provide a good coverage ofdifferent versions of a programming languages. For example, code snippets were extracted for the python tags: python-3.x,python-2.7, python-3.5, python-2.x, python-3.6, python-3.3and python-2.6, for the Java tags java-8 and java-7, and forthe C++ tags c++11, c++03, c++98 and c++14. The snippetsextracted had a significant variation in lines of code, asshown in Fig. 3. The number of lines of code in the snippetsvaried from to .III. M ETHODOLOGY
Textual information (title and body) and code snippetsextracted from the Stack Overflow questions were split usingthe Term Frequency-Inverse Document Frequency (Tf-IDF)vectorizer from the Scikit-learn library [9]. The MinimumDocument Frequency (min-df) was set to 10, which meansthat only words present in at least ten documents were se-lected (a document can be either the code snippet, the textualinformation, or both—code snippet and textual information).This step eliminates infrequent words from the dataset whichhelps machine learning models learn from the most importantvocabulary. The Maximum Document Frequency (max-df)was set to default because the stop words were alreadyremoved in the data preprocessing step discussed in SectionII.
A. Classifiers
The ML algorithms Random Forest Classifier (RFC) andXGBoost (a gradient boosting algorithm) were employed.These algorithms provided the higher accuracy comparedto the other algorithms we explored, ExtraTree and Multi-NomialNB. The performance metrics used in this paper forthe classifiers are precision, recall, accuracy, F1 score andconfusion matrix.
1) Random Forest Classifier (RFC):
RFC [24] is anensemble algorithm which combines more than one classifer.This classifier generates a number of decision trees fromrandomly selected subsets of training dataset. Each subsetprovides a decision tree that votes to make the final decisionduring test. The final decision made depends on the decisionof majority of trees. One advantage of this classifier is thatif one or few of trees make a wrong decision, it will notaffect the accuracy of the result significantly. Also, it avoidsthe overfitting problem seen in the Decision Tree model.Thetotal number of trees in the forest is extremely importantparameter because a large number of trees in the forest givehigh accuracy.
2) XGBoost:
XGBoost [23], standing for “Extreme Gradi-ent Boosting”, is a tree based model similar to Decision Treeand RFC. The idea behind boosting is to modify the weaklearner to be a better learner. Recall that Random Forest isa simple ensemble algorithm that generates many subtreesand each tree predicts the output independently. The finaloutput will be decided by the majority of the votes from thesubtrees. However, the XGBoost is more intelligent becauseeach subtree makes the prediction sequentially. Hence, eachsubtree learns from the mistakes that were made by theprevious subtree. The idea of XGBoost came from gradientig. 3: Box plots showing the number of lines of code in the extracted code snippets for all the languages. It should alsobe noted here that there were at least 400 posts which had more than 200 lines of code, however, were not included whilemaking this plot.boosting, but XGBoost uses the regularized model to helpcontrol overfitting and give a better performance.The machine learning models were tuned using Random-SearchCV, which is a tool for parameter search in the Scikit-learn library. The XGBoost algorithm has many parame-ters, such as minimum child weight, max depth, L1 andL2 regularization, and evaluation metrics such as ReceiverOperating Characteristic (ROC), accuracy and F1 score.RFC is a bagging classifier and has a parameter number ofestimators which is the number of subtrees used to fit themodel. It is important to tune the models by varying theseparameters. However, parameter tuning is computationallyexpensive using a technique such as grid search. Therefore,a deep learning technique called Random Search (RS) tuningis used to train the models. All model parameters were fixedafter RS tuning on the cross-validation sets (stratified ten-fold cross-validation). For this purpose, the datasets weresplit into training and test data using the ratio 80:20.An important contribution of this paper is the study ofthe vocabulary and feature space for two programminglanguages (Java and SQL). A word2vec model [15] wasused to visualize the features from code snippets and textualinformation (title and body) datasets using Gensim, whichis a Python framework for vector space modelling [14]. Theresulting model represented each word in the vocabulary ina dimensional vector space. The selection of asthe number of dimensions for the trained word-vector is asper the recommendation of the original paper by T. Mikolov[21]. It is impossible to visualize concepts in such a largespace, so T-SNE [16] was used to reduce the number ofdimensions to . The top frequent 3% words were selectedfrom the vectors for ‘Java and SQL’ and analyzed using wordsimilarity and cosine distance. The code snippets and textualinformation features which are close to each other in thevector space were selected for Java and SQL using the cosine distance. Then, these features were visualized to understandthe similarities and differences between Java and SQL.IV. R ESULTS
In this section, the results obtained for the three researchquestions are described in detail.
RQ1. Can we predict the programming language of aquestion in Stack Overflow?
To answer this question, XGBoost and RFC classifierswere trained on the combination of textual information andcode snippet datasets. The XGBoost classifier achieves anaccuracy of 91.1%, and the average score for precision,recall and F1 score are 0.91, 0.91 and 0.91 respectively; onthe other hand, the FRC achieves an accuracy of 86.3%,and the average score for precision, recall and F1 score are0.87, 0.86 and 0.86 respectively. The results for XGBoostclassifier are discussed in further detail because it providesthe best performance. In Table I, the performance metrics foreach programming language with respect to precision, recalland F1 score are given for XGBoost. Most programminglanguages have a high F1 score: Swift (0.97), GO (0.97),Groovy (0.97) and Coffeescript (0.97) had the highest, whilejava (0.75), SQL (0.78), C
RQ2. Can we predict the programming language of aquestion in Stack Overflow without using code snippetsinside it?
To answer this research question, two machine learningmodels were trained using the XGBoost and RFC classifierson the dataset that contained only the textual information.The XGBoost classifier achieved an accuracy of 81.1%, andthe average score for precision, recall and F1 score wereig. 4: Confusion matrix for the XGboost classifier trained on code snippet and textual information features. The diagonalrepresent the percentage of programming language that was correctly predicted.0.83, 0.81 and 0.81 respectively. In Table II, the perfor-mance metrics for each programming language with respectto precision, recall and F1 score are given for XGBoost.RFC achieved slightly lower performance than XGboost,with an average score for precision, recall and F1 scoreof 0.76, 0.74 and 0.75 respectively. Note that the accuracyof XGBoost using textual information decreased by about10% compared to using both the textual information and its code snippet. The top performing languages based on the F1score are coffeescript (0.94), javascript (0.92), swift (0.92),Go (0.92), Haskell (0.92), C (0.91) Objective-C (0.90) andAssembly (0.89). Note further that the F1 scores of mostof the programming languages in the table decreased byapproximately 5% with a few exceptions (such as vb.net,vba, PHP, Lua, C rogramming Precision Recall F1-scoreSwift 0.98 0.96 0.97Go 0.98 0.96 0.97Groovy 0.99 0.95 0.97Coffeescript 0.98 0.96 0.97Javascript 0.97 0.95 0.96C 0.98 0.95 0.96C++ 0.97 0.93 0.95Objective-c 0.97 0.94 0.95Assembly 0.96 0.95 0.95Haskell 0.95 0.95 0.95Python 0.97 0.91 0.94Vb.net 0.95 0.93 0.94PHP 0.94 0.91 0.93Ruby 0.89 0.93 0.91Perl 0.91 0.91 0.91Matlab 0.92 0.90 0.91R 0.91 0.89 0.90Lua 0.94 0.86 0.90Typescript 0.90 0.88 0.89Vba 0.85 0.91 0.88Scala 0.85 0.92 0.88C
TABLE I: Performance for the proposed classifier trained ontextual information and code snippet features.have a high F1 score and performed very well in both the(RQ1) and (RQ2). Java and SQL have the worst performancemetrics in (RQ1) and (RQ2), especially in (RQ2) the F1 scoredecreased by as much as 25%.
RQ3. Can we predict the programming language ofcode snippets in Stack Overflow questions?
To predict the programming language from a given codesnippet, two ML classifiers were trained on the code snippetdataset. XGBoost achieved an accuracy of 77.7%, and theaverage score for precision, recall and F1 score are 0.79, 0.77and 0.77 respectively. In Table III, the performance metricsfor each programming language with respect to precision,recall and F1 score are given while using XGBoost. RFCachieved accuracies of accuracy of 70.1%, and the averagescore for precision, recall and F1 score are 0.72, 0.72 and0.70 respectively. The results obtained for the code snippetdataset show the worst performance for both classifiers.The programming languages JavaScript (0.91), CoffeeScript(0.89) and PHP (.88) had a good F1 score. The F1 score ofPHP in (RQ1) is close to the average and in (RQ2) is oneof the worst; however, in (RQ3) the PHP language has thethird highest F1 score. Objective-C has the worst F1 scoreand precision (0.56 and 0.42); but its recall is extremelyhigh (0.85). When the programming language of a codesnippet is extremely hard to identity, XGBoost frequentlymisclassified it as Objective-C, while we observed that RFCmisclassified such snippets as Typescript. We have lookedmanually at some of these code snippets and are not able
Programming Precision Recall F1-scoreCoffeescript 0.96 0.91 0.94Javascript 0.94 0.89 0.92Swift 0.94 0.89 0.92Go 0.95 0.89 0.92Haskell 0.92 0.91 0.92C 0.93 0.88 0.91Objective-c 0.94 0.87 0.90Assembly 0.92 0.87 0.89Python 0.95 0.82 0.88Groovy 0.95 0.82 0.88C++ 0.92 0.83 0.87Ruby 0.86 0.88 0.87R 0.88 0.82 0.85Perl 0.88 0.81 0.84Matlab 0.88 0.80 0.84Scala 0.80 0.90 0.84Typescript 0.86 0.80 0.83Vb.net 0.82 0.82 0.82Vba 0.76 0.81 0.78PHP 0.82 0.72 0.77Lua 0.73 0.59 0.65C
TABLE II: Performance for the proposed classifier trainedon textual information features.to identify the programming language easily . This is themain motivation for combining the textual information andcode snippet in (RQ1). If the classifier gets confused whenpredicting the programming langauges from a code snippet,the textual information (title and body) will help the machinelearning model to make a better prediction. Fig. 5 shows theconfusion matrix for the XGBoost classifier. Table V showshow the accuracy improves as the minimum size of the codesnippet in the dataset is increased from 10 to 100 characters.The comparison between the results of textual informationdataset and code snippet dataset shows a slightly differencein accuracy of 5% (in average). On the other hand, usingthe combination of textual information and code snippetsignificantly increased the accuracy by 10% compared tousing only the textual information and 14% compared tousing only code snippets. Since many Stack Overflow postscan have a large textual information and a small code snippetor vice versa, combining the two gives a high accuracy in(RQ1). V. D
ISCUSSION
The most important observation in the previous sectionis that for the research question (RQ1), XGBoost achieveshigh accuracy of 91.1%, while for (RQ2) and (RQ3), it onlyachieves an accuracy of 81.1% and 77.7% respectively. Thisobservations highlights the importance of using the combin-ing of textual information and code snippets in predictingtags in comparison to textual information or code snippetonly. In some cases, Stack Overflow posts contain very https://stackoverflow.com/questions/855360/ https://stackoverflow.com/questions/942772/ https://stackoverflow.com/questions/9986404/ https://stackoverflow.com/questions/2115227/ ig. 5: Confusion matrix for the XGboost classifier trained on code snippet features. The diagonal represent the percentageof programming language that was correctly predicted.small code snippet making it extremely hard to identify itslanguage as many programming languages sharing the samesyntax.Dependency parsing and extracting of entity names usinga Neural Network (NN) through Spacy appeared to helpreduce noise and to extract important features from StackOverflow questions. This is likely the main reason for the significant improvement in performance compared to previ-ous approaches in the literature.The analysis of the feature space of the top performing lan-guages indicates that these languages have unique code snip-pet features (keywords/identifiers) and textual informationfeatures (libraries, functions). For example, when the tex-tual information based features were visualized for Haskell, rogramming Precision Recall F1-scoreJavascript 0.94 0.88 0.91Coffeescript 0.92 0.86 0.89PHP 0.91 0.85 0.88Go 0.92 0.84 0.87Groovy 0.91 0.84 0.87Swift 0.91 0.82 0.86C 0.86 0.82 0.84Vb.net 0.89 0.81 0.84Haskell 0.87 0.8 0.83C++ 0.86 0.78 0.82Vba 0.82 0.77 0.80Lua 0.87 0.74 0.80Assembly 0.76 0.76 0.76Python 0.85 0.67 0.75Ruby 0.72 0.79 0.75Matlab 0.79 0.72 0.75Scala 0.71 0.79 0.75SQL 0.70 0.77 0.73C TABLE III: Performance for the proposed classifier trainedon code snippet features.words such as ‘GHC’, ‘GHCI’, ‘Yesod’ and ‘Monad’ wereobtained. ‘GHC’ and ‘GHCI’ are compilers for Haskell,‘Yesod’ is a web-based framework and ‘Monad’ is a func-tional programming paradigm (Haskell is a functional pro-gramming language). Most of the top performing languageshave a small feature space (vocabulary) as compared to morepopular languages such as Java, Vba and C
ELATED W ORK
Baquero et al. [13] proposed a classifier to predict theprogramming language of a Stack Overflow question. Theyextracted a set of questions from Stack Overflow thatcontained text and code snippets, questions for eachof programming languages. They trained two classifiersusing a Support Vector Machine model on two differentdatasets which are text body and code snippet features.The evaluation achieved an accuracy of 60% for text bodyfeatures and 44% for code snippet features which are muchlower than the results obtained in this paper. Table IVsummarizes our results in comparison to [13] as, to thebest of our knowledge, it is the only previous work in theliterature that tackles the problem of predicting programminglanguage tags for Stack Overflow questions.Kennedy et al. [20] studied the problem of using naturallanguage identification to identify the programming languageof entire source code files from GitHub (rather than ques-tions from Stack Overflow). Their classifier is based onfive statistical language models from NLP and identifies programming languages and can achieve a high accuracyof 97.5%. In our work, programming languages arepredicted using small code snippets rather than source codefile. Similarly, Khasnabish et al. [19] proposed a model todetect programming languages using source code files.Four algorithms were used to train and test the model usingBayesian learning techniques, i.e. NB, Bayesian Network(BN) and Multinomial Naive Bayes (MNB). It was shownthat MNB provides the highest accuracy of . %.Some editors such as Sublime and Atom add highlightsto code based on the programming language. However, thisrequires an explicit extension, e.g. .html, .css, .py. Portfolio[3] is a search engine that supports programmers in findingfunctions that implement high-level requirements in queryterms. This engine does not identify the language, but itanalyzes code snippets and extracts functions which can bereused. Holmes et al. [5] developed a tool called Strathconathat can find similar snippets of code.Rekha et al. [7] proposed a hybrid auto-tagging systemthat suggests tags to users who create questions. When thepost contains a code snippet, the system detects the pro-gramming language based on the code snippets and suggestsmany tags to users. Multinomial Naive Bayes (MNB) wastrained and tested for the proposed classifier which achieved72% accuracy. Saha et al. [1] converted Stack Overflowquestions into vectors, and then trained a Support VectorMachine using these vectors and suggested tags used themodel obtained. The tag prediction accuracy with this modelis 68.47%. Although it works well for some specific tags,it is not effective with some popular tags such as Java. odel Description Accuracy Precision Recall F1 score Previous
Baquero [18] code snippets A model trained using Support Vector Machine onquestion questions from Stack Overflow using code features 44.6% 0.45 0.44 0.44Baquero [18] textual information A model trained using Support Vector Machine onquestions from Stack Overflow using text features 60.8% 0.68 0.60 0.60
Proposed code snippet features XGBoost classifier trained on Stack Overflowquestions using code snippet features 77.7% 0.79 0.77 0.78textual information features XGBoost classifier trained on Stack Overflowquestions using textual information features 81.1% 0.83 0.81 0.81code snippet and textual information features XGBoost classifier trained on Stack Overflow questionsusing code snippets and textual information features 91.1% 0.91 0.91 0.91
TABLE IV: A Comparison of Previous and Proposed classifiers (a) Java code snippet features.(b) Java textual information features.
Fig. 6: Code snippet and textual information features of Java represented in two dimensions after using t-SNE on a trainedWord2Vec model.
The Minimum Characters Accuracy Precision Recall F1-scoreMore than 10 77.7% 0.79 0.77 0.78More than 25 79.1% 0.80 0.97 0.79More than 50 81.7% 0.82 0.81 0.81More than 75 83.1% 0.83 0.83 0.83More than 100 84.7% 0.85 0.84 0.84
TABLE V: Effect of the minimum number of characters incode snippet on accuracyStanley and Byrne [2] used a cognitive-inspired Bayesianprobabilistic model to choose the most suitable tag for a post. This is the tag with the highest probability of beingcorrect given the a priori tag probabilities. However, thismodel normalizes the top for all questions, so it is unableto differentiate between a post where the top predictedtag is certain, and a post where the top predicted tag isquestionable. As a consequence, the accuracy is only 65%.VII. F
UTURE W ORK
The study of programming language prediction from tex-tual information and code snippets is still new, and muchremains to be done. Most of the existing tools focus on a) SQL code snippet features.(b) SQL textual information features.
Fig. 7: Code snippet and text information features of SQL represented in two dimensions using t-SNE on a trained Word2Vecmodel.file extensions rather than the code itself. In recent years,there has been tremendous progress made in the field of deeplearning, especially for time series or sequence-based modelssuch as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. RNN and LSTM modelscan be trained using source code one character at a time asinput, but they can have a high computational cost.NLP and ML techniques perform much better in predictinglanguages compared to tools that predict directly from codesnippets. Stack Overflow text is somewhat unique in thesense that it captures the tone, sentiments and vocabulary ofthe developer community. This vocabulary varies dependingon the programming language. Therefore, it is important thatthe vocabulary for each programming language is captured,understood and separated. It is worth exploring if a CNNcombined with Word2Vec can be used for this task.In the future, our model will be evaluated using program-ming blog posts, library documentation and bug repositories.This would help us understand how general the model is.VIII. T
HREATS TO V ALIDITY
Construct Validity: In creating the datasets from StackOverflow, only the most popular programming languageswere extracted, and this was based solely on the program-ming language tag. However, some tags synonymous with languages were not included in the extraction process. Forexample, ‘SQL SERVER’, ‘PLSQL’ and ‘MICROSOFT SQLSERVER’ are related to ‘SQL’ but were discarded.Internal validity: After the datasets were extracted, de-pendency parsing was used to select entity names so as toinclude only the most relevant code snippet and text features.The use of dependency parsing can result in the loss ofcritical vocabulary and might affect our results. However,we manually analyzed the vocabulary before and after thedependency parsing to ensure that information related to thelanguages was not lost. Further, selecting additional featuressuch as lines of code and programming paradigm could haveimproved our results but was not considered.External validity: The focus of this paper was to obtaina classifier for predicting languages due to the lack of opensource tools for this task. Stack Overflow was used in thisstudy as the data source but other sources such as GitHubrepositories were not explored. Therefore, no conclusionscan be made about the results with other sources of codesnippets and text on programming languages. Furthermore,some common programming languages such as Cobol andPascal were not considered in this study.X. C
ONCLUSIONS
This work tackles the important problem of predictingprogramming languages from code snippets and textual infor-mation. In particular, it chose to focus on predicting the pro-gramming language of Stack Overflow questions. Our resultsshow that training and testing the classifier by combining thetextual information and code snippet achieves the highestaccuracy of 91.1%. Other experiments using either textualinformation or code snippets, achieve accuracies of 81.1%and 77.7% respectively. This implies that information fromtextual features is easier for a machine learning model tolearn as compared to information from code snippet features.Our results also show that it is possible to identify theprogramming language of a snippet of few lines of sourcecode. We believe that our classifier could be applied inother scenarios such as code search engines and snippetmanagement tool. R
EFERENCES[1] A. K. Saha, R. K. Saha, and K. A. Schneider, “A discriminativemodel approach for suggesting tags automatically for stack overflowquestions,” In
Proc. of the 10th Working Conf. on Mining SoftwareRepositories (MSR), 2013, pp. 73-–76.[2] C. Stanley and M. D. Byrne, “Predicting Tags for StackOverflowPosts,” In
Proc. of Int. Conf. on Cognitive Modeling (ICCM), 2013.pp. 414–419[3] C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu,“Portfolio: A search engine for finding functions and their usages,”In
Proc. of 33rd Int. Conf. on Software Engineering (ICSE), 2011, pp.1043—1045.[4] M. Revelle, B. Dit, and D. Poshyvanyk, “Using Data Fusion and WebMining to Support Feature Location in Software,” In
Proc. of 18thIEEE Int. Conf. on Program Comprehension (ICPC), 2010, pp. 14-–23.[5] R. Holmes, R. J. Walker, and G. C. Murphy, “Strathcona examplerecommendation tool,”
ACM SIGSOFT Software Engineering Notes ,30(5), 2005, pp. 237-–240.[6] C. B. Seaman, “The Information Gathering Strategies of SoftwareMaintainers,” In
Proc. of the 18th Int. Conf. on Software Maintenance (ICSM), 2002, pp. 141–149.[7] V. S. Rekha, N. Divya, and P. S. Bagavathi, “A Hybrid Auto-tagging System for StackOverflow Forum Questions,” in
Proc. of the2014 Int. Conf. on Interdisciplinary Advances in Applied Computing (ICONIAAC), 2014, pp. 56:1-–56:5.[8] E. Loper and S. Bird, “NLTK: The Natural Language Toolkit,” In
Proc. of 42nd Annual Meeting of the Association for ComputationalLinguistics (ACL), 2004. pp. 63–70 [9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J.Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E.Duchesnay, “Scikit-learn: Machine Learning in Python,”
J. MachineLearning Research
12, 2011, pp. 2825–2830.[10] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann,“Design lessons from the fastest Q&A site in the west,” In
Proc. of theSIGCHI Conference on Human factors in Computing Systems (CHI),2011, pp. 2857-–2866.[11] M. Asaduzzaman, A. S. Mashiyat, C. K. Roy, and K. A. Schneider,“Answering questions about unanswered questions of stack overflow,”In
Proc. of the 10th Working Conference on Mining Software Repos-itories (MSR), 2013, pp. 97–100.[12] SoStats, 2017. [Online]. Available: https://sostats.github.io/last30days/[13] J. F. Baquero, J. E. Camargo, F. Restrepo-Calle, J. H. Aponte, andF. A. González, “Predicting the Programming Language: ExtractingKnowledge from Stack Overflow Posts,” In
Proc. of Colombian Conf.on Computing (CCC), 2017, Springer, pp. 199–21.[14] R. Rehurek and P. Sojka, “Gensim: Python Framework for VectorSpace Modelling,” NLP Centre, Faculty of Informatics, MasarykUniversity, Brno, Czech Republic, 2011.[15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed Representations of Words and Phrases and their Compo-sitionality,” In
Proc. of 26th Annual Conference on Neural InformationProcessing Systems (NIPS), 2013, pp. 3111–3119.[16] L. J. P. van der Maaten and G. E. Hinton, “Visualizing Data Usingt-SNE,”
J. Machine Learning Research 9 , pp. 2431–2456, Nov. 2008.[17] Developer Survey Results 2017. [Online]. Available:https://insights.stackoverflow.com/survey/2017
Proc. of Conf. on EmpiricalMethods in Natural Language Processing (EMNLP), 2015, pp. 1373—1378.[19] J. N. Khasnabish, M. Sodhi, J. Deshmukh, and G. Srinivasaraghavan,“Detecting Programming Language from Source Code Using BayesianLearning Techniques,” In
Proc. of Int. Conf. on Machine Learningand Data Mining in Pattern Recognition (MLDM), 2014, Springer,pp. 513–522.[20] J. Kennedy, V Dam and V. Zaytsev, “Software Language Identificationwith Natural Language Classifiers,” In
Proc. of IEEE Int. Conf. onSoftware Analysis, Evolution, and Reengineering (SANER), 2016, pp.624–628.[21] T. Mikolov, K.Chen, G. Corrado, and J. Dean.“Efficient Estimationof Word Representations in Vector Space,” In
Proc. of Int. Conf, onLearning Representations
Proc. of 22nd SIGKDD Conf. on Knowledge Discoveryand Data Mining (KDD), 2016. pp. 785–794.[24] L. Breiman, “Random Forests,”