On Using Active Learning and Self-Training when Mining Performance Discussions on Stack Overflow
Markus Borg, Iben Lennerstad, Rasmus Ros, Elizabeth Bjarnason
OOn Using Active Learning and Self-Training whenMining Performance Discussions on Stack Overflow (Preprint of paper accepted for the Proc. of the 21st International Conference on Evaluation and Assessment inSoftware Engineering, 2017)
Markus BorgSoftware and Systems Engineering Lab.RISE SICS ABLund, [email protected] Iben LennerstadDept. of Computer ScienceLund UniversityLund, [email protected] Rasmus Ros, Elizabeth BjarnasonDept. of Computer ScienceLund UniversityLund, Swedenfi[email protected]
Abstract —Abundant data is the key to successful machinelearning. However, supervised learning requires annotated datathat are often hard to obtain. In a classification task with limitedresources, Active Learning (AL) promises to guide annotators toexamples that bring the most value for a classifier. AL can besuccessfully combined with self-training, i.e., extending a trainingset with the unlabelled examples for which a classifier is themost certain. We report our experiences on using AL in asystematic manner to train an SVM classifier for Stack Overflowposts discussing performance of software components. We showthat the training examples deemed as the most valuable to theclassifier are also the most difficult for humans to annotate.Despite carefully evolved annotation criteria, we report lowinter-rater agreement, but we also propose mitigation strategies.Finally, based on one annotator’s work, we show that self-trainingcan improve the classification accuracy. We conclude the paperby discussing implication for future text miners aspiring to useAL and self-training.
Keywords -text mining, classification, active learning, self-training, human annotation.
I. I
NTRODUCTION
Large datasets are key to successful machine learning andtext mining. For example, applying natural language relatedmachine learning to text at web scale [1] has enabled manyof the advances in the last decade. It is well known thatan algorithm that works well on small datasets might bebeaten by simpler alternatives as more data are used fortraining [2]. However, while the web contains huge amountsof text, supervised learning requires annotated data – data thatare hard to obtain.A common solution to acquire enough annotated datais crowdsourcing using services such as Amazon Mechan-ical Turk. The possibility to employ a massive, distributed,anonymous crowd of individuals to perform general human-intelligence micro-tasks for micro-payments has radicallychanged the way many researchers work [3]. However, whenannotation requires more than general human intelligence,i.e., for non-trivial micro-tasks, such crowdsourcing solutionsmight not work. Annotation of developers’ posts on StackOverflow is an example of non-trivial classification for whichsuccessful crowdsourcing cannot be expected. Active Learning (AL) is a semi-automated approach toestablish a training set. The idea is to reduce the overall humaneffort by focusing on annotating examples that maximize thegained learning, i.e., the examples for which the classifieris the most uncertain. AL has been used for software faultprediction, successfully reducing the need for human interven-tion [4], [5], [6]. AL has also been used in several other fieldsof research, e.g., for creating large training sets for speechrecognition and information extraction [7]. Several studiesshow that AL can successfully be combined with self-training,which is a method to extend the training set by automaticlabeling of a trained classifier [8], [9], [10], but the techniqueshave not previously been used for text mining Stack Overflow.In this study, our target training set is
Stack Overflowdiscussions on performance of software components . Ourwork is part of the ORION project, in which we aim atdeveloping a decision-support system for software componentselection [11]. One aspect under study is how to collect andstore experiences from previous decisions [12]. The ORIONproject proposes collecting experiences from both internaland external sources, i.e., both from the company and fromother organizations. In this paper, we address using machinelearning to extract external experiences from the softwareengineering community by text mining Stack Overflow, theleading technical Q&A platform for software developers [13].We report our experiences from using AL and an SVMclassifier in a systematic way consisting of 16 iterations. Ourfindings show that not only the classifier is uncertain regardingthe borderline cases – also the human annotators display lim-ited agreement. Consequently, we stress that annotation criteriamust continuously evolve during AL. Moreover, we suggestthat AL with multiple annotators should be designed withpartly overlapping iterations to enable detection of differentinterpretations. Finally, we demonstrate that self-training hasthe potential to improve classification accuracy.The rest of the paper is organized as follows: Section IIintroduces background and related work, Section III presentsthe design of our study, and Section IV discusses our findings.Finally, we summarize our implications for future miningoperations in Section V. a r X i v : . [ c s . C L ] A p r ig. 1. Example of a performance question with an answer on StackOverflow. We highlight the parts of the text particularly relevant. The figurealso shows the question’s three tags. II. B
ACKGROUND AND R ELATED W ORK
Stack Overflow is the dominant technical Q&A platformfor software developers, with 101 million monthly uniquevisitors (March 2017). The information available on StackOverflow has been studied extensively in the software en-gineering community, mostly through text mining, but alsothrough qualitative analysis. Fig. 1 shows an example of aStack Overflow question with an answer, in which we highlighttext chunks related to performance.Treude et al. investigated the type of questions asked andthe quality of the answers and found that the information isparticularly useful for code reviews and conceptual questions,and for novice developers [14]. Soliman et al. found thatStack Overflow contains information relevant to and usefulfor decisions within software architectural design, and haveidenitified a list of words that may be used to automaticallyclassify such information [15]. Topic modelling has been usedto identify what topics that are discussed and relationshipsbetween these. In this way, Barua et al. identify a number ofcurrent trends within software development, e.g., that mobileapp development is increasing faster than web development[13]. It is suggested that knowledge mined from Stack Over-flow can be used to provide context-relevant hints in IDEs[16], [17] and for filtering out off-topic posts, e.g., in chatchannels [18].AL is a semi-supervised machine learning approach inwhich a learning algorithm interactively queries the human toobtain labels for specific examples, typically the most difficultones. The method for selecting examples to query shouldbe optimized to maximize the gained learning. Uncertaintysampling is a simple technique that selects examples wherethe classifier is least certain on which label to apply [7].This has the effect of separating the examples into twodistinct groups and thus remove borderline cases, see thehorizontal histograms in Fig. 2. AL enables a shift of focusfrom momentary data analysis to a process with a feedbackloop [19], [20].When mining from crowdsourced data there are usuallytoo many unlabelled examples to annotate them all manually.
Fig. 2. Target effect on unlabelled examples when applying AL and SVM. Tothe left, many examples appear around the hyperplane (the horizontal line).To the right, after completing the AL iterations, the remaining unlabelledexamples are separated, i.e., fewer borderline cases remain.
Semi-supervised learning are methods that use also remainingunlabelled examples to improve the classifier. Self-training(or bootstrap learning [21]) is one such method that extendsthe training set with the unlabelled examples classified withthe highest degree of certainty. This complements AL withuncertainty sampling well, since it maximizes the availableconfident labels [7]. To the best of our knowledge, we presentthe first application of both AL and self-training for StackOverflow mining. III. M
ETHOD
We designed a study to evaluate AL when mining StackOverflow. Fig. 3 shows an overview of the research designthat consisted of a preparation step and two iterative trainingsteps. In the preparation step, we downloaded the datasetused for the MSR Mining Challenge in 2015 containing43,336,603 posts [22]. We extracted all that were tagged with‘performance’ and at least one of the following tags: ‘apache’,‘nginx’ or ‘rails’ – an attempt to get an initial dataset relatedto components we know well, resulting in 2,304 posts in total.
Preparation
To assist the manual annotation task, wedeveloped a prototype tool integrating an SVM classifierfrom scikit-learn [23], i.e., the classifier finds the optimal hyperplane separating two categories of examples [24]. Inour application, we trained an SVM classifier with n-gramsas features (n=1-5) to separate Stack Overflow posts relatedto performance discussions of software components and otherposts. We refer to the two categories as positive and negativeexamples, respectively.During the tool development, the first and second authorsalternated annotating posts and evolving initial annotationcriteria – note that this inital step was done without AL. Intotal, we annotated 970 posts (25.4% positive) and the criteriaevolved into “a positive post discusses the performance ofa software component, rather than programming languages,the development environment, or measurements tools”. Whilemanually annotating the initially posts, we identified 67 addi-tional component names that also had explicit Stack Overflowtags. We used this to extend our dataset, i.e., we complemented‘apache’, ‘nginx’ or ‘rails’ with 67 new tags to obtain a larger ig. 3. Overview of the research design. Smileys depict where in the processthe first author (A) and the second author (B) where involved. Note that forthe self-training step, the smiley shows that the second author’s annotated datawas used in the automatic process. dataset of Stack Overflow posts. In total we collected 15,287Stack Overflow posts potentially related to performance ofsoftware components . Active learning
After the preparation, the first and secondauthors alternated manual annotation of the next 100 posts closest to the SVM hyperplane – we refer to each suchannotation batch [25] as an AL iteration . For each iteration,we measured the classification accuracy complemented byprecision, recall, and F -score using 5-fold cross-validation.Furthermore, we calculated the distance from each post, bothlabelled and unlabelled, to the SVM hyperplane. We visualizethe distribution of posts at different distances from the SVMhyperplane using histograms and beanplots. Self-training
We investigated self-training based on thesecond author’s annotation activity (cf. ‘Self-train. in Fig. 3)by adding unlabelled examples as if they were manuallyannotated. We explored extending the training set with differ-ent percentages of unlabelled data, corresponding to differentdistances to the SVM hyperplane. Our ambition was to identifya successful application of self-training, useful as a proof-of-concept, rather than finding the optimal parameter settings forthis particular case.
Human annotation
To measure the uncertainty in clas-sifying Stack Overflow posts close to the SVM hyperplane,we evaluated the inter-rater reliability of human annotators.The first and second author discussed experiences after eachcompleted iteration, and the annotation criteria evolved. After8 iterations, halfway into the study, we considered the criteriamature enough for evaluation. The criteria were then:“A positive post (both questions and answers) addresses theperformance of a specific software component (incl. frame-works, platforms, and libraries) that could be used to evolvea software-intensive system. Examples: database managementsystems (MySQL, Oracle, ..), content management systems(Drupal, Joomla, ..), web servers.A post is negative if it discusses performance of/from: • programming languages (e.g., Java, PHP) • operational environments (e.g., Windows, Linux) Replication package: URL A reasonable annotation task that requires roughly 90 min. • development tools (e.g., compilers, IDEs, build systems.) • alternative detailed implementations (e.g., formulation ofSQL queries, parsing of XML/JSON structures) • tweaking of componentsor if the post discusses components used to measure perfor-mance (e.g., JMeter, SQLTest). The exclusion criteria apply, unless such a discussion clearly originates in poor perfor-mance of a specific component ”.We designed a hands-on annotation exercise during a re-search workshop with 12 senior software engineering re-searchers (cf. ‘Group annotation’ in Fig. 3). First, we intro-duced the exercise, showed some examples, and provided theabove criteria. Second, everyone independently annotated 11posts, printed on paper, during a 20 minute session. In total, 66posts were distributed using pairwise assignment: two annota-tors per post, and each possible human pair represented once.Finally, we calculated Krippendorff’s α to assess inter-raterreliability, as recommended for difficult nominal tasks [26].After the group annotation, we discussed the outcome tobetter understand our differences. We continued the annotationactivity, following the same process and expecting a growingshared understanding, until iteration 16. Once finished, wehad 2,567 annotated posts (32.6% positive). To check ourhypothesis of improving agreement, we randomly selected50 posts among the already annotated (cf. ‘Pair annotation’in Fig. 3). Again we calculated Krippendorff’s α , both 1)between the first and second authors (referred to as A and B),and 2) between the new labels and the previous labels. Foreach post annotated differently, we quantified the certainty ofthe set label (1-5) and we provided a rationale.IV. R ESULTS AND L ESSONS L EARNED
Human annotation
We begin this section by reportingon the inter-rater reliability. The results from our group an-notation exercise after 8 iterations confirmed the challengeof annotating posts close to the SVM hyperplane. Despiteannotation criteria that evolved during 8 AL-iterations, the 12annotators obtained a Krippendorff’s α of 0.126 (37/66 sharedlabels, 56%) – a poor agreement. The first and second authorsanalyzed the discrepancies, along with posts for which therewere agreement, without identifying any concrete patterns. Thepresence of borderline cases is obvious, but we hypothesizedthat the alignment between the first and second authors wasstronger than within the whole group, and that it wouldcontinue improving during the remaining iterations.After 16 AL-iterations, we calculated the inter-rater relia-bility between A and B for a random sample of 50 previouslyannotated posts. The exercise yielded a Krippendorff’s α of0.028 (29/50 shared labels, 58%), considerably lower thanfrom the group exercise. We also calculated the inter-raterreliability against our previous annotations of the 50 posts,obtaining a Krippendorff’s α of 0.768 (18/20 shared labels,90%), and 0.577 (24/30 shared labels, 80%) for A and B,respectively. Our results show that while our individual anno-tation remained stable over time, our shared view still differedafter 16 iterations. ig. 4. Learning curves for the SVM classifier. Note that the 0th iterationcontained 373 labelled examples for A, and 600 for B. Each subsequentiteration adds another 100 examples. In most cases at least one of us was very uncertain,expressing a certainty level of 1 or 2, which means the postwas more or less randomly labelled. More alarming, however,was that in several cases both annotators felt certain but useddifferent labels. An analysis of the latter cases revealed that Awas more inclusive regarding posts that related to implemen-tation details and component tweaking, whereas B was moreinclusive concerning quality attributes not necessarily relatedto performance. Furthermore, B did not include posts thatcould be interpreted as anecdotal experiences. We concludethat AL for text classification is difficult, even after annotating2,674 posts with several intermediate discussions, our inter-rater reliability was low.
Active learning
Since our annotation criteria did notproperly align our annotation activity, we hesitated to poolour training data. Instead, we trained three separate SVMclassifiers using: 1) A data, 2) B data, and 3) A+B data – werefer to these as SVM A, SVM B, and SVM A+B, respectively.Note that we also split the training data from iteration 0 intoeither A or B, resulting in differently large initial training sets.Fig. 4 shows the mean value from five runs of 5-fold cross-validation for each iteration. The solid lines with markers showaccuracy and F -score for SVM A, the dashed lines withmarkers represent SVM B, and the solid lines without markersillustrate SVM A+B. Regarding accuracy, all three classifiersshow similar behavior: The accuracy decreases as additionaliterations are added, but the differences are minor. The curvesdo not resemble typical learning curves, instead they appearto stabilize between 0.7 and 0.8. We explain this by the postsannotated for iteration 0, i.e., clearly positive and negativeexamples were selected to span the document space, followedby nothing but borderline cases selected using AL. Looking at F -score, SVM B and SVM A+B remain fairly stable around0.5. On the other hand, SVM A improves considerably as moreiterations are added. This is likely due to the distribution ofexamples in the small A iteration 0 training set, containingonly 373 examples and a recall of only 0.18 – even addingborderline cases was useful in this case.Fig. 5 depicts distances between annotated posts and the TABLE IR
ESULTS FROM SELF - TRAINING AS A COMPLEMENT TO B’ S TRAININGSET . B
ASELINE SHOWS THE RESULTS FROM ITERATION
OTHER ROWSSHOW DIFFERENCES IN ABSOLUTE VALUES . Approach Accuracy Precision Recall F -score Baseline 0.700 0.574 0.344 0.4281) +5% pos. +0.019 +0.024 +0.096 +0.0802) +50% neg. +0.025 +0.069 +0.037 +0.0461) and 2) +0.043 +0.103 +0.067 +0.079
SVM hyperplanes (SVM A and SVM B) after the preparationstep and after the final iterations. The vertical histogramsshow frequency distributions of posts with distances fromthe hyperplane on the y-axis, where the sign denotes positiveand negative classifications, respectively. Moreover, the figuredisplays the number of true positives (TP), false positives(FP), true negatives (TN), and false negatives (FN). We noticethat as more posts are annotated, the distribution around thehyperplane increases, which is particularly evident for the truenegatives. This shows that, from the perspective of the SVMclassifiers, 16 AL iterations did not reduce the number ofborderline posts.Fig. 6 presents an analogous view for the unlabelled posts,also separating SVM A and SVM B. However, in this figurewe show beanplots, i.e., the frequencies are mirrored on they-axis. We also report the number of unlabelled posts onboth sides of the hyperplanes (cf. | p | ). SVM A suggests thatthere are 716 positive posts remaining in the set of set 13,745posts, whereas SVM B gives 259 remaining positive posts –these figures reflect A’s more inclusive interpretation of theannotation criteria. The goal of AL is to focus annotationefforts on borderline cases to create two clearly separatedclusters of examples (cf. Fig. 2). This phenomenon is notobvious in the Fig. 6, although we observe that SVM B indeedhas fewer negative examples close to the hyperplane, i.e., thebeanplot close to 0 is thinner after iteration 16. The patternfor SVM A is less clear, and we aim at investigating this infuture work by conducting additional iterations. Self-training
The rightmost part of Fig. 6 also illustrateshow we evaluated self-training using data annotated by B. Asdepicted by the dashed horizontal line, we explored addingdifferent fractions of the most confidently classified examples(cf. the white bars) to the training set, annotated with the labelpredicted by the classifier. As a proof-of-concept, we reportour results from adding the following unlabelled examples:1) 5% positive examples, 2) 50% negative examples, and 3)5% positive examples, and 50% negative examples. Theseadditions represent adding unlabelled examples farther awayfrom the hyperplane than 1.76 on the positive side, and 0.88on the negative side.Table I shows our results, compared to the baseline providedby iteration 16 without any self-training. Our results showthat active learning combined with self-training can be usedto improve an SVM classifier for Stack Overflow posts. Bothadding positive and negative examples from the unlabelledexamples can improve classification accuracy. We obtained ig. 5. Distribution of distances from labelled posts to SVM hyperplanes: SVM A iteration 0 and 15 (left), SVM B iteration 0 and 16 (right).Fig. 6. Distribution of distances from unlabelled posts to SVM hyperplanes: SVM A iteration 0 and 15 (left), SVM B iteration 0 and 16 (right). Note thatthe scale on the negative side represents 10x as many posts as the positive side. the best results when adding both types of data, resultingin improvements from the baseline corresponding to +4.3%accuracy, +10.3% precision, +6.7% recall, and +7.9% F -score. Limitations
Finally, we briefly discuss two aspects ofthreats to validity. First, we stress that we have populatedTable I by cherry-picking results from successful self-trainingruns. Most of our trial runs with self-training generated similaror worse results. Using an approach to semi-exhaustivelyevaluate different self-training settings, in total running about50 experimental runs, Table I shows the best results weobtained. However, our work is not a case of publication biasas we aim only to exhibit the existence of a phenomenon [27] –a beneficial application of self-training when text mining soft-ware repositories. Most self-training settings might deterioratethe accuracy, and a more systematic approach to parametertuning [28] would probably identify even better settings.Second, the external validity [29] of our work is limited.AL might be better suited for other software engineering textannotation tasks with less human interpretation. It is probablethat another set of annotators, guided by other annotationguidelines, would result in a different inter-rater reliability. Ashighlighted by Settles [25], while evolving annotation criteriais often a practical reality when applying AL, changes is a violation of the basic stability assumption. We also cannotclaim that self-training is beneficial to all types of text miningtasks in software engineering. What we can say, however, isthat for our particular task of classifying Stack Overflow postsrelated to performance of software components, self-trainingyielded improvements – and that is enough to recommendfurther research.V. C
ONCLUSION AND I MPLICATIONS FOR F UTURE T EXT M INING
We explored using AL and an SVM classifier for StackOverflow posts with two alternating annotators. The primarylesson learned is that AL and text mining appears to be adifficult combination, at least for short texts such as StackOverflow posts. In contrast to image classification tasks ,human Stack Overflow annotators must interpret incompleteinformation presented with limited context – differences inannotations are inevitable. However, we argue that awarenessof this intrinsic challenge of AL can be used to complement atraditional annotation process, i.e., AL can be used to identifythe borderline cases that are worthwhile to discuss.Based on our experiences, we present two recommendationswhen using AL for text mining software repositories. First, Please refer to Karen Zack’s viral tweets, e.g., “chihuahua or muffin?”:http://ow.ly/zpF1308F7kK he annotation criteria must continuously evolve , in parallelto the annotators’ interpretation of them, in line with codingguidelines for qualitative research [29]. It is not enough tosimply count the number of differing labels, instead quali-tative analysis is needed to identify any potential systematicdifferences – before it is too late. Second, we suggest thatAL settings with multiple annotators should be designed with partly overlapping iterations to enable early detection of dis-crepancies . The size of the labelled training set would increaseat a slower rate with overlapping iterations, thus this must bebalanced against the value of better annotator alignment. Infuture attempts with AL, we plan to initially design iterationswith 25% overlap, and then gradually decrease it to 5% asconsensus increases.Based on the second author’s AL process, we evaluatedcomplementing the training set using self-training. Our resultsare promising, we show that adding both positive and negativeexamples to the training set can increase the classificationaccuracy. In a semi-structured approach, we achieved improve-ments of 4.3% accuracy and 7.9% F -score. We stress thatour findings do not suggest that self-training generally is agood idea, rather our results constitute a proof-of-conceptthat self-training can be successfully combined with AL.Furthermore, we expect that further improvements from self-training would be possible, and plan to conduct systematicparameter optimization as the next step [28].A CKNOWLEDGMENT
The work is partially supported by a research grantfor the ORION project (reference number 20140218) fromThe Knowledge Foundation in Sweden, the Wallenberg Au-tonomous Systems and Software Program (WASP), and theIndustrial Excellence Center EASE - Embedded ApplicationsSoftware Engineering . R EFERENCES[1] F. Pereira, P. Norvig, and A. Halevy, “The Unreasonable Effectivenessof Data,”
IEEE Intelligent Systems , vol. 24, no. 2, pp. 8–12, 2009.[2] M. Banko and E. Brill, “Scaling to Very Very Large Corpora for NaturalLanguage Disambiguation,” in
Proc. of the 39th Annual Meeting onAssociation for Computational Linguistics , 2001, pp. 26–33.[3] T. Mitra, C. Hutto, and E. Gilbert, “Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon MechanicalTurk,” in
Proc. of the 33rd Annual ACM Conference on Human Factorsin Computing Systems , 2015, pp. 1345–1354.[4] B. Sun, G. Shu, A. Podgurski, and S. Ray, “CARIAL: Cost-AwareSoftware Reliability Improvement with Active Learning,” in
Proc. ofthe 5th International Conference on Software Testing, Verification andValidation , 2012, pp. 360–369.[5] H. Lu and B. Cukic, “An Adaptive Approach with Active Learning inSoftware Fault Prediction,” in
Proc. of the 8th International Conferenceon Predictive Models in Software Engineering , 2012, pp. 79–88.[6] H. Lu, E. Kocaguneli, and B. Cukic, “Defect Prediction betweenSoftware Versions with Active Learning and Dimensionality Reduction,”in
Proc. of the 25th International Symposium on Software ReliabilityEngineering , 2014, pp. 312–322.[7] B. Settles, “Active Learning Literature Survey,” University of Wisconsin-Madison, Tech. Rep. Computer Sciences Technical Report 1648, 2010.[8] Y. Lin, C. Sun, W. Xiaolong, and W. Xuan, “Combining Self Learningand Active Learning for Chinese Named Entity Recognition,”
Journalof Software , vol. 5, no. 5, pp. 530–537, 2010. http://ease.cs.lth.se [9] Z. Zhang, E. Pasolli, M. Crawford, and J. C. Tilton, “An Active LearningFramework for Hyperspectral Image Classification Using HierarchicalSegmentation,” IEEE Journal of Selected Topics in Applied EarthObservations and Remote Sensing , vol. 9, no. 2, pp. 640–654, 2016.[10] J. Richarz, S. Vajda, and G. Fink, “Annotating Handwritten Charac-ters with Minimal Human Involvement in a Semi-supervised LearningStrategy,” in
Proc. of the International Conference on Frontiers inHandwriting Recognition , 2012, pp. 23–28.[11] C. Wohlin, K. Wnuk, D. Smite, U. Franke, D. Badampudi, andA. Cicchetti, “Supporting Strategic Decision-Making for Selection ofSoftware Assets,” in
Software Business , ser. Lecture Notes in BusinessInformation Processing, A. Maglyas and A. Lamprecht, Eds. Springer,Cham, 2016, no. 240, pp. 1–15.[12] A. Cicchetti, M. Borg, S. Sentilles, K. Wnuk, J. Carlsson, and E. Pap-atheocharous, “Towards Software Assets Origin Selection Supported bya Knowledge Repository,” in
Proc. of the 1st International Workshop onDecision Making in Software Architecture , 2016.[13] A. Barua, S. Thomas, and A. Hassan, “What are Developers TalkingAbout? An Analysis of Topics and Trends in Stack Overflow,”
EmpiricalSoftware Engineering , vol. 19, no. 3, pp. 619–654, 2012.[14] C. Treude, O. Barzilay, and M. Storey, “How do Programmers Askand Answer Questions on the Web?: NIER track,” in
Proc. of the 33rdInternational Conference on Software Engineering , 2011, pp. 804–807.[15] M. Soliman, M. Galster, A. Salama, and M. Riebisch, “ArchitecturalKnowledge for Technology Decisions in Developer Communities: AnExploratory Study with StackOverflow,” in
Proc. of the 13th WorkingIEEE/IFIP Conference on Software Architecture , 2016, pp. 128–133.[16] M. Allamanis and C. Sutton, “Why, When, and What: Analyzing StackOverflow Questions by Topic, Type, and Code,” in
Proc. of the 10thWorking Conference on Mining Software Repositories , 2013, pp. 53–56.[17] L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza, “Min-ing StackOverflow to Turn the IDE into a Self-confident ProgrammingPrompter,” in
Proc. of the 11th Working Conference on Mining SoftwareRepositories , 2014, pp. 102–111.[18] S. Chowdhury and A. Hindle, “Mining StackOverflow to Filter out Off-topic IRC Discussion,” in
Proc. of the 12th Working Conference onMining Software Repositories , 2015, pp. 422–425.[19] A. Hassan and T. Xie, “Software Intelligence: The Future of MiningSoftware Engineering Data,” in
Proc. of the FSE/SDP Workshop onFuture of Software Engineering Research , 2010, pp. 161–166.[20] T. Menzies, C. Bird, T. Zimmermann, W. Schulte, and E. Kocaganeli,“The Inductive Software Engineering Manifesto: Principles for IndustrialData Mining,” in
Proc. of the International Workshop on MachineLearning Technologies in Software Engineering , 2011, pp. 19–26.[21] D. Yarowsky, “Unsupervised Word Sense Disambiguation Rivaling Su-pervised Methods,” in
Proc. of the 33rd Annual Meeting on Associationfor Computational Linguistics , 1995, pp. 189–196.[22] A. Ying, “Mining Challenge 2015: Comparing and Combining DifferentInformation Sources on the Stack Overflow Data Set,” in
Proc. of the12th Working Conference on Mining Software Repositories , 2015.[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg,“Scikit-learn: Machine learning in Python,”
Journal of Machine Learn-ing Research , vol. 12, no. Oct, pp. 2825–2830, 2011.[24] V. Vapnik,
The Nature of Statistical Learning Theory . Springer NewYork, 2000.[25] B. Settles, “From Theories to Queries: Active Learning in Practice,”in
Proc. of the JMRL Workshop on Active Learning and ExperimentalDesign , 2011.[26] G. Feng, “Mistakes and How to Avoid Mistakes in Using Intercoder Re-liability Indices,”
Methodology: European Journal of Research Methodsfor the Behavioral and Social Sciences , vol. 11, no. 1, pp. 13–22, 2015.[27] J. Hannay and M. Jorgensen, “The Role of Deliberate Artificial DesignElements in Software Engineering Experiments,”
IEEE Transactions onSoftware Engineering , vol. 34, no. 2, pp. 242–259, 2008.[28] M. Borg, “TuneR: A Framework for Tuning Software Engineering Toolswith Hands-On Instructions in R,”
Journal of Software: Evolution andProcess , vol. 28, no. 6, pp. 427–459, 2016.[29] P. Runeson, M. Host, A. Rainer, and B. Regnell,