Eliminating self-selection: Using data science for authentic undergraduate research in a first-year introductory course
EEliminating self-selection: Using data science for authentic undergraduateresearch in a first-year introductory course
Lior Shamir Department of Computer Science, Kansas State University1701D Platt St.Manhattan, KS [email protected]
Abstract
Research experience and mentoring has been identified as aneffective intervention for increasing student engagement andretention in the STEM fields, with high impact on studentsfrom undeserved populations. However, one-on-one mentor-ing is limited by the number of available faculty, and in cer-tain cases also by the availability of funding for stipend. One-on-one mentoring is further limited by the selection and self-selection of students. Since research positions are often com-petitive, they are often taken by the best-performing students.More importantly, many students who do not see themselvesas the top students of their class, or do not identify themselvesas researchers might not apply, and that self selection canhave the highest impact on non-traditional students. To ad-dress the obstacles of scalability, selection, and self-selection,we designed a data science research experience for under-graduates as part of an introductory computer science course.Through the intervention, the students are exposed to authen-tic research as early as their first semester. The interventionis inclusive in the sense that all students registered to thecourse participate in the research, with no process of selec-tion or self-selection. The research is focused on analytics oflarge text databases. Using discovery-enabling software tools,the students analyze a corpus of congressional speeches, andidentify patterns of differences between democratic speechesand republican speeches, differences between speeches forand against certain bills, and differences between speechesabout bills that passed and bills that did not pass. In the be-ginning of the research experience all student follow the sameprotocol and use the same data, and then each group of stu-dents work on their own research project as part of their finalproject of the course. Several students continued to work onthe research after the semester ended, and two teams also sub-mitted scientific papers describing their findings.
Introduction
In the recent years, undergraduate research experience hasbeen becoming increasingly more prevalent, and differentmodels of undergraduate research experience have beenproposed and implemented (Russell, Hancock, and Mc-Cullough 2007; PCAST 2012; Linn et al. 2015). Learn-ing through research exposes undergraduate students to ed-
Copyright c (cid:13) ucational aspects and hands-on experiences they cannotearn effectively through traditional lecture-based education(Hunter, Laursen, and Seymour 2007). That includes skillssuch as making connections among seemingly disparatepieces of information, evaluation of evidence, and bring-ing the requisite expertise to address complex issues (Amer-ican Association for the Advancement of Science 2009;Brownell et al. 2015).In addition to its academic advantages, undergraduateresearch experience is an effective tool for student en-gagement (Seymour et al. 2004) and consequently studentretention (Hippel et al. 1998; Kinkel and Henke 2006;Lopatto 2007; Braxton, Hirschy, and McClendon 2011;PCAST 2012). Research experience also leads to highergrades (Kinkel and Henke 2006; Barlow and Villarejo 2004).Undergraduate research experience was found highly effec-tive for attracting and retaining underrepresented minoritystudents in STEM (Barlow and Villarejo 2004; Tsui 2007;Johnson and Okoro 2016; Collins et al. 2017), and the im-pact of undergraduate research experience on retention andgraduation of underrepresented minorities is higher than theaverage impact of research experience on the general stu-dent population (Russell, Hancock, and McCullough 2007;Villarejo et al. 2008; Jones, Barlow, and Villarejo 2010;Chemers et al. 2011).While undergraduate research experience is a proven ef-fective intervention, exposing all students to research intro-duces several obstacles. One-on-one mentoring is often lim-ited by the availability of faculty who can mentor undergrad-uate students. In institutions that focus on undergraduate ed-ucation, the number of research labs is limited, as well asthe number of faculty with active research programs whocan mentor undergraduate students and lead them to authen-tic research. That situation can be solved partially by mod-els such as the NSF’s Research Experience for Undergradu-ates (REU), according which students can spend a summerat a research institution. However, that model depends onthe availability of funding, and therefore just a few of thestudents can benefit from it. More importantly, joining a re-search lab requires a student to actively apply and sometimescompete for the research position (Bangera and Brownell2014). That process might leave many students who do not a r X i v : . [ phy s i c s . e d - ph ] M a r ee themselves competitive or do not see themselves as re-searchers without access to the intervention, while some-times rewarding the more privileged students who use theseresearch experience opportunities to further enhance theirskills ( ? ). Therefore, the students who are stronger academ-ically, have higher GPA, and see themselves as researchersare much more likely to be exposed to research experience(Sell, Naginey, and Stanton 2018; Cooper et al. 2019), whilestudents who are less confident and might benefit from theintervention the most practically do not have access to theintervention (Salgueira et al. 2012). Another obstacles is thetime commitment for extracurricular activities, that mightmake research practically inaccessible to non-traditional stu-dents such as commuting students or students who havefull-time or part-time jobs. Moreover, it has been shownthat a continuous intervention is required to achieve integra-tion of underrepresented minorities in a STEM career ( ? ; ? ; ? ), and therefore a time-limited research experience mightnot achieve the full impact of undergraduate research expe-rience.Here I describe a model of authentic undergraduate re-search experience in data science that is part of a regu-lar first-year course. The students are exposed to the re-search as part of the course, and therefore all students tak-ing the course participate in the research activities. Be-cause the research is part of the curriculum, no selection orself-selection is applied, and no extra-curricular time com-mitment is needed from the students. As a first-year/first-semester introductory course, it exposes the students to re-search as early as possible, rather than in the junior or senioryears when retention becomes less critical. Institutional context
Kansas State University is a public research-oriented(Carnegie classification R1) land grant university. Its under-graduate computer science program enrolls ∼
550 students.While undergraduate research opportunities do exist, mostof the research is performed by graduate students. Class sizeof required courses range typically from 40 students to ∼ Educational goals
As described above, the purpose of the course is to intro-duce students to basic terms and topics in computer sci- ence, the history of computing, and also basic program-ming in Python. It briefly covers topics such as file sys-tems, databases, boolean algebra, computer organization,programming languages, theory of computation, algorithms,information theory, and operating systems.The purpose of the research experience is to introduce stu-dents to research methodology, which is another tool theycan earn in addition to the technical and programming skillscovered in the course. In the end of the course, the studentsare expected to be able to define a research problem, de-sign and test solutions, perform basic statistical inference,criticize their results, and describe their work in a researchpaper.As mentioned above, research experience has substan-tial impact on the engagement and retention of students, es-pecially students from underrepresented minorities (Barlowand Villarejo 2004; Tsui 2007; Collins et al. 2017). There-fore, exposing students to research as early as their firstsemester is expected to have the highest impact on retention,as retention is critical in the first year. Exposing students toresearch in their first semester will also provide them withthe ability to join a research lab or pursue other research op-portunities as early as possible, and maximize their exposureto research experience by joining a research lab or pursuingother research opportunities.However, research lab positions are often selective, andfaculty mentors often prefer to recruit students who are bet-ter prepared. Additionally, many first-year students mightnot see themselves competitive for earning a research po-sition. The extra-curricular time commitment might also in-timidate some of the students. Therefore, applying for a re-search position might not seem a high priority option forfirst-semester students. Also, by joining a research lab stu-dents normally work with a mentor on the mentor’s researchprogram rather than developing their own research. In thatcase, the students are not able to fully express their interestsand identity through their research.
Research experience design
The research experience is done entirely as part of thecourse, and is designed in two phases: In the first phase allstudents in the course work together on the same research,and follow the same research protocol with the same data.While each group of students works independently, apply-ing the same protocols to the same data naturally leads tothe same results for all students. Although all students makethe same discoveries, these discoveries are new and rele-vant knowledge, and are not known neither to the instruc-tor nor to the students, and do not appear in any textbook orother existing literature. During that phase the students areintroduced to the data, basic research practices, the researchtools, and methods of statistical inference. That phase pre-pares the students to the second phase of the research ex-perience, in which the students choose their own researchproblem and use the same data and same tools to make dis-coveries.As first-year students, no assumptions can be made abouttheir level of preparation, which introduces a challenge toerforming authentic research activities leading to meaning-ful discoveries. Another challenge is that all activities needto take place as part of a course, and need to scale to a largenumber of students. To address these challenges, data sci-ence is used to turn existing databases into knew knowledgeby using existing computational tools. That can be donewithout strong programming skills or other previous knowl-edge in computer science, and therefore suitable for fresh-men level research.The research project is 45% of the grade. The assignmentsof Phase 1 are 30% of the grade, and the paper and final pre-sentation make another 15% of the grade. The research ac-tivities consume a total of six meetings during the semester.
Data
While any type of data can be used, text data is selected forthe research experience. Text data are normally small com-pared to other types of data such as image or audio. It is alsooften easier to use, as no strong computing facilities are typ-ically needed to process text data, as opposed to image datawhere GPUs or strong processors are required to analyzeimages. Text can be pre-processed when needed by simplestring analysis, while image or audio files are more difficultto open and read, and are therefore less suitable for researchperformed by first-semester students.Many publicly accessible text databases can be used forthe research experience. To further engage the students in theresearch, data that the students are familiar with from theirpersonal lives should be preferred. Examples can be popularmusic lyrics (Napier and Shamir 2018), and different datacan be used in each semester to make different discoveriesrather than repeat the exact results of previous semesters.In this course, the dataset that was selected was a corpusof several thousands congressional speeches (Thomas, Pang,and Lee 2006). Each speech is labeled with the party of thespeaker and their vote (for or against the discussed bill).The annotation of the data to democratic and republicanspeeches allows the identification of possible differences be-tween republicans and democrats using data-driven discov-ery tools. That was done through the five steps that will bedescribed in in the next section. Before the students starttheir research, a short discussion about the type of researchtakes place in class. The discussion provides a summary ofthe tools that the students will earn from the research ex-perience such as analytical thinking, critical thinking, andthe ability to make connections between different pieces ofinformation. The purpose of the discussion is to justify thetime the students spend on research, and to explain the mo-tivation for research experience at an early stage. Due to thespecific topic of the project, the students are also asked notto confuse the research findings with political views or state-ments that can lead to conflict or division in the classroom.The request was granted by the students, and no inappro-priate political or divisive comments were made during theresearch experience by any of the students.While a certain corpus was used in the semester, manyother text datasets available on-line (e.g., Project Gutenberg)could have been used. As will be described in the next sec-tion, the text analysis tool is comprehensive, and the protocol can be used to analyze other datasets of text. Also, the con-gressional speeches used in the study were made in around2005, and therefore the exact same protocol can be used toanalyze speeches from different years.
Phase 1: Collaborative research
As described above, in the first phase of the research expe-rience all students work on the same research project usingthe same protocol and same data, and therefore also get thesame results. Because at first semester students are not ex-perienced in research, the practice of all students doing thesame research scales to a large number of students, and doesnot require the one-on-one attention that undergraduate re-search experience often requires. Using the same protocoland getting the same results also makes it much easier for theteaching assistants to grade the assignments. The researchassistants just need to repeat the same experiments with theclass, and compare the results of the students to their results.The research has four steps, each is summarized in a proto-col and an assignment that the students need to submit.Four meetings are dedicated to the first phase of the re-search during the semester. The first meeting takes place inthe second week of the semester, in which the research goalsare described, as well as the research data that will be usedin the semester. During the semester, the students work inteams of two to three students. The research process of thefirst phase of the undergraduate research experience is sum-marized by Figure 1, showing the different steps and the de-livery of each step.
Step 1: Organizing the data
The first step takes place inthe second week of the semester. The students are requestedto download and extract the dataset, and sort the dataset intotwo folders, such that one folder contains speeches made bydemocrat legislators and the other contains speeches madeby republican legislators. The students also need to removefiles that are less than 700 characters long, in order to avoidspeeches that are just short comments or welcome messages,and are too short and not informative for automatic analy-sis. The students need to submit the number of republicanand democrat speeches they have left, and their submissionis graded. Through this step, the students learn about thedataset while using some basic file systems tools.Because the data is relatively clean, this step is not ascomprehensive as typical data wrangling steps in other casesof data-driven research. On the other hand, the students donot need to spend substantial amount of time to collect andorganize data, and therefore can use more of their time onthe analytics part of the research.The step consumes one meeting, in which the students areintroduced to the research topic and dataset, and start work-ing on their research in class. The students follow the in-structions they are provided through an MS-Word file, whichthey also submit after they fill in the number of speeches thatthey have for each political party.
Step 2: Classification and features
In the second step ofthe research, the students start to analyze the data. For thatpurpose, they use UDAT (Shamir 2017), which is an opensource command-line tool designed to make discoveries in eek 2:Data preparationText filesFiles organized in foldersWeek 4:
Classification
Week 7:Statistical inferenceWeek 8:Core NLPClassification accuracy,Confusion matrixSpecific Statistically significant differences
Classification accuracy, Statistically significant differences (with CoreNLP features)
Figure 1: The four steps of the first phase of the research pro-cess. The phase starts in the second week of the semester andends after the eighth week. During that part of the semester,all students use the same protocol, same tools, and use thesame data, leading to the same results for all students.data. UDAT can be used without strong programming skills,and is freely available online also in the form of Windowsbinaries . Unlike document classifiers that can just assigndocuments to classes, UDAT also has explainable AI aspectsthat provide information about the relationship between thedifferent classes, and identify certain features that are simi-lar or different between classes. UDAT measures the text de-scriptors such as readability indices, sounds of words, use ofpunctuation characters, use of different parts of speech, re-use of words, sentiments, and more. More information aboutUDAT text analysis can be found in (Shamir, Diamond, andWallin 2015; Alluqmani and Shamir 2018).By following a detailed protocol, the students use super-vised machine learning to classify between democrat and re-public speeches, create the confusion matrix of the classifi-cation, create the similarity matrix, use bootstrapping, andchange the number of training and test samples to learn how http://people.cs.ksu.edu/ lshamir/downloads/udat/ the classification accuracy changes with different sizes ofdata. Additionally, the students use feature selection to iden-tify individual text measurements that provide discriminat-ing information between democrat and republican speeches.UDAT can provide a list of the features that have the highestdiscriminating power according to their LDA scores, as wellas the means of the features in each class.The students need to follow the detailed protocol, and thensubmit the classification accuracy under different sizes of thetraining set. The students also need to provide the confusionmatrix and similarity matrix that they generate using UDAT.Finally, the students need to identify the seven text measure-ments that have the highest discriminating power betweendemocrat and republican speeches. Step 3: Statistical significance
In Step 2, the experimentshowed that UDAT was able to identify the party of thespeaker in 61% of the cases. However, the ability to classifybetween republican and democrat speeches merely showsthat there are differences between democrat and republicanspeeches, but does not identify what these differences are.The identification of discriminating features is therefore animportant part in the discovery aspect of the research.In the third step of the process the students need to per-form a basic statistical inference to determine whether thetext features that were identified in Step 2 show statisticallysignificant difference between democrats and republicans.UDAT shows the means and standard error of the means oftext features measured for each class. Using that informa-tion, the students can use a statistical calculator to computethe t-test of the difference between the mean of the featurevalues of the democrat speeches and the means measured forthe same features in the republican speeches. Another topicis correcting the P value to multiple tests.The mathematics of the Student t-test is not being coveredin class, as the students have not yet taken calculus and arenot prepared to understand the statistics, but the concept of Pvalues is being discussed in the context of a discovery. Theactivity starts by describing its goals and discussing P val-ues, and then the groups of students work in the class to de-termine the P values of the different features. The outcomesof the assignment that the students submit is 10 text featuresthat have statistically significant means between democratand republican speeches, and five text features that are notstatistically significant. If no 10 text features are found to bestatistically significant, the students need to mentioned thatin the assignment their submit.Through that part of the research the students discoveredthat democrat legislators tend to use longer words in theirspeeches compared to republican legislators, and the differ-ence is statistically significant. The mean length of a word ina democrat speech was 4.72 ± ± tep 4: Adding CoreNLP In the final step of the firstphase the students repeat Steps 2 and Step3, but with us-ing CoreNLP (Manning et al. 2014). UDAT can work withCoreNLP to identify parts of speech, as well as sentimentsexpressed in the speeches. After Step 2 and Step 3, the stu-dents are more experienced, and can perform that part of theresearch independently.Through that step the class discovered that democrats usemore nouns in their speeches, and that republican speechesexpress more positive sentiments than democratic speeches.
Phase 2: Individual research
In the second phase of the research, each team of stu-dents needs to define its own research project. That is donein the last four weeks of the semester, and provides stu-dents with the opportunity for ownership of the research,which is an important elements of undergraduate research(Lopatto 2003). That part of research is performed by thestudents through discussions in the classroom, and eachteam presents their ideas (each presentation is about 5-7minutes) to the classroom, followed by a short discussionand comments from the instructor and the other students.In each week, the beginning of one meeting is dedicated tobrief update reports from each team. The students were en-couraged to use the same data, as well as the same data anal-ysis tools, but were also given the option to use other datathat is relevant to the project.In Phase 2, the students have sufficient knowledge aboutthe analysis tools, and could perform simple research tasksleading to basic discoveries from data. The primary outcomeof this phase is a research paper of 2000-5000 words. Thestudents also make a short presentation about their research.The requirement to submit an original research paper re-places the previous final paper requirement of the course.That is, instead of submitting a paper that summarizes atopic of their choice, the students submit a paper about theirresearch. Unlike the assignments submitted by the studentsin Phase 1, the final research papers are graded by the in-structor and not by the teaching assistants.Several students chose to use public sources of congres-sional speeches and obtain much larger datasets, a taskthat involves substantial labor not required by the course.Namely, analyzing data over a very long period of time ofover 100 years of congressional speeches led to interestinginsights about how legislators express themselves throughspeeches, and a student-authored paper written based on thisstudy was submitted for publication in a peer-reviewed jour-nal. Other students associated each speech file with the leg-islator, and identified differences between different legisla-tors reflected by their speeches. Another example is a studyby another team of students identifying differences betweenspeeches of legislators who voted for the bill and legislatorswho voted against it.
Results
In the end of the course, all students submitted their finalpapers, and completed the course successfully. No studentdropped the course, failed it, or avoided submitting the re- search outcomes. Student evaluation for the question “In-creased desire to learn about the subject” was 4.9 (out of 5).As a first-semester course, the objective of the course is toincrease student interest in computer science to engage andretain the students in the field, and therefore the student in-terest in the field is critical to the outcomes of the course.Fairness of grading was rated at 4.3, showing that no ma-jor concern was expressed about grading, despite the factthat 45% of the grade was the research project. Two teamsof students continued to work during winter break of 2019-2020, which led to a completion of a research paper that wassubmitted to a peer-reviewed journal, and another paper is inpreparation.The impact of the intervention was also tested by using apre- and post student surveys. The survey include 15 ques-tions adjusted from the Lopatto CURE survey (Denofrioet al. 2007), focusing on experience and self efficacy, andmeasured by a forced-choice 1-5 Likert scale. The questionsare the following:1. “Even if I forget the facts, I’ll still be able to usethe thinking skills I learn in science”2. “You can rely on scientific results to be true and correct”3. “The process of writing in science is helpful for under-standing scientific ideas”4. “Students who do not major/concentrate in science shouldnot have to take science courses.”5. “I wish science instructors would just tell us what weneed to know so we can learn it.”6. “Creativity does not play a role in science.”7. “Science is not connected to non-science fields such ashistory, literature, economics, or art.”8. “I get personal satisfaction when I solve a scientificproblem by figuring it out myself.”9. “I can do well in science courses.”10. “Scientists know what the results of their experimentswill be before they start.”11. “If an experiment shows that something doesn’t work,the experiment was a failure.”12. “I prefer hands-on activities in the course.”13. “Sometimes in my classes I noticed unfair treatmentrelated to race/ethnicity.”14. “I prefer open-ended projects over textbook assign-ments.”15. “The textbook is an important part of the course.”Figure 2 shows the average score of the answers of thestudents to each of the 15 questions. Interestingly, the ques-tion that showed the highest change between the pre andpost surveys was “Sometimes in my classes I noticed un-fair treatment related to race/ethnicity”. The average answerto that question dropped from 1.94 ± . in the pre survey to1.15 ± . (P < ± ± <
45 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 A v e r a g e s c o r e Question
Pre post
Figure 2: Average score for each of the 15 questions.“I can do well in science courses” also showed an increasefrom 4.01 ± ± (cid:39) Conclusion
Mentoring undergraduate students for research is a provenhigh-impact practice. However, it is difficult to scale one-on-one mentoring to a large number of undergraduate studentsgiven the typical student-faculty ratio and availability offunding. It is further limited by self-selection, as many non-traditional students do not see themselves as researchers, andmight therefore not apply for research opportunities that in-volve mentoring.By using course-based undergraduate research experienceall students are exposed to research, and perform researchas part of the regular course that they take. That kind ofresearch experience does not involve extra-curricular timecommitment, and all students are exposed to it without a pro-cess of application or selection. It can also scale to a muchlarger number of students compared to the number of stu-dents that can be mentored by a single faculty using the tra-ditional one-on-one mentoring model.The model proposed here uses data science foundationsto make discoveries in data, and it is implemented as part ofa first semester computer science course. The intervention isdivided into two phases, such as first the entire class does thesame research project, and then students can work in teamson their own research ideas as part of the course.Although the research is performed with first semesterstudents, it leads to authentic discoveries that were notknown before. Choosing a topic that the students under-stand such as political speeches allows the students bet-ter understand the research goals and discoveries, but alsohelps the students to express their own interests and iden-tity as they work on their individual research. Other top-ics that can connect between data science and culture aresports, music, and art. These topics can also be linked todata science (Strange and Shamir 2014; Yaldo and Shamir2017; Soares and Shamir 2016; George and Shamir 2014;George and Shamir 2015; Shamir and Tarakhovsky 2012; Burcoff and Shamir 2017), allowing students to experiencedata science research while expressing their culture andidentity through the research. Future work will aim at ex-panding the research topics outside the scope of data scienceor artificial intelligence, as well as increasing the class size.The nature of the research activities allows scaling the re-search to introductory courses with larger enrollment, andtherefore potentially provides a solution to the inclusion ofstudents in undergraduate research.
Acknowledgments
This work is supported in part by NSF grant AST-1903823.
References [Alluqmani and Shamir 2018] Alluqmani, A., and Shamir,L. 2018. Writing styles in different scientific disciplines:a data science approach.
Scientometrics
CBELifeSciences Education
Journal of Research inScience Teaching
Understand-ing and Reducing College Student Departure: ASHE-ERICHigher Education Report , volume 30. John Wiley & Sons.[Brownell et al. 2015] Brownell, S. E.; Hekmat-Scafe, D. S.;Singla, V.; Chandler Seawell, P.; Conklin Imam, J. F.; Eddy,S. L.; Stearns, T.; and Cyert, M. S. 2015. A high-enrollmentcourse-based undergraduate research experience improvesstudent conceptions of scientific thinking and ability to in-terpret data.
CBELife Sciences Education
International Journal of Art, Culture and Design Technolo-gies (IJACDT)
Journal of Social Is-sues
Journal of college student development
PloS One
Science
Pattern Recognition Letters
Artificial IntelligenceResearch
The Review of Higher Education
ScienceEducation
American Scientist
The Journal of Higher Education
Journal of Natural Resources & Life Sciences Education
Computer
Science
Council on Undergraduate Re-search Quarterly
CBE-Life Sciences Education
Proceedingsof 52nd Annual Meeting of the Association for Computa-tional Linguistics: System Demonstrations , 55–60.[Napier and Shamir 2018] Napier, K., and Shamir, L. 2018.Quantitative sentiment analysis of lyrics in popular music.
Journal of Popular Music Studies
Science
BMC Medical Education
Scholarship and Practice ofUndergraduate Research
Science Education
ACM Journal on Computing and Cultural Heritage
IEEETransactions on Human-Machine Systems
Astrophysics Source Code Library
American Journal of Sports Science
Inter-national Journal of Computer Science in Sport
Pro-ceedings of Empirical Methods in Natural Language Pro-cessing , 327–335.[Tsui 2007] Tsui, L. 2007. Effective strategies to increasediversity in stem fields: A review of the research literature.
The Journal of Negro Education
CBELife Sciences Education