[PDF] To get good student ratings should you only teach programming courses? Investigation and implications of student evaluations of teaching in a software engineering context

Abstract

Student evaluations of teaching (SET) are commonly used in universities for assessing teaching quality. However, previous literature shows that in software engineering students tend to rate certain topics higher than others: In particular students tend to value programming and software construction over software design, software engineering models and methods, or soft skills. We hypothesize that these biases also play a role in SET responses collected from students. The objective of this study is to investigate how the topic of a software engineering course affects the SET metrics. We accomplish this by performing multilevel regression analysis on SET data collected in a software engineering programme. We analyzed a total of 1295 student evaluations from 46 university courses in a Finnish university. The results of the analysis verifies that the student course evaluations exhibit similar biases as distinguished by previous software engineering education research. The type of the course can predict a higher SET rating. In our dataset, software construction and programming courses received higher SET ratings compared to courses on software engineering processes, models, and methods.

Full PDF

TTo get good student ratings should you only teachprogramming courses? Investigation andimplications of student evaluations of teaching in asoftware engineering context

Antti Knutas

LUT University

Lappeenranta, Finlandantti.knutas@lut.ﬁ

Timo Hynninen

South-Eastern Finland University of Applied Sciences

Mikkeli, Finlandtimo.hynninen@xamk.ﬁ

Maija Hujala

LUT University

Lappeenranta, Finlandmaija.hujala@lut.ﬁ

Abstract —Student evaluations of teaching (SET) are commonlyused in universities for assessing teaching quality. However,previous literature shows that in software engineering studentstend to rate certain topics higher than others: In particularstudents tend to value programming and software constructionover software design, software engineering models and methods,or soft skills. We hypothesize that these biases also play arole in SET responses collected from students. The objectiveof this study is to investigate how the topic of a softwareengineering course affects the SET metrics. We accomplish this byperforming multilevel regression analysis on SET data collectedin a software engineering programme. We analyzed a total of1295 student evaluations from 46 university courses in a Finnishuniversity. The results of the analysis veriﬁes that the studentcourse evaluations exhibit similar biases as distinguished byprevious software engineering education research. The type of thecourse can predict a higher SET rating. In our dataset, softwareconstruction and programming courses received higher SETratings compared to courses on software engineering processes,models, and methods.

Index Terms —quality of teaching, student evaluation of teach-ing, software engineering education, multilevel modelling

I. I

NTRODUCTION

There are established recommendations of what shouldbe included in a software engineering curriculum [1] andprofessionals have established a rough, evolving consensusof what is included in the ﬁeld of software engineering [2].However, while the software engineering education communityand practitioners might agree on the content of the curriculum,students completing these programs may not share this point ofview. In fact, studies have shown that in software engineeringand computing related ﬁelds students emphasize the importanceof programming, especially at the start of their studies, andmight devalue other parts of degree programs [3]–[6].Student perceptions of the usefulness of the course topics isimportant, not only for purposes of the student’s professionalgrowth and understanding of the ﬁeld, but also for meaningfuldialogue about the content of the study program. One of themost common ways universities engage in this dialogue isthrough their quality control processes, which often use student evaluations of teaching (SET) as the main source of data. It iscommon for universities to use student evaluations of teachingas an indicator of quality for both the teaching material andteachers themselves [7], [8].Sometimes universities connect teaching faculty performanceevaluation directly to student evaluation of teaching [8]. Theevaluations can affect the career prospects of the teachingpersonnel because the administration can use the data col-lected from SET questionnaires for decisions such as tenure,promotion and merit-pay [9]. But; if students value certaincourses and topics over others, is there a built-in bias to theSET metrics?In addition to being treated as an objective data source byadministrative departments, SET is also being used by teachersto reﬂect on their teaching practises [10]. For both reasons, itis an important ﬁeld of study. However, while SET has beenresearched in general, it has received little attention in the ﬁeldof software engineering education (SEE). In this paper, weaddress the research gap by exploring the statistical connectionbetween a software engineering course type and the SET.We accomplish our research goal by performing multilevelregression analysis on 1295 course evaluations from 46 softwareengineering courses, gathered between autumn 2017 and spring2020 in a Finnish university. Our main research question is asfollows:

How does the type of course affect student evaluation ofteaching in software engineering courses?

The rest of this paper is structured as follows. Section IIpresents the related work on student evaluation of teaching, anddiscusses the state of the art of SET in software engineeringeducation. Section II also establishes the research gap. SectionIII presents the research approach, detailing the hypotheses,data collection, and data analysis methods. The main results ofthe study are presented in Section IV, while Section V discussesthese results. Finally, Section VI concludes the paper. a r X i v : . [ c s . S E ] F e b I. B

ACKGROUND

A. Student evaluation of teaching (SET)

SET is a commonly used measure for teaching qualityin higher education [11]–[13]. In fact, according to manyarticles, SET is the most common method to evaluate faculty’steaching performance in higher education institutions [13]–[18]. It is surprising that SET is the only widely used methodfor assessing teaching quality as there exists many differentmethods of assessing teacher and teaching quality besidesstudent evaluations, including peer-rating, self-evaluation, stu-dent interviews, learning outcome measures and teachingportfolios [19].SET is also a controversial measure for teaching quality,as student ratings of teaching and student learning are notrelated [12]. Previous research has shown that SET is amultidimensional concept [11], [20]–[24], whose validity forformative or summative purposes remains questioned [9], [17].There is ample evidence that various student, teacher andcourse characteristics play a role in SET [9], [11], [25], [26]:For example, on average female students provide higher SETratings than males [27]. Some evidence also indicates that olderstudents appear to provide higher SET ratings [17]. Teachers’charisma appears to be associated strongly with perceivedteaching ability [7], and physically attractive teachers are likelyto receive higher SET ratings [28]–[30].As for how teaching methods affect SET ratings, previousstudies’ results are somewhat ambiguous. Some evidenceindicates that students rate online courses lower than face-to-face courses [31], whereas results from Carle [32] indicateno differences between instruction methods except for teacherswith racial minority status. A characteristic often found tobe important is course rigor, which Clayson [14] states isassociated negatively with SET ratings in general. Rigor hasbeen measured, e.g., through students’ perceptions of coursedifﬁculty [33]–[36], course workload [22], [33], [34], andcourse pace [33], [34].

B. SET in computer science and software engineering educa-tion

Existing work on SET in a computer science education (CSE)or SEE is scarce [37], and to our knowledge notable researchefforts in the cross-section of SET and computing educationhave not been made in the past decade.There are some recent works that deal with SET in softwareengineering and computing education. Kavalchuk et al. [38]analyzed data from RateMyProfessor.com to distinguish thequalities of popular CS and SWE instructors. In a similar veinCarbone and Ceddia mined student evaluations for improvementareas in the ICT ﬁeld [39].The concept of SET has been used implicitly in manycomputing education papers: In these works teaching tools,pedagogical interventions, or curricular implementations havebeen validated by using student feedback data. Often studentevaluations are used by researchers in the CSE/SEE commu-nities to validate the design of courses. For example, among (the many) recent software engineering papers the study ofRalph [40] evaluates the implementation of a course in SoftwareProject Management, and Agneli et al. [41] a graduate coursein web service design.The low number of studies related to the use of SET inSEE is an essential research gap, since previous studies haveshown that software engineering and computer science studentsvalue different areas of their ﬁelds differently. Research onstudent misconceptions show that students emphasize hands-on programming over other subﬁelds, such as design orengineering processes [3].Furthermore, existing research presents evidence that stu-dents consider particular skills as more central to softwareengineering or computing. For example, Ivins et al. [3] foundthat writing computer programs was emphasized as a skillcompared to requirements engineering or design. Similarly,Gold-Veerkamp found [6] that implementation is consideredstrongly part of software engineering, whereas some partsof design, requirements engineering, and quality assurancewere not. Hewner [4] had similar outcomes in a relatedﬁeld, computer science, where the role of programming wasemphasized over computer science theory.III. M

ETHODS

A. Research approach and hypotheses

In this study we examine whether student evaluations ofsoftware engineering courses vary between different coursetypes. More speciﬁcally, we address a part of the research gappresented in Section II-B by investigating whether the studentevaluations of teaching in software engineering courses reﬂectthe fact that students tend to value certain course topics overothers.We base our course type categorization on the Guide tothe Software Engineering Body of Knowledge (SWEBOK)[2]. SWEBOK was selected because the software engineeringcurriculum at the studied university follows the ACM/IEEE2014 joint task force guidelines [1], which in turn have beenbased on empirical research and existing knowledge bases,such as SWEBOK [2]. Furthermore, several other analyses inthe ﬁeld apply SWEBOK [42].Our hypotheses are as follows:

Hypothesis 1.

The type of course (based on SWEBOK cate-gorization) affects student evaluation of teaching in softwareengineering courses.

Hypothesis 2.

Courses related to software construction andprogramming provide higher SET ratings than courses relatedto other knowledge areas.B. Data

We test our hypotheses using student feedback data fromthe feedback surveys carried out at a Finnish universitybetween academic years 2017-2018 and 2019-2020. The datawas collected through two slightly different student feedbackuestionnaires : one for the academic year 2017-2018 and theother for 2018-2019 and 2019-2020.The ﬁrst questionnaire (2017-2018) comprised of ﬁve Likert-scale questions assessing students’ motivation, effort putinto learning, workload, and teaching methods and courseimplementation in relation to perceived learning. Five open-ended questions were included as well.The second questionnaire (2018-2019 and 2019-2020)comprised of four Likert-scale questions assessing students’motivation, workload, and teaching methods and course asa whole in relation to perceived learning. In addition, fouropen-ended questions were included in the questionnaire.The survey questionnaires were sent to students via emailafter they completed the courses. The surveys mostly weresent to all students enrolled in the courses, but teachers cancollect attendance and limit feedback surveys to only thosestudents who attended classes. Responding was anonymousand voluntary for all students.The sample is restricted to student feedback from softwareengineering courses with ten or more student feedback ques-tionnaires ﬁlled out. The sample includes 415 responses from16 courses in 2017-2018, 395 responses from 15 courses in2018-2019, and 485 responses from 15 courses in 2019-2020.As four of the Likert-scale questions were identical, or verysimilar, between the two student feedback questionnaires used,we combined all responses from each three academic yearinto one data set. The combined data set consists of studentfeedback collected from 22 courses taught one to three timesover the three academic years studied. The total number ofcourse implementations is 46 and the total number studentfeedback questionnaires ﬁlled out is 1295. C. Measures1) Dependent variable:

We carried out an exploratory factoranalysis of the four SET items of the combined data set. Factorswere extracted using principal factor analysis with promaxrotation. A scree plot of eigenvalues was used to determinethe optimal number of factors.One factor was identiﬁed. Two items reﬂecting student’sperceptions about the course and its teaching methods inrelation to perceived learning had high loadings ( > = learning experience on a scale from 1(the worse) to 5 (the best). This is our dependent variable inHypotheses 1 and 2.Two items - ‘My motivation in this course was (1 = verylow; 5 = very high)’ and ‘The workload relative to the studycredits awarded was (1 = very light; 5 = very heavy)’ - didnot load on the factors and were used as single-item measuresof student’s motivation and perceived workload . These itemsserve as control variables in the analysis because the previousliterature has found evidence that student’s motivation andperceived workload play a role in SET. According to, for Survey questions are available at https://doi.org/10.5281/zenodo.4519256 example, Grifﬁn [43] and Wachtel [26], students’ pre-coursemotivation or prior subject interest is positively associated withSET: interested students appear to give higher ratings. In turn,“just right” level of workload leads to better SET ratings [22],[33].

2) Course type:

We classiﬁed the 22 courses into threecategories according to the course content. The categories arebased on SWEBOK knowledge areas (see Table I) and labeledas A) Software construction and programming, B) Softwareengineering process, models and methods, and C) Professionalpractices for software engineering. The categories are referredto here as course types . Course type is used as an independentvariable in the analysis.

D. Analysis methods

We use multilevel regression analysis [44] to address theresearch questions. The main reason for employing multilevelanalysis is that the observations of the SET data presumablyare not independent. Student evaluations are nested withincourse implementations and course implementations are nestedwithin courses (see Figure 1). In other words, for example,SETs from the same course implementation presumably sharemore similarities than they do with SETs from other courseimplementations. Ignoring data clustering may lead to under-estimated standard errors of regression coefﬁcients and, thus,overly small p-values. Multilevel analysis takes into accountthis clustered structure of the data. In addition, it allows us toexamine the relationships between variables at different levelsof the data (course type and student’s learning experience).Multilevel models allow for residual components at all levels– at course level (level 3 in Figure 1), course implementationlevel (level 2 in Figure 1) and student level (level 1 in Figure1). However, in preliminary analyses we found out that theamount of level-3 variation is very small. Only 0.86 percentof the variance in the learning experience was situated at level3 (course level). The rule of thumb is that if 5 percent ormore of the variance is attributable to the level it should notbe ignored [45]. Thus, we chose to ignore the third level andﬁt two level (student level and course implementation level)models instead.Hypotheses 1 and 2 are jointly tested by ﬁtting the followingrandom coefﬁcient model with student’s learning experience(LE) as the outcome variable and student’s motivation (MO),perceived workload (WL) and course type (COURSETYPE)as predictors: LE ij = β + β MO ij + β W L ij + β W L ij + β COURSET Y P E j + u j + u j W L ij + u j W L ij + e ij (1) The relationship between learning experience and perceivedworkload is assumed to be curvilinear as suggested, for example,by Centra [33]. In addition, the impact of perceived workloadon the learning experience is assumed to differ between courses.We used Stata/SE 16.1 software for all analyses. able IC

OURSE TYPES AND CORRESPONDING

SWEBOK

KNOWLEDGE AREAS

Course type SWEBOK knowledge areas

A. SW construction and programming SW constructionSW testingSW maintenanceB. SW engineering process, models and method SW designSW engineering models and methodsSW requirementsSW engineering processSW qualityC. Professional practices for SW engineering SW engineering professional practiceSW engineering economicsSW engineering managementFigure 1. Illustration of the data structure

IV. F

INDINGS

Descriptive statistics of the variables are presented inTables II and III. As shown in Table II, the average learningexperience (range = = = = = = = = < able IID ESCRIPTIVE STATISTICS OF THE STUDENT LEVEL VARIABLES

Variable Mean SD Min Max n

Learning experience 3.46 1.13 1 5 1282Motivation 3.67 1.06 1 5 1287Perceived workload 3.61 0.92 1 5 1280Table IIID

ESCRIPTIVE STATISTICS OF THE COURSE TYPE

Course type No. of course implementations Percent Mean learning experience

A. SW construction and programming 27 58.70 3.63B. SW engineering process, models and method 10 21.74 3.08C. Professional practices for SW engineering 9 19.57 3.27Table IVE

STIMATED PARAMETERS OF THE TWO - LEVEL RANDOM COEFFICIENT MODELS PREDICTING STUDENT ’ S PERCEPTIONS OF LEARNING

Learning experience

Fixed effects

Intercept

Motivation

Workload -0.062 (0.040)

W orkload -0.121*** (0.028) Course typeAB -0.353* (0.149) C -0.165 (0.147)Random effects var workload var workload var u var e * p < < < Table IV presents the estimation results for the randomcoefﬁcient model predicting student’s learning experience. Asshown, the coefﬁcient for the course type B is negative andstatistically signiﬁcant (b = -0.353, p < = < = -0.121, p < ISCUSSION

The objective of this study was to investigate the effect of thetype (or topic) of a course on the student evaluations of teachingin a software engineering programme. We accomplished thisgoal by performing a multilevel modeling analysis on a set of1295 evaluations collected from SE students from a total of 46different course implementations. Previous research maintainsthat students value certain course topics, namely programmingand software construction, over others. At the same time,universities often use SETs as their primary (or sole) metricfor teaching quality. Therefore, establishing how the type (ortopic) of a course affects the SETs is an important researchtopic.n summary, our results suggest that the course type playsan essential role in SET responses of software engineeringcourses. It seems that students give higher ratings to softwareconstruction and programming related courses compared tosome other knowledge areas when evaluating their learningexperience. This result is in line with the works on studentperceptions and expectations of the different SE topics.We used the SWEBOK knowledge areas as the basis of ourcourse categorization. Regarding our hypothesis, however, therewas no signiﬁcant statistical difference between courses on theprofessional practices for software engineering (category C) andprogramming courses. This ﬁnding was somewhat surprising,since based on the established literature we expected thatprogramming courses always get more positive ratings. Thissuggests that while the course type does have an affect onthe SET ratings, there must also be other contributing factorsthat have a signiﬁcant effect on the SETs. During preliminaryanalyses we tested the effect of two other level-2 factors,teaching language and course size, on learning experience.However, these variables were excluded from the ﬁnal modeldue to collinearity with course type.In the following sections, we ﬁrst discuss implications ofour ﬁndings and then address threats to validity.

A. Implications

Our ﬁndings have implications for both the ﬁeld of researchin student perceptions of software engineering education, andthe software engineering education practise.First, our ﬁndings connect the ﬁeld of student evaluation ofteaching to the existing line of research in software engineeringeducation that examined bias in what students consider to beessential courses [3]–[6].Second, the ﬁndings have results for the practise of softwareengineering education, due to the fact that many universitiesuse quantitative results from student surveys for their qualitycontrol and lecturer job performance processes. Our ﬁndingsindicate that within our geographically limited dataset, studentshave bias towards programming courses and systematically givehigher “learning experience” rating to practical programmingcourses. This bias cannot be attributed to lecturer as predictor,since examining course descriptions showed that lecturerstaught multiple types of courses and sometimes switchedcourses over the duration of the datasets. Based on this, wepublish a series of recommendations for practitioners who useSET in software engineering education: • Consider the effect of student bias when evaluatinglecturers.

When using quantitative student evaluation ofteaching to evaluate teachers, bias [27], [46] should becorrected against and current critique of SET methodsshould be considered before use. • Evaluate the course evaluation instruments, and consideralso utilizing qualitative metrics in addition to numericdata.

As established in SET research, SET metrics areseldom objective. Other, perhaps more qualitative metricsto evaluate teaching quality, should also be included in aholistic quality process. • Give students a comprehensive vision of software engi-neering work.

As part of introduction courses, softwareengineering students should be better introduced to theentire ﬁeld and any misconceptions addressed. • Connect software engineering theory to practise.

Fairly ornot, students currently indicate (in the studied organiza-tion) that their “learning experience” was lower in theory-based courses. This might be due to the fact that in thestudied organisation, most courses concentrate on a singletopic. Can best practises from the software engineeringeducation research community, such as problem-basedlearning [47], be applied to connect theory with practisebetter?The main limitation in this study is that the dataset islimited geographically to one organization and the ﬁndingscannot be generalized quantitatively. However, Urquhart [48]synthesizes a line of thinking and presents a concept oftheoretical generalization , where several qualitative or theory-based contributions are related to each other. From thisperspective, our ﬁndings have wider utility in supporting similarindividual ﬁndings from Ivins et al. [3], Hewner [4], and Gold-Veerkamp [50]. What is still required for future research isconﬁrming that the student bias, shown to exist, affects studentevaluation of teachers in other organizations. B. Threats to validity

In this Section, we categorize and address threats to validity,following recommendations Wohlin et al. [49] have summarizedfrom the seminal work by Campbell et al. [51] and Yin [52].

Conclusion validity:

The used statistical analysis methodsare included in the best practises of survey outcome anal-ysis [49]. The analysis outcomes have sufﬁcient statisticalsigniﬁcance.

Construct validity:

The student feedback questionnairesused at the organization are based on accepted SET literatureand constructs such as learning experience [14], motivation [24],[43] and perceived workload [33]. Additionally, the coursetype categorization is based on SWEBOK and ACM curricularecommendations.

Internal validity:

The survey process itself is guidedby the organization’s quality assurance department and in-dependent from the course lecturers. Student motivation andworkload were controlled factors. While lecturer demographicsor teaching methods were not controlled through the model, thedepartment did not have a large quantity of the lecturers at thetime and many lecturers taught courses across the SWEBOKcategories. Furthermore, lecturers swapped courses during thedata collection period, increasing diversity.

External validity:

The ﬁndings have been related toother studies in the ﬁeld and conﬁrm their ﬁndings usingthe principles of theoretical and analytical generalization.

Reliability:

The data analysis process was cross-checkedby a team of three researchers with experience in the ﬁeld.The data has been collected and validated by an independentquality assurance department. Also known as analytical generalization [49]

I. C

ONCLUSION

To answer our research question, how does the type of acourse affect student evaluation of teaching in software engi-neering courses : The type (or topic) of the course can predicta higher SET rating. Software construction and programmingcourses receive higher SET ratings compared to some othertopics. However, programming courses do not always provide abetter SET rating. SET is a complex, multidimensional concept,and its validity for evaluating teaching quality is debated inthe education research literature.This paper establishes, to our knowledge, the ﬁrst explicitsteps towards understanding the dimensions of SET in thesoftware engineering education context. Our study extends thestate of the art by synthesizing the established knowledge onhow students tend to value some knowledge areas over others,and showing evidence of this in practice by analysing SET data.The results should provide insights about the use of SET forboth software engineering educators and faculty administrators.In an increasingly data-driven world, we call on the educationcommunity to acknowledge the limitations and biases that existin the way teaching quality is most commonly measured.In this paper, we follow Garcia-Martinez’s 2010 call [37]for more research in student evaluation of teaching in ﬁeldsrelated to computing that has mostly gone unanswered. Weextend the state of the art by connecting ﬁndings in SET to theprevious research of student bias by Ivins et al., Hewner, andGold-Veerkamp [3]–[6]. In this manner, we extend the scopeof investigation from evaluating teacher characteristics [38] tosystematic evaluation of course characteristics.The main limitation of this paper is the geographicallylimited scope of the dataset. While the study covers datafrom multiple years, collecting data from other organizationswould support generalizing the ﬁndings. For future research,we recommend wider replication studies to investigate if thephenomenon can be replicated in other software engineeringprograms and closely related ﬁelds.R

EFERENCES[1] M. Ardis, D. Budgen, G. W. Hislop, J. Offutt, M. Sebern, and W. Visser,“Se 2014: Curriculum guidelines for undergraduate degree programs insoftware engineering,”

Computer , no. 11, pp. 106–109, 2015.[2] P. Bourque, R. E. Fairley et al. , Guide to the software engineering bodyof knowledge (SWEBOK (R)): Version 3.0 . IEEE Computer SocietyPress, 2014.[3] J. Ivins, B. R. Von Konsky, D. Cooper, and M. Robey, “Softwareengineers and engineering: A survey of undergraduate preconceptions,”in

Proceedings. Frontiers in Education. 36th Annual Conference . IEEE,2006, pp. 6–11.[4] M. Hewner, “Undergraduate conceptions of the ﬁeld of computer science,”in

Proceedings of the ninth annual international ACM conference onInternational computing education research - ICER ’13 . San Diego,San California, USA: ACM Press, 2013, p. 107. [Online]. Available:http://dl.acm.org/citation.cfm?doid=2493394.2493414[5] ——, “How CS undergraduates make course choices,” in

Proceedingsof the tenth annual conference on International computing educationresearch - ICER ’14 . Glasgow, Scotland, United Kingdom: ACM Press,2014, pp. 115–122. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2632320.2632345[6] C. Gold-Veerkamp, “A software engineer’s competencies: Undergraduatepreconceptions in contrast to teaching intentions,” in

Proceedings of the52nd Hawaii International Conference on System Sciences , 2019. [7] M. Shevlin, P. Banyard, M. Davies, and M. Grifﬁths, “The validity ofstudent evaluation of teaching in higher education: love me, love mylectures?”

Assessment & Evaluation in Higher Education , vol. 25, no. 4,pp. 397–405, 2000.[8] F. Zabaleta, “The use and misuse of student evaluations of teaching,”

Teaching in Higher Education , vol. 12, no. 1, pp. 55–76, 2007.[9] P. Spooren, B. Brockx, and D. Mortelmans, “On the validity of studentevaluation of teaching: The state of the art,”

Review of EducationalResearch , vol. 83, no. 4, pp. 598–642, 2013.[10] T. M. Winchester and M. Winchester, “Exploring the impact of facultyreﬂection on weekly student evaluations of teaching,”

InternationalJournal for Academic Development , vol. 16, no. 2, pp. 119–131, 2011.[11] H. W. Marsh, “Students’ evaluations of university teaching: Researchﬁndings, methodological issues, and directions for future research,”

International journal of educational research , vol. 11, no. 3, pp. 253–388,1987.[12] B. Uttl, C. A. White, and D. W. Gonzalez, “Meta-analysis of faculty’steaching effectiveness: Student evaluation of teaching ratings and studentlearning are not related,”

Studies in Educational Evaluation , vol. 54, pp.22–42, 2017.[13] S. L. Wallace, A. K. Lewis, and M. D. Allen, “The state of the literatureon student evaluations of teaching and an exploratory analysis of writtencomments: Who beneﬁts most?”

College Teaching , vol. 67, no. 1, pp.1–14, 2019.[14] D. E. Clayson, “Student evaluations of teaching: Are they related to whatstudents learn? a meta-analysis and review of the literature,”

Journal ofMarketing Education , vol. 31, no. 1, pp. 16–30, 2009.[15] A. Hoel and T. I. Dahl, “Why bother? student motivation to participatein student evaluations of teaching,”

Assessment & Evaluation in HigherEducation , vol. 44, no. 3, pp. 361–378, 2019.[16] D. Kember, D. Y. Leung, and K. Kwan, “Does the use of student feedbackquestionnaires improve the overall quality of teaching?”

Assessment &Evaluation in Higher Education , vol. 27, no. 5, pp. 411–425, 2002.[17] P. Spooren, “On the credibility of the judge: A cross-classiﬁed multilevelanalysis on students’ evaluation of teaching,”

Studies in educationalevaluation , vol. 36, no. 4, pp. 121–131, 2010.[18] P. Spooren and F. Van Loon, “Who participates (not)? a non-responseanalysis on students’ evaluations of teaching,”

Procedia-Social andBehavioral Sciences , vol. 69, pp. 990–996, 2012.[19] R. A. Berk, “Survey of 12 strategies to measure teaching effectiveness,”

International journal of teaching and learning in higher education ,vol. 17, no. 1, pp. 48–62, 2005.[20] H. W. Marsh, “The inﬂuence of student, course, and instructor charac-teristics in evaluations of university teaching,”

American EducationalResearch Journal , vol. 17, no. 2, pp. 219–237, 1980.[21] ——, “Students’ evaluations of university teaching: Dimensionality,reliability, validity, potential baises, and utility.”

Journal of educationalpsychology , vol. 76, no. 5, p. 707, 1984.[22] ——, “Distinguishing between good (useful) and bad workloads onstudents’ evaluations of teaching,”

American Educational ResearchJournal , vol. 38, no. 1, pp. 183–212, 2001.[23] ——, “Students’ evaluations of university teaching: Dimensionality,reliability, validity, potential biases and usefulness,” in

The scholarship ofteaching and learning in higher education: An evidence-based perspective .Springer, 2007, pp. 319–383.[24] H. W. Marsh, B. Muth´en, T. Asparouhov, O. L¨udtke, A. Robitzsch, A. J.Morin, and U. Trautwein, “Exploratory structural equation modeling,integrating cfa and efa: Application to students’ evaluations of universityteaching,”

Structural equation modeling: A multidisciplinary journal ,vol. 16, no. 3, pp. 439–476, 2009.[25] J. S. Pounder, “Is student evaluation of teaching worthwhile?”

QualityAssurance in Education , 2007.[26] H. K. Wachtel, “Student evaluation of college teaching effectiveness: Abrief review,”

Assessment & Evaluation in Higher Education , vol. 23,no. 2, pp. 191–212, 1998.[27] J. Kohn and L. Hatﬁeld, “The role of gender in teaching effectivenessratings of faculty,”

Academy of Educational Leadership Journal , vol. 10,no. 3, p. 121, 2006.[28] R. A. Gurung and K. M. Vespia, “Looking good, teaching well? linkingliking, looks, and learning,”

Teaching of Psychology , vol. 34, no. 1, pp.5–10, 2007.[29] D. S. Hamermesh and A. Parker, “Beauty in the classroom: Instruc-tors’ pulchritude and putative pedagogical productivity,”

Economics ofEducation Review , vol. 24, no. 4, pp. 369–376, 2005.30] T. C. Riniolo, K. C. Johnson, T. R. Sherman, and J. A. Misso, “Hotor not: Do professors perceived as physically attractive receive higherstudent evaluations?”

The Journal of General Psychology , vol. 133, no. 1,pp. 19–35, 2006.[31] P. Lowenthal, C. Bauer, and K.-Z. Chen, “Student perceptions of onlinelearning: An analysis of online course evaluations,”

American Journalof Distance Education , vol. 29, no. 2, pp. 85–97, 2015.[32] A. C. Carle, “Evaluating college students’ evaluations of a professor’steaching effectiveness across time and instruction mode (online vs. face-to-face) using a multilevel growth modeling approach,”

Computers &Education , vol. 53, no. 2, pp. 429–435, 2009.[33] J. A. Centra, “Will teachers receive higher student evaluations by givinghigher grades and less course work?”

Research in Higher Education ,vol. 44, no. 5, pp. 495–518, 2003.[34] H. W. Marsh and L. A. Roche, “Effects of grading leniency and lowworkload on students’ evaluations of teaching: Popular myth, bias,validity, or innocent bystanders?”

Journal of Educational Psychology ,vol. 92, no. 1, p. 202, 2000.[35] R. Remedios and D. A. Lieberman, “I liked your course because youtaught me well: The inﬂuence of grades, workload, expectations andgoals on students’ evaluations of teaching,”

British Educational ResearchJournal , vol. 34, no. 1, pp. 91–115, 2008.[36] K.-f. Ting, “A multilevel perspective on student ratings of instruction:Lessons from the chinese experience,”

Research in Higher Education ,vol. 41, no. 5, pp. 637–661, 2000.[37] S. Garcia-Martinez, “Evaluation of teaching effectiveness in computerscience education and related ﬁelds: A brief review,” in

EdMedia+Innovate Learning . Association for the Advancement of Computing inEducation (AACE), 2010, pp. 1948–1953.[38] A. Kavalchuk, A. Goldenberg, and I. Hussain, “An empirical study ofteaching qualities of popular computer science and software engineeringinstructors using ratemyprofessor. com data,” in

Proceedings of theACM/IEEE 42nd International Conference on Software Engineering:Software Engineering Education and Training , 2020, pp. 61–70.[39] A. Carbone and J. Ceddia, “Common areas for improvement in ictunits that have critically low student satisfaction,” in

Proceedings ofthe Fourteenth Australasian Computing Education Conference - Volume123 , ser. ACE ’12. AUS: Australian Computer Society, Inc., 2012, p.167–176.[40] P. Ralph, “Re-imagining a course in software project management,”in

Proceedings of the 40th International Conference on SoftwareEngineering: Software Engineering Education and Training , 2018, pp.116–125.[41] L. Angeli, J. J. J. Laconich, and M. Marchese, “A constructivist redesignof a graduate-level cs course to address content obsolescence and studentmotivation,” in

Proceedings of the 51st ACM Technical Symposium onComputer Science Education , 2020, pp. 1255–1261.[42] V. Garousi, G. Giray, and E. Tuzun, “Understanding the Knowledge Gapsof Software Engineers: An Empirical Analysis Based on SWEBOK,”

ACM Transactions on Computing Education , vol. 20, no. 1, pp. 1–33,Feb. 2020. [Online]. Available: https://dl.acm.org/doi/10.1145/3360497[43] B. W. Grifﬁn, “Grading leniency, grade discrepancy, and student ratingsof instruction,”

Contemporary educational psychology , vol. 29, no. 4, pp.410–425, 2004.[44] S. W. Raudenbush and A. S. Bryk,

Hierarchical linear models: Applica-tions and data analysis methods . sage, 2002, vol. 1.[45] M. Mehmetoglu and T. G. Jakobsen,

Applied statistics using Stata: aguide for the social sciences . Sage, 2016.[46] D. Feistauer and T. Richter, “Validity of students’ evaluations of teaching:Biasing effects of likability and prior subject interest,”

Studies inEducational Evaluation , vol. 59, pp. 168–178, 2018.[47] S. Ouhbi and N. Pombo, “Software engineering education: Challenges andperspectives,” in . IEEE, 2020, pp. 202–209.[48] C. Urquhart, H. Lehmann, and M. D. Myers, “Putting the ‘theory’backinto grounded theory: guidelines for grounded theory studies in informa-tion systems,”

Information systems journal , vol. 20, no. 4, pp. 357–381,2010.[49] C. Wohlin, P. Runeson, M. H¨ost, M. C. Ohlsson, B. Regnell, andA. Wessl´en,

Experimentation in software engineering . Springer Science& Business Media, 2012.[50] C. Gold-Veerkamp, “Using grounded theory methodology to discoverundergraduates’ preconceptions of software engineering,” in . IEEE, 2018,pp. 707–711.[51] N. N. L. Gage and J. C. Stanley,

Experimental and quasi-experimentaldesigns for research . Chicago: R. McNally, 1963.[52] R. K. Yin,