Detecting Objectifying Language in Online Professor Reviews
DDetecting Objectifying Language in Online Professor Reviews
Angie Waller and Kyle Gorman
Graduate Center, City University of New York
Abstract
Student reviews often make reference to profes-sors’ physical appearances. Until recently Rate-MyProfessors.com, the website of this study’sfocus, used a design feature to encourage a “hotor not” rating of college professors. In the wakeof recent
Natural language processing techniques have longbeen used to study subjectivity and sentiment in me-dia and product reviews. In this study, we employthese technologies to study objectifying languagein reviews of professors using archival data fromRateMyProfessors.com (RMP). Detecting such lan-guage is di ffi cult because it is somewhat rare, mak-ing up a small part of a small proportion of reviews(Davison and Price, 2009), and references to physi-cal appearance show enormous linguistic variation(discussed in Section 2.2), making them di ffi cult todetect accurately using simple text features.This study provides insights into bias in profes-sor reviews and their interaction with the design ofthe web user interface. We propose two models—achunk tagger and a document classifier—used tobuild an ensemble to detect objectifying reviewsat scale. This approach could be applied to manyother domains where noisy user-generated reviewsmay contain harassment or exhibit harm. We focus on the RMP website because it hasbeen active for over twenty years, giving us ampledata to study trends across time. The website haslong been associated with students commenting ontheir professors’ appearances (Lagorio, 2006) andhas been the subject of many prior studies on biasin course reviews. Recent changes to the websiteinterface allow us to consider how text reviews mayhave been influenced by its design feature for ratingprofessor “hotness”. We look to previous work on bias in professor re-views, e ff ects of interface design on internet dis-course, and detecting subjectivity and opinions inonline reviews. Prior studies address bias among students’ reviewsof teachers. Freng and Webber (2009) find a pos-itive correlation between “hotness” and qualityscores of professors on RMP, accounting for 8%of variance. Chang and McKeown (2019) reportgendered di ff erences in students’ descriptions ofcomputer science professors on RMP which is alsoreflected in visualizations by Schmidt (2015) show-ing that words like genius are more frequently at-tributed to male professors and words like nurtur-ing to female professors. This is supported in workby Boring et al. (2016) and Boring (2017) wherein-class reviews show higher ratings for leadershipskills among male professors and “being warm”among female professors. Noting that perceptionsof “easiness” predict overall ratings, Davison andPrice (2009) recommend an RMP interface changereplacing the site’s “easiness” rating with better-defined terms such as “amount learned”. a r X i v : . [ c s . C L ] O c t .1.2 Interface design and online discourse The interaction between interface design and onlinediscourse is a central focus in computer-mediateddiscourse analysis (Herring, 2004; Herring and An-droutsopoulos, 2015) and critical technoculturaldiscourse analysis (Brock, 2018). Both considernot only how people express themselves in onlineenvironments, but also how elements like inter-face design of a website shape people into “users”,a ff ecting how they express themselves. Becausewe are interested in the relationship between at-tractiveness commentary and interface design, weconstrain this study to the RMP website and itsinterface elements, including its professor ratingform (Appendix, Figure 3), featuring the “hot ornot” chili pepper rating.Interface design is also considered quantitatively,and at scale, in company-led user experience stud-ies. For example, Facebook found that by curatingusers’ News Feeds to positive or negative posts,they influenced the emotional tenor of the users’own posts (Kramer et al., 2014). NextDoor, a pop-ular neighborhood classifieds website, made theirweb form for reporting suspicious activity moredetailed and inadvertently arduous, successfully de-creasing suspicious activity posts and therefore de-creasing posts with racial profiling (Hempel, 2017).In an e ff ort to combat online harassment, Twitterintroduced interface elements to warn users beforeposting tweets with inflammatory language (Statt,2020). These interventions suggest that small in-terface changes may produce measurable e ff ects inonline discourse. Like all genres of review, professor reviews inter-weave subjective and objective statements (e.g.,“the class was poorly attended”). We consider com-mentary on a professor’s physical attractiveness,which we refer to as objectifying or attractivenesscommentary, to be subjective content.Wiebe et al. (2001) discuss a method of label-ing spans of subjective text within news corpora sothat opinion phrases, even those that occur infre-quently, can be detected using collocation clues. Todetermine review sentiment, Pang and Lee (2004)automatically segment movie reviews into subjec-tive and objective portions, discarding the objectiveportions before attempting to determine the overallsentiment. Here we are also interested primarily ina subjective portion of reviews, but whereas sub-jectivity is an expected feature in other genres of reviews, comments about a professor’s “hotness”may constitute workplace harassment (Flaherty,2018) among other harms. To our knowledge, thisis the first study to target objectifying commentaryin professor reviews and its relationship to websitedesign.
Our classification scheme is tailored for a low-resource setting with a limited amount of labeleddata. The goal is to construct classifiers whichachieve su ffi cient accuracy to allow extrapolationto a much larger set of unlabeled reviews. Toachieve our goal of analyzing large-scale trends,we train two models for identifying objectifyingcommentary in RMP reviews: (1) a token-levelchunk tagger similar to those used for named entityrecognition; and (2) a review-level text classifiersimilar to those used for document classification.Unlike the chunk tagger, the document classifiercan take into account more variety in features andaccount for attractiveness commentary that occursmultiple times in a document. Multiple spans can“gang up”, allowing them to be more easily detectedat document level. In contrast, the chunk taggerconsiders objectifying language as a highly-localphenomena and is therefore more able to detectattractiveness commentary in the context of longerreviews covering a range of topics.We then build ensembles of these models. Weanticipate that ensembling will be useful becausewe hypothesize that the two classifiers’ patterns oferrors will be only weakly correlated (van Halterenet al., 1998), and because labeled data is limited,high-variance, and class-imbalanced (Brill and Wu,1998) for this task. For this study, anonymous RMP reviews of pro-fessors were scraped on two occasions. The firstscrape, in July 2018, paired textual data with theprofessor’s “hotness” rating, defined by the num-ber of times a student rated the professor as “hot”minus the number times they rated them as “nothot” (Felton et al., 2008). In the web interface, thenames of professors receiving positive attractive-ness scores are marked with a chili pepper emoji(see Appendix, Figure 5). The second scrape, inAugust 2019, targeted a broader set of regions and Scraping was seeded using a list of professors and theirchili pepper scores ( http://morph.io/chrisguags ). chools. Test data was drawn from this latter dataset, which was also used for trend analysis.By this latter date, the chili pepper emoji hadbeen removed from the website in response topublic criticism (McLaughlin, 2018), so it was nolonger possible to extract hotness scores. In addi-tion to text, both scrapes also collected the namesof professors, student-reported quality and di ffi -culty scores (averaged by professor, on a five-pointscale), subject area, and the name of the school.See Table 8 and Table 9 in the Appendix for thefull list of schools. We define objectifying or attractiveness commen-tary as reviews that describe a professor’s physicalappearance, demeanor, clothing style, or resem-blance. In contrast to prior work (e.g., Felton et al.,2008), we also include language disparaging a pro-fessor’s appearance. Although previous work hasconsidered objectifying comments in limited RMPdatasets (Davison and Price, 2009; Kindred andMohammed, 2017), there are no previous anno-tation guidelines to follow for labeling these ex-pressions. Kindred and Mohammed (2017) findout of 788 RMP ratings in their sample, only 3.6%describe teacher attractiveness. Given the low fre-quency of these reviews and their informal quali-ties, creating instructions that cover attractivenesscommentary in all of its variations is not possible.We acknowledge some reviews like ones describedin Section 2.3 will be more subjective than others.
RMP reviews contain stylistic flourishes commonto online discourse: slang and non-standard lan-guage, typographical errors, expressive punctuationand capitalization, and emoticons. The examplesbelow are fragments from 30-to-50-word reviewsrepresenting attractiveness commentary. See Fig-ure 4 in the Appendix for additional examples inscreen-capture format. • Everyone LOOOOOVES sexy Je ff ! • . . . he doesn’t assume students understandcomplex stu ff like other math teachers do.Plus, hello, HOT! • He’s also pretty cute which helps. :) • . . . when he talked about vector space he al-most saw my O-face. This section describes reviews that pose challengesin labeling attractiveness spans and the processbehind how distinctions are made. Annotators wereinstructed that, when in doubt, reviews that implyromantic interest, or lack thereof, are consideredobjectifying.
Flirtation but no attractiveness commentary
Examples where the review may be flirtatiousbut not directly describing professor appearancepresent a grey area. • Damn, I love that man.
None
Referring to a professor as “that man” borders onobjectifying, but without additional context it is notconsidered attractiveness commentary. • I love him so much, I would totally marry himif I could.
Obj.
However, we consider references to marriage ordating the professor like the above example to beobjectifying commentary. We decide this becausethe element of fantasy in samples like these is takento be indicative of an attraction to the professor. • He is a math god!
None
Reviews that compare the professor to a deity arealso di ffi cult to distinguish. If the focus could bethe professor’s expertise, the review is not consid-ered attractiveness commentary. Accents
The most common challenging exam-ples refer to the professor’s voice or accent. Thesetypes of reviews primarily fall into two categories:(1) the accent is sexy, charming, or appealing; and(2) denoting professors who are non-native Englishspeakers described as di ffi cult to understand. Re-views in the latter category can be considered den-igrating of the professor but are not necessarilyattractiveness commentary. We consider the intentof the student. If the comment is personally deroga-tory, such as “horrible accent”, it is considered ob-jectifying. The following examples illustrate thesedistinctions: • And he’s British, such a charmer! Love hisaccent!
Obj. • He has the cutest accent.
Obj. • His accent was di ffi cult to understand. None e is CUT for a Stanford professor word-lower he is cut for a stanford professorlemma he is cut for a stanford professorpos PRON AUX NOUN NOUN DET PROPN NOUNhas-hot false false false false false false falsenext-word is CUT for a stanford professor [END]next-pos AUX NOUN ADP DET PROPN NOUN [END]prev-word [START] He is cut for a stanfordprev-pos [START] PRON AUX NOUN ADP DET PROPNprev-iob O O O B O O Oall-caps false false true false false false falseprev-all-caps false false false true false false falsenext-all-caps false true false false false false false
Table 1: Example feature vector for chunk tagger using review snippet commenting on professor’s physique.
We implement classification techniques withunique strengths for capturing the qualities and con-texts of objectifying comments. The first, a chunktagger, represents a bottom-up strategy, whereasthe second, a document classifier, uses top-downprocessing and a richer feature set.
Because discussion of a professor’s attractivenessmay only be a small portion of any given review,we annotate spans of tokens which refer to attrac-tiveness. These labels can then be automaticallypropagated from spans to the document level. Thatis, if a review contains any spans tagged as con-taining objectifying language, the whole review islabeled objectifying. We employ a chunk taggercustomized to identify these spans within reviews.During preprocessing, labeled data is tagged forpart of speech (POS) using the spaCy tagger (Hon-nibal and Montani, 2017). Text spans that refer toattractiveness are tagged using the CoNLL-2003IOB format (Tjong Kim Sang and De Meulder,2003). The chunker is built using the nltk.chunk library (Bird et al., 2009, ch. 7); it uses a multi-nomial logistic regression classifier and a greedyleft-to-right decoding strategy.
Attractiveness features
In addition to token fea-tures, we develop a dictionary of words describingattractiveness (see Appendix, Table 10); these arematched using regular expressions so alternativespellings (e.g., hoooottt , hotttttt ) are also captured.See Table 1 for an example token feature vector. We also develop a model that can take advantageof features extracted from the entire review. Thedocument classifier is built using a linear-kernelsupport vector machine classifier from sklearn (Pedregosa et al., 2011). The primary features usedare term frequency-inverse document frequencyweighted unigrams and bigrams. Several othertypes of features, described below, are used to im-prove classifier accuracy.
Formality
Impressionistically, RMP reviewsthat discuss teacher appearance tend to be less for-mal than those that focus on the quality of instruc-tion. To capture this distinction, we use featuresproposed by Pavlick and Tetreault (2016) to mea-sure textual formality. These include average wordand sentence length, the ratio of nouns to verbs, andthe proportion of words over 4 characters. We alsoadd one-hot features for the use of non-standardpunctuation and capitalization. Finally, we alsoextract features tracking the use of titles such as
Dr. , Professor , Mrs. , and
Mr.
Gender
We extract professor gender by trackingthird-person singular pronouns (e.g., he , his , she , her ) in reviews; gender-non-specific pronouns like they and neo-pronouns like ze were not present inReviews Tokens WordsLabeled 4,050 12,209 139,091Unlabeled 358,970 71,700 15m Table 2: Summary statistics for datasets. he labeled data and therefore not tracked. We alsodo not track gender of the reviewers as all reviewsare submitted anonymously.
Subjectivity
Davison and Price (2009) and Rit-ter (2008) argue that student reviews largely fol-low a transactional consumerist discourse similarto customer service reviews. We hypothesize thatthis would be reflected in the ratio of first-personto third-person pronouns; a greater proportion offirst-person pronouns may indicate a review aboutpersonal opinions and feelings ( consumerist ) ratherthan instruction. We also reuse the attractivenessdictionary regular expression patterns from thechunk tagger, expanding this to include commonidioms such as easy on the eyes and good look-ing . Additionally, each review is scored for itssentiment and subjectivity using the textblob sentiment classifier. Style
We consider features measuring the use oftext properties characteristic of internet discourse,including the use of emoticons, repeated exclama-tion points, and words in all uppercase letters.
For the document classifier, a feature ablation studyon the development data (Appendix, Table 11)shows accuracy scores rely on the custom dictionar-ies for “hotness” including pattern matching for id-iomatic expressions. However, omitting formalityand stylistic features does not impact performance.
After training the chunk tagger and document clas-sifier on the labeled data, a simple document-levelensemble of these models is applied to the unla-beled data. Since there are only two weak clas-sifiers available, we use two forms of voting: inensemble 1 we consider reviews with disagreementas non-objectifying reviews; in ensemble 2, wecompletely discard reviews when the classifiersdisagree to achieve higher accuracy.
The labeled data of 4,050 reviews is randomly splitinto training (80%) and development sets (20%),the latter used for feature ablation (Appendix, Ta-ble 11). During annotation, professors labeled “hot” https://textblob.readthedocs.org Chunk tagger Doc. classifierTargeted NoneTargeted 8,573 9,858None 4,295 336,242
Table 3: Confusion matrix for chunk tagger and docu-ment classifier models; Targeted: reviews which containattractiveness commentary. were deliberately oversampled. Review and tokencounts can be found in Table 2.
To estimate interannotator agreement, a subset ofthe labeled data was independently labeled by asecond annotator, a graduate student in linguistics,according to the authors’ guidelines. This gavea span-level Cohen’s κ = .
785 and a document-level κ = . After applying the chunk tagger and document clas-sifier to the unlabeled data, we find the classifiersdisagree on 4.1% of the reviews (see Table 3). Thisis roughly what one might expect given the overalllow proportion of true positive samples. We de-termine accuracy by creating a test set from 600of these reviews. This set includes reviews withclassifier agreement on 150 documents predictedto contain, and 150 documents predicted not tocontain, objectifying language. We also sample300 reviews in which the chunk tagger and docu-ment classifier disagree. These 600 samples arerandomly sorted and then adjudicated by a humanjudge to create the test set.The results for the chunk tagger and documentclassifier are shown in Table 4. As can be seen,both classifiers have relatively high accuracy butsignificantly lower precision and recall. Table 5 We oversample from recent date ranges to better captureany new trends in reviews.
Classifier Prec. Rec. F κ Chunk tag. .42 .21 .28 .89 .23Doc. class. .44 .23 .30 .93 .26
Table 4: Weak classifier results. eview Chunk tagger Doc. class.
Not a great teacher (in fact pretty awful) but she’s looking GOOD.
FN TP he is now bald, but he still has the look ;)
FN TP
His classes are worthwhile because he’s a good teacher, but mostly because hehas the most awesome accent in the world. Rawr.
FN TP the WORST ****ING TEACHER EVER. WORST CLASS, WORST PERSON.NOT PROFESSIONAL AT ANYTHING, DOES NOT KNOW PHYSICS FROMTHE HOLE IN HIS ASS. AVOID!
FP TN
My experience with this professor was awful. He wasn’t helpful and I ended uplearning everything on my own without his help. I should have just stared at thewall rather than wasting my time in this class. He did not BUMP my grade up!
FP TN
Probably the BEST Org Chem prof out of all the ones I’ve had. His slides areactually notes, not just pictures with lines on the side for you to write on. Theexam is based onthe notes, but you also need to read the book. Def didn’t mindlooking at him for 1 hour 25 mins either.
TP FN not a bad prof. has a nice smile. class discussions were pretty interesting.grades are based on ur attendance, and ur blog entries (they are not hard, but becareful, cuz her way of grading is kinda picky). overall, not a hard class. kindainteresting. take it if u want, but if u cant stand reading don’t. TONS of reading.
TP FN
Jason’s a fantastic section leader–some of the best classes I’ve had here were insection for this class. Plus, he knows his stu ff , is super eloquent, and kicks assin suits (just sayin’). I will say that he can come o ff as cold and intimidating atfirst, but he actually cares and is really willing to help you. TP FN
I can’t understand her heavy accent. I found her subject boring.
TN FP
Jenny is an extraordinary professor- she truly cares about how you do in herclass, and does her best to help you in whatever fashion she can.
TN FP
Very easy going. Knows what he is doing from lived experience. The powerpointsare very good. You can skip class and just follow along on the slides and get theidea of things (although probably not a good grade). Hot daughter.
FP FP worst prof ever, and i really mean that. she’s not even a professor, just someplant biologist hired as a lecturer. she is completely inept as a lab manager anduniversally hated by the students. oh, and very not hot
TP TP
Table 5: Examples with classifier disagreement; reviews have been modified to reflect their original format whileprotecting the identity of professors. shows example reviews where the classifiers dis-agree. The chunk tagger performs better in reviewswith higher word counts. In some cases, the chunktagger avoids false positives of the document clas-sifier where keywords from custom dictionariesappear but are in a context that is not objectify-ing. The document classifier performed well onlower word count reviews and where words fromcustom dictionaries and regular expression patternsare present. In Table 6, we show the same resultsfor the two ensembles; note that results for ensem-ble 2 do not include the 300 samples from the testdata on which the two weak models disagree.We see that both ensemble classifiers achievegreatly improved results compared to either the chunk tagger or the document classifier alone, andas expected, error can be further reduced in ensem-ble 2 by discarding data on which they disagree.We conclude that ensemble methods are ef-fective for detecting objectifying commentary instudent reviews in the face of unbalanced data. Inwhat follows, the ensemble 2 classifier is used toto analyze trends in attractiveness commentary on344,815 reviews.
Building on previous RMP research studying biasin student reviews, we continue this inquiry fo-cusing on how attractiveness commentary is dis-tributed based on teacher gender, and quality and igure 1: Log-odds of attractiveness commentary in reviews from 2010 to August 2019.
Classifier Prec. Rec. F κ Ensemble 1 .72 .44 .55 .93 .50Ensemble 2 .72 1.00 .84 .99 .83
Table 6: Ensemble classifier results. di ffi culty scores. We then focus on a logistic regres-sion analysis using generalized estimating equa-tions (GEE) to determine if there was a decrease inattractiveness commentary following the removalof the chili pepper feature from the web interface. Our dataset contains 39.7% female professors(11 , , ff erence is signif-icant ( χ = . p < . ff erence could be attributed to thedistinction between the low e ff ort act of clicking“hot” on the review form versus actually writingcommentary on the teacher’s appearance. Also, un-like chili pepper ratings, our counts include reviews with negative commentary. We deploy logistic generalized estimating equa-tions (GEE; Liang and Zeger 1986), an extensionof the generalized linear modeling that takes intoaccount the correlation between observations. Alogistic GEE accommodates the unequal numberof earlier observations across professors and condi-tions as well as the variation in review activity vol-ume over quarterly time intervals. This is optimalfor the noise in the dataset and allows utilization ofthe entire collection of reviews. The final model pa-rameters are determined by the best goodness-of-fitscore computed using the full log quasi-likelihoodfunction. School size and tuition did not have sig-nificant outcomes in the results and were discarded.The final model includes presence or absence ofthe chili pepper interface feature, teacher qualityand di ffi culty scores, and professor gender. Time isinput as an interval covariate by quarter, while chilipepper condition is a binary factor; final parametersand their outcomes are given in Table 7. Chili pepper and time interval
First, we focuson our primary question concerning the proportionof objectifying comments and the removal of thechili pepper. We observe a downward trend over thetime period prior to the interface change; however,the log-odds of attractiveness commentary after thechili pepper was removed on June 28, 2018 is lowerthan the time variable can account for alone (see Ta-ble 7). Our analysis finds a significant e ff ect of timeand condition (with vs. without the chili pepper).stimate (log-odds) Std. err. Wald χ p ( χ )(Intercept) − .
111 .143 476.18 < . − .
428 .136 9.93 .002timeInQuarters − .
020 .002 79.44 < . ffi cultyHigh − .
075 .022 11.49 < . .
051 .026 3.76 .053genderFemale − .
528 .174 9.19 .002qualityHigh:genderFemale .
097 .043 5.09 .024
Table 7: GEE model parameter estimates with attractiveness commentary as dependent variable. The interceptrepresents pepperPresent, timeInQuarters =
0, di ffi cultyLow, qualityLow, genderMale. N = , These findings support our hypothesis: RMP’s re-moval of the chili pepper coincides with a declinein reviews mentioning professor attractiveness.
Quality and teacher gender
We compare theproportions for attractiveness commentary in re-lation to quality and di ffi culty rating scales (Fig-ure 2). There is a significant interaction betweenteacher quality and gender, female professors ratedhigh quality are significantly more likely to receiveattractiveness commentary than male professorsrated high quality (see Table 7). Di ffi culty was alsoa significant factor, the higher the di ffi culty score,the less likely the reviews for the professor willcontain attractiveness commentary. While our work has focused on the text contents ofreviews, our analysis of objectifying comments fol-lows previous findings about biases of the originalchili pepper rating, correlating with teacher gen-der, quality, and di ffi culty ratings. This is the firststudy to find a correlation between attractivenesscommentary and the website interface.More research is needed to understand the ob-served steady eight-year decline. As this was anobservational study rather than a controlled experi-ment, there are many uncontrolled variables. Forinstance, we cannot compare attractiveness com-mentary by size of professor’s class or attributesof the reviewer. We tried to estimate these factorswith proxies such as university size, geographicarea, and tuition amounts, but these only providerough estimates and did not have significant e ff ecton the presence of attractiveness commentary. Mc-Neil (2020) reflects on how users’ perceptions ofanonymity have changed, from posting to onlinebulletin boards in the late 1990s, to present-day “sharing” on corporate-owned, heavily surveilledsocial network sites like Facebook. This turn fromanonymity to self-awareness is observed by Mar-wick and boyd (2011) in their study of Twitter users.These users describe their own self-censoring be-haviors by imagining their audiences to includenot only friends but also parents and employers.The decline in attractiveness commentary on RMPmay reflect broader internet trends, correspondingwith internet users being more conscious of theirperceived audience and realizing that true onlineanonymity is impossible. We find that a small change to the RMP website, re-moval of the chili pepper rating, is associated witha lower likelihood of comments on professor attrac-tiveness. Our experiments show that an ensembleof classifiers can accurately detect objectifying lan-guage in online professor reviews and can allow usto analyze trends in a large unlabeled dataset.One area where classifiers disagreed was in the“fuzzy samples” such as accents and godliness dis-cussed in Section 2.3. Breitfeller et al. (2019) de-scribe similar challenges in classifying microag-gressions and label themes within their dataset tobetter define these utterances. Based on our classi-fier’s success in pulling out objectifying commentsfrom large datasets, we can identify enough exam-ples to consider labeling categories such as accentcriticism and comments about unattractiveness. Fi-nally, one could apply an active learning approach(Yarowsky, 1995) to label and train on exampleswhere the classifiers disagreed.With further exploration it is hoped these tech-niques could be applied to detecting other formsof abusive language in online reviews. Insofar as igure 2: Proportion of reviews with attractiveness commentary by quality and di ffi culty ratings. the removal of the chili pepper feature correlatedwith a significant decrease in attractiveness com-mentary, we suggest that web interface design maypositively influence online discourse. As the scopeof the gig economy continues to expand and moreworkers find themselves evaluated by anonymousonline reviews, we hope these findings will inspirefuture research around potential biases in online re-views based on gender, appearance, and the designof the online interface used. We would like to thank Martin Chodorow for hisguidance in statistical analysis and Deepali Ad-vani for her assistance with data preparation. Weapprciate Jonathan Butterick for helping with datacollection. We would also like to acknowledge SaraMorini for her assistance with the data annotation,William Jordan for proofreading, and anonymousreviewers for their helpful feedback.
References
Steven Bird, Ewan Klein, and Edward Loper. 2009.
Nat-ural language processing with Python . O’Reilly.Anne Boring. 2017. Gender biases in student evalu-ations of teaching.
Journal of Public Economics ,145:27–41.Anne Boring, Kellie Ottoboni, and Philip B. Stark. 2016.Student evaluations of teaching (mostly) do not mea-sure teaching e ff ectiveness. ScienceOpen Research ,1:1–11.Luke Breitfeller, Emily Ahn, David Jurgens, and Yu-lia Tsvetkov. 2019. Finding microaggressions in the wild: A case for locating elusive phenomena in so-cial media posts. In
Proceedings of the 2019 Confer-ence on Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Conferenceon Natural Language Processing (EMNLP-IJCNLP) ,pages 1664–1674.Eric Brill and Jun Wu. 1998. Classifier combinationfor improved lexical disambiguation. In , pages 191–195.Andr´e Brock. 2018. Critical technocultural discourseanalysis.
New Media & Society , 20(3):1012–1030.Serina Chang and Kathy McKeown. 2019. Automat-ically inferring gender associations from language.In
Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the 9thInternational Joint Conference on Natural LanguageProcessing , pages 5746–5752.Elizabeth Davison and Jammie Price. 2009. How dowe rate? An evaluation of online student evalua-tions.
Assessment & Evaluation in Higher Education ,34(1):51–65.James Felton, Peter T. Koper, John Mitchell, andMichael Stinson. 2008. Attractiveness, easiness andother issues: student evaluations of professors onratemyprofessors.com.
Assessment & Evaluation inHigher Education , 33(1):45–61.Colleen Flaherty. 2018. Bye, bye, chili pepper: Rate MyProfessors ditches its chili pepper “hotness” quotient.Accessed 10 / / Teaching of Psychology , 36(3):189–193.Hans van Halteren, Jakub Zavrel, and Walter Daelemans.1998. Improving data driven wordclass tagging bysystem combination. In , pages 491–497.Jessi Hempel. 2017. For Nextdoor, Eliminating RacismIs No Quick Fix.
Wired .Susan C. Herring. 2004. Computer-mediated dis-course analysis. In Sasha Barab, Rob Kling, andJames H.Editors Gray, editors,
Designing for Vir-tual Communities in the Service of Learning , page338–376. Cambridge University Press.Susan C. Herring and Jannis Androutsopoulos. 2015.Computer-mediated discourse 2.0. In Deborah Tan-nen, Heidi E. Hamilton, and Deborah Schi ff rin, ed-itors, The Handbook of Discourse Analysis , pages127–151. John Wiley & Sons.Matthew Honnibal and Ines Montani. 2017. spaCy 2:natural language understanding with Bloom embed-dings, convolutional neural networks and incrementalparsing. Accessed 1 / / Jour-nal of Computer-Mediated Communication , 10(3).Adam D. I. Kramer, Jamie E. Guillory, and Je ff rey T.Hancock. 2014. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences ,111(24):8788–8790.Christine Lagorio. 2006. Hot for teacher.
The VillageVoice .J. Richard Landis and Gary G. Koch. 1977. The mea-surement of observer agreement for categorical data.
Biometrics , 33(1):159–174.Kung-Yee Liang and Scott L. Zeger. 1986. Longitu-dinal data analysis using generalized linear models.
Biometrika , 73(1):13–22.Alice E. Marwick and danah boyd. 2011. I tweet hon-estly, i tweet passionately: Twitter users, contextcollapse, and the imagined audience.
New Media & Society , 13(1):114–133.BethAnn McLaughlin. 2018. I killed the chili pepperon Rate My Professors. Accessed 1 / / Lurking: How a Person Becamea User . Macmillan.Bo Pang and Lillian Lee. 2004. A sentimental education:sentiment analysis using subjectivity summarizationbased on minimum cuts. In
Proceedings of the 42ndAnnual Meeting on Association for ComputationalLinguistics , page 271–278. Ellie Pavlick and Joel Tetreault. 2016. An empiri-cal analysis of formality in online communication.
Transactions of the Association for ComputationalLinguistics , 4:61–74.Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, Olivier Grisel,Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin-cent Dubourg, Jake Vanderplas, Alexandre Passos,David Cournapeau, Matthieu Brucher, Matthieu Per-rot, and ´Edouard Duchesnay. 2011. Scikit-learn: ma-chine learning in Python.
Journal of Machine Learn-ing Research , 12:2825–2830.Kelly Ritter. 2008. E-valuating learning: Rate My Pro-fessors and public rhetorics of pedagogy.
RhetoricReview , 27(3):259–280.Andrew S. Rosen. 2018. Correlations, trends and po-tential biases among publicly accessible web-basedstudent evaluations of teaching: a large-scale studyof ratemyprofessors.com data.
Assessment & Evalua-tion in Higher Education , 43(1):31–44.Ben Schmidt. 2015. Gendered language in teacher re-views. Accessed 7 / / ff ensive replies. The Verge .TextBlob. 2018. TextBlob: simplified text processing.Accessed 1 / / Proceed-ings of the Seventh Conference on Natural LanguageLearning at HLT-NAACL 2003 , pages 142–147.Janyce Wiebe, Theresa Wilson, and Matthew Bell. 2001.Identifying collocations for recognizing opinions. In
Proceedings of the ACL-01 Workshop on Collocation:Computational Extraction, Analysis, and Exploita-tion , pages 24–31.David Yarowsky. 1995. Unsupervised word sense dis-ambiguation rivaling supervised methods. In , pages 189–196.
Appendix
Figure 3: Screen capture of review form on RateMyPro-fessors.com, 2014.Figure 4: Screen captures of attractiveness ratings onRateMyProfessors.com, 2019. arnegie Mellon University Stanford UniversityDuke University Tufts UniversityHarvard University University of ChicagoMassachusetts Institute of Technology University of Texas at AustinPrinceton University Yale UniversityRice University
Table 8: Universities sampled in labeled dataset.
Tuition Level Enrollment Level Universityhigh low Drexel UniversityEmory UniversityFairfield UniversityLawrence Technological UniversityLoyola Marymount UniversityPennsylvania State UniversityPomona CollegePrinceton UniversityRice UniversityTrinity UniversityUniversity of ChicagoUniversity of TulsaVillanova UniversityWesleyan UniversityYale Universityhigh medium Northwestern UniversityStanford Universitylow high Iowa State UniversityUniversity of California Los AngelesUniversity of South CarolinaUniversity of Texas at AustinUniversity of Wisconsinlow low College of CharlestonEvergreen State CollegeMontclair State UniversityNew Mexico State UniversitySouthern Utah UniversityUniversity of MontanaUniversity of Wyominglow medium Boise State UniversityBrigham Young UniversityGeorgia Institute of TechnologyMississippi State UniversityOklahoma State UniversityUniversity of Illinois at ChicagoUniversity of Northern IowaUniversity of OregonWashington State UniversityReed Collegemedium high Rutgers State Universitymedium low Austin CollegeBerry CollegeBradley UniversityNewberry CollegeOklahoma Baptist Universitymedium medium Temple University
Table 9: Universities appearing in unlabeled dataset. Tuition levels are binned by high (42,000–59,000), medium(27,000–41,000) and low (5,500–14,800). School enrollment is binned by high (35,000–52,000), medium (17,000–34,500) and low (2,000-17,000). ictionary Word Listhot dictionary adorable, alluring, appealing, athletic, attractive, babe, ban-gin, banging, beaut, beautiful, beauty, becoming, beguiling, be-witched, bewitching, bootylicious, breathtaking, buxom, charm-ing, chili, comely, cute, dainty, dazzling, divine, doll, dork,dorky, dreamboat, dreamy, enchanting, fetching, fire, flaming,fox, foxy, gentle, gentleness, glamorous, glorious, gorgeous,graceful, handsome, hottie, hubba, hunk, hunky, hypnotic, ir-resistible, looker, lovely, luscious, magnetic, marry, nerdy, rav-ishing, seductive, sensuous, sexy, smokin, smoking, soothing,spi ff y, striking, stunning, sublimefashion list boots, clothes, clothing, dress, dressed, dresses, fashion, hip,hipster, jacket, outfit, outfits, shoes, socks, stylish, wardrobe,wear, wearshair words bald, baldness, baldspot, beard, blond, blonde, brunette, curly,dreadlocks, hair, haircut, moustache, mustache, shave, side-burns, toupee, wavy Table 10: Domain-specific word lists used in feature selection.Figure 5: Screen capture of professor profile on RateMyProfessors.com, 2014. eatures
Familiarity and first person • – • • – – – – –Lexical:has “hot” • – • • • • • • • has “accent” • – • • • • • – • age, body part, clothing • – • • • • • – –Readability • – – – – – – – –Sentiment polarity • - • • • – – – –Subjectivity score • – • • • – – – –Formality • – • – – – – – –Pronouns • – • • – – – – –Internet Style • – • • • • – – – F1 .90 .27 .90 .90 .90 .90 .90 .79 .90