[PDF] Predicting gender and age categories in English conversations using lexical, non-lexical, and turn-taking features

Abstract

This paper examines gender and age salience and (stereo)typicality in British English talk with the aim to predict gender and age categories based on lexical, phrasal and turn-taking features. We examine the SpokenBNC, a corpus of around 11.4 million words of British English conversations and identify behavioural differences between speakers that are labelled for gender and age categories. We explore differences in language use and turn-taking dynamics and identify a range of characteristics that set the categories apart. We find that female speakers tend to produce more and slightly longer turns, while turns by male speakers feature a higher type-token ratio and a distinct range of minimal particles such as "eh", "uh" and "em". Across age groups, we observe, for instance, that swear words and laughter characterize young speakers' talk, while old speakers tend to produce more truncated words. We then use the observed characteristics to predict gender and age labels of speakers per conversation and per turn as a classification task, showing that non-lexical utterances such as minimal particles that are usually left out of dialog data can contribute to setting the categories apart.

Full PDF

PPredicting gender and age categories in English conversations using lexical,non-lexical, and turn-taking features

Andreas Liesenfeld, G´abor Parti, Yu-Yin Hsu, Chu-Ren Huang

Department of Chinese and Bilingual StudiesThe Hong Kong Polytechnic UniversityHong Kong amliese;gabor.parti;yu-yin.hsu;[email protected]

Abstract

This paper examines gender and age salienceand (stereo)typicality in British English talkwith the aim to predict gender and age cat-egories based on lexical, phrasal and turn-taking features. We examine the SpokenBNC,a corpus of around 11.4 million words ofBritish English conversations and identify be-havioural differences between speakers thatare labelled for gender and age categories.We explore differences in language use andturn-taking dynamics and identify a range ofcharacteristics that set the categories apart.We ﬁnd that female speakers tend to producemore and slightly longer turns, while turns bymale speakers feature a higher type-token ra-tio and a distinct range of minimal particlessuch as “eh”, “uh” and “em”. Across agegroups, we observe, for instance, that swearwords and laughter characterize young speak-ers’ talk, while old speakers tend to producemore truncated words. We then use the ob-served characteristics to predict gender andage labels of speakers per conversation andper turn as a classiﬁcation task, showing thatnon-lexical utterances such as minimal parti-cles that are usually left out of dialog data cancontribute to setting the categories apart.

Author’s note (Oct 2020): statement on the use ofsocial categories in this study

This work involvesthe labelling of participants for social categories re-lated to gender and age. We caution against the useof this heuristic due to the risk of promoting a biasedview on the topics. We would like to encourage thoseinterested in the computational modelling of socialcategories to join the discussion on these concerns and consider participating in efforts to build moreinclusive resources for the study of the topics.

One of the most interesting topics in languagestudies has been on how speakers’ gender andage differences inﬂuence their communicative be-haviour. Transcriptions of real-world, naturally-occurring conversations provide us a window to ex-amine such differences in talk-in-interaction.Gendered and age-salient elements of talk havelong been studied from a range of perspectives, in-cluding linguistics (e.g. Lakoff 1973, Tannen 1990),psychology (for an overview see, e.g. James andDrakich 1993, Tannen 1993), and conversation anal-ysis (e.g. Jefferson 1988). This topic has alsobeen extensively studied from a computational per-spective, focusing on how the differences can beformally described and modelled. In recent years,the research interest has been extended to vari-ous applications using different types of data, suchas using movie subtitles to identify gender distin-guishing features (Schoﬁeld and Mehr, 2016); emailinteractions to study gender and power dynamics(Prabhakaran and Rambow, 2017); video record-ings of human-robot interactions to study genderedand age-related differences in turn-taking dynamics(Skantze, 2017); literary and weblog data to studydifferences between male and female writing (Her-ring and Paolillo 2006; Argamon et al. 2003); andmultimodal audiovisual and thermal sensor data forgender detection (Abouelenien et al., 2017).Recent studies on gendered and age-salient be-haviour in conversations also focus on the use of a r X i v : . [ c s . C L ] F e b peciﬁc constructions or classes of constructions,such as swear words (McEnery and Xiao, 2004),ampliﬁers (Xiao and Tao, 2007), do constructions(Oger, 2019), and minimal particles (Acton, 2011).In the current study, we use transcriptions of record-ings of naturally-occurring talk in British Englishto explore distributional differences in language useacross gender and age groups, testing well-knowntropes such as tendencies that women speak morepolitely, or that men use more swear words (Baker2014, Lakoff 1973, Tannen 1990), and also shed-ding light on other under-explored aspects of gen-dered and age-salient elements in talk such as theuse of non-lexical vocalizations, laughter and otherturn-taking dynamics. Our interest in this topic de-rives from work in computational modelling of di-alog and conversation, especially studies aiming toautomatically identify speaker properties for the usein voice technology and user modeling (Joshi et al.2017, Wolters et al. 2009, Liesenfeld and Huang2020).This pilot study explores whether non-lexical vo-calizations and turn-taking properties can contributeto the prediction of age and gender categories. Weinvestigate this question using a dataset of natural-istic talk that includes a range of elements otherthan words, such as laughter, pauses, overlaps andminimal particles. Can authentic and often “disﬂu-ent” and “messy” transcriptions of natural talk beused for a classiﬁcation task? How will different be-havioural cues contribute to a statistical investiga-tion and prediction of gender and age salience andtypicality? Our dataset comes from the Spoken BNC2014 (Loveet al., 2017), a corpus of contemporary British En-glish conversations recorded between 2012-2016. Itconsists of transcriptions of talk on a range of top-ics covering everyday life in casual settings betweenaround 2 to 4 speakers with a wide variety of so-cial relationships such as between family membersor good friends, and among colleagues or acquain-tances. For classiﬁcation, we extract two subcorporafrom the SpokenBNC using the speaker labels “fe-male” and “male” as well as age labels for speakersabove 70 years and under 18 years. Table 1 pro- vides an overview of the two subcorpora. For age,we chose to only include the youngest and oldestspeakers to tease out more signiﬁcant differences byremoving the bulk of middle-aged speakers. Thedownside of this approach is that this subcorpus isrelatively small.

Feature Category Count

Speakers Female 365Male 305Old 56Young 49Words Female 6,671,774Male 4,080,524Old 737,398Young 792,039Turns Female 742,973Male 478,851Old 96,994Young 102,433Average turn length Female 9.42(in words) Male 8.950Old 8.05Young 8.18Type-token ratio Female 0.0073Male 0.011Old 0.0231Young 0.0235

Table 1: Properties of the dataset obtained from the Spo-kenBNC2014 corpus. ”Old” refers to speakers above 69years of age, ”young” includes speakers up to 18 yearsold.

Comparing the behaviour of speakers across cate-gories, we ﬁrst look at lexical and phrasal differ-ences. Then we examine non-lexical vocalizationssuch as laughter, minimal particles, and turn-takingdynamics such as overlaps and pauses. For bothparts, we tokenize the corpora and remove stop-words using the NLTK and SpaCy libraries (Birdet al. 2009, Honnibal and Montani 2017).

We select a number k i of speaker’s of each labelfrom conversations i and build a language modelith the n-gram frequencies for all turns per cat-egory. We then examine the characteristic differ-ences in the use of lexical items using Scaled F-score. Scaled F-score is a modiﬁed metric based onthe F-score, the harmonic mean of precision and re-call. It addresses issues related to harmonic meansdominated by precision, as well as a better represen-tation of low-frequency terms. We plot gender and age categories using the Scat-tertext library (Kessler 2017) to visualize the cross-categorical differences at the n-gram level. Figure 1and 2 show words and phrases that are more char-acteristic of each category, while also reporting theirfrequencies and a list of top characteristic items.We observe that a range of terms reﬂect(stereo)typicality of gender and age categoriesin our corpus. For instance, top characteristicterms of male speakers feature the nouns “mate”,“game”, “cards”, “quid” and “football”, whilefemale speaker’s talk more prominently features“baby”, “weekend”, “hair”, “birthday” and “cake”(see Figure 1). More interesting for us, the char-acteristic terms per gender category also featurea number of verb constructions, exclamations andminimal particles such as “ain’t”, “innit”, “eh” and“uh” for male speakers and “my God”, “mhm”,“blah”. “huh” and “hm” for female.For age categories, notably a much smaller cor-pus, we also observe that a range of items featuresmore prominently in talk of speakers labelled asyoung or old (see Figure 2). Likewise, we observethat some non-lexical utterances, exclamations andparticles exhibit category salience, such as “em”,“innit” and “oh dear”. Based on these observationswe decide to further explore the role of non-lexicalvocalizations, exclamations and minimal particles inthe corpus.

S1: you’re so good at hairS2: really?S1: mm S2: hair is my weakness I feel likeI’m really badS1: no I’d rather sit inside https://github.com/JasonKessler/scattertext S2: uhu

S1: if it was just a little bitsunnier

S1: you don’t like riding them?S2: I do but [short pause] hmm [short pause] you don’t reallyS3: [overlapping] I don’t have abike

S1: oh lemon balm yeah you can dothat as wellS2: erm what what is very good forcolds i [truncated] is er erm purple sageS1: yeah pur [truncated] yeah[short pause] I know that oneyeah

S1: are all all the actors areredubbed for the songs aren’ttheyS2: hm?

S1: are the all the actors redubbedfor the songs? I can’t remember

S1: she was just awake screamingfor hoursS2: oh S1: so that took its toll

Table 2: Overview of minimal particle types

Next, we move beyond lexis and examine non-lexical vocalizations and a range of other aspectsof turn-taking such as laughter, pauses, overlaps,and truncation. Our dataset contains non-lexical vo-calizations of different functions, such as the min-imal particles (also known as interjections) “hm”,“mhm”, “hmm”, “er”, “erm”, “um”, “aha”, “oh”.In fact, this type of utterance ranks among the mostfrequent in the corpus. These utterances can formata wide range of functions that may be relevant togender and age category prediction. Unfortunately,our corpus does not annotate functional informa-tion of these utterance which makes it difﬁcult to igure 1: Overview of characteristic terms by gender category, blue=Male, Red=Female, plotted by frequencyigure 2: Overview of characteristic terms by age category, blue=Old, Red=Young, plotted by frequency onsistently group this type of utterance into func-tional categories in retrospect (Liesenfeld 2019b).Inspired by Couper-Kuhlen and Selting (2017), wetherefor decided to only group these particles intoﬁve broad form-function mappings based on theirtypical forms. This way we aim to capture at leastsome functional variety, even though this unfortu-nately does not accommodate inter-speaker varia-tion. Table 2 shows the functions we differentiate.In addition, laughter, truncation , pauses and over-lap is also annotated in our corpus as single la-bels that indicate the occurrence of laughter-relatedsounds, abandoned words, as well as the occurrenceof overlap between two turns by speakers. Table 3provides an overview of these non-lexical vocaliza-tions and turn-taking properties.The cells in dark blue show the highest occur-rence of a feature per category as relative frequen-cies. For example, laughter is most prominentamong young speakers, while turn management to-kens (”er”, ”erm”, ”um”) are typical for old speak-ers. The lighter blue cells compare the prominenceof the same item across categories, displayed as thepercentage of the highest ranking category.First we look at minimal particles that typicallyformat positive responses (as for acknowledgments)and continuers. This includes nasal utterances suchas “mm”, “mhm”, “mm hm”, as well as vocalic ut-terances such as “aha”, “uhu”, “uhuh”, “uh huh”.Turns by female and old speakers feature these ut-terances more often as those of other speakers. Sec-ond, we examine utterances typically related to turnstalling or management. We distinguish two typesof forms that typically format this, nasal “hmm” and“hmmm” sounds as well as vocalic sounds “um”,“er”, and “erm”. Turn stalling tokens appear mostfrequently in turns by female speakers while turnmanagement tokens appear predominantly in turnsby old speakers. Next, we examine nasal utter-ances featuring a rising pitch which are annotated as“hm?”. This type of utterance can format doubt, dis-believe, or serve as a repair initiator. It appears pre-dominantly in turns by female speakers (notably rawcounts for this token are very low). Lastly, the utter-ance “oh”, that (as a response and with rising pitch)commonly formats a change-of-state token that ex-presses an insight or understanding, features mostprominently in talk by old speakers.

Can we predict the speaker’s gender and age cate-gory based on lexical, non-lexical and turn-takingfeatures alone? And how does including non-lexicalvocalizations impact the binary classiﬁcation task?

One challenge of working with transcriptions of un-scripted conversations is that various subcorpora thatone would like to compare are rarely of the samesize. For binary classiﬁcation it is therefore essentialto select equal numbers of speakers of each category.We also checked the amount of utterances of eachspeaker and removed those which only produced aminimal amount of talk.Furthermore, we considered controlling for gen-der pairs, making sure our subcorpora feature male-male, female-male, and female-female talk in equalmeasure, but ultimately we decided that, in this case,the resulting dataset would not be big enough for thetask. Similarly, we decided against using a leave-one-label-out split to control for the language of aparticular conversation.

First, we predicted the label of a single speakerbased on all their utterances. We obtained 305speakers each for classifying gender and around 50each for age. Especially the size of the age cor-pus is therefore almost unsuitable for a classiﬁca-tion task, so we caution the reader to treat the re-sulting classiﬁcation accuracy with a grain of salt.Using the features discussed in Section 3, we be-gan with only considering lexical features, and thenconsidering both lexical and non-lexical. We thentrain/tested (50/50) a Logistic Regression classiﬁerto predict gender and age of each speaker with 10-fold cross-validation. We obtained a classiﬁcationaccuracy of 71% for gender labels and 90% for agelabels using only lexical features, after added non-lexical features the accuracy increased to 81% and92% respectively.Second, we also tried to predict the label of aspeaker per individual turn. Similarly, we splitthe dataset into equal amounts of turns per cate-gory (around a million turns for gender, and around200,000 for age), and trained a classiﬁer using a eature CategoryFemale Male Old Young minimal particles

Positive responses and continuers (mm, mhm, mm hm, aha, uhu, uhuh,uh huh) gender n=86,098; age n=10,506 highest

Turn stalling (hmm, hmmm) gender n=2,722; age n=132 highest

Turn management (um, er, erm) gender n=98,442; age n=16,777 62.8% 77.3% highest

Repair initiators (hm?) gender n=195; age n=15 highest

Change-of-state token (oh) gender n=96,566; age n=15,852 80.6% 62.4% highest laughter gender n=92,417; age n=15,603 72.5% 59.1% 58.6% highestpause (short, 1-5 sec) gender n=236,885; age n=29,703 highest truncated words gender n=68,122; age n=11,065 80% 89% highest overlaps (by total turns ratio) gender n=250,628; age n=46,285 87% 80% 90% highest

Table 3: Overview of non-lexical features in the dataset: minimal particles, turn-taking properties and other vocaliza-tions (in relative frequencies, ﬁrst rank is displayed as “highest” and rank 2-4 in percentage of ﬁrst rank, blue and tealintensity indicates rank, n = counts of each feature by subcorpus)

We examined gender and age salience and(stereo)typicality in British English conversation.The results of this pilot study show that a rangeof lexical, phrasal, non-lexical, and turn-taking- related features exhibit a tendency to appear moreprominently across binary gender and old cate-gories. We were especially interested in the useof non-lexical vocalizations, particles, exclamationsand other turn-taking dynamics. Here, we foundthat female speakers produce signiﬁcantly more andslightly longer turns. Talk by female speakers alsotends to feature the minimal particles “huh”, “hm”and “mm” more prominently. In contrast, malespeakers’ talk tends to be characterized by the min-imal particles “eh”, “uh” and “em”. Overall, malespeakers tend to produce shorter turns with fewerwords and a higher type-token ratio.Looking at generational differences, we foundthat young speakers laugh more and their turns over-lap more and tends to feature more words typicallyrelated to swearing such as “shit”, “fuck” or “fuck-ing”. Talk by old speakers tends to feature moretruncated words and turn management tokens.Based on such observations of characteristicsacross categories, we set up a classiﬁcation task topredict gender and age labels of both single speak-ers and individual turns. We found that predictingspeaker labels per conversation yields signiﬁcantly eatures Gender AgeAccuracy ± Std. Error Accuracy ± Std. Errorper conversation baseline 50 ± ± ± ± ± ± ± ± per turn baseline 50 ± ± ± ± ± ± ± ± Table 4: Results of category prediction as a binary classiﬁcation task of “male” - “female”, “young” - “old” labels, persingle-speaker and per turn higher classiﬁcation accuracy in comparison to theprediction of labels for individual turns. This islikely due to the high number of very short turnsthat don’t feature utterances with label bias. Theclassiﬁcation results of around 80% for predictingspeaker labels per conversation show that a simplelogistic regression classiﬁer does a reasonably goodjob even when confronted with “unstructured” and“messy” transcribed speech. Notably, we show thatnon-lexical utterances and minimal particles, whichare often ﬁltered out in dialog and speech corpusdatasets, contribute to more accurate prediction.

Despite the signiﬁcant conceptual pitfalls that comewith labelling participants for gender and agecategories, we hope that our preliminary resultsyielded some interesting insights on gender and age(stereo)typicality in contemporary British Englishtalk and will draw more attention to much-neededcomputational work based on authentic, real-worldrecordings instead of sterile, polished datasets.In the real world, gender and age performances intalk-in-interaction are not classiﬁcation tasks. Chal-lenges for the big data approach to user modeling areplenty. For instance, a more comprehensive modelneeds to take into account that speakers performgender and age differently across various conversa-tional settings. When more datasets become avail-able, a natural extension to the existing predictionstudy would be explorations of differences across various conversational compositions. Would we ob-serve similar patterns in conversations with speakersof the same or different gender and age?Another important extension to the current type ofstudy are more detailed explorations of turn-takingdynamics that look into more ﬁne-grained aspects ofdifferent types of actions in conversation. User mod-eling as a text classiﬁcation problem yields good re-sults for broad categories. But humans are often ableto make informed guesses on very speciﬁc speakertraits based on style or format of just one turn.Modelling this requires a more sophisticated modelof turn types, conversational moves, and the ﬁne-grained systematics of talk-in-interaction. However,quantitative methods are often not well suited to cap-ture more subtle differences of how speakers formatvarious action types, which leads to challenges ofhow to model the sequential unfolding of action inmore detail (Liesenfeld, 2019a).A critical challenge for the data-driven predictionof gender and age salience in talk is therefore howto take variation in formats of speciﬁc actions andactivities into account, especially those that havebeen described as gendered or age-salient such ashedging or “troubles talk” (Lakoff 1973; Jefferson1988). Focusing on speciﬁc actions would enable amore ﬁne-grained analysis of how speakers negoti-ate their concepts of gender and age in interaction aspart of speciﬁc sequences in conversation and hownavigating these concepts in interaction relates to(stereo)typical gender and age salience. eferences

Abouelenien, M., P´erez-Rosas, V., Mihalcea, R.,and Burzo, M. (2017). Multimodal gender de-tection. In

Proceedings of the 19th ACM In-ternational Conference on Multimodal Inter-action , pages 302–311.Acton, E. K. (2011). On gender differences inthe distribution of um and uh.

University ofPennsylvania Working Papers in Linguistics ,17(2):2.Argamon, S., Koppel, M., Fine, J., and Shimoni,A. R. (2003). Gender, genre, and writingstyle in formal written texts.

Text & Talk ,23(3):321–346.Baker, P. (2014).

Using corpora to analyze gender .A&C Black.Bird, S., Klein, E., and Loper, E. (2009).

Natu-ral language processing with Python: ana-lyzing text with the natural language toolkit .O’Reilly Media, Inc.Couper-Kuhlen, E. and Selting, M. (2017).

Interac-tional linguistics: Studying language in socialinteraction . Cambridge University Press.Devlin, J., Chang, M.-W., Lee, K., and Toutanova,K. (2019). Bert: Pre-training of deep bidirec-tional transformers for language understand-ing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associ-ation for Computational Linguistics: HumanLanguage Technologies, Vol 1 , pages 4171–4186.Herring, S. C. and Paolillo, J. C. (2006). Gender andgenre variation in weblogs.

Journal of Soci-olinguistics , 10(4):439–459.Honnibal, M. and Montani, I. (2017). Spacy 2: Nat-ural language understanding with bloom em-beddings, convolutional neural networks andincremental parsing.

To appear , 7(1).James, D. and Drakich, J. (1993). Understand-ing gender differences in amount of talk:A critical review of research. In Tannen,D., editor,

Oxford studies in sociolinguistics.Gender and conversational interaction , page281–312. Oxford University Press.Jefferson, G. (1988). On the sequential organizationof troubles-talk in ordinary conversation.

So-cial problems , 35(4):418–441. Joshi, C. K., Mi, F., and Faltings, B. (2017). Per-sonalization in goal-oriented dialog. In

NIPS2017 Conversational AI Workshop, 4-9 Dec2017 .Kessler, J. S. (2017). Scattertext: a Browser-BasedTool for Visualizing how Corpora Differ. In

Proceedings of ACL-2017 System Demonstra-tions, 30 July - 4 August 2017 , Vancouver,Canada. Association for Computational Lin-guistics.Lakoff, R. (1973). Language and woman’s place.

Language in society , 2(1):45–79.Liesenfeld, A. (2019a).

Action formation with jan-wai in Cantonese Chinese conversation . PhDthesis, Nanyang Technological University.Liesenfeld, A. (2019b). Cantonese turn-initialminimal particles: annotation of discourse-interactional functions in dialog corpora.

Pro-ceedings of the 33rd Paciﬁc Asia Conferenceon Language, Information and Computation(PACLIC 33) .Liesenfeld, A. and Huang, C. R. (2020). Name-Spec Asks: What’s Your Name in Chinese?A Voice Bot to Specify Chinese PersonalNames through Dialog. In

Proceedings of the2nd Conference on Conversational User In-terfaces , CUI ’20, New York, NY, USA. As-sociation for Computing Machinery.Love, R., Dembry, C., Hardie, A., Brezina, V., andMcEnery, T. (2017). The Spoken BNC2014:Designing and building a spoken corpus of ev-eryday conversations.

International Journalof Corpus Linguistics , 22(3):319–344.McEnery, A. and Xiao, Z. (2004). Swearing in mod-ern British English: the case of fuck in theBNC.

Language and Literature , 13(3):235–268.Oger, K. (2019). A Study of Non-Finite Forms ofAnaphoric do in the Spoken BNC.

Anglopho-nia. French Journal of English Linguistics , 28.Prabhakaran, V. and Rambow, O. (2017). Dialogstructure through the lens of gender, genderenvironment, and power.

Dialogue & Dis-course , 8(2):21–55.Schoﬁeld, A. and Mehr, L. (2016). Gender-distinguishing features in ﬁlm dialogue. In

Proceedings of the Fifth Workshop on Com-putational Linguistics for Literature, 16 June,016 , pages 32–39.Skantze, G. (2017). Predicting and regulating par-ticipation equality in human-robot conversa-tions: Effects of age and gender. In , pages 196–204. IEEE.Tannen, D. (1990).

You just don’t understand:Women and men in conversation . MorrowNew York.Tannen, D. (1993).

Gender and conversational in-teraction . Oxford University Press.Wolters, M., Vipperla, R., and Renals, S. (2009).Age recognition for spoken dialogue systems:Do we need it? In

Tenth Annual Conference ofthe International Speech Communication As-sociation, 6-10 September 2009 .Xiao, R. and Tao, H. (2007). A corpus-based soci-olinguistic study of ampliﬁers in British En-glish.