Pardon the Interruption: An Analysis of Gender and Turn-Taking in U.S. Supreme Court Oral Arguments
PPardon the Interruption: An Analysis of Gender and Turn-Taking in U.S.Supreme Court Oral Arguments
Haley Lepp , Gina-Anne Levow Educational Testing Service, University of Washington, [email protected], [email protected] Abstract
This study presents a corpus of turn changes between speakersin U.S. Supreme Court oral arguments. Each turn change is la-beled on a spectrum of cooperative” to competitive” by a humanannotator with legal experience in the United States. We ana-lyze the relationship between speech features, the nature of ex-changes, and the gender and legal role of the speakers. Finally,we demonstrate that the models can be used to predict the labelof an exchange with moderate success. The automatic classi-fication of the nature of exchanges indicates that future studiesof turn-taking in oral arguments can rely on larger, unlabeledcorpora. Index Terms : gender, speech, emotion recognition, computa-tional linguistics
1. Introduction
The Supreme Court plays a key role in defining, identifying, androoting out gender discrimination by hearing the cases that willdetermine the way gender rights are evaluated across the UnitedStates. However, there are few checks on the presence of genderbias within the court itself. This study offers a novel corpusof annotated speech from Supreme Court oral arguments andproposes a framework to analyze gender biases in turn changes.For decades, scientists have argued that women are inter-rupted more than men in professional settings, indicating thatthis speech act could be an indicator of gender bias. The
NewYork Times has described “being interrupted, talked over, shutdown or penalized for speaking out” as “nearly a universal ex-perience for women when they are outnumbered by men” [1].Interruptions correlating with gender within Supreme Courtoral arguments have occurred consistently over time, and arenot necessarily due to political polarization or the personalitiesof justices [2]. However, in conversational turn-taking, an in-terruption is not inherently a negative act. As demonstrated byTannen [3] in her research on gender and language, interrup-tions cannot be defined categorically as acts of rudeness or dom-inance. Interruptions can be part of regular discourse depend-ing on the context of a conversation, and are especially commonamong speakers of certain social groups in the United States.Furthermore, the term “interruption is not a clear-cut lin-guistic term. Interruptions have variously been described as anoverlap in speech between two speakers [4], possibly includingbackchannels [5], a “power type event to wrest the discoursefrom the speaker[6], a “topic change attempt[6], an event to“bolster the interruptees positive face[6], or a syntactically in-complete turn [7].To address this, we annotate a corpus of turn changes as au- This work was completed while the first author was a graduate stu-dent at the University of Washington. dio segments on a spectrum of cooperative to competitive. Todemonstrate the utility of the corpus, we extract speech featuresand show that classifiers can automatically predict the humanlabels of the turn changes with relative success.
2. Audio and transcription retrieval
The transcripts and audio recordings of all U.S. Supreme Courtoral arguments since October 2006 are publicly available on-line. The transcripts, written by court stenographers, are for-matted like the script of a play, with the name of the speakerfollowed by the transcribed speech. The transcripts include dis-fluencies and speech that ends mid-sentence or word [8].The transcriptions do not include the time at which eachstatement is said, so we retrieve time stamps of turns or sen-tences (whichever are shorter) from
The Oyez Project [9].
We define a turn change as an event in which one speaker stopsspeaking and a second speaker starts speaking according to thetranscript. For example:Hannah S. Jurss: And so we’re certainly askingfor this Court’s –John G. Roberts, Jr.: But I’m not faulting themfor that.[10]Using the start and end time-stamp of each speakers turn,we segment each argument into short audio clips around turnchanges. Multiple studies have demonstrated that listeners canperceive significant social and emotional information from ashort slice of an audio, despite not knowing the greater con-text of a conversation [11, 12]. The brevity of clips also ensuresthat the annotators, who are busy professionals, do not lose in-terest in the task. Also, because this study aims to find patternsin speech without regard to the subject matter of the case, limit-ing the content which an annotator can listen to can help avoidannotator bias.The default length of a segment is six seconds long: twoseconds before the end-label of the first speaker and four sec-onds after the start label of the next speaker. If the turn of thefirst speaker is less than two seconds long, then we use the startof the first speakers turn as the start of the turn change, insteadof a full two seconds of audio. If the turn of the second speakeris less than four seconds long, then we use the end of that speak-ers turn as the end of the turn change.We manually check each segment and remove those inwhich at least one speaker is inaudible (probably due to the The corpus of turn changes and annotations is available for publicuse at https://github.com/hlepp/pardontheinterruption. a r X i v : . [ c s . C L ] S e p tenographer hearing something not picked up by the micro-phone), turns that are listed as separate in the transcripts butare the same person with a pause, and turns that are scripted,such as Mr. Chief Justice, and may it please the court. We trimrecordings by no more than one second if another adjacent turnchange occurs that makes it unclear what an annotator might belabeling. We extend recordings by no more than one second ifthe change becomes more clear with extension; this is usuallydue to timestamp rounding cutting off a very short turn.We update speaker names if the ordering is wrong or nameswere incorrect. For example, if a number of turns occur in quicksuccession or there are two or more speakers talking at the sametime, we change the label so that the first and second speak-ers heard are the first and second speakers listed. Exchangeswith fully overlapped speech are checked to ensure the order ofspeakers aligns with human perception. The corpus includes 711 turn changes from four oral arguments:Kahler v. Kansas [13], Mitchell v. Wisconsin [10], VirginiaHouse of Delegates v. Bethune-Hill [14], and Washington StateDept. of Licensing v. Cougar Den Inc. [15]. A typical case isheard by nine justices (three female, six male) and two or threeattorneys (the corpus includes seven female and five male).Each of the selected trials occurred in 2018 or 2019, covers aunique topic, and includes at least one female arguing beforethe court. , The number of turns per attorney in the corpus ranges from27 to 128. For justices, each of whom appear in every oralargument, the number of turns per individual per trial rangesfrom 10 to 42; with one exception: Justice Clarence Thomasdoes not speak in any of the four arguments. Among justices,Justice Sonia Sotomayor and Justice Stephen Breyer are mostrepresented, with over 130 turns each.Table 1:
Information about Annotated Segments
Corpus Component Number
Male Participants 11, with Justice ThomasFemale Participants 9Justice to Attorney exchanges 338Attorney to Justice exchanges 351Justice to Justice exchanges 22Attorney to Attorney exchanges 0Female to Female exchanges 127Male to Male exchanges 269Female to Male exchanges 165Male to Female exchanges 150
3. Corpus annotation
The rules of conversational speech within a courtroom settingare not the same as those in informal conversational speech; The latter qualification narrows the selection considerably: in 2018,only 15% of the people who argued before the court were women [16,17]. Gender information was gathered from public profiles of speak-ers. An expansion of the corpus should ensure such characteristics alignwith speakers self-identities. power-relationships, formal rules and procedures, and field-specific argument strategies are among the many factors that in-fluence the ways that speakers interact within an oral argument.In the annotation process, all annotators are required to be U.S.-based, and to identify as an attorney, judge, legal scholar, or lawstudent in their second year or above.
We design a brief, easy-to-complete, anonymous online surveyfor legal professionals. The survey is built on the JavaScriptlibrary JSPsych [18]. We instruct participants to categorize theshort clips on a spectrum of cooperative to competitive. Beforebeginning annotation, the participants are given descriptions ofeach category:By cooperative, we mean that to your ears, thefirst speaker expects a turn change and gives thefloor to the second speaker. The second speakermight leave space for the first speaker to finishtheir turn. Or, the second speaker might talk atthe same time as the first speaker, providing shortspurts of feedback, for example saying “mhmm”or “yes”.By competitive, we mean that to your ears, thesecond speaker competes with the first speaker forthe chance to speak. You, the first speaker, or lis-teners might perceive this as a disruption to theprevious speaker’s speech. The second speakermay cause the first speaker to stop speaking, ortalk over the first speaker to compete to be heard.as well as trial exercises and example audio clips that couldbe classified into each category. The descriptions demonstrate adivision between the two categories of turn changes, but makeclear that the participant should use their training and experi-ence to evaluate turn changes with nuance.After this short training, the annotators proceed to the tasks.Each task includes the prompt How competitive or cooperativedo you perceive this exchange to be?, which emphasizes to theparticipant that the annotation should be solely from their per-ception. Below this prompt is an audio element which the usercan control and a slider showing a spectrum from Competitiveto Cooperative with Likert-style category labels. The partici-pants are instructed to leave the slider in the middle if the cate-gory of the clip is unclear. The speaker is given up to 26 moresegments to listen to (a number selected to keep the entire an-notation exercise under five minutes). If the speaker leaves thesurvey before completion, all results are still saved.
Each segment is annotated twice by 2 of 77 unique annota-tors. Each speaker annotated between 1 and 26 segments, witha mean per person of 18, and a standard deviation of 6.2.As the way listeners perceive speech differs depending onmany factors, we ask participants to optionally share demo-graphic data to demonstrate that the age, gender, ethnic, po-litical, and linguistic diversity of listeners is relatively represen-tative of the diversity of the United States. This information isincluded in the corpus.
Annotators score each audio segment on a visual spectrum, asseen in Figure 1. The location on which an annotator placeshe slider is codified with a score between 0 and 100, in which0 represents the most competitive turn change, and 100 is themost cooperative turn change.Figure 1:
Slider given to annotators.
The distribution of labels in Figure 2 reflects the layout ofthe web interface. The highest peaks are at either end of thespectrum and directly in the middle. This phenomenon indi-cates that annotators move the slider all the way to one end whenan audio clip clearly sounds competitive or cooperative. The an-notators leave the slider in place if the audio does not clearly fallinto a category. There are also middling peaks around where thesurvey interface has labels of slightly competitive and slightlycooperative. These peaks show that annotators make use of theLikert-style guidelines, despite having the ability to drop thebutton anywhere on the slider.Figure 2:
The distribution of labels in the annotated corpus.
We evaluate the annotations under the assumption that if the au-dio files receive similar annotations by different people, then theannotation process can be considered reproducible [19]. Inter-annotator agreement on the raw labels as well as labels cate-gorized into five equally-spaced bins are shown in Table 2 andindicate moderate agreement on this task.Table 2:
Annotator agreement.
Raw Five BinsSpearman’s ρ κ
4. Data analysis
To investigate the relationship between gender and competitiveturn-taking, we explore the distribution of turn change scoreswith respect to the gender of the first speaker in the turn. Forcomparison, we also consider the distribution of these turnchange scores with respect to the role of the speaker (i.e. justiceor attorney).The mean score of an exchange when a woman is the firstspeaker is more competitive (45.0 with a standard deviation of28.1) than when a man is the first speaker (49.7 with a stan-dard deviation of 26.7). Alternatively, the mean label for a turnin which a woman is the second speaker is slightly more co-operative (51.5) than when a man is (45.0). More extreme isthe difference between roles: the mean label for an attorneyfirst speaker is 36.3, while a justice first speaker has a muchmore cooperative mean of 59.0. This is predictable consider-ing the power differential and a culture of deference; attorneyswould avoid speaking competitively to a justice, while justiceswould be much more likely to speak to an attorney competi-tively. There are no instances of attorney-to-attorney speech.Figure 3 shows the distribution of labels for each speakerin the corpus who is the first speaker in a turn. The distributionilluminates the severity with which role aligns with turn changetype. The eight speakers who have the highest means, or arespoken to most cooperatively, are all justices; those with thelowest means are all attorneys.Figure 3:
Distribution of labels for the first speaker in a turn.
Based on the non-parametric Kruskal-Wallis test, we con-firm a significant effect of speaker gender ( p < . ) and ofspeaker role ( p = 7 . e − ) on competitiveness score. AWilcoxon Ranked-Sum test also shows this difference, withcomparably low p -values. . Turn Classification Experiments We also investigate the automatic classification of turn changesbased on acoustic and speaker cues. Effective classificationcould enable analysis of the relationship between gender andturn change at scale.
Using openSMILE , we extract time-aggregated features foreach speaker in each audio segment in the corpus. We use twofeature sets: the eGeMaPS collection of 88 psychologically-informed features, and the Speech Prosody collection of 36pitch and loudness related features [20, 21]. We select eGeMaPsbecause of its demonstrated success in emotion recognitionstudies, and we select the Speech Prosody set because pat-terns in pitch and amplitude have been shown linguistically todifferentiate between cooperative and competitive turn-taking[22, 23, 4, 7]. In each feature set, we also include the genderand role of the first and second speakers. We do not normalizepitch features for speaker gender. We divide the labeled corpus into two subsets: 80% of the cor-pus into a training set and 20% into an evaluation set. Each sethas comparable gender and role distribution across turn-type;for example, 21% of the turns in the full corpus, training set,and evaluation set are male-to-female, and 49% of all turns ineach set are attorney-to-justice.To group the labels into classes, we take the mean of a seg-ment’s two raw labels given by annotators in a 0 to 100 scale,then categorize each segment into one of three quantile-basedclasses based on that mean: the most competitive, the most co-operative, and the middling exchanges.We measure the effectiveness of Random Forest (RF) andSupport Vector Machine classifiers (SVC with RBF kernel) inpredicting whether an audio segment falls into these competitiveand cooperative classes. We use the SciKit Learn LaboratoryToolkit version 2.0 (SKLL) , with features scaled by standarddeviation and centered around a mean, and a micro-averaged F score as a grid search objective [24]. The training data is di-vided randomly into subsets for grid search for hyperparameteroptimization.We report two baselines. The first predicts a competitiveturn in every instance in which the transcription of the firstspeaker’s speech ends in a dash (”-”), indicating syntactic in-completeness, and a cooperative turn when there is no dash.Second, we show the micro F score were the target class pre-dicted for all instances. The classifiers are most successful at predicting the competi-tive turns, with the highest micro F resulting from the SpeechProsody feature set. This may be due to the fact that highpitch and loudness are defining elements of a competitive turnchange, while there may be more variation in what could definea cooperative turn change. The results without adding genderand role as features, as well as normalizing prosodic features bygender, are generally lower, but within 0.1 of listed respective We find that normalization of features by speaker gender harmspredictive results, and leave it to future research to explore this phe-nomenon more. https://skll.readthedocs.io/en/latest/index.html scores. The added performance due to these features could bedue to the fact that gender and role do correlate with the classof segments.Table 3: Micro- F score for baseline predictions of classes Dash Target ClassCompetitive
Cooperative
Micro- F score for SVC and RF predictions of classes Model eGeMaPS ProsodyCompetitive
SVC 0.636
RF 0.617 0.640
Cooperative
SVC 0.551 0.561RF 0.593
6. Conclusion
This study introduces a corpus of segments of speech from U.S.Supreme Court oral arguments that include a turn change be-tween speakers. The segments, annotated by legal practitionersfor competitiveness and cooperativeness, provide insight in theways that justices and attorneys speak with one another in thisunique speech setting. We find that as the first person in anexchange, female speakers and attorneys are spoken to morecompetitively than are male speakers and justices. We also findthat female speakers and attorneys speak more cooperatively asthe second person in an exchange than do male speakers andjustices. We demonstrate that classifiers trained only on pho-netic and acoustic features extracted from the audio segmentscan achieve a level of predictive accuracy above multiple base-lines.In-depth studies of gender bias and inequality are criticalto the oversight of an institution as influential as the SupremeCourt. While the models presented in this study analyze lin-guistic trends in relation to gender, the labeled corpus could beintegrated with other demographic or content-related informa-tion to provide a fine-grained analysis of intersectional fields.There is demand in the social sciences for even broader anal-ysis; within the first few months of 2020, several cross-cuttingstudies have criticized increasing bias in the Supreme Court andfederal appeals courts, especially in regards to poverty and race[25, 26]. With improved predictive models, a larger set of turnchanges across all Supreme Court oral argument recordings andpossibly other court recordings could provide fodder for futurestatistical social science studies of speech trends in the U.S. ju-dicial system.
7. Acknowledgments
We are grateful to the scholars who supported this cross-disciplinary study: Richard Wright, Keelan Evanini, VikramRamanarayanan, Victoria Zayats, and the board, reviewers, andparticipants in Widening NLP at ACL 2019. We also thank thelegal professionals who annotated data and helped test the an-notation survey. Finally, we express our appreciation for thepublic servants and activists who spend their daily lives fightingfor equality and fairness in the U.S. Judicial System. . References [1] S. Chira, “The Universal Phenomenon of Men InterruptingWomen,”
The New York Times , 06 2017.[2] T. Jacobi and D. Schweers, “Justice, Interrupted: The Effect ofGender, Ideology and Seniority at Supreme Court Oral Argu-ments,”
Virginia Law Review , vol. 103, no. 7, pp. 1379–1496, 112017.[3] D. Tannen,
Gender and Discourse . New York: Oxford UniversityPress, 1994.[4] L. Yang,
Current and New Directions in Discourse and Dialogue.Text, Speech and Language Technology . Springer, 2003, vol. 22,ch. Visualizing Spoken Discourse.[5] K. Laskowski, “Modeling norms of turn-taking in multi-partyconversation,” in
Proceedings of the 48th Annual Meeting ofthe Association for Computational Linguistics . Association forComputational Linguistics, 2010, p. 9991008.[6] J. A. Goldberg, “Interrupting the discourse on interruptions: Ananalysis in terms of relationally neutral, power and rapport-oriented acts,”
Journal of Pragmatics , vol. 14, pp. 883–903, 121990.[7] A. Wichmann and J. Caspers, “Melodic cues to turn-taking inEnglish: Evidence from perception,” in
Proceedings of the SecondSIGdial Workshop on Discourse and Dialogue
No. 18-6210 . United States SupremeCourt, Jan. 21, 2019.[11] N. Ambady, M. A. Krabbenhoft, and D. Hogan, “The 30-SecSale: Using Thin-Slice Judgments to Evaluate Sales Effective-ness,”
Journal of Consumer Psychology , vol. 16, pp. 4–13, 2006.[12] N. Ambady and R. Rosenthal, “Half a minute: Predicting teacherevaluations from thin slices of nonverbal behavior and physi-cal attractiveness,”
Journal of Personality and Social Psychology ,vol. 64, p. 431441, 1993.[13] Kahler v. Kansas,
No. 18-6135 . United States Supreme Court,Oct. 7, 2019.[14] Virginia House of Delegates v. Bethune-Hill,
No. 18-281 . UnitedStates Supreme Court, Mar. 18, 2019.[15] Washington State Dept. of Licensing v. Cougar Den Inc.,
No. 16-1498 . United States Supreme Court, Oct. 30, 2018.[16] K. S. Robinson and J. S. Rubin, “Women Argue Only a Fractionof Supreme Court Cases,” Jan. 30, 2019.[17] M. Walsh, “Number of Women Arguing Before the SupremeCourt has Fallen off Steeply,”
American Bar Association Journal ,Aug. 1, 2018.[18] J. R. de Leeuw, “jsPsych: A JavaScript library for creating behav-ioral experiments in a web browser,”
Behavior Research Methods ,vol. 47, pp. 1–12, 2015.[19] R. Artstein, “Inter-annotator Agreement,” in
Handbook of Lin-guistic Annotation , P. J. Ide N., Ed. Springer, Dordrecht, 2017,ch. 11, pp. 297–313.[20] F. Eyben and B. Schuller, “OpenSMILE:): The MunichOpen-Source Large-Scale Multimedia Feature Extractor,”
SIG-Multimedia Rec. , vol. 6, no. 4, p. 413, Jan. 2015. [Online].Available: https://doi.org/10.1145/2729095.2729097[21] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr,C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan,and K. P. Truong, “The Geneva Minimalistic Acoustic ParameterSet (GeMAPS) for voice research and affective computing,”
IEEETransactions on Affective Computing , vol. 7, no. 2, pp. 190–202,2016. [22] J. Gorisch, B. Wells, and G. Brown, “Pitch Contour Matchingand Interactional Alignment Across Turns: An Acoustic Inves-tigation,”
Language and Speech , vol. 55, pp. 57–76, 03 2012.[23] K. Truong, “Classification of Cooperative and Competitive Over-laps in Speech Using Cues from the Context, Overlapper, andOverlappee,”
Proceedings of the Annual Conference of the Inter-national Speech Communication Association (INTERSPEECH) ,pp. 1404–1408, 2013.[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,and E. Duchesnay, “Scikit-learn: Machine Learning in Python,”
Journal of Machine Learning Research , vol. 12, pp. 2825–2830,2011.[25] R. R. Ruiz, R. Gebeloff, S. Eder, and B. Protess, “A ConservativeAgenda Unleashed on the Federal Courts,”
The New York Times ,03 2020.[26] A. Cohen,