[PDF] Exploring Speech Cues in Web-mined COVID-19 Conversational Vlogs

Abstract

The COVID-19 pandemic caused by the novel SARS-Coronavirus-2 (n-SARS-CoV-2) has impacted people's lives in unprecedented ways. During the time of the pandemic, social vloggers have used social media to actively share their opinions or experiences in quarantine. This paper collected videos from YouTube to track emotional responses in conversational vlogs and their potential associations with events related to the pandemic. In particular, vlogs uploaded from locations in New York City were analyzed given that this was one of the first epicenters of the pandemic in the United States. We observed some common patterns in vloggers' acoustic and linguistic features across the time span of the quarantine, which is indicative of changes in emotional reactivity. Additionally, we investigated fluctuations of acoustic and linguistic patterns in relation to COVID-19 events in the New York area (e.g. the number of daily new cases, number of deaths, and extension of stay-at-home order and state of emergency). Our results indicate that acoustic features, such as zero-crossing-rate, jitter, and shimmer, can be valuable for analyzing emotional reactivity in social media videos. Our findings further indicate that some of the peaks of the acoustic and linguistic indices align with COVID-19 events, such as the peak in the number of deaths and emergency declaration.

Full PDF

EExploring Speech Cues in Web-minedCOVID-19 Conversational Vlogs

Kexin Feng [email protected] of Computer Science and EngineeringTexas A&M UniversityCollege Station, Texas

Preeti Zanwar [email protected] of Epidemiology & BiostatisticsTexas A&M UniversityCollege Station, Texas

Amir H. Behzadan [email protected] of Construction ScienceTexas A&M UniversityCollege Station, Texas

Theodora Chaspari [email protected] of Computer Science and EngineeringTexas A&M UniversityCollege Station, Texas

ABSTRACT

The COVID-19 pandemic caused by the novel SARS-Coronavirus-2(n-SARS-CoV-2) has impacted people’s lives in unprecedented ways.During the time of the pandemic, social vloggers have used socialmedia to actively share their opinions or experiences in quaran-tine. This paper collected videos from YouTube to track emotionalresponses in conversational vlogs and their potential associationswith events related to the pandemic. In particular, vlogs uploadedfrom locations in New York City were analyzed given that this wasone of the first epicenters of the pandemic in the United States.We observed some common patterns in vloggers’ acoustic and lin-guistic features across the time span of the quarantine, which isindicative of changes in emotional reactivity. Additionally, we in-vestigated fluctuations of acoustic and linguistic patterns in relationto COVID-19 events in the New York area (e.g. the number of dailynew cases, number of deaths, and extension of stay-at-home orderand state of emergency). Our results indicate that acoustic features,such as zero-crossing-rate, jitter, and shimmer, can be valuable foranalyzing emotional reactivity in social media videos. Our findingsfurther indicate that some of the peaks of the acoustic and linguisticindices align with COVID-19 events, such as the peak in the numberof deaths and emergency declaration.

CCS CONCEPTS • Applied computing → Sociology ; •

Human-centered com-puting → Empirical studies in collaborative and social com-puting . KEYWORDS

COVID-19, pandemic, vlogging, YouTube, social media, speech

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

ACM Reference Format:

Kexin Feng, Preeti Zanwar, Amir H. Behzadan, and Theodora Chaspari.2020. Exploring Speech Cues in Web-mined COVID-19 ConversationalVlogs. In

Joint Workshop on Aesthetic and Technical Quality Assessmentof Multimedia and Media Analytics for Societal Trends (ATQAM/MAST’20),October 12–16, 2020, Seattle, WA, USA.

ACM, New York, NY, USA, 5 pages.https://doi.org/10.1145/3423268.3423584

The COVID-19 disease caused by the the novel SARS-Coronavirus-2(n-SARS-CoV-2) became a pandemic in late 2019, and have infectedtens of millions people around the world [1]. The exact date of firstreport of the COVID-19 disease and the origins of the n-SARS-CoV-2are undergoing scientific investigation. Throughout this pandemic,governments have been encouraging people to stay home and re-duce physical contact with others. This prolonged confinement andabsence of face-to-face interaction in combination with negativefeelings of anxiety caused by the pandemic s expected to resultin significant emotional strain. Social media platforms, such asWeibo and Twitter, can potentially reveal individuals’ emotionalreactions to such impactful events and have been actively exploredby researchers in social media analytics [12, 13].In contrast to social networks and blogs, that rely mostly onwritten communication, conversational vlogs provide a valuablesource of multimodal data for understanding subtle facets of emo-tion in communities and societies through the integration of spokenlanguage and visual information. Conversational vlogs refer to aspecific type of vlogging, in which whole or part of the shots depicta single person facing and talking to the camera [5]. The richnessof multimodal information presented in conversational vlogs canpotentially provide a better understanding of the vloggers’ attitude,feelings, and emotions compared to written text. Additionally, in-teractive cues in vlog videos, such as comments and number ofupvotes or downvotes, can help identify how attitudes and emo-tions in the vlogs are propagated to the world. While few previousstudies have conducted public sentiment analysis during a shortperiod of the pandemic based on written text in social media [16],to the best of our knowledge, multimodal analysis of conversationalvlogs with the aim of a better understanding of public sentimentsduring the pandemic has not been examined. a r X i v : . [ c s . MM ] S e p TQAM/MAST’20, October 12–16, 2020, Seattle, WA, USA Kexin Feng, Preeti Zanwar, Amir H. Behzadan, and Theodora Chaspari

To fill this gap, we collected 463 conversational vlogs from NewYork city, a major epicenter of the pandemic in the United States(U.S.). The examined vlogs uploaded between March 13, 2020 (dateof announcement of the national emergency declared by the U.S.government) and June 1, 2020 [2, 8]. We then applied a speechpre-processing pipeline to obtain the acoustic features indicativeof prosodic changes on a weekly basis. We further analyzed thefrequency of the words in the title and description of the YouTubevideos to obtain a set of linguistic descriptors. Analysis of the acous-tic and linguistic data found significant fluctuations across the spanof the 11 weeks, some of which aligned with significant COVID-19events, such as the peak in the number of deaths in New York City,as well as the stay-at-home order. Our pilot study provides prelimi-nary insights into vloggers’ emotional reactions during the periodof the COVID-19 pandemic and contributes to better understatingof emotion type and propagation in social media videos.

YouTube has been widely used by researchers in computing andsocial science, due to it is a great source of naturalistic and diversereal-life data [15, 20]. Vlogging became a social trend after 2010[10]. Conversational vlogs are a specific type of vlogging, whereusually an individual talking to a camera to share their ideas, views,or expertise about a topic of interest. This have rendered the con-tent of vlogs a valuable source for researchers enabling the betterunderstanding of people’s behavior in social media [10]. Biel etal. used verbal and non-verbal cues of the vlogger to estimate theamount of social attention that the vlog would receive [5]. Integrat-ing vloggers’ personality scores can further increase the accuracyof this task [4]. Biel et al. further performed crowdsourcing ex-periments to investigate how vloggers were perceived by theiraudience [6]. Researchers have also done sentiment analysis usingcomments posted under YouTube videos, and reviews posted forMovies. [7, 11, 17, 22].Although previous studies have performed sentiment analysisin YouTube, conversational vlogs and emotion tracking of vloggers’experiences remains underexplored. This motivated us to explorethe possible impact of social events on vloggers. To the best of ourknowledge, our study is the first to investigate vloggers’ emotionsduring the period of COVID-19 and their potential association withsignificant events of the pandemic. Our pilot study examines datacollected by vloggers in New York, which was the first center of thepandemic in the U.S [8]. Findings from our work could provide abetter understanding on how emotion is propagated in social mediaduring during large-scale emergencies and life-changing events.

In this section, we discuss data collection and processing methodsof conversational vlogs from YouTube videos.

Given the recency of the COVID-19 pandemic and nonexistenceof datasets of vlogs recorded during this period, we collected anew dataset from YouTube. Due to limitations of the YouTube APIfor collecting videos at a large-scale (e.g., many vlogs are frompeople who do not reside in the U.S., which is not related to our

Table 1: Components relevant to keywords used in the Sele-nium tool for video mining.

Event Behavior Locationquarantine vlog New YorkNYcovid-19 vloggerpandemic vlogginggoal), we applied selenium webdriver, an automation testing toolto obtain YouTube videos [3]. Selenium is able to simulate a humansearch behavior in a browser window on the YouTube websiteby scrolling down to the end of the search query while trackingthe search results. In this way, the number of overseas videos issignificantly decreased since the result of such traditional searchmethod could possibly related to searcher’s region. To maximizethe number of retrieved videos, our queries included a variety ofkeywords through the combination of three components related tothe event (e.g., COVID-19), behavior (e.g., vlog), and location (e.g.,New York), as in Table 1, resulting in a total of 18 combinations.After removing duplicate videos from the search results, weobtained 4,265 videos potentially relevant to COVID-19 vlogs inNew York City. We further collected video information, includingtitle, description, duration, date published, number of views, andnumber of upvotes and downvotes, which could potentially becomecues based on previous research [5]. We then filtered out the videosbefore the March 13th, the date when U.S. national emergency isdeclared, resulting in a total of 3,021 videos. The end date is June1st, the time when we performed the data collection. Among thosevideos, we manually examined each one to make sure they satisfythe following requirements: • Part of the video displays a conversational shot. • The video has low or no background music or noise. • The video is not recorded overseas.This process resulted in a total number of 463 valid videos. Due tothe subjectivity of this task, 13.24% of the 3,021 videos (400 videos)were cross-examined by an additional annotator, yielding a Cohen’skappa coefficient of 0.703. The main reasons for the different labelsbetween annotators are relying on annotator’s subjective judgmentabout background music level and the ratio of conversational shotsin each video. In order to obtain the videos within New York, wefurther selected the videos that included “NY," “NYC," or “NewYork" in the title or corresponding description, which yielded a finalnumber of 278 videos that were used in the rest of the analysis.

The audio data obtained from our study may contain multiple speak-ers, as well as non-speech segments corresponding to noise andmusic background. To address this challenge, we manually labeleda 5-second reference audio of the target speaker within each video,and did the speaker diarization by calculating the similarity betweenthe reference audio and each target analysis window. Similaritymetrics were calculated in a 256-dimensional d-vector space calcu-lated by a deep learning model [21]. A similarity score of 0.65 isused as a threshold and the window size is set to 125 milliseconds.By introducing the reference audio, this speaker diarization step xploring Speech Cues in Web-minedCOVID-19 Conversational Vlogs ATQAM/MAST’20, October 12–16, 2020, Seattle, WA, USA can effectively identify non-speech segments or speech segmentsthat do not belong to the target speaker.

Acoustic features focused on capturing prosodic changes, whichare indicative of emotional information [19]. We extracted fourprosodic features, namely loudness, zero-crossing-rate (ZCR), jit-ter, and shimmer, followed by four statistics descriptors of thesefeatures, that include mean, standard deviation, skewness, and theslope coefficient of the linear regression fit. For ZCR, we also ex-tracted the min and max values since these two ZCR descriptorsare used in other emotion-related tasks [18]. To avoid the influenceof possible extreme values, we first segmented the speech signalinto 125ms windows, and subsequently obtained the statistics ofthe prosodic descriptors using a 25ms frame with a 10ms shift usingOpenSmile toolbox [9]. Finally, we calculated the average for eachdescriptor among all the windows. As a result, a 18-dimensionalspeech feature for each video was extracted.

As an additional source of information, we analyzed the wordsin the title and video description of each video and measured theword frequency. We further considered the words with the highestfrequency from each video as linguistic measures.

To explore the connections between social media reactions andCOVID-19 spread, we collected data related to COVID statistics. Inparticular, we used the NYC Open Data and measured daily numberof new cases, new deaths, and hospitalized patients [14]. We onlyincluded the data after March 13rd to match the start of the searchperiod for extracted videos, as visualized in Fig. 1. (a)

New COVID-19 cases (b)

New deaths from COVID-19

Figure 1: Visualization of number of daily new cases anddeaths in New York City. The x-axis is the number of daysfrom March 13th, while the y-axis indicates the number ofpeople.

Our experiment aims to find if there is a potential connection be-tween people’s reactions in social media vlogs and the spread ofCOVID-19. Since there is generally a delay between video recordingand posting online, we clustered the data into weekly bins, whichis likely to address this delay. More specifically, starting from March 13th, we grouped videospublished every 7 days, which resulted in 11 time periods end-ing on June 1st. The detailed start date information and numberof videos in each time period can be found in Table 2. For eachweek, we calculated the average of each acoustic feature to ob-tain a 18-dimensional weekly prosodic representation (Fig. 2). Themost frequent words and corresponding frequency of the title andvideo description over each week was also examined. We list themost frequent words within each week in Table 3, and plot thelargest frequency within target words (e.g., word ’quarantine’ is themost frequent in most weeks) among weeks to explore potentialconnections with COVID-19 spread as shown in Fig. 3.

After plotting the variation of the 18 acoustic features across allthe weeks, common patterns seem to emerge in some of the ex-tracted features. Particularly, we found that the slope of the linearregression fit is the most consistent descriptor compared with mean,skewness, or standard deviation, potentially due to its ability tocapture temporal trends. As for the four acoustic features (loud-ness, ZCR, jitter, and shimmer), the loudness was not able to revealmeaningful patterns. A possible reason could be that the loudnessis highly dependent on the recording conditions (e.g., microphone,distance from the microphone), making it highly variable acrossvideos. Based on Fig. 2, three peaks at week 3, 5, and 8-9 can beobserved among the other three acoustic features.Next, we explored the change of word frequency among thoseweeks. As observed in Fig. 3, the frequency of negative words rele-vant to COVID-19 is higher in weeks 3, 5, and 8-9. In the remainingweeks, the frequency of such words is rarely greater than 1. Theword frequency analysis is thus consistent with the change ob-served across acoustic features.Finally, we explored potential connections between the acousticand linguistic trajectories and the COVID-19 spread. According toFig. 1, the number of daily new cases reached a peak in New YorkCity around week 3 (day 20), while the number of daily new deathsreached to the peak around week 5 (day 35). During weeks 8 and 9,The Governor of New York extended the PAUSE order as well asthe state of emergency for the New York state, which added to thequarantine period [23]. Although we cannot draw direct compar-isons, due to the fact that acoustic and linguistic measures mightbe confounded by multiple factors, we observed similar spikes inweeks 3, 5, and 8-9 in some of the acoustic data. These trajectorysimilarities might suggest that COVID-19 related events can influ-ence vloggers’ sentiments, but a more thorough analysis is neededto better understand the contextual factors causing such spikes inthe acoustic and linguistic trends.Even though our results indicate fluctuations in the acoustic andlinguistic features which might be relevant to COVID-19 events,there are various limitations in our study. First, the data collectionstep heavily relies on the YouTube video search results. Possiblebias could occur in this process, such as potentially more popularvideos being retrieved first. Also, even though we took actions tomake sure all the videos used in our study are from New York, wecannot exactly justify the video is from New York City or the stateof New York. Finally, contextual factors that capture the content

TQAM/MAST’20, October 12–16, 2020, Seattle, WA, USA Kexin Feng, Preeti Zanwar, Amir H. Behzadan, and Theodora Chaspari

Table 2: The starting date and number of videos within each group (week). week number 1 2 3 4 5 6 7 8 9 10 11starting date 3/13/20 3/20/20 3/27/20 4/3/20 4/10/20 4/17/20 4/24/20 5/3/20 5/11/20 5/18/20 5/26/20number of videos 20 51 23 26 30 14 10 18 28 27 31

Table 3: Summary of most frequently used words and their corresponding frequency over each week. week number 1 2 3 4 5 6 7 8 9 10 11vlog 1.15 new 1.12 vlog 1.13 vlog 1.12 vlog 1.57 vlog 1.43 new 1.5 vlog 1.44 new 2.11 show 1.07 vlog 1.06show 0.9 vlog 1.08 coronavirus 1.13 quarantine 1.0 quarantine 1.37 show 1.14 vlog 1.4 quarantine 1.06 life 2.04 vlog 1.04 show 1.06new 0.8 show 1.02 show 1.0 show 1.0 new 1.167 nyc 1.0 york 1.3 show 0.94 day 1.75 nyc 1.0 new 0.87nyc 0.65 quarantine 0.94 new 1.0 new 0.69 show 1.07 quarantine 0.93 quarantine 0.9 new 0.72 york 1.75 quarantine 0.81 nyc 0.84most frequentwords andtheirfrequency coronavirus 0.6 york 0.80 nyc 0.87 nyc 0.62 york 0.9 19 0.64 family 0.9 nyc 0.61 quarantine 1.25 new 0.7 quarantine 0.61

The COVID-19 related words are marked with italic .(a)

Jitter linreg (b)

Shimmer linreg (c)

ZCR linreg (d)

ZCR mean

Figure 2: Visualization of jitter, shimmer, and zero-crossing rate (ZCR) extracted from videos. The x-axis is the number ofweeks from initial video date (March 13th), while the y-axis indicates the value of this feature. Linreg refers to the slope ofthe linear regression to which the corresponding measure was fitted. of each video have to be taken into account in a more thouroughanalysis.

Figure 3: Visualization of largest frequency of target wordsamong weeks. The x-axis is the number of week from March13th, while the y-axis indicates the word frequency.

In this paper, we explored the possibility of understanding publicsentiments during the COVID-19 pandemic through multimodalcontent of social media. We selected New York City, because it wasone of the first epicenters of COVID-19 pandemic. We collectedour own dataset from YouTube and provided a complete pipelineto pre-processing and analyzing real-life audio data. We then ex-tracted acoustic features and observed common patterns from three features (jitter, shimmer, and ZCR). This pattern was also consis-tent with a word frequency analysis performed on video title anddescription and can be potentially explained by taking into accountthe timing of major COVID-19 related events occurring in NewYork City during this time.As part of our future work, we plan to extend our study to addi-tional geographical locations, and explore the influence of gender,age, as well as other potential factors on viewers’ reactions to socialmedia content. Finally, we will analyze additional cues from thevideos, such as facial expression and linguistic cues obtained fromthe vloggers’ speech, to obtain a better understanding of socialmedia videos.

This work was supported by the Texas A&M Institute of Data Sci-ence (TAMIDS) through the Data Resource Development Program.The authors would like to thank Alexandria Curtis, Texas A&MComputer Science & Engineering student, for her help in annotatingthe conversational vlogs.

REFERENCES

Selenium WebDriver practical guide . Packt Publishing Ltd.[4] Joan-Isaac Biel, Oya Aran, and Daniel Gatica-Perez. 2011. You are known by howyou vlog: Personality impressions and nonverbal behavior in youtube. In

Fifth xploring Speech Cues in Web-minedCOVID-19 Conversational Vlogs ATQAM/MAST’20, October 12–16, 2020, Seattle, WA, USA

International AAAI Conference on Weblogs and Social Media .[5] Joan-Isaac Biel and Daniel Gatica-Perez. 2011. VlogSense: Conversational behav-ior and social attention in YouTube.

ACM Transactions on Multimedia Computing,Communications, and Applications (TOMM)

7, 1 (2011), 1–21.[6] Joan-Isaac Biel and Daniel Gatica-Perez. 2012. The good, the bad, and the angry:Analyzing crowdsourced impressions of vloggers. In

Sixth International AAAIConference on Weblogs and Social Media .[7] Yen-Liang Chen, Chia-Ling Chang, and Chin-Sheng Yeh. 2017. Emotion classifi-cation of YouTube videos.

Decision Support Systems

Proceedings of the 18thACM international conference on Multimedia . 1459–1462.[10] Wen Gao, Yonghong Tian, Tiejun Huang, and Qiang Yang. 2010. Vlogging: Asurvey of videoblogging technology on the web.

ACM Computing Surveys (CSUR)

42, 4 (2010), 1–57.[11] Mousannif Hajar et al. 2016. Using YouTube comments for text-based emotionrecognition.

Procedia Computer Science

83 (2016), 292–299.[12] Bennett Kleinberg, Isabelle van der Vegt, and Maximilian Mozes. 2020. Measuringemotions in the covid-19 real world worry dataset. arXiv preprint arXiv:2004.04225 (2020).[13] Sijia Li, Yilin Wang, Jia Xue, Nan Zhao, and Tingshao Zhu. 2020. The impactof COVID-19 epidemic declaration on psychological consequences: a study onactive Weibo users.

International journal of environmental research and publichealth

17, 6 (2020), 2032.[14] Department of Health and Mental Hygiene (DOHMH). 2020. COVID-19 Daily Counts of Cases, Hospitalizations, and Deaths: NYC OpenData. https://data.cityofnewyork.us/Health/COVID-19-Daily-Counts-of-Cases-Hospitalizations-an/rc75-m7u3 [15] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Van-houcke. 2017. Youtube-boundingboxes: A large high-precision human-annotateddata set for object detection in video. In proceedings of the IEEE Conference onComputer Vision and Pattern Recognition . 5296–5305.[16] Jim Samuel, GG Ali, Md Rahman, Ek Esawi, Yana Samuel, et al. 2020. Covid-19public sentiment insights and machine learning for tweets classification.

Infor-mation

11, 6 (2020), 314.[17] Julio Savigny and Ayu Purwarianti. 2017. Emotion classification on youtubecomments using word embedding. In . IEEE, 1–5.[18] Björn Schuller, Stefan Steidl, and Anton Batliner. 2009. The interspeech 2009emotion challenge. In

Tenth Annual Conference of the International Speech Com-munication Association .[19] Björn Schuller, Bogdan Vlasenko, Florian Eyben, Gerhard Rigoll, and AndreasWendemuth. 2009. Acoustic emotion recognition: A benchmark comparisonof performances. In . IEEE, 552–557.[20] Anjana Susarla, Jeong-Ha Oh, and Yong Tan. 2012. Social networks and thediffusion of user-generated content: Evidence from YouTube.

Information SystemsResearch

23, 1 (2012), 23–41.[21] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalizedend-to-end loss for speaker verification. In . IEEE, 4879–4883.[22] Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun,Kenji Sagae, and Louis-Philippe Morency. 2013. Youtube movie reviews: Senti-ment analysis in an audio-visual context.