[PDF] HEMVIP: Human Evaluation of Multiple Videos in Parallel

Abstract

In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impression of key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects through user studies, especially subjective evaluations of video stimuli. Common evaluation paradigms either present individual stimuli to be scored on Likert-type scales, or ask users to compare and rate videos in a pairwise fashion. However, the time and resources required for such evaluations scale poorly as the number of conditions to be compared increases. Building on standards used for evaluating the quality of multimedia codecs, this paper instead introduces a framework for granular rating of multiple comparable videos in parallel. This methodology essentially analyses all condition pairs at once. Our contributions are 1) a proposed framework, called HEMVIP, for parallel and granular evaluation of multiple video stimuli and 2) a validation study confirming that results obtained using the tool are in close agreement with results of prior studies using conventional multiple pairwise comparisons.

Full PDF

HHEMVIP: Human Evaluation of Multiple Videos in Parallel

PATRIK JONELL,

Div. of Speech, Music and Hearing, KTH Royal Institute of Technology, Sweden

YOUNGWOO YOON,

ETRI & KAIST, Republic of Korea

PIETER WOLFERT,

IDLab, Ghent University – imec, Belgium

TARAS KUCHERENKO,

Div. of Robotics, Perception and Learning, KTH Royal Institute of Technology, Sweden

GUSTAV EJE HENTER,

Div. of Speech, Music and Hearing, KTH Royal Institute of Technology, Sweden

In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impressionof key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects throughuser studies, especially subjective evaluations of video stimuli. Common evaluation paradigms either present individual stimuli to bescored on Likert-type scales, or ask users to compare and rate videos in a pairwise fashion. However, the time and resources requiredfor such evaluations scale poorly as the number of conditions to be compared increases. Building on standards used for evaluating thequality of multimedia codecs, this paper instead introduces a framework for granular rating of multiple comparable videos in parallel.This methodology essentially analyses all condition pairs at once. Our contributions are 1) a proposed framework, called HEMVIP, forparallel and granular evaluation of multiple video stimuli and 2) a validation study confirming that results obtained using the tool arein close agreement with results of prior studies using conventional multiple pairwise comparisons.CCS Concepts: •

Human-centered computing → Human computer interaction (HCI) .Additional Key Words and Phrases: evaluation paradigms, video evaluation, conversational agents, gesture generation

Owing to the difficulties of objectively quantifying human perception and preference, user studies have becomethe canonical way to evaluate stimuli in many fields. This includes aspects of human-computer interaction such asvideo stimuli of synthetic gesture motion for avatars and social robots [1, 10, 13, 21]. There exist several methodsfor performing such evaluations, particularly Likert scales [15] and pairwise preference tests. These approaches dohowever not scale well when comparing many different conditions, for example when performing ablation studies orbenchmarking a new system against other available approaches. This can hinder ablation studies and prevent properbenchmarking against the state of the art.This paper proposes a novel method for evaluating comparable video stimuli from multiple conditions. Our methodis inspired by an evaluation standard called MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor) [7], whichis widely used for identifying subtle differences in audio quality. We present HEMVIP (Human Evaluation of MultipleVideos in Parallel), which adapts MUSHRA to the evaluation of video stimuli, and validate our proposal by comparingresults obtained from HEMVIP against a previous evaluation of videos of generated non-verbal behavior motion fora virtual agent [13]. We chose this particular study because it already has been reproduced once [11] and has videostimuli and study results publicly available (see URL https://svito-zar.github.io/gesticulator).

Authors’ addresses: Patrik Jonell, [email protected], Div. of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden; YoungwooYoon, [email protected], ETRI & KAIST, Daejeon, Republic of Korea; Pieter Wolfert, [email protected], IDLab, Ghent University – imec, Ghent,Belgium; Taras Kucherenko, [email protected], Div. of Robotics, Perception and Learning, KTH Royal Institute of Technology, Stockholm, Sweden; GustavEje Henter, [email protected], Div. of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden. a r X i v : . [ c s . H C ] J a n Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, and Gustav Eje Henter

Subjective evaluations of video stimuli are often carried out using Mean Opinion Scores (MOS) [9], Likert-type scales[15, 18], or relative preference tests. In this paper, we use the problem of evaluating nonverbal-behavior-generationsystems for embodied conversational agents (ECAs) as a running example, where both MOS (cf. [6, 12, 21]) and pairwisepreference tests (cf. [2, 13]) are commonplace; see [20] for a comprehensive review. In MOS tests, participants rateindividual stimuli (in our case, videos) on a discrete scale, e.g., 1 through 5. A proper Likert-scale evaluation requiresmany such judgments [18]. Since videos are rated in isolation, raters may struggle to notice minor differences betweenstimuli and to apply a consistent standard to stimuli presented at different points during the test. Relative preferencetests are easier than MOS for participants as selection is easier than scoring [3], and such tests are usually better atidentifying subtle differences. However, relative preference tests introduce additional design choices, e.g., how manypairs to be compared and in which combinations. The binary nature of many preference tests means that responsesare relatively information-poor and makes it harder to verify that two conditions are statistically different. Moreover,neither of these two evaluation schemes scales well with the number of systems to be evaluated.The parallel field of audio-quality evaluation has long used both MOS and pairwise preference tests [19], but alsoa more recent standard for comparing multiple audio systems called MUSHRA [7]. While originally proposed forcomparing audio coding systems, MUSHRA-like tests have also been shown to be more efficient for evaluating text-to-speech systems than comparable MOS tests [16]. In a MUSHRA test, multiple comparable stimuli are presented to thelistener on the same page, and listeners rate each of the stimuli individually in the context of the other, comparablestimuli (e.g., stimuli generated from the same input text or audio). The main benefit of MUSHRA over the MOSmethodology is that related stimuli are assessed together, which makes differences easier to spot. Unlike pairwisetests, many systems can be compared at once. In contrast to both MOS and preference tests, a high-resolution scale isused, which can resolve smaller differences and use more sensitive statistical analyses. Recent software for conductingMUSHRA tests online [17] makes it easy to evaluate synthetic audio using crowdsourcing platforms.Our proposal also has elements in common with the MUSHRA-derived ITU SAMVIQ (Subjective AssessmentMethodology for Video Quality) standard [8], which, however, has been withdrawn in early 2020. SAMVIQ wasproposed for evaluation of video codec quality and functions similarly to MUSHRA, using hidden references andanchors. Our proposal instead focuses on adapting the MUSHRA approach to the evaluation of video stimuli, in orderto meet the needs of subjective evaluations in, e.g., nonverbal behavior generation for ECAs and related problems inhuman-computer interaction, and in the future release an off-the-shelf web-based evaluation toolkit to the community.

The HEMVIP framework is an extension of the MUSHRA standard [7], but instead of audio recordings we assess multiplecomparable videos together using the same setup of parallel rating sliders. The core aspects of the MUSHRA standardthat HEMVIP inherits are 1) joint, parallel scoring of comparable stimuli, e.g., ones intended for the same context orcorresponding to the same system input, which makes HEMVIP scale well for multiple comparisons; 2) blind judgments,in that none of the systems being compared are labeled and the order typically is random; and 3) the use of highlygranular ratings entered via a slider GUI (0–100 being the default). There are also some differences from MUSHRAthat go beyond the fact that HEMVIP uses video instead of audio: Unlike audio-codec evaluations, where samplingfrequency can be reduced to degrade quality, there is no canonical way of defining what should constitute objectivelypoor and degraded gesture motion, in order to define low-end anchor stimuli. HEMVIP thus does not mandate a specificEMVIP: Human Evaluation of Multiple Videos in Parallel 3

Stimuli order and location FastAPI web server MongoDBParticipant results, status, and interaction data

Configuration fileResults {"testname": "Gesture Motion Experiment","pages": [{"question": , "stimuli":

Fig. 1. On the left: A screenshot of a page with stimuli from the HEMVIP evaluation interface. The blue border around the videomeans that the video for the blue slider is currently displayed; its play button is also highlighted. The question asked in the image(“How well do the character’s movements reflect what the character says?”) was designed to be as similar as possible to the questionused in the previous, pairwise evaluation. On the right: System overview showing how the web server interacts with the evaluationinterface. low-end anchor, but one can easily be included if it exists. Other differences from MUSHRA are the absence of anexplicit reference, since in many cases participants easily can identify the ground truth among the evaluation stimulianyhow; cf. Table 1. Furthermore there is no requirement to rate the perceived best stimulus on each page a perfect100, since limitations in motion capture and visualization mean that ground-truth motion cannot always be claimed toappear perfectly natural.Fig. 1 shows an example of the HEMVIP evaluation interface (on the left). Each of the “Play” buttons corresponds to avideo stimulus, which is played upon clicking the button. Clicking any other play button will immediately start playingits corresponding video instead. Below the play buttons are sliders for rating the different stimuli in response to thequestion shown below the video. Text on the left-hand side anchors, here adapted from common MOS-test terminology[9], is used to anchor different intervals on the scale. In order for the “Next” button to be enabled (clickable) and advanceto the next page, all video clips have to be played.We implemented the HEMVIP framework based on webMUSHRA [17], but modified to support video material byusing HTML5 video elements or a video provider, e.g., Vimeo or YouTube .WebMUSHRA operates based on configuration files that determine the content of each page using different templates,for example templates for text pages (e.g., instructions), MUSHRA comparisons, and questionnaires. We extendedthis with a template for video comparisons, as seen in Fig. 1 and described below. We also altered some of the basefunctionality of webMUSHRA, with the most salient such extensions described in this section.Configuration files, in JSON, are used by experimenters to set up their user studies. These files for instance define thepages (including instructions and questionnaires) and questions, and which stimuli that are shown. Counterbalancingis achieved by creating multiple configuration files, one per participant, which specify the order of pages and the https://vimeo.com/ https://youtube.com Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, and Gustav Eje Henterstimuli on each page. In order to weed out inattentive or non-serious test-takers, we also implemented a mechanism forattention checks where the participants were required to input a specific value for a certain stimulus (within a toleranceof plus/minus 3). An instruction to set this value was communicated to the participant either through a text overlay orusing a synthesized voice when playing the attention check stimulus, which otherwise appeared similar to other videos.During the test, every click in the interface is recorded in an interaction log together with timestamps and whatelement was clicked. This could potentially be used for interaction analysis or detecting cheating participants. The timea participant spends watching each video is also recorded.To make it easier to remember which slider goes with which video, we apply individual colors to each slider and acolored border around the video, matching the color of the slider of the currently playing stimulus (see Fig. 1). Thesecolors were assigned randomly for each page, and test participants were informed that assignments were completelyrandom.In addition to extending webMUSHRA with additional JavaScript templates and functionality, our HEMVIP imple-mentation incorporates a Python-based web server using FastAPI to provide configuration files to the frontend, blockusers that fail attention checks, and save result data in an external MongoDB database that we also added; see Fig. 1. We validated the results obtained using HEMVIP by reproducing a previously conducted evaluation [13], where differentspeech gesture generation models were compared. This previous evaluation contained six individual ablation studiesand a study comparing the best ablation to the ground truth. Our validation study compared those six ablations (namedNoAR, NoPCA, NoFiLM, NoAudio, NoText, and NoVel), the “Full” model, and the ground truth side by side (see [13] forthe details of the ablated and “Full” model).Participants were asked to rate video stimuli produced by various gesture generation models. The stimuli from allconditions were presented together in parallel, and the participants were asked to rate them individually. The studywas balanced such that each stimulus appeared on each page with approximately equal frequency (stimulus order), andeach condition was associated with each slider with approximately equal frequency (condition order). For any givenparticipant and study, each page would use different speech segments. Every page would contain the “Full” condition,the ground truth condition, and additionally five or six of the ablated conditions, depending on whether an attentioncheck was employed or not. Three attention checks were incorporated into the pages for each study participant asdescribed in Section 3. Numbers between 5 and 95 were randomly selected, however, some numbers were excluded dueto acoustic ambiguity (e.g., 13 and 30). Which sliders on which pages that were used for attention checks was uniformlyrandom, except that no page had more than one attention check, and condition “Full” and ground truth were neverreplaced by attention checks. The question asked throughout the study was “How well do the character’s movementsreflect what the character says?” which is in close agreement with question Q2 asked in [13]. The reason only onequestion was chosen was due to the interface design, where only one question can be asked at a time. Multiple separatestudies would be required to address more than one question. Furthermore, Q2 was chosen over the other questions forseveral reasons. To begin with the purpose of Q1 was to evaluate motion quality while Q2, Q3, and Q4 try to evaluatethe link between speech and motion. We wanted to verify that participants could pay active attention to both speechand motion, leading us to not consider Q1 (motion quality) in this instance, since it can be assessed without considering https://fastapi.tiangolo.com EMVIP: Human Evaluation of Multiple Videos in Parallel 5

Table 1. System contrasts from user studies, with 95% confidence intervals and Holm-Bonferroni corrected 𝑝 -values from Clopper-Pearson (C-P) tests (preference ignoring ties) and pairwise 𝑡 -tests (mean rating). 𝑅 stands for “rating”. Daggers (†) mark non-significantdifferences. The “Like [13]?” columns report whether or not the findings of significance agree with the C-P tests on data from [13]reported in the final two columns. 𝐴 → 𝐵 E [ 𝑅 𝐵 − 𝑅 𝐴 ] 𝑝 -val. Like [13]? P ( 𝑅 𝐵 > 𝑅 𝐴 ) 𝑝 -val. Like [13]? P ( 𝑅 𝐵 > 𝑅 𝐴 ) from [13] 𝑝 -val.Full → NoAR − ± † ✓ ∈ (45,56) 0.846 † ✓ ∈ (43,55) 0.747 † Full → NoPCA 11.5 ± − ✓ ∈ (70,79) <10 − ✓ ∈ (63,75) <10 − Full → NoFiLM 2.6 ± ✓ ∈ (51,62) 0.042 ✓ ∈ (51,63) 0.039Full → NoAudio − ± ✓ ∈ (40,50) 0.119 † ✗ ∈ (27,40) <10 − Full → NoText − ± − ✓ ∈ (19,28) <10 − ✓ ∈ ( 7,15) <10 − Full → NoVel 1.0 ± † ✓ ∈ (47,58) 0.761 † ✓ ∈ (50,63) 0.067 † NoPCA → GT 36.6 ± − ✓ ∈ (92,98) <10 − ✓ ∈ (95,99) <10 − audio at all. Q2, Q3, and Q4 elicited broadly similar responses in [13], but responses to Q2 showed the most pronounceddifferences between different systems in the original study, and was selected for our validation.After completing the 10 stimuli pages, the participants answered a questionnaire asking demographic questions(age, gender, which continent they had lived on the most, whether English was their native language and how theyperceived the task difficulty) together with more qualitative questions not covered in this paper.The video stimuli were the same as from [13], accessed from http://doi.org/10.6084/m9.figshare.13055609. The clipswere 10 s long and comprised generated motion (or ground truth motion capture) together with matching speech froma single actor originally recorded by [4]. There were 50 speech segments in total, each of which was associated withone motion from each of the eight conditions.We used the crowdsourcing platform Prolific (formerly Prolific Academic) to recruit 46 participants (34 females and12 males; mean age was 28 ± 𝑝 -values from Clopper-Pearson tests (denoted P ( 𝑅 𝐵 > 𝑅 𝐴 ) in the table)which were calculated for the data from the HEMVIP validation study as well as for the data from [13] (obtainedfrom https://doi.org/10.6084/m9.figshare.13055585) using the same analysis code. Since the HEMVIP responses aregranular, as opposed to binary preference judgments as in [13], they can also be analyzed using pairwise 𝑡 -tests, denoted E [ 𝑅 𝐵 − 𝑅 𝐴 ] , (for differences in true mean rating) and pairwise Wilcoxon signed-rank tests (differences in true median),as is common with MUSHRA tests in audio. The numerical differences between systems are visualized in Fig. 2, withthe results of 𝑡 -tests reported in Table 1, with pairwise Wilcoxon signed-rank tests finding the same set of contrasts tobe significant as the 𝑡 -tests did. As the study in [13] uses pairwise comparisons, Clopper-Pearson (C-P) tests can be used to analyze and comparepreference ignoring ties in the two studies. The results show a high correspondence between the two studies (only onecontrast differed, namely “

𝐹𝑢𝑙𝑙 → 𝑁𝑜𝐴𝑢𝑑𝑖𝑜 ”). However, results obtained using HEMVIP provide more information thanbinary preference tests, since the responses fall on a granular scale instead of being a categorical system-preference Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, and Gustav Eje Henter C h a n g e i n r a t i n g F u l l → N o A R F u l l → N o P C A F u l l → N o F i L M F u l l → N o S p e e c h F u l l → N o T e x t F u l l → N o V e l N o P C A → G T − − Fig. 2. Boxplot visualizing the distribution of pairwise differences in rating for selected system contrasts. Red bars are the medianratings (each with a 95% confidence interval for the true median); yellow diamonds are mean ratings (with 95% confidence intervals).Box edges are at 25 and 75 percentiles, while whiskers cover 95% of all rating differences. measure. This in turn allows using a wider range of statistical analyses, such as pairwise 𝑡 -tests (mean rating) orWilcoxon signed-rank tests (median rating). These, rather than Clopper-Pearson tests, are standard for analyzingMUSHRA responses. Both pairwise 𝑡 -tests and Wilcoxon signed-rank tests produce results that match the originalstudy for all contrasts (see Table 1 for 𝑡 -test results), thus validating that the new methodology delivers the sameconclusions as the conventional pairwise evaluation. We recommend these tests (especially the Wilcoxon, which doesnot assume Gaussianity) as they leverage more of the information in the responses and are better at telling conditionsapart. Additionally, HEMVIP allows for comparing all systems against one another (28 pairs) at no additional cost,and not only the contrasts in Table 1. Analyzing the current study in this way finds 23 of the pairs to be statisticallysignificantly different from one another at 𝛼 = .

05 after Holm-Bonferroni correction.There were some differences between the described validation study and the original study in [13]. The number ofparticipants in the current study was 46 while there were 143 participants in the other study, however, the number ofratings was 3542 (46 participants, 10 pages, 8 ratings per page minus 138 attention checks) vs. 2860 (143 participants, 26pages, 1 rating per page minus 858 attention checks). Another difference was that the original study evaluated threeadditional questions at the same time, while the present study only evaluated one of them. There was also a differencein the attention checks employed, as the attention checks in [13] were based on detecting degraded videos, and weremore subtle. Furthermore the study in [13] was conducted on Amazon Mechanical Turk (AMT) while the current studyused Prolific instead.One of the key benefits of HEMVIP is its efficiency, i.e., the time it takes to provide a certain amount of ratings.Compared to pairwise comparisons, as employed in for example [11, 13], HEMVIP is more efficient because it evaluatesmultiple stimuli in parallel. To estimate efficiency, it is perhaps more meaningful to compare against the partialreplication study in [11] because a) it only asked one question (as opposed to [13] which asked four), and b) it wasconducted across three participant pools, one of them being Prolific, as also used here. While the replication studyonly considered a single contrast (“NoPCA” vs. “NoText”), making it unsuitable for evaluating HEMVIP properly, itprovides test-taking duration information directly relevant for participants on the Prolific platform. The average time theEMVIP: Human Evaluation of Multiple Videos in Parallel 7participants spent on each page was 176 s. According to the results in [11] it took participants on Prolific approximately32 s to complete one page in a pairwise comparison. This means that comparing one to the other seven systems in apairwise fashion takes 256 s (8 ×

32 s), as opposed to 176 s using HEMVIP. Furthermore, if one wants to compare allsystems with each other, HEMVIP still only needs 176 s (one page) to accomplish this, as opposed to 896 s (8 choose 2 =28 pages) using a pairwise method.A limitation of HEMVIP is that it mainly focuses on one question at a time, which is also the reason to why only onequestion from [13] was evaluated. A way of scoring multiple questions at once (e.g., for efficiently conducting properLikert-scale evaluations; cf. [18]) could probably be added by, for example, adding a new set of sliders after one questionhas been rated, or putting in multiple sliders per stimulus. This – and related issues such as how many stimuli thatcan be evaluated in parallel without exhausting the participant – are interesting questions that should be addressed ina future usability study, as it is currently unknown if adding an extra question or increasing the amount of stimuliwould be too cognitively demanding for the users. The MUSHRA standard for audio [7] recommends no more than 12stimuli (and thus sliders) per page, but does not indicate the reasoning used to derive this number. Another limitationof this work is that we only evaluated video stimuli from one particular domain and task. While we do evaluate it usingmaterial that has been evaluated twice before, and found consistent results, future work should investigate differentkinds of video stimuli to further study how well the method generalizes to different situations.

In this paper we proposed a framework for evaluating multiple comparable video stimuli in parallel, called HEMVIP. Avalidation experiment compared results obtained using the proposed method to results obtained using pairwise binarypreference tests in earlier work in the domain of nonverbal behavior, finding high correspondence between the twoexperiments, but with greater efficiency and vastly better scaling properties for HEMVIP. We believe HEMVIP can beof great benefit to researchers performing thorough evaluations across multiple video stimuli, e.g., to evaluate ECAs.Future work includes validation of HEMVIP for the remaining questions evaluated in [13], evaluating using stimuli fromother domains, for example videos of human behavior, along with usability testing of various aspects of the interface,such as the utility of color-coding sliders and videos, and how many videos users are able to rate effectively on a singlepage without incurring excessive cognitive load. Such validations can leverage the publicly available data from theGENEA Challenge 2020 [14], which used HEMVIP to compare a wide range of different gesture-generation methods ina large-scale evaluation and has made all its stimuli, subjective ratings, and analysis code available online (see URLhttps://zenodo.org/communities/genea2020).

ACKNOWLEDGMENTS

The authors wish to thank Jonas Beskow, Dmytro Kalpakchi, and Bram Willemsen for giving feedback on the paper. Thisresearch was partially supported by Swedish Foundation for Strategic Research contract no. RIT15-0107 (EACare), byIITP grant no. 2017-0-00162 (Development of Human-care Robot Technology for Aging Society) funded by the Koreangovernment (MSIT), the Flemish Research Foundation grant no. 1S95020N, and by the Wallenberg AI, AutonomousSystems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, and Gustav Eje Henter

REFERENCES [1] Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-controllable speech-driven gesture synthesis usingnormalising flows.

Comput. Graph. Forum

39, 2 (2020), 487–496. https://doi.org/10.1111/cgf.13946[2] Chung-Cheng Chiu and Stacy Marsella. 2014. Gesture generation with low-dimensional embeddings. In

Proceedings of the International Conferenceon Autonomous Agents and Multi-agent Systems . ACM, 781–788. https://doi.org/10.5555/2615731.2615857[3] Andrew P. Clark, Kate L. Howard, Andy T. Woods, Ian S. Penton-Voak, and Christof Neumann. 2018. Why rate when you could compare? Using the“EloChoice” package to assess pairwise comparisons of perceived physical strength.

PLoS One

13, 1 (2018), 1–16. https://doi.org/10.1371/journal.pone.0190393[4] Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In

Proceedings of theACM International Conference on Intelligent Virtual Agents . ACM, 93–98. https://doi.org/10.1145/3267851.3267898[5] Sture Holm. 1979. A simple sequentially rejective multiple test procedure.

Scandinavian Journal of Statistics

6, 2 (1979), 65–70.[6] Ryo Ishii, Taichi Katayama, Ryuichiro Higashinaka, and Junji Tomita. 2018. Generating body motions using spoken language in dialogue. In

Proceedings of the ACM International Conference on Intelligent Virtual Agents . ACM, 87–92. https://doi.org/10.1145/3267851.3267866[7] ITU-R BS.1534-3. 2015.

Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems

Methodology for the Subjective Assessment of Video Quality in Multimedia Applications

Methods for Subjective Determination of Transmission Quality

Proceedings of the ICDL-EpiRob Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions . 4. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-261242[11] Patrik Jonell, Taras Kucherenko, Ilaria Torre, and Jonas Beskow. 2020. Can we trust online crowdworkers? Comparing online and offline participantsin a preference test of virtual agents.. In

Proceedings of the ACM International Conference on Intelligent Virtual Agents . ACM, 30:1–30:8. https://doi.org/10.1145/3383652.3423860[12] Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing input and output representationsfor speech-driven gesture generation. In

Proceedings of the ACM International Conference on Intelligent Virtual Agents . ACM, 97–104. https://doi.org/10.1145/3308532.3329472[13] Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator:A framework for semantically-aware speech-driven gesture generation. In

Proceedings of the ACM International Conference on Multimodal Interaction .ACM, 242–250. https://doi.org/10.1145/3382507.3418815[14] Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021. A large, crowdsourced evaluation of gesturegeneration systems on common data. Accepted for publication at the .[15] Rensis Likert. 1932. A technique for the measurement of attitudes.

Archives of Psychology

140 (1932), 1–55.[16] Manuel Sam Ribeiro, Junichi Yamagishi, and Robert A. J. Clark. 2015. A perceptual investigation of wavelet-based decomposition of 𝑓 Proceedings of the Annual Conference of the International Speech Communication Association

Journal of Open Research Software

6, 1 (2018), 8:1–8:8. https://doi.org/10.5334/jors.187[18] Mariah L. Schrum, Michael Johnson, Muyleng Ghuy, and Matthew C. Gombolay. 2020. Four years in review: Statistical practices of Likertscales in human-robot interaction studies. In

Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction . ACM, 43–52.https://doi.org/10.1145/3371382.3380739[19] Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter. 2015. Are we using enough listeners? No! An empirically-supported critique ofInterspeech 2014 TTS evaluations. In

Proceedings of the Annual Conference of the International Speech Communication Association