Investigating Correlations of Automatically Extracted Multimodal Features and Lecture Video Quality
Jianwei Shi, Christian Otto, Anett Hoppe, Peter Holtz, Ralph Ewerth
IInvestigating Correlations of Automatically ExtractedMultimodal Features and Lecture Video Quality
Jianwei Shi [email protected] Research Center, LeibnizUniversität Hannover
Christian Otto [email protected] Information Centre forScience and Technology (TIB)
Anett Hoppe [email protected] Information Centre forScience and Technology (TIB)
Peter Holtz [email protected] Institut fürWissensmedien (IWM)
Ralph Ewerth [email protected] Information Centre forScience and Technology (TIB)L3S Research Center, LeibnizUniversität Hannover
ABSTRACT
Ranking and recommendation of multimedia content such as videosis usually realized with respect to the relevance to a user query.However, for lecture videos and MOOCs (Massive Open OnlineCourses) it is not only required to retrieve relevant videos, but par-ticularly to find lecture videos of high quality that facilitate learning,for instance, independent of the video’s or speaker’s popularity.Thus, metadata about a lecture video’s quality are crucial featuresfor learning contexts, e.g., lecture video recommendation in searchas learning scenarios. In this paper, we investigate whether auto-matically extracted features are correlated to quality aspects of avideo. A set of scholarly videos from a Mass Open Online Course(MOOC) is analyzed regarding audio, linguistic, and visual features.Furthermore, a set of cross-modal features is proposed which arederived by combining transcripts, audio, video, and slide content.A user study is conducted to investigate the correlations betweenthe automatically collected features and human ratings of qualityaspects of a lecture video. Finally, the impact of our features on theknowledge gain of the participants is discussed.
KEYWORDS multimodal, video assessment, correlation, knowledge gain
ACM Reference Format:
Jianwei Shi, Christian Otto, Anett Hoppe, Peter Holtz, and Ralph Ewerth.2019. Investigating Correlations of Automatically Extracted MultimodalFeatures and Lecture Video Quality. In
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3347451.3356731
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
SALMM ’19, October 21, 2019, Nice, France © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6919-0/19/10...$15.00https://doi.org/10.1145/3347451.3356731
Modern studying has left the classrooms – more and more, learnersrefer to the Web as a source of educational material for a certainlearning need. This kind of informal learning scenario is, nowadays,only insufficiently supported by Web mechanisms such as searchengines which focus on satisfying information needs instead ([2]).Recently, the research area
Search as Learning (SaL) gains momen-tum: It recognizes learning as an implicit component of Web search[1], aims to detect learning needs and intents [10, 24] and to iden-tify factors correlated to successful learning outcomes (e. g., [7]).While there are first interesting insights, the research field is stillmostly geared to the recommendation of textual learning resources[17]. This stands in contrast to current learning research whichsuggests that users may prefer images and videos when tacklingcertain learning needs (e. g., procedural learning tasks [19, 25]). Inconsequence, this work sets out to establish a relationship betweenautomatically extracted video features and successful learning. Theobjective is to facilitate more effective search for learning objects, beit for learners with a certain learning intent or for educators whichseek for high-quality material to enhance their own courses. Nor-mally, there is a large amount of educational and lecture videos thatcover similar content – from single-video tutorials to full-fledgedMassive Open Online Courses (MOOCs). In such a large lecturevideo archive, it is desirable to find content of high quality regard-ing the knowledge presentation. However, objective features whichassess, for instance, the nature of the presenters voice or the designof the slides, are not fully explored in automated systems in thecurrent state of research. A human viewer evaluates the quality ofa learning resource based on all available information. In commonlecture videos this includes the textual, oral and visual modality.Viewers are supported in their learning by the visual elements onthe slide, the words spoken and the gestures of the lecturer.Previous work, like Guo et al. [13], investigated how the designof lecture videos impacts the viewer engagement and provided rec-ommendations to optimize the content accordingly. Chen et al. [6]used multimodal sensing to assess the quality of a presentation.They extracted speech, body movement and visual features from theshown slides. Principal Component Analysis was applied to human a r X i v : . [ c s . MM ] M a y atings in order to address the two main modalities of the presen-tation: 1) recital skills, including, for instance, voice informationand body language, and 2) slide quality, with regards to grammar,readability, and visual design. Pearson correlation was used to mea-sure the relation between the different features. Haider et al. [14]proposed a system for automatic video quality assessment, which isthe most similar to our approach, focusing on prosodic and visualfeatures. They extracted the complete set of audio features fromthe ComParE challenge [22] and a total of 42 features related tohand movements of the speaker. The employed Multimodal Learn-ing Analytics (MLA) dataset [18] contains 416 oral presentations(in Spanish) and the respective metadata regarding speech, facialexpressions, skeletal data extracted from a Microsoft Kinect, aswell as the shown slides. Each of these videos was labeled with tenindividual ratings and an overall score related to the quality of theslides. A correlation study (discriminant analysis) was employedwhich found that prosodic features are able to predict self confidenceand enthusiasm (of the speaker) as well as body language and pose ,which is a quality measure their participants had to label. Theirvisual features showed similar results, but with less accuracy.In this paper, we go beyond previous work by (1) proposing anovel set of intuitive unimodal and cross-modal features that donot rely on skeletal data which are hard to acquire, (2) conductingan empirical evaluation on the correlation of features with qualityaspects of a video, and (3) conducting an empirical evaluation on thecorrelation of features with participants’ knowledge gain. Videosof a MOOC website are utilized and automatically annotated withour set of unimodal and multimodal features that address differentquality aspects. Different modalities are exploited: audio and spo-ken text, video frame content, slide content, as well as cross-modalfeatures. The experimental results reveal that several of our fea-tures show a significant correlation with the corresponding humanassessment of the lecture videos.The remainder of the paper is organized as follows. Section 2describes the set of extracted unimodal as well as novel cross-modalfeatures. The design of the user study and the experimental resultsare presented in Section 3, while Section 4 concludes the paper andoutlines areas of future work. This section outlines our approach for the extraction of in total22 features from lecture videos including textual, audio, linguistic,as well as a set of chosen multimodal features. An overview ofour feature set is depicted in Figure 1. Since we are dealing witheducational videos, we assume that for each data sample a videofile is available, together with a PDF file of the shown presentationas well as a speech transcript.
There are three kinds of unimodal features: audio, linguistic, andvisual.
Audio Features.
The openSMILE -toolkit [11] is used to extract theaudio features, except for the pitch variation information which isextracted according to Hincks [16]. We selected the feature subsetof the ComParE challenge (6 .
373 dimensions, around 70 low-level
Figure 1: Overview over the feature sets extracted for the au-tomatic assessment algorithm. descriptors with multiple features each) and reduced it to ninefeatures and their arithmetic mean, see Table 1. For our study, weselected those features that have either shown an impact on theaudio quality before (Jitter, F0 Harmonics ratio, ...) or are very likelyto influence the perceived quality of the audio (Energy, Loudness,Harmonics-to-Noise Ratio, etc.).
Feature Description
Loudness Sum of a simplified auditory spectrumModulated loud-ness Sum of a simplified RASTA-filtered audi-tory spectrum (RelAtive Spectral TrAns-form, Hermansky et al. [15])Root-Mean-Squareenergy Square root of mean of the discrete valuesof the sound pressureJitter Deviation from true periodicity of a presum-ably periodic signal ∆ Jitter Normalized average length deviation fromtrue periodicity of a presumably periodicsignalShimmer Amplitude variation of consecutive voicesignal periodsHarmonicity (spec-tral) Ratio between the minima and the maximain relation to the amplitude of the maximafrom a magnitude spectrumLogarithmicHarmonics-to-Noise Ratio Logarithmic scale of the ratio harmonic tonoise component in the wave signalPitch Variant Quo-tient Standard deviation of the pitch, whichis divided by the mean of the pitch (cf.Hincks [16])
Table 1: Automatically extracted audio features. inguistic Features.
The extraction of linguistic features aims todescribe how the presenters articulate themselves by means of syl-lable duration and speaking rate. De Jong and Wempe’s [9]
Praat [4]script was used to extract these features, namely speech rate , artic-ulation rate , and average syllable duration ( ASD ). All of them arederived from the number of vowels or syllables per time intervaland indicate if the speaker is talking too fast or too slow.The video transcript does also contain a lot of useful informationregarding speech quality. Since the influence of textual content onthe knowledge gain has been examined extensively (e.g., by Gadi-raju et al. [12]), this information is disregarded here and the focusset on the lesser researched features. However, we make use of thecontent of the transcript in a different way in section 2.2.
Visual Features.
For the visual content, we examine the PDF filesof the presentation slides. With the bash command pdftotext weextract text layout information from the PDF slides. The commandextracts the position of each text element as well as the size ofthe slide. The elements of a slide are stored in an hierarchical waystarting with the biggest text element, which contains multiple textlines and each line consists of multiple words. This information isconverted into a XHTML file. Similarly, the pdftohtml command isused to extract the image position and size of the slide, which arestored in an XML file. The generated files are then parsed to JSON,since the format is more convenient for data handling.Based on this representation, we compute two features related tothe design of the slides, which are text ratio and image ratio . Theydescribe, how much slide space is covered by each of the modalitiesaccording to Formula 1.
TextRatio = (cid:205) ni = TextArea i Area slide . (1)Also, for each file we store the mean and sample variance of thetext ratio and image ratio values of all slides. In this section, we present a set of multimodal features which aim tomodel specific quality aspects of a presentation. We were inspired bycriteria that are important to us humans, for instance, the way andfrequency the presenter highlights important aspects on the slides.If we are able to capture these metrics, we can rank videos withsimilar content according to their presentation quality providingan optimal recommendation to a learner.
Highlight of Important Statements . This feature is supposedto indicate how often important statements are emphasized perslide and over the complete slide set. To identify the text boxesmost likely containing the key components of a slide, we use theinformation from the document layout analysis, which we storedin JSON format earlier.In this procedure, we use the following natural language pro-cessing functions which were adapted from Bird et al. [3]: • N(): return a list of nouns from a sentence • LEM(): return a set of lemmas from a list of words • SYN(): return a set of synonyms from a list of wordsOur assumption for the identification of important text is thatfont size is often proportional to importance. Since we do not have the font size information for each slide we sort the text lines by thearea they cover. However, simply choosing the n largest text areasdoes not yield good results, because there are often bullet pointsof similar importance but cover a text areas of different size. Sowe cluster the text areas according to the following rule: For eachtext area starting from the biggest one, if the area difference to thenext biggest text area is smaller than n % of the slide size (in ourexperiments 1%), add it to the existing cluster, otherwise create anew one. All text areas in the two biggest clusters are considered tocontain important statements. This usually includes the text area ofthe title and all headlines of the highest category. For each selectedtext line, we first extract the sentence(s) ( St imp ) from the respectiveJSON file. Then, the nouns and their synonyms are extracted fromthe text. Finally, we lemmatize the nouns and their synonyms: St = LEM ( N ( St imp ) ∪ SY N ( N ( St imp ))) . (2) Locating Emphasized Transcriptions . This feature is designedto capture the ability of the presenter to consider important state-ments shown on the slides as well as their emphasis through his orher voice. If so, there should be a corresponding local maximumin the audio signal. To get this information, we need to align thespeech transcript with the audio signal in the time frame wherethe currently observed slide was covered. The segmentation of thevideo according to single slides is done manually. For each slide andassociated time frame we need to find the corresponding segmentof the speech transcript. The transcript is segmented into blocksof 10 seconds, which is a common interval in speech analysis (cf.Hincks [16]). A slide segment of arbitrary length can contain mul-tiple of these blocks and it is important to find the correct one foreach highlighted statement, see Figure 2. This is done using theaudio signal. We encoded the audio information via three metrics:
Slide 1 Slide 2 Slide 3 a b c d e f g h irelevant irrelelevantirrelelevant
Figure 2: Visualization of the overlap between the speechtranscript blocks a , ..., i and the slides − . Based on their per-cental overlap with the slide, a certain percentage of wordsis removed from each block. F , loudness, and energy. If all of them have a maximum at aroundthe same timestamp, we assume the presenter emphasized the wordsaid at that moment. Locating the exact words at that moment isdone by choosing the speech transcript block whose temporal cen-tre is closest to the found maximum. The list of these presumedimportant statements are stored in EmphasizedTranscriptions andwe lemmatize them again: Tr = LEM ( EmphasizedTranscriptions ) . (3)Finally, the fraction of highlighted statements is calculated as: Hiдhliдht = | St ∩ Tr || St | . (4) evel of Detailing . Another possible measure of quality for thepresentation is the overlap of spoken text with the information onthe slide. As literature from video learning suggests [5], audio andvisual contents should provide complementary information ratherthan being overly redundant. Thus, we examine if the speaker onlyread the information already present on the slide, or if the oralexplanation provides further detail, giving an appropriate amountof additional information. For this purpose, we calculate the ratioof said words to shown words on the slide.Again, we use the speech transcript blocks from the previous met-ric to count the number of words said during the time frame whenthe corresponding slide was visible. Every speech transcript blockoverlapping with the duration of the slide is considered. Blocksat the interval boundaries are cut off appropriately. If a transcriptblock overlapped 70% at the end of a slide and contained 10 words,we would consider the first seven words and would dismiss thelast three. All found words are therefore stored in
W ords said . Thenumber of words on the slide
W ords slide were gathered from theprocess explained in section 2.1.The level of detailing is calculated as the ratio of number ofsaid words to number of words on the slide, see Formula 5:
LevelO f Detailinд = | W ords said || W ords slide | . (5)Also, for each video the mean and sample variance of thesevalues was calculated for later usage. Coverage of Slide Contents . This metric encapsulates if thespeaker talked about all the information shown on the slide ofif some parts were skipped. Again, this gives an idea if the overalltalk is well structured and timed or rushed and hard to follow foran observer, since some information shown is left without explana-tion. We can reuse the
W ords slide from the previous section. Forthe words said by the speaker, we reuse the already established
W ords said from the previous metric.The words in
W ords slide and
W ords said were lemmatized again,which enables an easier comparison. The
Coverage of slide con-tent is calculated as the ratio of the number of common words in
W ords said and
W ords slide to the total number of words on theslide:
Coveraдe = | W ords said ∩ W ords slide || W ords slide | . (6)Similarly, the mean and sample variance of the values of all slidesare calculated for later usage. The full list of manual and automaticfeatures can be seen in Table 2. To estimate the expressiveness of our automatically extracted fea-tures we conducted a user study. We aimed to get human ratingsfor different quality aspects of lecture videos, while these aspectsare covered by respective feature sets. In addition, we asked foran overall rating of a lecture video. Furthermore, every participantwas asked to fill in a knowledge test before (pre-test) and after(post-test) watching the video, aiming to measure the capability ofa video to convey knowledge. We conducted a correlation analysisto find out which features are correlated with the quality aspectsof lecture videos as well as with knowledge gain.
Human-rated Quality Aspect Automatic Features
Clear Language Loudness avg.Vocal Diversity mod. Loudness avg.Filler Words RMS Energy avg.Speed of Presentation f0 avg.Coverage of the Content Jitter avg.Level of Detail ∆ Jitter avg.Highlight of imp. Content Shimmer avg.Summary Harmonicity avg.Text Design log. HNR avg.Image Design PVQ avg.Formula Design Speech RateTable Design Articulation RateStructure of Presentation avg. Syllable DurationEntry Level Text Ratio avg.Overall Rating Text Ratio var.Image Ratio avg.Image Ratio var.Highlight of imp. StatementsLevel of Detailing avg.Level of Detailing var.Coverage of Slide Content avg.Coverage of Slide Content var.
Table 2: Overview of all recorded features.
Our dataset consists of 22 videos (with associated slides and speechtranscripts) from edX . The course materials of this course areCopyright Delft University of Technology and are licensed undera Creative Commons Attribution-NonCommercial-ShareAlike 4.0International License [8]. The available course materials on edXare provided the following formats: videos in MP4, slides in PDF,and transcriptions in SRT. We chose this source since it does notrequire any further pre-processing, it is open access and the speechtranscript is of high quality (presumably they have been manuallyreviewed). The subject of the 22 videos is software engineering. Each videohas exactly one presenter, while the full dataset has nine differentpresenters with varying slide designs. We employed 13 participants(10 men, 3 women) from our university with a computer sciencebackground, an average age of 25 . ± . . − https://courses.edx.org/courses/course-v1:DelftX+GSE101x+1T2018/course/ .3 Experimental Setting A common way to estimate the knowledge gain of a participantduring a learning session is to conduct a knowledge test before andafter a controlled learning session (e.g., Yu et al. [27]). The resultingscore difference indicates how much was learned. Even though thepotential knowledge gain depends heavily on the participant, bychoosing a subject that is most likely unfamiliar to a majority ofpeople, we try to circumvent this problem. It is however importantto ensure that the participants have a chance to understand thecontent, otherwise the knowledge gain will be low again. Therefore,we selected the topic
Globally Distributed Software Engineering .It is on one hand, a computer science topic related to the studiesof our participants but also a very specific area which is not partof their curriculum. Thus, everyone had a chance to understandthe topic based on their prior knowledge and therefore favouring apositive knowledge gain.A negative effect of the pre-test is that it might influence theuser behavior by providing hints on what to focus on in a video,because participants will try to get a good score on the post-test.We gathered a set of relevant questions inspired by the intermediatequizzes in the course material. However, we made sure to amendand change them since their reuse is restricted. We chose two tofour questions intended to be asked after each video with a similaramount of unrelated questions from other videos. Also, we put thevideos in random order so it was hard to guess which of the ques-tions will be the relevant ones. In addition, the number of possibleanswers was different every time. Exemplary, the knowledge testfor video 6_2a can be seen in Figure 3.After filling out the pre-test, the participant was instructed towatch the entire video without pausing, rewinding, or taking notes.Reason for that is that we wanted the participants to get a fullimpression of the presentation instead of, again, just skipping to therelevant parts for the knowledge test to get a good score. Similarly,we assumed that when we allowed people to take notes, they wouldjust write reminders down about the pre-test and focus solely ontheir appearance in the video. Admittedly, this is slightly differentto a realistic setting, but we applied it in favor of the knowledgegain measurement. After watching the video, the person is asked toanswer the same questions again and also to fill out an evaluationform with questions that are related to different quality aspects, seeTable 3. The items are assessed using a Likert scale from 1-5.This paragraph describes how we scored the knowledge testwith the goal of treating each video with a similar importance,independent of 1) the number of relevant questions per quiz and 2)the number of possible answers per question. First, the score foran unanswered question will be treated as zero since we gave theparticipants the option to skip a question in order to discouragerandom guessing. If the question was answered, we calculate thescore for each answer option by increasing (decreasing) the scoreby 1 for a correct (false) answer. Thus, a question with five answeroptions can yield the following scores: − , − , − , , , , s after watching video v isthen calculated as the difference between the pre-test score PB vs and post-test score PA vs . Let n v be the number of participants whowatched video v . Figure 3: The questionnaire for the pre- and post test ofvideo 6_2a. Questions 1 and 3 are relevant to this video.
We start by computing the mean knowledge gain of participantsfor each video: µ = (cid:205) n v j = ( PA vs − PB vs ) n v . (7)Next is the standard deviation of the knowledge gain: σ = (cid:118)(cid:117)(cid:116) n v [ n v (cid:213) s = ( PB vs − µ ) + n v (cid:213) s = ( PA vs − µ ) ] . (8)Based on the mean and standard deviation value, the scores arenormalized. PB ′ vs is the normalized score which is computed bysubtracting the mean and dividing by the standard deviation. Thesame applies for PA ′ vs : PB ′ vs = PB vs − µσ . (9) utomatic features Human-rated aspects Audio Clear languageVocal diversityLinguistic Filler wordsSpeed of presentationVisual text/image/formula/table designStructure of the presentationCoverage of the slide contentAppropriate level of detailCross-modal Highlight of important contentSummaryOverall rating
Table 3: Automatically extracted features and correspondingitems in the evaluation form of the user study.Figure 4: The full evaluation form the users had to fill outfor each video.
Consequently, the normalized knowledge gain of participant s forvideo v is: KG vs = PA ′ vs − PB ′ vs . (10) KG v , the overall knowledge gain for video v is finally calculatedby the average of all participants’ knowledge gain : KG v = n v n v (cid:213) s = KG vs . (11) We conducted a correlation analysis to compare the automaticallyextracted features with their human-labeled counterparts. The fol-lowing section discusses the results of this analysis.
For each pair of features we compute a correlation coefficient r anda confidence value α (assuming a two-tailed hypothesis). r is inthe range of [− , ] . r = − r = r = + α represents how extreme the observedcorrelation has to be in order to reject the null hypothesis (”Thecorrelation happened by chance.“). The α values are set accordingto the table of exact critical values for Spearman’s correlation co-efficient by Ramsey [21]. For Pearson’s correlation coefficient, the α thresholds are given by Weathington et al. [26] (p. 452) and anextension table which is calculated according to the descriptionin [26] (p. 451) using an R script [20].In order to apply the appropriate correlation coefficient r to-gether with the correct α value, multiple aspects have to be con-sidered. First, based on the data type of the analyzed feature pairwe have to decide between Spearman’s or Pearson’s correlationcoefficient. Spearman’s coefficient is applicable to the majority ofcombinations, because all the automatic features are interval dataand the manually labeled features are entirely ordinal data, exceptfor the results of the knowledge gain test. The difference betweenpost- and pre-knowledge tests are interval scaled, therefore Pearsonanalysis is the appropriate method to conduct the correlation anal-ysis between the knowledge gain score and the automatic features.The next important parameter to determine is the degree offreedom d f = N −
2, where N is the sample size. There are 22videos for correlation analysis, so for the following analysis N = d f = For each video, we have at least five annotations from which wecompute the mean value. Table 4 shows the correlation resultswith a confidence level α < .
05, so the effect is not coincidentalwith a confidence of at least 95%. Similarly, Table 5 shows negativecorrelation results. The following subsections discuss the subset ofhuman labeled quality measures that had a statistically significantcorrelation with at least two automatically extracted features.
Clear Language . Several of our automatically extracted featuresfrom the audio signal have a strong correlation ( α < . clear language , especially loudness , RMS energy , harmonicity (spectral) , and F . This is intuitive according to theirdefinitions in Table 1 and shows that this quality feature of a videocan most likely be determined automatically when these featuresare regarded jointly. With respect to negative correlations, thereis a strong connection ( α < . shimmer and clearlanguage . It correlates with vocal diversity ( α < . PVQ average contrasts with Hincks [16] findings that a higher PVQ score is uman-rated Aspect Autom. Features r s α < Clear Language RMS Energy 0.368 0.1Loudness 0.430 0.05Harmonicity (spectral) 0.435 0.05Detailing Mean 0.541 0.02Detailing Variance 0.448 0.05 F F F ∆ Jitter 0.408 0.1Jitter 0.423 0.1Filler Words Image Ratio Mean 0.394 0.1Articulation Rate 0.459 0.05Detailing Variance 0.493 0.05Detailing Mean 0.601 0.005Image Design ASD 0.371 0.1Appropriate Detailing Detailing Variance 0.378 0.1
Table 4: Positive correlation results with α < . ( r s : SpearmanCorrelation Coefficient, HNR: Harmonics-to-Noise Ratio, De-tailing: Level of Detailing)Human-rated Aspect Automatic Features r s α < Clear Language PVQ Average -0.416 0.1Coverage Variance -0.489 0.05Shimmer -0.623 0.005Vocal Diversity Text Ratio Mean -0.396 0.1Highlight -0.406 0.1Shimmer -0.437 0.05Overall Rating Speech Rate -0.455 0.05Image Design Articulation Rate -0.368 0.1Speed of Presentation Text Ratio Variance -0.454 0.05Image Ratio Variance -0.561 0.01Text Design Image Ratio Mean -0.466 0.05Structure Coverage Variance -0.402 0.1Summary Coverage Variance -0.457 0.05
Table 5: Negative correlation results with α < . ( r s : Spear-man Correlation Coefficient, Coverage: Coverage of SlideContents, Structure: Structuring of Presentation) associated with a more lively voice. This inconsistency may be dueto the data smoothing for F values. The smoothed values are lessaccurate and could lead to this effect. Vocal Diversity . The
Vocal Diversity of the speaker had a strongcorrelation with three audio features and two multimodal features.While the latter ones lack an intuitive explanation, the featuresfrom the audio signal, namely logarithmic HNR , F , and speech rate ,most certainly directly contributed to this measurement. For in-stance, Yumoto et al. [28] found a negative correlation between Harmonics-to-Noise Ratio and hoarseness, hinting that this metricresembles the voice quality of the speaker. For the negative correla-tions, shimmer , as mentioned before, severely impacts this metricand should definitely be considered when trying to automaticallypredict it. Speed of Presentation . The rating regarding an appropriate pre-sentation speed positively correlated with the two jitter features,which indicates changes in the speech rate. Possibly, the presenterregularly adapted the presentation speed based on the difficulty ofthe current slide – which would be an indicator for a good presen-tation. It is negatively correlated to the amount of images on theslides, which can be explained by the fact that an excessive amountof images on slides adds to the cognitive load: without sufficienttime to process visual contents, the viewer might feel overwhelmed.
Filler Words . Our filler words quality metric has a strong corre-lation with the articulation rate feature. This can be explained bythe fact that a high articulation rate represents a high amount ofsyllables per time unit, which is especially apparent if the speakeruses a lot of filler words like “uhhhmm” or “eehmm”. The remainingfeatures associated with it do not allow for a direct interpretationbut might be correlated indirectly.
Appropriate Detailing . Our cross-modal feature appropriate de-tailing has a strong correlation with the variance of its automaticcounterpart, namely detailing variance ( α < . One of the most important features for an educational video isits ability to convey the intended information and, consequently,to improve the viewers’ knowledge state. The following sectionhighlights the automatically extracted features that had the biggestimpact on the knowledge gain of the participants during the userstudy. Since both our extracted features and the knowledge gainscores are interval data, Pearson correlation analysis is used.First, Figure 5 shows the distribution of achieved knowledgegains for all 23 relevant questions answered during the user study.The data is normalised preventing questions with more possibleanswers to skew the results. The average knowledge gain of + . . α . The feature withthe biggest positive impact on the knowledge gain with a value of( r p = − . , α < .
2) is
Modulated Loudness . This reflects the intu-itive assumption that loud voices or unpleasant background noiseshinder learning experience and effect. Knowledge gain also corre-lates negatively with
Image Ratio Variance ( r p = − . , α < . igure 5: Knowledge gain distribution for all relevant ques-tions in the user study (5 or 6 participants per question).Automatic Features r p α < Modulated Loudness -0.357 0.2Image Ratio Var. -0.282 0.3Coverage Avg. 0.278 0.3Highlight 0.264 0.3Coverage Var. 0.253 0.3Speech Rate -0.224 0.4Text Ratio Avg. 0.220 0.4Avg. Syllable Duration 0.218 0.4Detailing Var. -0.211 0.4RMS Energy 0.209 0.4Harmonicity 0.193 0.4
Table 6: Correlations between auto-matic features and results of theKnowledge Gain tests sorted by mag-nitude r p with α < . : Pearson Corre-lation Coefficient) i.e., when the number of images varies significantly from slide toslide. This result indicates that a homogeneous slide design is to bepreferred in a good presentation. Another intuitive result is that Highlight of important Statements positively correlates with knowl-edge gain ( r p = . , α < . Coverage of Slide Content ( r p = . , α < . speech rate ( r p = − . , α < .
4) and average syllable duration ( r p = . , α < .
4) are negatively correlated with the knowledgegain of the learner, representing the situation when a speaker talkstoo fast. The remaining features, even though they correlate onlyslightly, showed effects as expected. For instance,
RMS Energy and
Harmonicity , which measure the liveliness of the speaker’s voice,had a positive correlation with the knowledge gain.
In this paper, we have presented a set of unimodal and cross-modalfeatures that can be automatically extracted from lecture videos.Further, we have presented a user study that investigated the cor-relations between these features and quality aspects of the lecturevideo. Also, the knowledge gain of users was measured and the cor-relations to the video features were evaluated. The results providedinsights about a number of moderate to good correlations.We were able to represent the objective quality metrics
ClearLanguage , Vocal Diversity , Speed Presentation and
Filler Words eachwith at least three automatically extracted features and a confidencelevel of at least 95%. Additionally, we presented an approach toevaluate the level of
Appropriate Detailing which correlated withhuman assessment.In the future, we will investigate if and how these results gener-alize to other areas.Also, it is possible to measure the impact of user-related featureslike click behavior or gaze tracking data on the learning processand their correlation to our quality metrics.Finally, we plan to utilize machine learning methods in order topredict video quality based on our set of features. This informationcould serve as an additional input for a recommender or retrievalsystem and add another dimension to distinguish between videosof similar content but differing presentation quality. In this way,exploratory search capabilities of educational video portals andMOOCs could be improved.
ACKNOWLEDGMENTS
Part of this work is financially supported by the Leibniz Association,Germany (Leibniz Competition 2018, funding line "CollaborativeExcellence", project SALIENT [K68/2017]).
EFERENCES [1] Maristella Agosti, Norbert Fuhr, Elaine G. Toms, and Pertti Vakkari. 2013. Evalua-tion Methodologies in Information Retrieval (Dagstuhl Seminar 13441).
DagstuhlReports
3, 10 (2013), 92–126. https://doi.org/10.4230/DagRep.3.10.92[2] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. 2011.
Modern InformationRetrieval - the concepts and technology behind search, Second edition
Natural Language Processingwith Python
CBE-Life Sci-ences Education
15, 4 (2016), es6. https://doi.org/10.1187/cbe.16-03-0125arXiv:https://doi.org/10.1187/cbe.16-03-0125 PMID: 27789532.[6] Lei Chen, Chee Wee Leong, Gary Feng, and Chong Min Lee. 2014. Using Multi-modal Cues to Analyze MLA’14 Oral Presentation Quality Corpus: PresentationDelivery and Slides Quality. In
Proceedings of the 2014 ACM Workshop on Mul-timodal Learning Analytics Workshop and Grand Challenge (MLA) . ACM, NewYork, NY, USA, 45–52. https://doi.org/10.1145/2666633.2666640[7] Kevyn Collins-Thompson, Soo Young Rieh, Carl C. Haynes, and Rohail Syed. 2016.Assessing Learning Outcomes in Web Search: A Comparison of Tasks and QueryStrategies. In
Proceedings of the 2016 ACM Conference on Human InformationInteraction and Retrieval, CHIIR 2016, Carrboro, North Carolina, USA, March 13-17,2016 , Diane Kelly, Robert Capra, Nicholas J. Belkin, Jaime Teevan, and PerttiVakkari (Eds.). ACM, 163–172. https://doi.org/10.1145/2854946.2854972[8] Creative Commons. 2019. Attribution-NonCommercial-ShareAlike 4.0 Interna-tional. https://creativecommons.org/licenses/by-nc-sa/4.0. [Online; accessed18-April-2019].[9] Nivja H. de Jong and Ton Wempe. 2009. Praat script to detect syllable nuclei andmeasure speech rate automatically.
Behavior Research Methods
41, 2 (01 May2009), 385–390. https://doi.org/10.3758/BRM.41.2.385[10] Carsten Eickhoff, Jaime Teevan, Ryen White, and Susan T. Dumais. 2014. Lessonsfrom the journey: a query log analysis of within-session learning. In
Seventh ACMInternational Conference on Web Search and Data Mining, WSDM 2014, New York,NY, USA, February 24-28, 2014 , Ben Carterette, Fernando Diaz, Carlos Castillo, andDonald Metzler (Eds.). ACM, 223–232. https://doi.org/10.1145/2556195.2556217[11] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. 2013. RecentDevelopments in openSMILE, the Munich Open-source Multimedia Feature Ex-tractor. In
Proceedings of the 21st ACM International Conference on Multimedia (MM’13) . ACM, New York, NY, USA, 835–838. https://doi.org/10.1145/2502081.2502224[12] Ujwal Gadiraju, Ran Yu, Stefan Dietze, and Peter Holtz. 2018. AnalyzingKnowledge Gain of Users in Informational Search Sessions on the Web. In
Proceedings of the 2018 Conference on Human Information Interaction and Re-trieval, CHIIR 2018, New Brunswick, NJ, USA, March 11-15, 2018 . 2–11. https://doi.org/10.1145/3176349.3176381[13] Philip J. Guo, Juho Kim, and Rob Rubin. 2014. How video production affectsstudent engagement: an empirical study of MOOC videos. In
First (2014) ACMConference on Learning @ Scale, L@S 2014, Atlanta, GA, USA, March 4-5, 2014 .41–50. https://doi.org/10.1145/2556325.2566239[14] Fasih Haider, Loredana Cerrato, Nick Campbell, and Saturnino Luz. 2016. Presen-tation Quality Assessment Using Acoustic Information and Hand Movements.In .IEEE, Shanghai, China, 2812–2816. https://doi.org/10.1109/ICASSP.2016.7472190[15] Hynek Hermansky, Nathaniel Morgan, A Bayya, and P Kohn. 1992. RASTA-PLPspeech analysis technique. In
ICASSP-92: 1992 IEEE International Conference onAcoustics, Speech, and Signal Processing , Vol. 1. IEEE, San Francisco, CA, USA,121–124. https://doi.org/10.1109/ICASSP.1992.225957[16] Rebecca Hincks. 2005. Measures and perceptions of liveliness in student oralpresentation speech: A proposal for an automatic feedback mechanism.
System
33, 4 (2005), 575 – 591. https://doi.org/10.1016/j.system.2005.04.002[17] Anett Hoppe, Peter Holtz, Yvonne Kammerer, Ran Yu, Stefan Dietze, and RalphEwerth. 2018. Current Challenges for Studying Search as Learning Processes.In
Linked Learning Workshop - Learning and Education with Web Data (LILE), inconjunction with ACM Conference on Web Science .[18] Xavier Ochoa, Marcelo Worsley, Katherine Chiluiza, and Saturnino Luz. 2014.MLA’14: Third Multimodal Learning Analytics Workshop and Grand Challenges.In
Proceedings of the 16th International Conference on Multimodal Interaction, ICMI2014, Istanbul, Turkey, November 12-16, 2014 . ACM, Istanbul, Turkey, 531–532.https://doi.org/10.1145/2663204.2668318[19] Georg Pardi, Yvonne Kammerer, and Peter Gerjets. 2019. Search and JustificationBehavior During Multimedia Web Search for Procedural Knowledge. In
Compan-ion Publication of the 10th ACM Conference on Web Science (WebSci ’19) . ACM,New York, NY, USA, 17–20. https://doi.org/10.1145/3328413.3329405[20] R Core Team. 2018.
R: A Language and Environment for Statistical Computing
Journal of Educational Statistics
14, 3 (1989), 245–253. https://doi.org/10.3102/10769986014003245[22] Björn W. Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus R.Scherer, Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben,Erik Marchi, Marcello Mortillaro, Hugues Salamin, Anna Polychroniou, FabioValente, and Samuel Kim. 2013. The INTERSPEECH 2013 Computational Paralin-guistics Challenge: Social Signals, Conflict, Emotion, Autism. In
INTERSPEECH2013: 14th Annual Conference of the International Speech Communication Associa-tion . INTERSPEECH 2013, Lyon, France, 5.[23] João Paulo Teixeira, Carla Oliveira, and Carla Lopes. 2013. Vocal Acoustic Analy-sis - Jitter, Shimmer and HNR Parameters.
Procedia Technology
J. Inf. Sci.
42, 1 (2016), 7–18. https://doi.org/10.1177/0165551515615833[25] Erlijn van Genuchten, Katharina Scheiter, and Anne Schüler. 2012. Examininglearning from text and pictures for different task types: Does the multimediaeffect differ for conceptual, causal, and procedural tasks?
Computers in HumanBehavior
28, 6 (2012), 2209–2218. https://doi.org/10.1016/j.chb.2012.06.028[26] Bart L. Weathington, Christopher J. L. Cunningham, and David J. Pittenger. 2012.
Understanding Business Research . John Wiley & Sons, Inc., Hoboken, New Jersey.https://doi.org/10.1002/9781118342978[27] Ran Yu, Ujwal Gadiraju, Peter Holtz, Markus Rokicki, Philipp Kemkes, and StefanDietze. 2018. Predicting User Knowledge Gain in Informational Search Sessions.In
The 41st International ACM SIGIR Conference on Research & Development inInformation Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018 . 75–84.https://doi.org/10.1145/3209978.3210064[28] Eiji Yumoto, W J Gould, and T Baer. 1982. Harmonics-to-noise ratio as an indexof the degree of hoarseness.