[PDF] A Clustering-Based Method for Automatic Educational Video Recommendation Using Deep Face-Features of Lecturers

Abstract

Discovering and accessing specific content within educational video bases is a challenging task, mainly because of the abundance of video content and its diversity. Recommender systems are often used to enhance the ability to find and select content. But, recommendation mechanisms, especially those based on textual information, exhibit some limitations, such as being error-prone to manually created keywords or due to imprecise speech recognition. This paper presents a method for generating educational video recommendation using deep face-features of lecturers without identifying them. More precisely, we use an unsupervised face clustering mechanism to create relations among the videos based on the lecturer's presence. Then, for a selected educational video taken as a reference, we recommend the ones where the presence of the same lecturers is detected. Moreover, we rank these recommended videos based on the amount of time the referenced lecturers were present. For this task, we achieved a mAP value of 99.165%.

Full PDF

AA Clustering-Based Method for AutomaticEducational Video Recommendation Using DeepFace-Features of Lecturers

Paulo R. C. Mendes, Eduardo S. Vieira, Álan L. V. Guedes, Antonio J. G. Busson, and Sérgio Colcher

TeleMidia Lab - Department of InformaticsPontiﬁcal Catholic University of Rio de JaneiroRio de Janeiro, Brazil{paulo.mendes, eduardo, alan, busson}@telemidia.puc-rio.br, [email protected]

Abstract —Discovering and accessing speciﬁc content withineducational video bases is a challenging task, mainly because ofthe abundance of video content and its diversity. Recommendersystems are often used to enhance the ability to ﬁnd andselect content. But, recommendation mechanisms, especially thosebased on textual information, exhibit some limitations, suchas being error-prone to manually created keywords or due toimprecise speech recognition. This paper presents a method forgenerating educational video recommendation using deep face-features of lecturers without identifying them. More precisely,we use an unsupervised face clustering mechanism to createrelations among the videos based on the lecturer’s presence.Then, for a selected educational video taken as a reference, werecommend the ones where the presence of the same lecturersis detected. Moreover, we rank these recommended videos basedon the amount of time the referenced lecturers were present. Forthis task, we achieved a mAP value of 99.165%.

Index Terms —Multimedia Retrieval, Deep Learning, Cluster-ing, Educational Video, Video Analysis.

I. I

NTRODUCTION

The traditional paradigm of classroom courses centeredon the physical presence of a teacher has been graduallygiving space to online and hybrid courses, which enablesthe emergence of VTEs (Virtual Teaching Environment) andMOOCs (

Massive Open Online Courses , such as Udacity, Coursera, and EdX). For example, in 2018, a study [1]has shown that almost 59% of people aged 14 to 23 preferYouTube as a learning tool rather than printed books, with55% of them also saying that YouTube has contributed totheir education. Recently, due to the covid-19 outbreak, theworld has experienced an unprecedented usage of virtualeducation [2], and some say that this model of education cameto stay. If, on the one hand, the abundance of educational videoscan contribute to and facilitate learning, on the other hand, italso makes it challenging to discover and access the contentof interest [3]. This issue is usually addressed by a proactive https://udacity.com https://coursera.org https://edx.org user search (using queries, for example), or by automaticrecommendations made by specialized systems.Recommendation mechanisms are usually based on twomethods: collaborative ﬁltering and content-based ﬁltering .In collaborative ﬁltering, the system groups users based ontheir common interest on items, using users’ preferences, rates,purchases or accesses to those items. With this approach,knowledge about the item’s content is not needed; the recom-mendation is purely based on the relationship between usersand items. The content-based ﬁltering, differently, requiresitems’ description; similar items are the ones recommendedto the user.In general, the current video recommendation methods areheavily dependent on textual information from the video, suchas labels ( i.e. keywords) [4], [5], or automatically generatedcaptions [6] from the lecturer speech. These systems faceproblems such as errors introduced by manually inserted labelsand by imprecise speech recognition. In our research, weaim to investigate methods that are able to perform videorecommendations that are not based on content nor on anyerror-prone textual descriptions, but solely on lecturers’ pres-ence. Notice that this approach does not necessarily have tocompletely replace textual-based recommendations; in fact, itcan be easily used as an additional aid to enhance the abilityto ﬁnd content in any system.Face detection methods have been attracting the attentionof researchers for more than two decades [7]. Nowadays, it isused for surveillance, video analytics systems, smart shopping,automatic face tagging in photo collections, investigative toolsthat search for identities in social networks based on faceimages, and thousands of other applications in our daily lives.For instance, Facer [8] is the Facebook’s face detection andrecognition framework; given a photograph, it ﬁrst detectsall the faces, and then runs a deep model to determine thelikelihood of that face belonging to one of the top-N userfriends. This allows Facebook to suggest which friends theuser might want to tag within the uploaded photographs.This work aims at recommending educational video contentbased on lecturers’ presence. To do that, we take advantage offace detection methods. More precisely, we detect lecturers a r X i v : . [ c s . MM ] O c t n a video taken as a reference and perform a clusteringbased on the face of these lecturers in different videos.Given these clusters, we extract their centroids (explained inSection III), and perform another clustering step for creatinga relationship between videos that share the presence of thesame lecturers. Finally, we rank the recommended videosbased on the amount of time the referenced lecturers werepresent. A particular feature of this approach is that it canbe done without supervision, allowing for new videos to beautomatically analyzed. Moreover, our approach permits thecreation of timelines based on lecturers’ presence that can beused in the search for speciﬁc parts of a content where onlyspeciﬁc lecturers’ are present. To evaluate our recommendationranking, we use the mAP (Mean Average Precision) metric,which is commonly used in information retrieval evaluationtasks [9].The remainder of this paper is structured as follows. Sec-tion II discusses some related work. Then, we present ourmethod in Section III. Section IV presents the used dataset,followed by Section V, that shows the experiments to validatethe face clustering and the video recommendation rankingmechanisms. Finally, Section VI brings our ﬁnal remarks.II. R ELATED W ORK

We have organized the related work into two groups. Inthe ﬁrst, we grouped works that share our goal of educa-tional video recommendation but do not necessarily use face-embeddings (deep face-features). The second group is the onein which every work addresses the task of face recognition invideos.Regarding

Educational Video Recommendation , we citeworks based on content-ﬁltering. These works perform anal-yses and comparisons using the video textual description orspeech recognition performed on them. Omisore et. al. [5], forexample, propose combining fuzzy techniques to recommendbooks with content suitable for students based on their readinghistories in a digital library, while Mahajan et. al. [4] propose,given a reference video, mining social media, and web forsuggesting links for a student to visit. Moreover, Barrére et. al. [6] use texts from speech recognition to create recommenda-tions. These works are only based on textual characteristics (orcontent converted to it) for performing recommendations. Ourwork focuses on using a visual part of the video, moreprecisely the presence of lecturers.Works based on

Video Face Recognition usually apply deeplearning models for the task. DeepFace [10] and DeepID [11],for example, use a CNN (Convolutional Neural Network) witha fully-connected layer output to produce a representation ofhigh-level features (face embeddings) from an input image,followed by a softmax layer to indicate the identity of classes.Other approaches, such as FaceNet [12], can directly measurethe similarity among faces using euclidean space. Yang etal. [13] proposed a deep network for video face recogni-tion called NAN (Neural Aggregation Network). They use aCNN to generate the embeddings, followed by an aggregationmodule that consists of two attention blocks which adaptively aggregate the feature vectors to form a single feature insidethe convex hull spanned by them. Rao et al. [14] proposeda method for video face recognition based on attention-awaredeep reinforcement learning. They formulated the process ofﬁnding the attention of videos as a Markov decision processand training the attention model without using extra labels. Un-like existing attention models, their method takes informationfrom both the image space and the feature space as the inputto make use of face information that is discarded in the featurelearning process. Sohn et al. [15] proposed an adaptativedeep learning framework for image-based face recognitionand video-based face recognition. Given an embedding gen-erated by a CNN, their framework adaptation is achievedby (1) distilling knowledge from the network to a videoadaptation network through feature matching, (2) performingfeature restoration through synthetic data augmentation, and(3) learning a domain-invariant feature through an adversarialdomain discriminator.Like [13], [14], [15], our method uses a CNN to generateface embeddings from face images, with the difference thatwe use an unsupervised cluster-based method to compare thesimilarity among faces extracted from videos.III. M

ETHOD

Our method intends to recommend educational videos basedon the lecturers that appear in each video, so that, when aperson watches a video, other videos containing the samelecturers are recommended. For didactic purposes we decidedto divide our exposition in two phases: (i) video representa-tion and (ii) video recommendation , which are described inSections III-A and III-B respectively.

A. Video Representation

The objective of this phase is to represent each video withvectors (centroids) of the lecturers that appear on it. Fig. 1shows the pipeline we propose for this phase, described inthe remainder of this subsection. It is divided into four steps:

Frames Extraction , Face Detection , Embeddings Generation and

Clustering Representation .First, we perform the

Frames Extraction by receiving avideo ﬁle as input and extracting its frames according toa given frame rate. Next, for each of the frames, the

FaceDetection step uses an object detection model for detectingfaces in each of them. The face detection model is responsiblefor returning the bounding boxes of the faces present in theimage giving the x and y axes coordinates of the upper-leftcorner and of the lower-right corner of the rectangle thatestablishes the visual limits encapsulating each face. Withthese bounding boxes, we can isolate and extract the boundedimages, obtaining a dataset composed of images of faces only.The objective of the Embeddings Generation step is torepresent each face image as a vector in R n . To achieve that,it processes each of the faces generated in the previous stepthrough a CNN that generates their embeddings. An embed-ding is a representation of the input in a lower dimensionalityspace. Ideally, an embedding captures some semantics of the ideo File Frame 1Frame 2Frame n... [0.2 2.3 ... 0.5] EmbeddingsFaces

FramesExtraction FaceDetection EmbeddingsGeneration ClusteringRepresentation

Video Face Centroids

Fig. 1: Lecturers representation process in video. This process receives a video ﬁle and returns the centroids of the clustersthat ideally represent each of the lecturers present in the video ﬁle.input, e.g. by placing semantically similar inputs close togetherin an embedding space. Therefore, at the end of this step, wehave all faces represented as embeddings.In the

Clustering Representation step, we group embed-dings ( e ), and, consequently, faces that are close enough inthe embedding space using a clustering algorithm. Clusteringis the task of dividing a set of data points, embeddings inthis case, into a number of groups (called clusters ) such thatdata points in a given group are similar to other data pointsin the same group and dissimilar to the data points in othergroups. The clustering process should produce a partition ofthe faces present in the frames, hopefully with each generatedcluster representing a speciﬁc person; moreover, the union ofall clusters covers the whole set of faces found in the video.As most of the clustering algorithms require the number ofclusters as parameter, we use a strategy (deﬁned in Algorithm1) based on the Silhouette Score ( s ) [16], that corresponds tothe mean of the Silhouette Coefﬁcient ( σ ) of all samples. Thiscoefﬁcient for each sample is σ = b − amax ( a, b ) (1)where a is the mean distance from a sample to all othersamples in the same cluster, and b is the mean distance from asample to all other samples in the closest cluster to that sample.In this way, the best value is 1 and the worst is -1. Values closeto 0 indicate overlapping clusters, whereas negative valuesusually indicate that a sample has been assigned to the wrongcluster since a different cluster is more similar.With the strategy deﬁned in Algorithm 1, we increasethe number of clusters until the maximum Silhouette Scoredecreases to more than t times in a row or until it reachesthe maximum number of clusters (lines 5-18), which is thenumber of embeddings ( | e | ). The Clustering procedure(line 7) can be substituted by any clustering algorithm thatrequires the number of clusters in advance. When the iterationstops, we return the clustering conﬁguration with the highestSilhouette Score. Since the Silhouette Coefﬁcient requires atleast two clusters, it would not be possible to compute theSilhouette Score for a clustering conﬁguration with only onecluster (there are only faces of a single person). To overcomethis problem, we start with 2 clusters consecutively increasingit as described above. Then, if the returned clustering conﬁg-uration has a Silhouette Score smaller than a threshold ω , that Algorithm 1

Iteratively ﬁnding the best clustering conﬁgura-tion for unknown number of clusters. procedure B LIND C LUSTERING ( e, t, ω ) n K ← s max ← − t cur ← while t cur ≤ t & n K < | e | do n K ← n K + 1 K cur ← Clustering ( e, n K ) s ← SilhouetteScore ( K cur ) if s < s max then t cur ← t cur + 1 else K ← K cur t cur ← if s > s max then s max ← s end if end if end while if s max < ω then K ← OneCluster ( e ) end if return K end procedure probably indicates overlapping, we say that all faces belongto one single cluster (lines 19-20).Next, we compute the clusters’ centroids for each of theclusters k ∈ K where K is the best clustering conﬁgurationfound with the Silhouette Score . A centroid c k for each clusteris the mean of the elements present in the cluster, and can bedeﬁned as follows c k = (cid:88) a ∈ k a | k | (2)where a represent each element of a cluster k .By the end of this phase, we have each video in the datasetrepresented by its centroids where, ideally, each centroidrepresents a lecturer present in the video. We also record theframes where each lecturer is present. ideo 1 Face CentroidsVideo 2 Face CentroidsVideo n Face Centroids ... Video 2Video nVideo 1

CentroidsClustering Ranking

Video 2Watch nextVideo 1Video n

Fig. 2: Video Recommendation based on Lecturers Centroids Clustering. This pipeline receives the centroids of lecturers fromall the videos in the dataset, then it creates relationships among videos that share the presence of the same lecturers. Finally,it performs ranking of recommended videos for each of the videos in the dataset. This ranking is based on the number oflecturers shared and their time presence.

B. Video Recommendation

This phase aims at recommending videos by the lecturerspresent in it and in the other videos. It is divided in two steps:

Centroids Clustering and

Ranking , as depicted in Fig. 2.First, we gather the centroids from the videos of the datasetas one single set and perform the

Centroids Clustering . Forperforming this clustering, we also use the strategy for anunknown number of clusters described in Algorithm 1. Bydoing that, we group centroids from the same lecturer that arein different videos. For instance, in Fig. 2, one can see thatthe purple lecturer is present in both Videos 1 and 2, whilethe orange lecturer is present in both Videos 2 and n. By theend of this step, we have the group L of lecturers present inthe dataset of videos V , and we can also denote L v as thegroup of lecturers present in video v .Next, based on these relationships among different videos,we perform Ranking , by recommending videos in whichlecturers of the current video are present. For doing that, wecompute a similarity score using the presence of the lecturersin the current video and the presence of these same lecturersin the other video. Let p l,v denote the percentage of frames inwhich the lecturer l ∈ L v is present in video v ∈ V . For eachvideo v ∈ V and u ∈ V − v we compute a score of similarity S v,u . S v,u = (cid:88) l ∈ L v p l,v · p l,u (3)Finally, using this score, for each video v we computea ranking R v where R v,i denotes the i-greatest S v and R v,i ≥ R v,i +1 for all i ∈ ...n v , where n v is the number ofvideos u in which S v,u > . In this way, the more lecturers avideo have in common with the reference video, and the moretime these lecturers are present in both videos, the higher thevideo is positioned in the ranking of the reference video.By the end of this phase, we have a ranking of recommendedvideos for each video in the dataset. It is important to noticethat our method is unsupervised and does not require theinformation of the lecturers in advance. Consequently, wedo not store any information regarding the identity of thelecturers, respecting their privacy. IV. D ATASET

The experiments were conducted using a dataset created inthe context of this work. It is composed of 98 educationalvideos publicly available on YouTube. , Each video contains at least one lecturer; moreover, somevideos could have some special participation or collaboration.Thus, each video is annotated to contain between 1 to 5 people.In total, 16 people are present in the dataset. Each person hasan average presence of 6.67% in the videos, and their identitiesare known and ready to be used to assist in the nominal datasetorganization.Regarding the duration, the videos vary between 00m:30sand 1h:49m:01s. The average duration of the videos is23m:34s, with a standard deviation of 23m:05s. The high valueof the standard deviation for the time estimates indicates thatthe videos are not in the same time range, and therefore havea wide duration variety.V. E

XPERIMENTS

First we compute the centroids that represent each lecturerin each video using the process described in Section III-A.Next we perform the video recommendation using the processdescribed in Section III-B.For representing the video ﬁles in the dataset, we startby performing

Frames Extraction for each video ﬁle usinga frame rate of 1 frame per second (fps). Next, in the

FaceDetection step, we use MTCNN [17] (Multitask CascadedConvolutional Networks), which is widely used for the facedetection task [18], [19], [20]. Once we have detected thefaces of lecturers in the video frames, we perform

EmbeddingsGeneration using SE-ResNet-50 [21] (SeNet-50 for short) thatgenerates embeddings on the R feature space. We used thearchitecture and weights pre-trained on the VGGFace2 dataset[22], available on the keras-vggface library. The VGGFace2dataset contains 3.31 million images of 9,131 subjects andhas large variations in pose, age, illumination, ethnicity, andprofession. Finally, we use Algorithm 1 in the

ClusteringRepresentation step with the parameters t = 5 , ω = 0 . , andthe Ward Agglomerative Clustering [23] as the Clustering https://github.com/rcmalli/keras-vggface rocedure using its implementation in the scikit-learn [24]library. The Ward Agglomerative Clustering algorithm mergespairs of clusters that minimize the ward criterion, which is thevariance of the clustering being merged. By using this method,at each step, the algorithm ﬁnds the pair of clusters that leadto a minimum increase in total within-cluster variance aftermerging. Finally, we compute the centroids of each of theclusters generated.For performing the video recommendation task, we gatherthe centroids (that represent each lecturer in the video) fromall videos in the dataset. Next, we perform the process de-scribed in Section III-B. For Centroids Clustering , we alsouse Algorithm 1 with the same parameters t = 5 , ω = 0 . ,and the Ward Aglommerative Clustering as Clustering procedure. Finally, based on the clusters generated, we performthe

Ranking step.The remainder of this Section describes the evaluation ofthe centroids clustering (Section V-A) and the evaluation ofthe video recommendation (Section V-B).

A. Centroids Clustering Evaluation

Our evaluation aims at discovering the precision achievedby the clustering over the centroids. More precisely, we wantto evaluate how well our approach identiﬁed that the samelecturer is present in different video ﬁles. For this task, werequire some human feedback. To receive that feedback, wedeveloped an application, called

VideoFacesTool consistingof a graphical web interface. The tool allows participants toimport a ﬁle, which contains information after the

CentroidsClustering step, so that we have the set of centroids and asample face image of it. Inside the tool, a cluster is calleda group . After being imported, faces are visually organizedaccording to the group to which they belong and when a groupis selected, all face centroids from that group are displayed,as shown in Fig. 3.Each face centroid has a Boolean property, which indicateswhether it is correctly grouped (it belongs to a group in whichall the face centroids are from the same lecturer) or not. If thereis an error, the participant can indicate it.A total of 5 participants collaborated in the evaluationsession. They were advised to mark a face centroid as wrongif it represents (a) an object, or (b) a part of the human body,or (c) a lecturer other than the lecturer in the group. Fig. 4shows examples of these types of errors, and Table I providesan overview of the evaluations obtained from each participant.It is important to notice that these results do not reﬂect therecommendation of educational videos, they only evaluate the

Centroids Clustering step. For instance, we could have a groupof face centroids of people that appear for a few amount oftime in the videos and are not lecturers. This case, of course,would reduce the precision of the

Centroids Clustering step.However, our method for video recommendation and rankingis robust to these cases, as it considers the amount of time thata person appears for scoring the recommended videos. Withthe evaluation completed, it is possible to export the analysisinformation with the number of right and wrong face centroids. Fig. 3: Centroids images correction in

VideoFacesTool . On thetop, each image represents one lecturer. When a lecturer isselected, the tool displays all appearances (centroids) of thatlecturer in different videos. The user can then mark each ofthese appearances as correct or wrong.Fig. 4: Examples of wrong faces centroids. (a) a part of anicon (b) a hand and (c) face centroids that are not from thesame lecturer .TABLE I: Results of visual evaluation of faces centroidsshowing the number of correct and wrong centroids classiﬁedby each participant.Participant

B. Recommendation Evaluation

We evaluate our approach based on the relevance of thevideos recommended. A video is considered relevant to an-other if they have at least one lecturer in common. To verifythat, we use the information of the lecturers’ presence availableon our dataset.To evaluate our ranking, for each video we compute theAverage Precision (AP), that evaluates how well a ranking ofABLE II: Results obtained with our approach with different thresholds of time presence for a lecturer to be considered aspresent in a video.

Thershold MeanR MinR MeanP MinP MeanF1 MinF1 mAP MinAP

0% 0,88851 0,45455 0,64681 0,20370 0,70971 0,33333 0,98641 0,585971% 0,88851 0,45455 0,64681 0,20370 0,70971 0,33333 0,98641 0,585972% 0,88851 0,45455 0,64885 0,20370 0,71171 0,33333 0,98641 0,585973% 0,88851 0,45455 0,67368 0,21569 0,73086 0,34921 0,98642 0,585974% 0,88851 0,45455 0,67368 0,21569 0,73086 0,34921 0,98642 0,585975% 0,88851 0,45455 0,69923 0,22449 0,74930 0,36066 0,98642 0,585976% 0,88851 0,45455 0,73615 0,25714 0,77648 0,40000 0,98642 0,585977% 0,88742 0,45455 0,77849 0,31429 0,80768 0,45833 0,98642 0,585978% 0,88742 0,45455 0,78408 0,31429 0,81111 0,45833 0,98643 0,585979% 0,88742 0,45455 0,80171 0,33333 0,82410 0,47826 0,98643 0,5859710% 0,88742 0,45455 0,83165 0,42308 0,84306 0,51613 0,98643 0,5859711% 0,88742 0,45455 0,85306 0,45714 0,85693 0,53333 0,98643 0,5859712% 0,88616 0,45455 0,85956 0,44118 0,86018 0,50847 0,98662 0,5859713% 0,88490 0,45455 0,88216 0,43750 0,87305 0,49123 0,98688 0,5859714% 0,88289 0,45455 0,90265 0,47368 0,88450 0,51064 0,98884 0,5859715% 0,88289 0,45455 0,90265 0,47368 0,88450 0,51064 0,98884 0,5859716% 0,88163 0,44000 0,91327 0,47368 0,88980 0,47826 0,98908 0,5859717% 0,87197 0,44000 0,91538 0,47368 0,88580 0,48889 0,98912 0,5859718% 0,87197 0,44000 0,91538 0,47368 0,88580 0,48889 0,98912 0,5859719% 0,87086 0,44000 0,93165 0,47368 0,89476 0,50000 0,98946 0,5859720% 0,86130 0,35484 0,93645 0,44444 0,89218 0,46809 0,99000 0,5859721% 0,86130 0,35484 0,93645 0,44444 0,89218 0,46809 0,99000 0,5859722% 0,86130 0,35484 0,95054 0,61111 0,89886 0,46809 0,99000 0,5859723% 0,86130 0,35484 0,95054 0,61111 0,89886 0,46809 0,99000 0,5859724% 0,85805 0,32000 0,95718 0,64286 0,90046 0,43243 0,99165 0,5859725% 0,85805 0,32000 0,95718 0,64286 0,90046 0,43243 0,99165 0,58597 recommendations is based on each element’s relevancy. Thismetric penalizes more a ranking if a non-relevant element isrecommended in the ﬁrst positions than if it was in the lastones. Let P k be the precision of the ﬁrst k elements of aranking, which is the percentage of videos that are relevant inthe sub-ranking that starts at position and ends at position k .Let α k denote the relevancy of the video in position k , where α k = 1 if the video is relevant, and otherwise. The AP of agiven ranking is deﬁned as follows AP = 1 GT P n (cid:88) k =1 P k · α k (4)where GTP refers to the total number of ground truth positivesin the ranking, which is the total number of videos that areconsidered relevant in a ranking. Fig. 5 shows an example ofhow the AP is computed for a given ranking. In this case, the GT P = 3 because the total number of relevant videos in theranking is 3 (videos A, B and D).In order to prevent outliers from having much inﬂuence inthe recommendation (e.g. a person that is not a lecturer – andnot relevant to the video – and appears for a short amountof time), we experimented different thresholds of presenceintervals in a video for a person to be considered as “present”when computing the score for the ranking. In this way, a p l,v lesser than the threshold is considered as . Besides theAverage Precision, we also compute the mean and minimumvalues of the recall (MeanR and MinR), precision (MeanP andMinP), and F1-Score (MeanF1 and Min F1) for the recommen- ReferenceVideo

Recommended Videos

A B C D E

Relevant Video Non-Relevant VideoAP = 1/3 ( 1/1 + 2/2 + 0 + 3/4 + 0) = 0.91667 1º 2º 3º 4º 5º

Fig. 5: Example of how the Average Precision (AP) is com-puted for a reference video and its recommended videos.dation generated for each of the videos, without consideringthe positioning of these videos in the rankings. The recallrefers to the percentage of relevant videos that are present inthe ranking. The F1-score represents an overall performancemetric based on the harmonic mean of the precision and recalland is deﬁned as follows. F · P · RP + R (5)Table II shows the thresholds used, the values of recall,precision, F1-score, and the mean and minimum AveragePrecision (mAP and MinAP),One can observe from Table II that the precision clearlyincreased with the use of the threshold. Different from theprecision, the recall decreased with the increase of the thresh-old. It means that with a greater threshold more videos thathould be recommended were not chosen by our method. It isimportant to notice that these two metrics (precision and recall)do not consider the ordering of the recommendations. Differentfrom them, the Mean Average Precision (mAP) has high valuesfor all thresholds, specially because the score for computingthe ranking takes into consideration the percentage of timethat a person appears in the reference and recommendedvideos. Then, we can conclude that our proposed approachfor ordering the recommended videos tends to recommendmore suitable videos ﬁrst with a high mAP ≈ . . Moreover,despite the precision of the Clustering Step shown in Table I,our method for ranking was robust to outliers and was able tocorrectly recommend and rank relevant videos.VI. F

INAL R EMARKS

In this paper, we present a method for educational videorecommendation using deep-face-features of lecturers. Moreprecisely, we use an unsupervised clustering-based method andan heuristic for ranking. It takes advantage of face detectionmechanisms to perform educational video recommendationbased on the lecturers’ presence. Besides the face detection,we also perform face clustering of the lecturers in each video,and, given these clusters, we extract their centroids to performanother clustering step that creates a relationship of videosthat share the presence of the same lecturers. Finally, we rankthe recommended videos based on the amount of time eachlecturer is present. It is worth mentioning that our method iscompletely automatic and does not require any informationof the video ﬁles in advance. Moreover, our approach doesnot need to know or store the identity of the lecturers forperforming recommendation, preserving their privacy.A collateral contribution of our paper is video segmentationthe by lecturer. As illustrated in Fig. 6, we can create atimeline based on lecturers’ presence, which can be used tohelp students in ﬁnding moments where speciﬁc lecturers arepresent. With this segmentation, we could recommend speciﬁcparts of the video to the student.Fig. 6: Educational video timeline tagged by the lecturerspresence. Notice that the frames with the lecturer on the leftare tagged in red, while the frames with the lecturer on theright are tagged in yellow. This timeline is resulted from the

Video Representation phase, described in Section III-A.The main limitation of our work is that we can onlyrecommend videos in which the lecturers are visually present. As future work, we intend to use a hybrid recommendationapproach, that combines both textual and audiovisual informa-tion from the video to create clusters. Video summarization isalso a technique that can be explored to enhance video contentsearching and selection.R

Nature Materials , vol. 19, no. 6, pp. 687–687, 2020.[3] L. L. Dias, J. S. Barbosa, E. Barrére, and J. F. de Souza, “An ap-proach to identify similarity among educational resources using externalknowledge bases,”

Brazilian Journal of Computers in Education , vol. 25,no. 02, p. 18, 2017.[4] R. Mahajan, J. Sodhi, and V. Mahajan, “Optimising web usage miningfor building adaptive e-learning site: a case study,”

International Journalof Innovation and Learning , vol. 18, no. 4, pp. 471–486, 2015.[5] M. Omisore and O. Samuel, “Personalized recommender system fordigital libraries,”

International Journal of Web-Based Learning andTeaching Technologies (IJWLTT) , vol. 9, no. 1, pp. 18–32, 2014.[6] E. Barrére, J. F. de Souza, M. A. Vitor, and M. A. de Almeida, “Utiliza-ção de enriquecimento semântico para a recomendação automática devideoaulas no moodle,”

Revista Brasileira de Informática na Educação ,vol. 28, p. 319, 2020.[7] I. Masi, Y. Wu, T. Hassner, and P. Natarajan, “Deep face recognition:A survey,” in , 2018, pp. 471–478.[8] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov,M. Fawzy, B. Jia, Y. Jia, A. Kalro et al. , “Applied machine learningat facebook: A datacenter infrastructure perspective,” in . IEEE, 2018, pp. 620–629.[9] C. D. Manning, “Introduction to information retrieval-cs 276 lectureslides.“,”

Introduction to Information Retrieval” Stanford University ,2009.[10] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing thegap to human-level performance in face veriﬁcation,” in

Proceedings ofthe IEEE conference on computer vision and pattern recognition , 2014,pp. 1701–1708.[11] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation frompredicting 10,000 classes,” in

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2014, pp. 1891–1898.[12] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uniﬁed embed-ding for face recognition and clustering,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2015, pp. 815–823.[13] J. Yang, P. Ren, D. Zhang, D. Chen, F. Wen, H. Li, and G. Hua, “Neuralaggregation network for video face recognition,” in

Proceedings of theIEEE conference on computer vision and pattern recognition , 2017, pp.4362–4371.[14] Y. Rao, J. Lu, and J. Zhou, “Attention-aware deep reinforcement learningfor video face recognition,” in

Proceedings of the IEEE internationalconference on computer vision , 2017, pp. 3931–3940.[15] K. Sohn, S. Liu, G. Zhong, X. Yu, M.-H. Yang, and M. Chandraker, “Un-supervised domain adaptation for face recognition in unlabeled videos,”in

Proceedings of the IEEE International Conference on ComputerVision , 2017, pp. 3210–3218.[16] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation andvalidation of cluster analysis,”

Journal of computational and appliedmathematics , vol. 20, pp. 53–65, 1987.[17] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection andalignment using multitask cascaded convolutional networks,”

IEEESignal Processing Letters , vol. 23, no. 10, pp. 1499–1503, Oct 2016.[18] E. Jose, M. Greeshma, M. H. TP, and M. Supriya, “Face recognitionbased surveillance system using facenet and mtcnn on jetson tx2,” in . IEEE, 2019, pp. 608–613.19] A. Ghofrani, R. M. Toroghi, and S. Ghanbari, “Realtime face-detectionand emotion recognition using mtcnn and minishufﬂenet v2,” in .IEEE, 2019, pp. 817–821.[20] G. Bezerra and R. Gomes, “Recognition of occluded and lateral facesusing mtcnn, dlib and homographies,” 2018.[21] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2018, pp. 7132–7141.[22] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2:A dataset for recognising faces across pose and age,” in . IEEE, 2018, pp. 67–74.[23] J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,”

Journal of the American statistical association , vol. 58, no. 301, pp.236–244, 1963.[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay, “Scikit-learn: Machine learning in Python,”