[PDF] A Multimodal CNN-based Tool to Censure Inappropriate Video Scenes

Abstract

Due to the extensive use of video-sharing platforms and services for their storage, the amount of such media on the internet has become massive. This volume of data makes it difficult to control the kind of content that may be present in such video files. One of the main concerns regarding the video content is if it has an inappropriate subject matter, such as nudity, violence, or other potentially disturbing content. More than telling if a video is either appropriate or inappropriate, it is also important to identify which parts of it contain such content, for preserving parts that would be discarded in a simple broad analysis. In this work, we present a multimodal~(using audio and image features) architecture based on Convolutional Neural Networks (CNNs) for detecting inappropriate scenes in video files. In the task of classifying video files, our model achieved 98.95\% and 98.94\% of F1-score for the appropriate and inappropriate classes, respectively. We also present a censoring tool that automatically censors inappropriate segments of a video file.

Full PDF

AA M

ULTIMODAL

CNN-

BASED T OOL TO C ENSURE I NAPPROPRIATE V IDEO S CENES

A P

REPRINT

Pedro V. A. de Freitas

PUC-Rio [email protected]

Paulo R. C. Mendes

PUC-Rio [email protected]

Gabriel N. P. dos Santos

PUC-Rio [email protected]

Antonio José G. Busson

PUC-Rio [email protected]

Álan Livio Guedes

PUC-Rio [email protected]

Sérgio Colcher

PUC-Rio [email protected]

Ruy Luiz Milidiú

PUC-Rio [email protected]

November 12, 2019 A BSTRACT

Due to the extensive use of video-sharing platforms and services for their storage, the amount of suchmedia on the internet has become massive. This volume of data makes it difﬁcult to control the kind ofcontent that may be present in such video ﬁles. One of the main concerns regarding the video contentis if it has an inappropriate subject matter, such as nudity, violence, or other potentially disturbingcontent. More than telling if a video is either appropriate or inappropriate, it is also important toidentify which parts of it contain such content, for preserving parts that would be discarded in asimple broad analysis. In this work, we present a multimodal (using audio and image features)architecture based on Convolutional Neural Networks (CNNs) for detecting inappropriate scenes invideo ﬁles. In the task of classifying video ﬁles, our model achieved 98.95% and 98.94% of F1-scorefor the appropriate and inappropriate classes, respectively. We also present a censoring tool thatautomatically censors inappropriate segments of a video ﬁle. K eywords Inappropriate Video · CNN keyword · Deep Learning

The amount of multimedia content on the internet, especially video, is increasing each year. More than 300 hours ofvideo are uploaded to YouTube every minute . Due to the amount of material uploaded, controlling the content that isloaded is quite challenging even for large companies. For example, Facebook and Youtube are being sued for hostingvideos from the Christchurch shootings .The word Inappropriate is often used as a reference to media that contain content such as nudity, intense sexuality,violence, gore or other potentially disturbing subject matter. On the other hand,

Appropriate means that a content issuitable for most viewers. Figure 1 illustrates these two categories. There are three scenes with appropriate content onthe left, and three scenes with inappropriate on the right. https://biographon.com/youtube-stats/ a r X i v : . [ c s . MM ] N ov PREPRINT - N

OVEMBER

12, 2019 (a) Appropriate videos (b) Inappropriate videos

Figure 1: Examples of each categoryBesides controlling the content of entire video ﬁles, it is also important to identify which scenes of it are inappropriate.For example, YouTube only considers the entire video for telling if it is either appropriate or inappropriate–only oneinappropriate scene is necessary for banning an entire video from the platform. For that reason, an ongoing problem isthat, in some cases, a video has its entire content labeled as inappropriate when only speciﬁc scenes contain inappropriatecontent (e.g. movies and documentaries on wars and conﬂicts). For instance, one can suppose that students are lookingfor a documentary about Vietnam War for their history class. But they just cannot ﬁnd any because the storage servicein which they are searching has banned all documentaries in that subject for containing violent scenes. Thus, a toolthat not only labels videos but also provides information on which scenes are inappropriate would allow access to thevideo content, while preventing exposure to these scenes. This work aims at presenting an approach for detecting andcensoring inappropriate video scenes in video ﬁles. This is done in a way that appropriate scenes are not censored andthe inappropriate ones pass through a process to make them presentable.Controlling the type of content loaded to video storage service requires an automatic analysis efﬁciently and quickly.Methods based on

Deep Learning (DL) became the state-of-the-art in various segments related to automatic videoanalysis. Convolutional Neural Networks (CNNs) architectures, or ConvNets, have become the primary method usedfor audio-visual pattern recognition.Other works also share the motivation of classifying video ﬁles in the categories we mentioned [1, 2, 3]. However, mostof them do not use audio and image for classiﬁcation, or use hand-crafted feature extraction methods, or do not use thelatest feature extraction CNNs, which have been showing great potential in video recognition and classiﬁcation. Ourwork uses two deep CNNS, one to extract image sequence features and another to extract audio features. We combinethose features to create a single feature vector for the entire video (or video segments), which then is used as input forthe classiﬁer. It is a rather simpler method for video classiﬁcation and yet it still yields better results than the relatedwork.Similar to ours, [4] proposes a model for detecting inappropriate video scenes in video ﬁles. They extract the framesfrom the video ﬁle and use them as input to a CNN model to classify it. Their work differs from ours in the sense thatthey extract features only from images (video frames) and does not consider the audio track of the video. Instead, wedivide the video into smaller video segments for extracting features from both its frames and audio track.To present our proposal, this paper is organized as follows. Section 2 discusses the used model for classifying a video aseither appropriate or inappropriate. Section 3 presents the tool we developed for censoring inappropriate video scenesand preserve the appropriate ones. Section 4 shows the results we obtained in both the video classiﬁcation and with ourcensorship tool. Finally, Section 5 presents our ﬁnal remarks and future work.

Our CNN-based classiﬁer is composed of two modules. The ﬁrst module is what researchers call the backbone , whichacts as a feature extractor from which the model draws its discriminating power. The second module, the classiﬁer ,operates over the extracted features by the backbone to aggregate and classify it. We opt for a bi-modal approach thatuses two backbones to extract the audio and image features from videos. Once we have extracted the features from the2

PREPRINT - N

OVEMBER

12, 2019Figure 2: Multimodal architecture for inappropriate video classiﬁcationvideo, we then use a shallow model to perform the video classiﬁcation. In the remainder of this section, we detail theembeddings extractor and the algorithms used for classiﬁcation.A CNN, when trained, tends to learn at the ﬁrst layers the low-level features (e.g. in the visual domain: edges, corners,contours). At the intermediate and ﬁnal layers, the combination of these ﬁlters helps to extract more complexes features,resulting in a vector of continuous numbers called embeddings . In this work, we use two benchmark CNNs to extractboth image and audio embeddings from videos by using the transfer learning technique [5].First we extract both visual and audio embeddings . Based in the work of Abu-El-Haija et al. [6], we decode each videoat 1 frame-per-second up to the ﬁrst 360 seconds and feed an

InceptionV3 [7] with the network weights pre-trainedon ImageNet to extract the image embeddings . We also feed AudioVGG [8] with the network weights pre-trainedon Audioset to extract the audio embeddings . Next, we apply PCA (and whitening) to transform the dimensions ofthe image embeddings to size 1024 and audio embeddings to size 128. Finally, we concatenate both image and audio embeddings to compose the ﬁnal video embeddings with 1152 dimensions.The video embeddings are then fed to a Support Vector Machine (SVM) [9] for classiﬁcation. In an SVM, thedata is mapped into a higher dimension input space where an optimal separating hyper-plane is constructed. Thesedecision surfaces are found by solving a linearly constrained quadratic programming problem. The architecture of ourInappropriate classiﬁer is illustrated in Figure 2.

We evaluate the model by the Precision (P), Recall (R) and F1-Score for appropriate and inappropriate classes: P = T PT P + F P (1) R = T PT P + F N (2) F × P × RP + R (3)Where T P, T N, F P and

F N denote the examples that are true positives, true negatives, false positives, and falsenegatives, respectively. The F1 score, deﬁned in Equation 3, measures how precise the classiﬁer by the harmonic meanbetween Precision (Equation 1) and Recall (Equation 2). The F1-score represents an overall performance metric, andthe precision and recall metrics can give insights on where the classiﬁcation model is doing better.

We designed a tool that receives a video and automatically censors parts of it that may contain inappropriate content.Figure 3 shows how it works and such process is summarized in the following steps:1.

Split:

The video received is split into video segments (5 seconds each at most)2.

Classiﬁcation:

Each segment is labeled in either appropriate or inappropriate. https://research.google.com/audioset PREPRINT - N

OVEMBER

12, 20193.

Censorship:

If a segment is labeled as inappropriate, the audio of the video is removed and its image isblurred.4.

Merge:

Finally, the video segments are merged (the ones ﬁrst labeled as appropriate and the ones that passedthrough the censorship process).

Appropriate Original VideoOriginal Video 4.Merge Censored VideoCensored Video Inappropriate Appropriate1.Split3.Censorship2.Classi ﬁ cation Figure 3: Censorship tool overviewOur censorship tool is implemented in the Python programming language. This choice comes from the fact that suchlanguage is broadly used for deep learning solutions and provides a myriad of libraries for audio, image and videoprocessing.The

Split and

Merge steps are performed using the library MoviePy . For the classiﬁcation step, the classiﬁcation modeldetailed in Section 2 is used.The Censorship step is carried out by extracting all the frames of the video segment, applying a

Gaussian Blur for eachframe, and then replacing the frames with the processed ones. The Gaussian Blur formula is the following: https://zulko.github.io/moviepy/ PREPRINT - N

OVEMBER

12, 2019 G ( x, y ) = 12 πσ e x y σ In that formula, x is the distance from the origin in the horizontal axis, y is the distance from the origin in the verticalaxis, and σ is the standard deviation of the Gaussian distribution. Figure 4 shows an example of such image processing. (a) Original image (b) Blurred image Figure 4: Example of Gaussian BlurThe audio track from the original video segment is not attached to the new one so that the audio censorship is performedby removing it.Besides automatically censoring inappropriate video scenes from the video ﬁle, the tool we designed also returns anXML ﬁle that tells the beginning and duration of each inappropriate scene so that the video service has the autonomy todecide what to do with such scenes.

For model training, we used a video dataset with 33.500 inappropriate videos and 33.500 appropriate videos. Theinappropriate video ﬁles consist of pornographic and violent content.The pornographic content was obtained from the XVideos database. This database was chosen as a source forpornographic content because of its size (7 million videos) and variety of annotations (minor and major tags –eachvideo has one major tag and many other minor tags). To select a sample of this database, we selected videos from the70 major tags to maintain a similar distribution to the original database. In particular, to prevent lower-quantity majortags from disappearing, we have deﬁned a minimum of 10 videos for each major tag.The violent content (hereafter referred to as gore ) was collected from specialized websites through web crawlers (e.g.bestgore ). It is mainly composed of videos depicting deaths, exposed injuries, diseases, accidents, and other mentallydisturbing imaging.As for the appropriate videos, we collected them from the Yotube8M dataset. This choice comes from the size of thedataset (almost 8 million videos), the diverse tagging and the video classiﬁcation challenges it holds, which make thedataset largely available. We also added the Cholec80[10] dataset to ours, it contains 80 videos of cholecystectomysurgeries performed by 13 surgeons. All videos from the Cholec80 dataset were labeled as appropriate since videos ofsurgery are usual in some speciﬁc contexts (e.g. educational context).We split the dataset into 90% for training and 10% for test. Then, we perform a 20-fold cross validation and use f1-score , recall and precision metrics for model evaluation. In Sub-section 4.1, we present the results of model in both thetraining and test steps. Next, in Sub-section 4.2, we present an usage scenario to attest the applicability of our trainedmodel. Table 1 shows the performance of our model in 20-fold cross-validation. One can observe that our approach achievedgood results in all metrics with a small standard deviation. In the test step, our model achieved similar results, as can beseen in Table 2. https://info.xvideos.com/db https://research.google.com/youtube8m PREPRINT - N

OVEMBER

12, 2019

F1-score Precision Recall SupportAppr ± ± ± Inap ± ± ± F1-score Precision Recall SupportAppr

Inap inappropriate

F1-score is more desirable than a higher appropriate

F1-score.

We propose a usage scenario in which a video ﬁle needs to be veriﬁed for detecting and censoring any inappropriatescenes on it. In such a scenario, one is presented with an hour-long video of an interview. To make sure the entire videois safe for uploading on social media, one chooses to use our tool for automatic detecting and censoring inappropriatescenes it may contain.To accomplish this task, our tool ﬁrst splits the video ﬁle into 5 seconds segments. Then, it uses our classiﬁcationmodel to label each video section as either appropriate or inappropriate. Next, it performs the censorship by applying aGaussian ﬁlter on each frame of the segments which are predicted as inappropriate and remove their audio. Finally, allresulting segments of the video are merged, generating a new video in which inappropriate scenes are censored. Figure5 illustrates how the video is split, labeled, censored and merged, resulting in a censored video, accomplishing the taskproposed.

In this work, we presented a CNN based model for detecting inappropriate scenes in video ﬁles. Our approach achievedhigh performance with an F1-score of 98.95% for appropriate videos and 98.94% for inappropriate videos. To attest tothe applicability of our proposal, we created a usage scenario of our model, while also creating a tool for automaticcensoring inappropriate scenes in video ﬁles.It is important to point out that the tool we proposed only works with video ﬁles that are already completely available.To extend our tool for working with online content (e.g. video streams and live broadcasts), it would have to be ﬁrstbuffered, generating video segments, processed (what takes some time) and then made available.

References [1] Kwang Ho Song and Yoo-Sung Kim. Pornographic video detection scheme using multimodal features.

Journal ofEngineering and Applied Sciences , 13(5):1174–1182, 2018.[2] Manuel Torres Castro. Automatic ﬂagging of offensive video content using deep learning. Master’s thesis,Universitat Politècnica de Catalunya, 2018.[3] Jônatas Wehrmann, Gabriel S Simões, Rodrigo C Barros, and Victor F Cavalcante. Adult content detection invideos with convolutional and recurrent neural networks.

Neurocomputing , 272:432–438, 2018.[4] Kamrun Nahar Tofa, Farhana Ahmed, Arif Shakil, et al.

Inappropriate scene detection in a video stream . PhDthesis, BRAC University, 2017.[5] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transferlearning. In

International Conference on Artiﬁcial Neural Networks , pages 270–279. Springer, 2018.6

PREPRINT - N

OVEMBER

12, 2019

Original VideoAppropriateAppropriate Inappropriate Inappropriate2.Classi ﬁ cation 3.Censorship1.Split Censored Video4.Merge

Figure 5: Censorship Tool use example[6] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan,and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classiﬁcation benchmark. arXiv preprintarXiv:1609.08675 , 2016.[7] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inceptionarchitecture for computer vision. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2016.[8] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, ManojPlakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classiﬁcation. In , pages 131–135. IEEE,2017.[9] Corinna Cortes and Vladimir Vapnik. Support-vector networks.

Machine learning , 20(3):273–297, 1995.[10] Andru P Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel De Mathelin, and Nicolas Padoy.Endonet: a deep architecture for recognition tasks on laparoscopic videos.