A Multimodal CNN-based Tool to Censure Inappropriate Video Scenes
Pedro V. A. de Freitas, Paulo R. C. Mendes, Gabriel N. P. dos Santos, Antonio José G. Busson, Álan Livio Guedes, Sérgio Colcher, Ruy Luiz Milidiú
AA M
ULTIMODAL
CNN-
BASED T OOL TO C ENSURE I NAPPROPRIATE V IDEO S CENES
A P
REPRINT
Pedro V. A. de Freitas
PUC-Rio [email protected]
Paulo R. C. Mendes
PUC-Rio [email protected]
Gabriel N. P. dos Santos
PUC-Rio [email protected]
Antonio José G. Busson
PUC-Rio [email protected]
Álan Livio Guedes
PUC-Rio [email protected]
Sérgio Colcher
PUC-Rio [email protected]
Ruy Luiz Milidiú
PUC-Rio [email protected]
November 12, 2019 A BSTRACT
Due to the extensive use of video-sharing platforms and services for their storage, the amount of suchmedia on the internet has become massive. This volume of data makes it difficult to control the kind ofcontent that may be present in such video files. One of the main concerns regarding the video contentis if it has an inappropriate subject matter, such as nudity, violence, or other potentially disturbingcontent. More than telling if a video is either appropriate or inappropriate, it is also important toidentify which parts of it contain such content, for preserving parts that would be discarded in asimple broad analysis. In this work, we present a multimodal (using audio and image features)architecture based on Convolutional Neural Networks (CNNs) for detecting inappropriate scenes invideo files. In the task of classifying video files, our model achieved 98.95% and 98.94% of F1-scorefor the appropriate and inappropriate classes, respectively. We also present a censoring tool thatautomatically censors inappropriate segments of a video file. K eywords Inappropriate Video · CNN keyword · Deep Learning
The amount of multimedia content on the internet, especially video, is increasing each year. More than 300 hours ofvideo are uploaded to YouTube every minute . Due to the amount of material uploaded, controlling the content that isloaded is quite challenging even for large companies. For example, Facebook and Youtube are being sued for hostingvideos from the Christchurch shootings .The word Inappropriate is often used as a reference to media that contain content such as nudity, intense sexuality,violence, gore or other potentially disturbing subject matter. On the other hand,
Appropriate means that a content issuitable for most viewers. Figure 1 illustrates these two categories. There are three scenes with appropriate content onthe left, and three scenes with inappropriate on the right. https://biographon.com/youtube-stats/ a r X i v : . [ c s . MM ] N ov PREPRINT - N
OVEMBER
12, 2019 (a) Appropriate videos (b) Inappropriate videos
Figure 1: Examples of each categoryBesides controlling the content of entire video files, it is also important to identify which scenes of it are inappropriate.For example, YouTube only considers the entire video for telling if it is either appropriate or inappropriate–only oneinappropriate scene is necessary for banning an entire video from the platform. For that reason, an ongoing problem isthat, in some cases, a video has its entire content labeled as inappropriate when only specific scenes contain inappropriatecontent (e.g. movies and documentaries on wars and conflicts). For instance, one can suppose that students are lookingfor a documentary about Vietnam War for their history class. But they just cannot find any because the storage servicein which they are searching has banned all documentaries in that subject for containing violent scenes. Thus, a toolthat not only labels videos but also provides information on which scenes are inappropriate would allow access to thevideo content, while preventing exposure to these scenes. This work aims at presenting an approach for detecting andcensoring inappropriate video scenes in video files. This is done in a way that appropriate scenes are not censored andthe inappropriate ones pass through a process to make them presentable.Controlling the type of content loaded to video storage service requires an automatic analysis efficiently and quickly.Methods based on
Deep Learning (DL) became the state-of-the-art in various segments related to automatic videoanalysis. Convolutional Neural Networks (CNNs) architectures, or ConvNets, have become the primary method usedfor audio-visual pattern recognition.Other works also share the motivation of classifying video files in the categories we mentioned [1, 2, 3]. However, mostof them do not use audio and image for classification, or use hand-crafted feature extraction methods, or do not use thelatest feature extraction CNNs, which have been showing great potential in video recognition and classification. Ourwork uses two deep CNNS, one to extract image sequence features and another to extract audio features. We combinethose features to create a single feature vector for the entire video (or video segments), which then is used as input forthe classifier. It is a rather simpler method for video classification and yet it still yields better results than the relatedwork.Similar to ours, [4] proposes a model for detecting inappropriate video scenes in video files. They extract the framesfrom the video file and use them as input to a CNN model to classify it. Their work differs from ours in the sense thatthey extract features only from images (video frames) and does not consider the audio track of the video. Instead, wedivide the video into smaller video segments for extracting features from both its frames and audio track.To present our proposal, this paper is organized as follows. Section 2 discusses the used model for classifying a video aseither appropriate or inappropriate. Section 3 presents the tool we developed for censoring inappropriate video scenesand preserve the appropriate ones. Section 4 shows the results we obtained in both the video classification and with ourcensorship tool. Finally, Section 5 presents our final remarks and future work.
Our CNN-based classifier is composed of two modules. The first module is what researchers call the backbone , whichacts as a feature extractor from which the model draws its discriminating power. The second module, the classifier ,operates over the extracted features by the backbone to aggregate and classify it. We opt for a bi-modal approach thatuses two backbones to extract the audio and image features from videos. Once we have extracted the features from the2
PREPRINT - N
OVEMBER
12, 2019Figure 2: Multimodal architecture for inappropriate video classificationvideo, we then use a shallow model to perform the video classification. In the remainder of this section, we detail theembeddings extractor and the algorithms used for classification.A CNN, when trained, tends to learn at the first layers the low-level features (e.g. in the visual domain: edges, corners,contours). At the intermediate and final layers, the combination of these filters helps to extract more complexes features,resulting in a vector of continuous numbers called embeddings . In this work, we use two benchmark CNNs to extractboth image and audio embeddings from videos by using the transfer learning technique [5].First we extract both visual and audio embeddings . Based in the work of Abu-El-Haija et al. [6], we decode each videoat 1 frame-per-second up to the first 360 seconds and feed an
InceptionV3 [7] with the network weights pre-trainedon ImageNet to extract the image embeddings . We also feed AudioVGG [8] with the network weights pre-trainedon Audioset to extract the audio embeddings . Next, we apply PCA (and whitening) to transform the dimensions ofthe image embeddings to size 1024 and audio embeddings to size 128. Finally, we concatenate both image and audio embeddings to compose the final video embeddings with 1152 dimensions.The video embeddings are then fed to a Support Vector Machine (SVM) [9] for classification. In an SVM, thedata is mapped into a higher dimension input space where an optimal separating hyper-plane is constructed. Thesedecision surfaces are found by solving a linearly constrained quadratic programming problem. The architecture of ourInappropriate classifier is illustrated in Figure 2.
We evaluate the model by the Precision (P), Recall (R) and F1-Score for appropriate and inappropriate classes: P = T PT P + F P (1) R = T PT P + F N (2) F × P × RP + R (3)Where T P, T N, F P and
F N denote the examples that are true positives, true negatives, false positives, and falsenegatives, respectively. The F1 score, defined in Equation 3, measures how precise the classifier by the harmonic meanbetween Precision (Equation 1) and Recall (Equation 2). The F1-score represents an overall performance metric, andthe precision and recall metrics can give insights on where the classification model is doing better.
We designed a tool that receives a video and automatically censors parts of it that may contain inappropriate content.Figure 3 shows how it works and such process is summarized in the following steps:1.
Split:
The video received is split into video segments (5 seconds each at most)2.
Classification:
Each segment is labeled in either appropriate or inappropriate. https://research.google.com/audioset PREPRINT - N
OVEMBER
12, 20193.
Censorship:
If a segment is labeled as inappropriate, the audio of the video is removed and its image isblurred.4.
Merge:
Finally, the video segments are merged (the ones first labeled as appropriate and the ones that passedthrough the censorship process).
Appropriate Original VideoOriginal Video 4.Merge Censored VideoCensored Video Inappropriate Appropriate1.Split3.Censorship2.Classi fi cation Figure 3: Censorship tool overviewOur censorship tool is implemented in the Python programming language. This choice comes from the fact that suchlanguage is broadly used for deep learning solutions and provides a myriad of libraries for audio, image and videoprocessing.The
Split and
Merge steps are performed using the library MoviePy . For the classification step, the classification modeldetailed in Section 2 is used.The Censorship step is carried out by extracting all the frames of the video segment, applying a
Gaussian Blur for eachframe, and then replacing the frames with the processed ones. The Gaussian Blur formula is the following: https://zulko.github.io/moviepy/ PREPRINT - N
OVEMBER
12, 2019 G ( x, y ) = 12 πσ e x y σ In that formula, x is the distance from the origin in the horizontal axis, y is the distance from the origin in the verticalaxis, and σ is the standard deviation of the Gaussian distribution. Figure 4 shows an example of such image processing. (a) Original image (b) Blurred image Figure 4: Example of Gaussian BlurThe audio track from the original video segment is not attached to the new one so that the audio censorship is performedby removing it.Besides automatically censoring inappropriate video scenes from the video file, the tool we designed also returns anXML file that tells the beginning and duration of each inappropriate scene so that the video service has the autonomy todecide what to do with such scenes.
For model training, we used a video dataset with 33.500 inappropriate videos and 33.500 appropriate videos. Theinappropriate video files consist of pornographic and violent content.The pornographic content was obtained from the XVideos database. This database was chosen as a source forpornographic content because of its size (7 million videos) and variety of annotations (minor and major tags –eachvideo has one major tag and many other minor tags). To select a sample of this database, we selected videos from the70 major tags to maintain a similar distribution to the original database. In particular, to prevent lower-quantity majortags from disappearing, we have defined a minimum of 10 videos for each major tag.The violent content (hereafter referred to as gore ) was collected from specialized websites through web crawlers (e.g.bestgore ). It is mainly composed of videos depicting deaths, exposed injuries, diseases, accidents, and other mentallydisturbing imaging.As for the appropriate videos, we collected them from the Yotube8M dataset. This choice comes from the size of thedataset (almost 8 million videos), the diverse tagging and the video classification challenges it holds, which make thedataset largely available. We also added the Cholec80[10] dataset to ours, it contains 80 videos of cholecystectomysurgeries performed by 13 surgeons. All videos from the Cholec80 dataset were labeled as appropriate since videos ofsurgery are usual in some specific contexts (e.g. educational context).We split the dataset into 90% for training and 10% for test. Then, we perform a 20-fold cross validation and use f1-score , recall and precision metrics for model evaluation. In Sub-section 4.1, we present the results of model in both thetraining and test steps. Next, in Sub-section 4.2, we present an usage scenario to attest the applicability of our trainedmodel. Table 1 shows the performance of our model in 20-fold cross-validation. One can observe that our approach achievedgood results in all metrics with a small standard deviation. In the test step, our model achieved similar results, as can beseen in Table 2. https://info.xvideos.com/db https://research.google.com/youtube8m PREPRINT - N
OVEMBER
12, 2019
F1-score Precision Recall SupportAppr ± ± ± Inap ± ± ± F1-score Precision Recall SupportAppr
Inap inappropriate
F1-score is more desirable than a higher appropriate
F1-score.
We propose a usage scenario in which a video file needs to be verified for detecting and censoring any inappropriatescenes on it. In such a scenario, one is presented with an hour-long video of an interview. To make sure the entire videois safe for uploading on social media, one chooses to use our tool for automatic detecting and censoring inappropriatescenes it may contain.To accomplish this task, our tool first splits the video file into 5 seconds segments. Then, it uses our classificationmodel to label each video section as either appropriate or inappropriate. Next, it performs the censorship by applying aGaussian filter on each frame of the segments which are predicted as inappropriate and remove their audio. Finally, allresulting segments of the video are merged, generating a new video in which inappropriate scenes are censored. Figure5 illustrates how the video is split, labeled, censored and merged, resulting in a censored video, accomplishing the taskproposed.
In this work, we presented a CNN based model for detecting inappropriate scenes in video files. Our approach achievedhigh performance with an F1-score of 98.95% for appropriate videos and 98.94% for inappropriate videos. To attest tothe applicability of our proposal, we created a usage scenario of our model, while also creating a tool for automaticcensoring inappropriate scenes in video files.It is important to point out that the tool we proposed only works with video files that are already completely available.To extend our tool for working with online content (e.g. video streams and live broadcasts), it would have to be firstbuffered, generating video segments, processed (what takes some time) and then made available.
References [1] Kwang Ho Song and Yoo-Sung Kim. Pornographic video detection scheme using multimodal features.
Journal ofEngineering and Applied Sciences , 13(5):1174–1182, 2018.[2] Manuel Torres Castro. Automatic flagging of offensive video content using deep learning. Master’s thesis,Universitat Politècnica de Catalunya, 2018.[3] Jônatas Wehrmann, Gabriel S Simões, Rodrigo C Barros, and Victor F Cavalcante. Adult content detection invideos with convolutional and recurrent neural networks.
Neurocomputing , 272:432–438, 2018.[4] Kamrun Nahar Tofa, Farhana Ahmed, Arif Shakil, et al.
Inappropriate scene detection in a video stream . PhDthesis, BRAC University, 2017.[5] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transferlearning. In
International Conference on Artificial Neural Networks , pages 270–279. Springer, 2018.6
PREPRINT - N
OVEMBER
12, 2019
Original VideoAppropriateAppropriate Inappropriate Inappropriate2.Classi fi cation 3.Censorship1.Split Censored Video4.Merge
Figure 5: Censorship Tool use example[6] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan,and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprintarXiv:1609.08675 , 2016.[7] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inceptionarchitecture for computer vision. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2016.[8] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, ManojPlakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In , pages 131–135. IEEE,2017.[9] Corinna Cortes and Vladimir Vapnik. Support-vector networks.
Machine learning , 20(3):273–297, 1995.[10] Andru P Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel De Mathelin, and Nicolas Padoy.Endonet: a deep architecture for recognition tasks on laparoscopic videos.