CNN-Based Prediction of Frame-Level Shot Importance for Video Summarization
Mohaiminul Al Nahian, A. S. M. Iftekhar, Mohammad Tariqul Islam, S. M. Mahbubur Rahman, Dimitrios Hatzinakos
CCNN-Based Prediction of Frame-Level Shot Importance for Video Summarization
Mohaiminul Al Nahian ∗ , A. S. M. Iftekhar ∗ , Mohammad Tariqul Islam ∗ S. M. Mahbubur Rahman † , and Dimitrios Hatzinakos ‡∗ Department of EEE, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh † Department of EEE, University of Liberal Arts Bangladesh, Dhaka 1209, Bangladesh ‡ Department of ECE, University of Toronto, Toronto, ON, Canada, M5S 2E4Email: [email protected], [email protected], [email protected]@ulab.edu.bd, [email protected]
Abstract —In the Internet, ubiquitous presence of redundant,unedited, raw videos has made video summarization an im-portant problem. Traditional methods of video summarizationemploy a heuristic set of hand-crafted features, which in manycases fail to capture subtle abstraction of a scene. This paperpresents a deep learning method that maps the context of avideo to the importance of a scene similar to that is perceivedby humans. In particular, a convolutional neural network(CNN)-based architecture is proposed to mimic the frame-levelshot importance for user-oriented video summarization. Theweights and biases of the CNN are trained extensively throughoff-line processing, so that it can provide the importance of aframe of an unseen video almost instantaneously. Experimentson estimating the shot importance is carried out using thepublicly available database TVSum50. It is shown that theperformance of the proposed network is substantially betterthan that of commonly referred feature-based methods forestimating the shot importance in terms of mean absolute error,absolute error variance, and relative F -measure.
1. Introduction
With the development of comfortable and user-friendlydevices for capturing and storing multimedia content, a hugeamount of videos are being shot at every moment. Nearly60 hours worth of footage is uploaded on YouTube in everyminute [1]. To find and analyze this huge amount of videoshave become an extremely tedious task. The generation of acompact, comprehensive, and automated summary of videocan facilitate an effective way to utilize videos for variousreal-life applications such as for classifying huge number ofonline videos, removing redundant videos, highlighting thesports matches or trailer of feature films. Also, a semanticalrelevant position can be located using video summaries thatcan be essential for surveillance system [2]. Fig. 1 shows anillustrative example of summary of a video titled “ReubenSandwich with Corned Beef & Sauerkraut” available inYouTube. A number of frames of the video are groupedtogether based on the noticeable contexts as a summary ofthe video. It is evident from Fig. 1 that a context-dependent
Unifrom Sampling of Video FramesContext-Based Sumamry of Video Frames
Figure 1. An illustrative example showing effectiveness of context-basedsummarization instead of uniform sampling of frames to review a video. video summary can have a better representation as comparedto uniform sampling of video frames.
In general, there are three major approaches to sum-marize videos [3]. They are: object-based, event-based andfeature-based methods. Object-based approach mainly de-pends on detecting the highlighted objects. The underlyingassumption is that these objects are the key elements fora summary. In other words, the frames in which theseobjects are found can be considered as the important framesto be presented in the summary [4]. Lee et al. [5] usedthese object-based detection of key frames to summarizeegocentric videos. Though this approach is effective forcertain types of videos, the success of the methods largelydepends on the content of the videos. If a highlighted objectis not present throughout the entire video or the highlightedobject is present in every frame of a video, then object-baseddetection methods will not be able to summarize the videoeffectively.In the event-based methods, an important event is de-termined by the use of previously defined bag-of-words.The events can be detected by the change in various low-level factors, e.g., change in colors or abrupt change incamera direction. These methods are used by many worksin the literature, when the goal as well as the environments a r X i v : . [ c s . C V ] A ug f summarization is very much specific, e.g., surveillancevideos [6], sports videos [7], coastal area videos [8]. Thisapproach fails to represent the overall generality as similarevents can have contrasting significance in different envi-ronments. For example, in a video of a football match, thescene of scoring goal is considered important, but similarevent in a surveillance video can be useless.Most popular methods for video summarization arebased on suitable features. In this approach, certain featuresare used to detect important frames termed as key frames.In most cases, a large number of features are combinedtogether to detect important frames. These features are se-lected by judging the content of the videos. Different typesof features including the visual attention [9] and singularvalue decomposition (SVD) [10] have been used for keyframe detection. Recently, machine learning techniques havebeen introduced to select suitable features [11]. However, thesuccess of such methods seriously depends on the numberof selected features, and the way the features are combined.Hence, the methods fail to map individual perception in ageneralized framework. Most of the existing methods for video summarizationfocus on detecting key frames based on some sorts offixed parameters. This type of parameter-based detectionis not suitable for an overall general platform of videosummarization. In this work, a convolutional neural net-work (CNN)-based architecture is proposed to deal withthe overall generality of the problem and to estimate theimportance of each frame in a video. This can be used todevelop a platform in which a user can have freedom toselect the length of the summary as applicable. To the best ofour knowledge, finding out the frame-level shot importanceusing the CNN is not present in the current literature.
The main objective of the paper is to present a CNNmodel to estimate the shot-by-shot importance in a video.The overall contributions of the paper are: • Developing a CNN based algorithm to estimateframe-level shot importance. • Generating a platform for the summarization of anykind of video using the estimated frame-level shotimportance.The rest of the paper is organized as follows. Section 2provides a description of the proposed architecture. Theexperimental setup and the results obtained are describedin Section 3. Finally, Section 4 provides the conclusion.
2. Proposed Method
In this paper, a feed-forward CNN is employed to de-termine the frame-level shot importance of a video. In the proposed multilayer CNN, the first layer is the input layerwhich uses the raw video frame X as the input. The lastlayer is the output layer that predicts the importance score y (0 ≤ y ≤ L ) ( y ∈ R ) of the input frame in that particularvideo, where L is a positive integer corresponding to thehighest score. A low value of y indicates a less importantframe, while a high value implies an important one. In thissection, first the proposed CNN model is described. Then,the training and optimization schemes are detailed. In order to estimate the shot importance of a frame, wetrain an end-to-end CNN that automatically learns visualcontexts to predict the score in the output. The proposedCNN architecture is a six-stage model employing learnableconvolution and fully connected layers as shown in the stickdiagram in Fig. 2. The convolution and fully connected oper-ations are followed by ReLU activation for its ability to helpneural networks attaining a better sparse representation [12].The first stage of the network is pre-processing tasks neededto normalize the dimension of the data. The pre-processingstage can be written as X = preprocess ( X ) (1)This task involves frame resizing and cropping that areapplied sequentially. In the stick diagram of Fig. 2 the frameresizing is shown using a rectangle with a single stripeand cropping operation is shown by a diverging trapezoid.The second stage performs a convolution operation whichemploys ReLU activation on X , which is given by X = max( , W ∗ X + b ) (2)where ∗ is the convolution operation, max( , · ) is the ReLUoperation. In Fig. 2, the convolution layer is shown as arectangle and ReLU layer as a solid line. The third andfourth stages use the convolution, ReLU and max-poolingoperations serially. These operations are given by the equa-tions X = M P (max( , W ∗ X + b )) (3) X = M P (max( , W ∗ X + b )) (4)where M P ( · ) is the max-pool operation. This operationreduces the spatial dimension by half and is represented byconverging trapezoid (see Fig. 2). The fifth stage consistsof fully connected operation, the ReLU and dropout layers.First, the output of fourth stage X is flattened to a 1-Dvector (cid:98) X , and then this vector is fed into the fifth stage toprovide the output X r given by X r = Drop (max( , W T (cid:98) X + b )) (5)where Drop ( · ) is the dropout operation [13]. In Fig. 2, thefully connected layer and the dropout layer are representedby rectangle with three stripes in the middle and a parallelo-gram, respectively. The final part of the CNN is the regressor tage 1 Stage 2 Stage 3
Stage 4 Stage 5 Regressor
Smoothing
Layer
Input Frame
Output Shot
Importance
Figure 2. Proposed CNN model for predicting the frame-level shot importance of a video. The input to the model is raw video frames and output is thescore of importance. which is a fully connected layer that outputs the estimationof the frame importance in scalars from X r given by ˆ y = W Tr X r + b r (6)In many cases, the frame-level importance can be averagedover a few neighboring frames using a smoothing filter(shown as rectangle with a diagonal stripe in Fig. 2). Overall,the learnable parameters of the network are the filter sets, W , W , W , W , and W r , and their corresponding biasterms b , b , b , b and b r , respectively. There are number of training and optimization schemesthat can be chosen for attaining good results from thenetwork. An effective choice of initialization of weights andbiases can significantly reduce training time by convergingthe network faster. In this context, we have explored theworks of Glorot et al. [14] and initialized all the biaseswith zeros and weights W i at each layer by taking samplesfrom a uniform distribution W i ∼ U (cid:104) − √ M , √ M (cid:105) where M ( M ∈ Z ) is the size of the previous layer. In order to applyback-propagation [15] for training the network, a loss func-tion is required to be specified that is easily differentiable.For regression-based tasks such as the estimation of scores,most common choices are (cid:96) -norm, (cid:96) -norm or Frobenius norm. In the proposed method, we choose an (cid:96) -norm-basedloss function given by C = N (cid:88) n =1 || y n − ˆ y n || (7)where y n is the ground truth value of the shot importance, (cid:98) y n is the predicted score, and N ( N ∈ Z ) is the numberof training inputs fed into the back-propagation processin each iteration for mini-batch optimization [16]. Duringthe training period, this function is optimized by using thecontemporary Adam stochastic optimization technique [17].The weights of the filter sets denoted by w are updated basedon the first moment (cid:98) m and second moment (cid:98) v of the gradient of the loss function C with respect to the weights. Overall,the update process of the optimization can be written as [17] (cid:98) m ( t ) = (cid:98) m ( t −
1) + (1 − β ) dC ( t ) dw ( t ) (8) (cid:98) v ( t ) = (cid:98) v ( t −
1) + (1 − β ) (cid:18) ∂C ( t ) ∂w ( t ) (cid:19) (9) w ( t ) = w ( t − − α (cid:98) m ( t ) (cid:112)(cid:98) v ( t ) + (cid:15) (10)where α ( α > is the step size, β and β ( β , β > aredecay rates for the first and second moments, respectively,and (cid:15) ( (cid:15) > is a factor of numerical stability.
3. Experiments and Results
Experiments are carried out to evaluate the performanceof the proposed CNN architecture as compared to exist-ing methods for predicting the score of frame importancein videos. In this section, first we give an overview ofvideo dataset used in the experiments, then we describeour training and testing data partitions, data augmentationtechniques, parameter settings of the proposed architecture,and matching scheme of estimated score of importancewith the ground truth. Then, the methods compared forperformance evaluation are introduced. Finally, results arepresented and evaluated in terms of commonly-referred per-formance metrics of regression.
In the experiments, we have used the TVSum50database [18] that includes 50 video sequences. Thesevideos are categorized into ten different genres includingthe flash mob, news, and video blog. Each genre containsvideos of five independent scene. The duration of videosvaries from 2 to 10 minutes. Each frame of these videoshas been annotated by an importance score of continuousvalues ranging from 1 to 5 by using crowd-sourcing. It isfound empirically that a shot length of two seconds will beble to the reflect local context of a video [18]. By adoptingthis rule, each video is divided into segments, where eachsegment has a duration of two seconds. These segments arefirst annotated by 20 users. A ground truth of importancescore has been produced by regularizing and combiningthese annotated scores.
Out of 50 videos of the dataset, 35 videos are chosen fortraining and the mutually exclusive rest of the 15 videos arekept for testing phase. In order to design a fair evaluationprocess, at least three videos for the training set and onevideo for testing set are included from each of the tengenres. In order to achieve a computational efficiency and toreduce the training period, a subset of frames from videosare considered for learning. In particular, a single framefrom each strip of five consecutive frames is considered fortraining scheme. This is mainly due to the fact that the visualcontents of five consecutive frames are almost same in avideo. This ensures that the training data has less amount ofredundant information and, thus the approach significantlyreduces the training period. On the other hand, no frames isdiscarded from the test set, instead the importance score ofevery frame of a video is predicted.
Data augmentation helps to achieve generalized resultsin CNN-based learning [19]. It reduces overfitting by vir-tually increasing the training data size. In general, a largernetwork can be trained by augmenting a dataset withoutlosing validation accuracy. This scheme has been adoptedin our experiments. Augmentation techniques that are usedin the training include the transpose, horizontal flips, andvertical flips of the frames. One or more of these operationsare chosen randomly in each stage of the training step. Inother words, seven new variants of the original data areachieved, and our training set virtually increases by up to 8times. During each iteration, a random integer is generatedbetween 1 and 8 inclusive that correspond to a specificcombination of data augmentation techniques. Based on thegenerated integer, the selected operations are performed onthe data prior to feeding it to the following stage.
The network parameters of the CNN model describedin Section 2.1 are chosen based on the dimensions of inputand required output in different layers. Since the size ofinput video frames varies among different videos, first thevideo frames are resized to × × and then croppedcentrally to obtain × × sized images, where isthe channel parameter of RGB components of a color image.The number of filters in the sets W , W , W , W , and W r and corresponding number of bias terms b , b , b , b and b r are set to , , , and , respectively, since such a choice provides an overall good performance.The kernel size of all the convolution filters is set to andthat of the max-pool operation is set to . The dropoutparameter is chosen as . during training and duringtesting. Empirically the parameters α , β , β of the Adamoptimizer are found to be − , . and . , respectively.The numerical stability factor (cid:15) is set to − . A single value has been assigned as the shot importancefor 50 neighboring frames in the ground truth. Since theproposed model predicts shot importance for each of theframes in a video, a scheme for matching the importancehas been employed in order to be consistent with the groundtruth of dataset. In particular, first the predicted outputvalues for 50 consecutive frames are considered, then theminimum and maximum of the predictions arediscarded, and finally the root mean squared (RMS) valueof the remaining data is assigned as the fixed-level shotimportance for the 50 neighboring frames.
The proposed CNN is a learning-based method, wherethe importance of frames are predicted automatically by thenetwork. In the experiments, we select three feature-basedapproaches reported for video summarization. Originally themethods are concerned with the selection of key frames. Themethods are briefly described as follows: • Visual attention [9]: In this method, the visual atten-tion extracted from spatial and temporal saliency isused to extract key frames from a video. • Motion attention [20]: The video features extractedfrom motions are employed for video summariza-tion. • Singular value decomposition (SVD) [10]: The min-imization of cross correlation of the features ex-tracted in terms of SVD of frames is used to identifythe key frames for video summarization.To compare these methods with the proposed one, theyare invoked to predict shot importance for each of theframes of a video. In particular, the features are used ina support vector regression technique to predict the frame-level shot importance using the same training and testingsets described in Section 3.2.
The performance of the proposed CNN-based methodand three comparing methods are evaluated in terms ofthree metrics, namely, mean absolute error (MAE), absoluteerror variance (AEV), and relative F -measure. The MAEindicates how much the predicted values deviate from theground truth on average, and the AEV reveals the fluctua-tions of absolute errors. Thus, a lower value of MAE means
00 400 600 800 1000 1200 1400 1600 1800 2000
Frame Number S ho t I m po r t an c e Ground truthProposed CNN-based method
200 400 600 800 1000 1200 1400 1600 1800 2000
Frame Number S ho t I m po r t an c e Ground truthSVD-based method
200 400 600 800 1000 1200 1400 1600 1800 2000
Frame Number S ho t I m po r t an c e Ground truthVisual attention based method
200 400 600 800 1000 1200 1400 1600 1800 2000
Frame Number S ho t I m po r t an c e Ground truthMotion attention based method (a) Motion attention-based method (b) Visual attention-based method(c) SVD-based method (d) Proposed CNN-based method
Figure 3. Frame-level scores of shot importance predicted by using the experimental methods. The predicted scores are compared with the ground truth.The comparisons are shown for (a) motion attention-based method, (b) visual attention-based method, (c) SVD-based method, and (d) proposed CNN-basedmethod. TABLE 1. P
ERFORMANCE OF P REDICTION OF S HOT I MPORTANCE IN T ERMS OF
MAE, AEV
AND R ELATIVE F - MEASURE
Methods MAE AEV Relative F -measureMotion Attention [9] . . . Visual Attention [20] . . . SVD [10] . . . Proposed CNN . . . predicted value is very close to the actual one. Similarly,a small AEV is a good sign implying that errors do notfluctuate significantly.The F -measure gives an idea about the close matchingbetween the video summary prepared by the predicted shotimportance and that by the ground truth. In order to computethe F -measure, a threshold is selected for each of thecomparing methods as well as for the ground truth. Thethreshold maps the continuous values of frame importanceinto binary values denoting the selected and non-selectedframes for a summary, preferably with a length of − of the original video. The metric F -measure is given by F -measure = 2 × P recision × RecallP recision + Recall (11)where
P recision is the fraction of matched frames withrespect to the ground truth, and
Recall implies the fractionof matched frames with respect to the total number offrames. To find out how well the proposed CNN-based method performs as compared to others, the relative F -measure is evaluated by normalizing the metric with thesame calculated from the annotated ground truths of fifteenvideos. In the experiments, shot importance of all the framesof test videos are predicted using the proposed as well asthree comparing methods. Then, the importance values aregrouped for local neighboring 50 frames as described inSection 3.5. Table 1 shows the overall prediction perfor-mance of the testing videos in terms of the metrics MAE,AEV and relative F -measure. It is seen from the tablethat the proposed CNN-based method performs the best byproviding the lowest MAE. It shows approximately improvement in terms of MAE from the most competitivemethod reported in [10], which uses SVD of frames aseatures. The proposed method outperforms the comparingmethods by showing an improvement of at least inrobustness by providing the lowest AEV. It can also befound from Table 1 that our method provides the highestrelative F -measure as compared to others, where the im-provement is more than from the competing method.In other words, our proposed method performs significantlybetter than others for predicting the shot importance. Thisis evident because the method consistently provides lowabsolute errors through out the entire frames of a video andthus results in a video summarization close to the groundtruth.Fig. 3 shows the frame-level scores of shot importancepredicted for first two thousand frames of a test video withflash mob genre having a title of “ICC World Twenty20Bangladesh 2014 Flash Mob Pabna University of Science& Technology (PUST)” . This video was shot by a group ofBangladeshi students as a promotional video of the 2014ICC World Twenty20 event. It is seen from Fig. 3 that thepredicted scores of importance provided by the proposedCNN-based method tend to follow the ground truth moreclosely than that provided by the comparing three methods.The motion-based method [20] shows sudden changes ofscores of importance, which appear even in the oppositedirection to the trend of the ground truth. The visual at-tention [9] and SVD-based [10] methods though follow thetrend of ground truth closely in a few region, the deviationsare significant in most of the regions. Evidently, the abovetwo limitations are nearly absent in the prediction scores ofthe proposed method, and hence, the CNN-based predictionappears to be accurate and robust.
4. Conclusion
In this paper, a CNN-based architecture has been pro-posed to predict frame-level shot importance of videos.The predicted scores of shot importance can be used forthe development of a platform, which can provide a user-oriented automated summary of a video. Thus, our worksuccessfully converts the subjective video summarizationinto a measurable objective framework. To evaluate theproposed CNN-based method, annotated importance of tengenres of videos of TVSum50 database have been usedas the ground truth. Experiments have been conducted byadopting mutually exclusive training and testing sets thatencompasses available genres of the dataset. The proposedmethod has been compared with the methods based onthe visual attention, motion attention, and SVD features.Experimental results reveal that the proposed CNN-basedmethod outperforms the existing feature-based methods interms of three evaluation metrics, namely, MAE, AEV andrelative F -measure. References [1] S. Brain, “YouTube statistics,” , 2014. [2] A. G. Money and H. Agius, “Video summarisation: A conceptualframework and survey of the state of the art,”
J. Visual Communica-tion and Image Representation , vol. 19, no. 2, pp. 121–143, 2008.[3] W. Ding and G. Marchionini, “A study on video browsing strategies,”University of Maryland at College Park, College Park, MD, Tech.Rep. Report No. UMIACS-TR-97-40, 1998.[4] J. Meng, H. Wang, J. Yuan, and Y.-P. Tan, “From keyframes to keyobjects: Video summarization by representative object proposal selec-tion,” in
Proc. IEEE Conf. Computer Vision and Pattern Recognition ,Las Vegas, NV, 2016, pp. 1039–1048.[5] Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important peopleand objects for egocentric video summarization,” in
Proc. IEEE Conf.Computer Vision and Pattern Recognition , Providence, RI, 2012, pp.1346–1353.[6] W. Lin, Y. Zhang, J. Lu, B. Zhou, J. Wang, and Y. Zhou, “Summariz-ing surveillance videos with local-patch learning-based abnormalitydetection, blob sequence optimization, and type-based synopsis,”
Neurocomputing , vol. 155, pp. 84–98, 2015.[7] B. Li and M. I. Sezan, “Event detection and summarization in sportsvideo,” in
Proc. IEEE Work. Content-Based Access of Image andVideo Libraries , Kauai, HI, 2001, pp. 132–138.[8] D. Cullen, J. Konrad, and T. D. Little, “Detection and summarizationof salient events in coastal environments,” in
IEEE Int. Conf. Ad-vanced Video and Signal-Based Surveillance , Beijing, China, 2012,pp. 7–12.[9] N. Ejaz, I. Mehmood, and S. W. Baik, “Efficient visual attention basedframework for extracting key frames from videos,”
Signal Processing:Image Communication , vol. 28, no. 1, pp. 34–44, 2013.[10] K. S. Ntalianis and S. D. Kollias, “An optimized key-frames extractionscheme based on SVD and correlation minimization,” in
IEEE Int.Conf. Multimedia and Expo. , Amsterdam, The Netherlands, 2005, pp.792–795.[11] S. Lu, Z. Wang, T. Mei, G. Guan, and D. D. Feng, “A bag-of-importance model with locality-constrained coding based featurelearning for video summarization,”
IEEE Trans. Multimedia , vol. 16,no. 6, pp. 1497–1509, 2014.[12] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neuralnetworks,” in
Proc. Int. Conf. Artificial Intelligence and Statistics ,vol. 15, Fort Lauderdale, FL, 2011, pp. 315–323.[13] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A simple way to prevent neural networksfrom overfitting.”
J. Machine Learning Research , vol. 15, no. 1, pp.1929–1958, 2014.[14] X. Glorot and Y. Bengio, “Understanding the difficulty of trainingdeep feedforward neural networks,” in
Proc. Int. Conf. ArtificialIntelligence and Statistics , vol. 9, Sardinia, Italy, 2010, pp. 249–256.[15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learningrepresentations by back-propagating errors,”
Nature , vol. 323, pp.533–536, 1986.[16] M. Li, T. Zhang, Y. Chen, and A. J. Smola, “Efficient mini-batchtraining for stochastic optimization,” in
Proc. ACM SIGKDD Int.Conf. Knowledge Discovery and Data Mining , New York, NY, 2014,pp. 661–670.[17] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[18] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “TVSum: Summa-rizing web videos using titles,” in
Proc. IEEE Conf. Computer Visionand Pattern Recognition , Boston, MA, 2015, pp. 5179–5187.[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-cation with deep convolutional neural networks,” in
Proc. Int. Conf.Neural Information Processing Systems , Lake Tahoe, NV, 2012, pp.1097–1105.[20] Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, “A user attention model forvideo summarization,” in