[PDF] CNN-Based Prediction of Frame-Level Shot Importance for Video Summarization

Abstract

In the Internet, ubiquitous presence of redundant, unedited, raw videos has made video summarization an important problem. Traditional methods of video summarization employ a heuristic set of hand-crafted features, which in many cases fail to capture subtle abstraction of a scene. This paper presents a deep learning method that maps the context of a video to the importance of a scene similar to that is perceived by humans. In particular, a convolutional neural network (CNN)-based architecture is proposed to mimic the frame-level shot importance for user-oriented video summarization. The weights and biases of the CNN are trained extensively through off-line processing, so that it can provide the importance of a frame of an unseen video almost instantaneously. Experiments on estimating the shot importance is carried out using the publicly available database TVSum50. It is shown that the performance of the proposed network is substantially better than that of commonly referred feature-based methods for estimating the shot importance in terms of mean absolute error, absolute error variance, and relative F-measure.

Full PDF

CCNN-Based Prediction of Frame-Level Shot Importance for Video Summarization

Mohaiminul Al Nahian ∗ , A. S. M. Iftekhar ∗ , Mohammad Tariqul Islam ∗ S. M. Mahbubur Rahman † , and Dimitrios Hatzinakos ‡∗ Department of EEE, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh † Department of EEE, University of Liberal Arts Bangladesh, Dhaka 1209, Bangladesh ‡ Department of ECE, University of Toronto, Toronto, ON, Canada, M5S 2E4Email: [email protected], [email protected], [email protected]@ulab.edu.bd, [email protected]

Abstract —In the Internet, ubiquitous presence of redundant,unedited, raw videos has made video summarization an im-portant problem. Traditional methods of video summarizationemploy a heuristic set of hand-crafted features, which in manycases fail to capture subtle abstraction of a scene. This paperpresents a deep learning method that maps the context of avideo to the importance of a scene similar to that is perceivedby humans. In particular, a convolutional neural network(CNN)-based architecture is proposed to mimic the frame-levelshot importance for user-oriented video summarization. Theweights and biases of the CNN are trained extensively throughoff-line processing, so that it can provide the importance of aframe of an unseen video almost instantaneously. Experimentson estimating the shot importance is carried out using thepublicly available database TVSum50. It is shown that theperformance of the proposed network is substantially betterthan that of commonly referred feature-based methods forestimating the shot importance in terms of mean absolute error,absolute error variance, and relative F -measure.

1. Introduction

With the development of comfortable and user-friendlydevices for capturing and storing multimedia content, a hugeamount of videos are being shot at every moment. Nearly60 hours worth of footage is uploaded on YouTube in everyminute [1]. To ﬁnd and analyze this huge amount of videoshave become an extremely tedious task. The generation of acompact, comprehensive, and automated summary of videocan facilitate an effective way to utilize videos for variousreal-life applications such as for classifying huge number ofonline videos, removing redundant videos, highlighting thesports matches or trailer of feature ﬁlms. Also, a semanticalrelevant position can be located using video summaries thatcan be essential for surveillance system [2]. Fig. 1 shows anillustrative example of summary of a video titled “ReubenSandwich with Corned Beef & Sauerkraut” available inYouTube. A number of frames of the video are groupedtogether based on the noticeable contexts as a summary ofthe video. It is evident from Fig. 1 that a context-dependent

Unifrom Sampling of Video FramesContext-Based Sumamry of Video Frames

Figure 1. An illustrative example showing effectiveness of context-basedsummarization instead of uniform sampling of frames to review a video. video summary can have a better representation as comparedto uniform sampling of video frames.

In general, there are three major approaches to sum-marize videos [3]. They are: object-based, event-based andfeature-based methods. Object-based approach mainly de-pends on detecting the highlighted objects. The underlyingassumption is that these objects are the key elements fora summary. In other words, the frames in which theseobjects are found can be considered as the important framesto be presented in the summary [4]. Lee et al. [5] usedthese object-based detection of key frames to summarizeegocentric videos. Though this approach is effective forcertain types of videos, the success of the methods largelydepends on the content of the videos. If a highlighted objectis not present throughout the entire video or the highlightedobject is present in every frame of a video, then object-baseddetection methods will not be able to summarize the videoeffectively.In the event-based methods, an important event is de-termined by the use of previously deﬁned bag-of-words.The events can be detected by the change in various low-level factors, e.g., change in colors or abrupt change incamera direction. These methods are used by many worksin the literature, when the goal as well as the environments a r X i v : . [ c s . C V ] A ug f summarization is very much speciﬁc, e.g., surveillancevideos [6], sports videos [7], coastal area videos [8]. Thisapproach fails to represent the overall generality as similarevents can have contrasting signiﬁcance in different envi-ronments. For example, in a video of a football match, thescene of scoring goal is considered important, but similarevent in a surveillance video can be useless.Most popular methods for video summarization arebased on suitable features. In this approach, certain featuresare used to detect important frames termed as key frames.In most cases, a large number of features are combinedtogether to detect important frames. These features are se-lected by judging the content of the videos. Different typesof features including the visual attention [9] and singularvalue decomposition (SVD) [10] have been used for keyframe detection. Recently, machine learning techniques havebeen introduced to select suitable features [11]. However, thesuccess of such methods seriously depends on the numberof selected features, and the way the features are combined.Hence, the methods fail to map individual perception in ageneralized framework. Most of the existing methods for video summarizationfocus on detecting key frames based on some sorts ofﬁxed parameters. This type of parameter-based detectionis not suitable for an overall general platform of videosummarization. In this work, a convolutional neural net-work (CNN)-based architecture is proposed to deal withthe overall generality of the problem and to estimate theimportance of each frame in a video. This can be used todevelop a platform in which a user can have freedom toselect the length of the summary as applicable. To the best ofour knowledge, ﬁnding out the frame-level shot importanceusing the CNN is not present in the current literature.

The main objective of the paper is to present a CNNmodel to estimate the shot-by-shot importance in a video.The overall contributions of the paper are: • Developing a CNN based algorithm to estimateframe-level shot importance. • Generating a platform for the summarization of anykind of video using the estimated frame-level shotimportance.The rest of the paper is organized as follows. Section 2provides a description of the proposed architecture. Theexperimental setup and the results obtained are describedin Section 3. Finally, Section 4 provides the conclusion.

2. Proposed Method

In this paper, a feed-forward CNN is employed to de-termine the frame-level shot importance of a video. In the proposed multilayer CNN, the ﬁrst layer is the input layerwhich uses the raw video frame X as the input. The lastlayer is the output layer that predicts the importance score y (0 ≤ y ≤ L ) ( y ∈ R ) of the input frame in that particularvideo, where L is a positive integer corresponding to thehighest score. A low value of y indicates a less importantframe, while a high value implies an important one. In thissection, ﬁrst the proposed CNN model is described. Then,the training and optimization schemes are detailed. In order to estimate the shot importance of a frame, wetrain an end-to-end CNN that automatically learns visualcontexts to predict the score in the output. The proposedCNN architecture is a six-stage model employing learnableconvolution and fully connected layers as shown in the stickdiagram in Fig. 2. The convolution and fully connected oper-ations are followed by ReLU activation for its ability to helpneural networks attaining a better sparse representation [12].The ﬁrst stage of the network is pre-processing tasks neededto normalize the dimension of the data. The pre-processingstage can be written as X = preprocess ( X ) (1)This task involves frame resizing and cropping that areapplied sequentially. In the stick diagram of Fig. 2 the frameresizing is shown using a rectangle with a single stripeand cropping operation is shown by a diverging trapezoid.The second stage performs a convolution operation whichemploys ReLU activation on X , which is given by X = max( , W ∗ X + b ) (2)where ∗ is the convolution operation, max( , · ) is the ReLUoperation. In Fig. 2, the convolution layer is shown as arectangle and ReLU layer as a solid line. The third andfourth stages use the convolution, ReLU and max-poolingoperations serially. These operations are given by the equa-tions X = M P (max( , W ∗ X + b )) (3) X = M P (max( , W ∗ X + b )) (4)where M P ( · ) is the max-pool operation. This operationreduces the spatial dimension by half and is represented byconverging trapezoid (see Fig. 2). The ﬁfth stage consistsof fully connected operation, the ReLU and dropout layers.First, the output of fourth stage X is ﬂattened to a 1-Dvector (cid:98) X , and then this vector is fed into the ﬁfth stage toprovide the output X r given by X r = Drop (max( , W T (cid:98) X + b )) (5)where Drop ( · ) is the dropout operation [13]. In Fig. 2, thefully connected layer and the dropout layer are representedby rectangle with three stripes in the middle and a parallelo-gram, respectively. The ﬁnal part of the CNN is the regressor tage 1 Stage 2 Stage 3

Stage 4 Stage 5 Regressor

Smoothing

Layer

Input Frame

Output Shot

Importance

Figure 2. Proposed CNN model for predicting the frame-level shot importance of a video. The input to the model is raw video frames and output is thescore of importance. which is a fully connected layer that outputs the estimationof the frame importance in scalars from X r given by ˆ y = W Tr X r + b r (6)In many cases, the frame-level importance can be averagedover a few neighboring frames using a smoothing ﬁlter(shown as rectangle with a diagonal stripe in Fig. 2). Overall,the learnable parameters of the network are the ﬁlter sets, W , W , W , W , and W r , and their corresponding biasterms b , b , b , b and b r , respectively. There are number of training and optimization schemesthat can be chosen for attaining good results from thenetwork. An effective choice of initialization of weights andbiases can signiﬁcantly reduce training time by convergingthe network faster. In this context, we have explored theworks of Glorot et al. [14] and initialized all the biaseswith zeros and weights W i at each layer by taking samplesfrom a uniform distribution W i ∼ U (cid:104) − √ M , √ M (cid:105) where M ( M ∈ Z ) is the size of the previous layer. In order to applyback-propagation [15] for training the network, a loss func-tion is required to be speciﬁed that is easily differentiable.For regression-based tasks such as the estimation of scores,most common choices are (cid:96) -norm, (cid:96) -norm or Frobenius norm. In the proposed method, we choose an (cid:96) -norm-basedloss function given by C = N (cid:88) n =1 || y n − ˆ y n || (7)where y n is the ground truth value of the shot importance, (cid:98) y n is the predicted score, and N ( N ∈ Z ) is the numberof training inputs fed into the back-propagation processin each iteration for mini-batch optimization [16]. Duringthe training period, this function is optimized by using thecontemporary Adam stochastic optimization technique [17].The weights of the ﬁlter sets denoted by w are updated basedon the ﬁrst moment (cid:98) m and second moment (cid:98) v of the gradient of the loss function C with respect to the weights. Overall,the update process of the optimization can be written as [17] (cid:98) m ( t ) = (cid:98) m ( t −

1) + (1 − β ) dC ( t ) dw ( t ) (8) (cid:98) v ( t ) = (cid:98) v ( t −

1) + (1 − β ) (cid:18) ∂C ( t ) ∂w ( t ) (cid:19) (9) w ( t ) = w ( t − − α (cid:98) m ( t ) (cid:112)(cid:98) v ( t ) + (cid:15) (10)where α ( α > is the step size, β and β ( β , β > aredecay rates for the ﬁrst and second moments, respectively,and (cid:15) ( (cid:15) > is a factor of numerical stability.

3. Experiments and Results

Experiments are carried out to evaluate the performanceof the proposed CNN architecture as compared to exist-ing methods for predicting the score of frame importancein videos. In this section, ﬁrst we give an overview ofvideo dataset used in the experiments, then we describeour training and testing data partitions, data augmentationtechniques, parameter settings of the proposed architecture,and matching scheme of estimated score of importancewith the ground truth. Then, the methods compared forperformance evaluation are introduced. Finally, results arepresented and evaluated in terms of commonly-referred per-formance metrics of regression.

In the experiments, we have used the TVSum50database [18] that includes 50 video sequences. Thesevideos are categorized into ten different genres includingthe ﬂash mob, news, and video blog. Each genre containsvideos of ﬁve independent scene. The duration of videosvaries from 2 to 10 minutes. Each frame of these videoshas been annotated by an importance score of continuousvalues ranging from 1 to 5 by using crowd-sourcing. It isfound empirically that a shot length of two seconds will beble to the reﬂect local context of a video [18]. By adoptingthis rule, each video is divided into segments, where eachsegment has a duration of two seconds. These segments areﬁrst annotated by 20 users. A ground truth of importancescore has been produced by regularizing and combiningthese annotated scores.

Out of 50 videos of the dataset, 35 videos are chosen fortraining and the mutually exclusive rest of the 15 videos arekept for testing phase. In order to design a fair evaluationprocess, at least three videos for the training set and onevideo for testing set are included from each of the tengenres. In order to achieve a computational efﬁciency and toreduce the training period, a subset of frames from videosare considered for learning. In particular, a single framefrom each strip of ﬁve consecutive frames is considered fortraining scheme. This is mainly due to the fact that the visualcontents of ﬁve consecutive frames are almost same in avideo. This ensures that the training data has less amount ofredundant information and, thus the approach signiﬁcantlyreduces the training period. On the other hand, no frames isdiscarded from the test set, instead the importance score ofevery frame of a video is predicted.

Data augmentation helps to achieve generalized resultsin CNN-based learning [19]. It reduces overﬁtting by vir-tually increasing the training data size. In general, a largernetwork can be trained by augmenting a dataset withoutlosing validation accuracy. This scheme has been adoptedin our experiments. Augmentation techniques that are usedin the training include the transpose, horizontal ﬂips, andvertical ﬂips of the frames. One or more of these operationsare chosen randomly in each stage of the training step. Inother words, seven new variants of the original data areachieved, and our training set virtually increases by up to 8times. During each iteration, a random integer is generatedbetween 1 and 8 inclusive that correspond to a speciﬁccombination of data augmentation techniques. Based on thegenerated integer, the selected operations are performed onthe data prior to feeding it to the following stage.

The network parameters of the CNN model describedin Section 2.1 are chosen based on the dimensions of inputand required output in different layers. Since the size ofinput video frames varies among different videos, ﬁrst thevideo frames are resized to × × and then croppedcentrally to obtain × × sized images, where isthe channel parameter of RGB components of a color image.The number of ﬁlters in the sets W , W , W , W , and W r and corresponding number of bias terms b , b , b , b and b r are set to , , , and , respectively, since such a choice provides an overall good performance.The kernel size of all the convolution ﬁlters is set to andthat of the max-pool operation is set to . The dropoutparameter is chosen as . during training and duringtesting. Empirically the parameters α , β , β of the Adamoptimizer are found to be − , . and . , respectively.The numerical stability factor (cid:15) is set to − . A single value has been assigned as the shot importancefor 50 neighboring frames in the ground truth. Since theproposed model predicts shot importance for each of theframes in a video, a scheme for matching the importancehas been employed in order to be consistent with the groundtruth of dataset. In particular, ﬁrst the predicted outputvalues for 50 consecutive frames are considered, then theminimum and maximum of the predictions arediscarded, and ﬁnally the root mean squared (RMS) valueof the remaining data is assigned as the ﬁxed-level shotimportance for the 50 neighboring frames.

The proposed CNN is a learning-based method, wherethe importance of frames are predicted automatically by thenetwork. In the experiments, we select three feature-basedapproaches reported for video summarization. Originally themethods are concerned with the selection of key frames. Themethods are brieﬂy described as follows: • Visual attention [9]: In this method, the visual atten-tion extracted from spatial and temporal saliency isused to extract key frames from a video. • Motion attention [20]: The video features extractedfrom motions are employed for video summariza-tion. • Singular value decomposition (SVD) [10]: The min-imization of cross correlation of the features ex-tracted in terms of SVD of frames is used to identifythe key frames for video summarization.To compare these methods with the proposed one, theyare invoked to predict shot importance for each of theframes of a video. In particular, the features are used ina support vector regression technique to predict the frame-level shot importance using the same training and testingsets described in Section 3.2.

The performance of the proposed CNN-based methodand three comparing methods are evaluated in terms ofthree metrics, namely, mean absolute error (MAE), absoluteerror variance (AEV), and relative F -measure. The MAEindicates how much the predicted values deviate from theground truth on average, and the AEV reveals the ﬂuctua-tions of absolute errors. Thus, a lower value of MAE means

00 400 600 800 1000 1200 1400 1600 1800 2000

Frame Number S ho t I m po r t an c e Ground truthProposed CNN-based method

200 400 600 800 1000 1200 1400 1600 1800 2000

Frame Number S ho t I m po r t an c e Ground truthSVD-based method

200 400 600 800 1000 1200 1400 1600 1800 2000

Frame Number S ho t I m po r t an c e Ground truthVisual attention based method

200 400 600 800 1000 1200 1400 1600 1800 2000

Frame Number S ho t I m po r t an c e Ground truthMotion attention based method (a) Motion attention-based method (b) Visual attention-based method(c) SVD-based method (d) Proposed CNN-based method

Figure 3. Frame-level scores of shot importance predicted by using the experimental methods. The predicted scores are compared with the ground truth.The comparisons are shown for (a) motion attention-based method, (b) visual attention-based method, (c) SVD-based method, and (d) proposed CNN-basedmethod. TABLE 1. P

ERFORMANCE OF P REDICTION OF S HOT I MPORTANCE IN T ERMS OF

MAE, AEV

AND R ELATIVE F - MEASURE

Methods MAE AEV Relative F -measureMotion Attention [9] . . . Visual Attention [20] . . . SVD [10] . . . Proposed CNN . . . predicted value is very close to the actual one. Similarly,a small AEV is a good sign implying that errors do notﬂuctuate signiﬁcantly.The F -measure gives an idea about the close matchingbetween the video summary prepared by the predicted shotimportance and that by the ground truth. In order to computethe F -measure, a threshold is selected for each of thecomparing methods as well as for the ground truth. Thethreshold maps the continuous values of frame importanceinto binary values denoting the selected and non-selectedframes for a summary, preferably with a length of − of the original video. The metric F -measure is given by F -measure = 2 × P recision × RecallP recision + Recall (11)where

P recision is the fraction of matched frames withrespect to the ground truth, and

Recall implies the fractionof matched frames with respect to the total number offrames. To ﬁnd out how well the proposed CNN-based method performs as compared to others, the relative F -measure is evaluated by normalizing the metric with thesame calculated from the annotated ground truths of ﬁfteenvideos. In the experiments, shot importance of all the framesof test videos are predicted using the proposed as well asthree comparing methods. Then, the importance values aregrouped for local neighboring 50 frames as described inSection 3.5. Table 1 shows the overall prediction perfor-mance of the testing videos in terms of the metrics MAE,AEV and relative F -measure. It is seen from the tablethat the proposed CNN-based method performs the best byproviding the lowest MAE. It shows approximately improvement in terms of MAE from the most competitivemethod reported in [10], which uses SVD of frames aseatures. The proposed method outperforms the comparingmethods by showing an improvement of at least inrobustness by providing the lowest AEV. It can also befound from Table 1 that our method provides the highestrelative F -measure as compared to others, where the im-provement is more than from the competing method.In other words, our proposed method performs signiﬁcantlybetter than others for predicting the shot importance. Thisis evident because the method consistently provides lowabsolute errors through out the entire frames of a video andthus results in a video summarization close to the groundtruth.Fig. 3 shows the frame-level scores of shot importancepredicted for ﬁrst two thousand frames of a test video withﬂash mob genre having a title of “ICC World Twenty20Bangladesh 2014 Flash Mob Pabna University of Science& Technology (PUST)” . This video was shot by a group ofBangladeshi students as a promotional video of the 2014ICC World Twenty20 event. It is seen from Fig. 3 that thepredicted scores of importance provided by the proposedCNN-based method tend to follow the ground truth moreclosely than that provided by the comparing three methods.The motion-based method [20] shows sudden changes ofscores of importance, which appear even in the oppositedirection to the trend of the ground truth. The visual at-tention [9] and SVD-based [10] methods though follow thetrend of ground truth closely in a few region, the deviationsare signiﬁcant in most of the regions. Evidently, the abovetwo limitations are nearly absent in the prediction scores ofthe proposed method, and hence, the CNN-based predictionappears to be accurate and robust.

4. Conclusion

In this paper, a CNN-based architecture has been pro-posed to predict frame-level shot importance of videos.The predicted scores of shot importance can be used forthe development of a platform, which can provide a user-oriented automated summary of a video. Thus, our worksuccessfully converts the subjective video summarizationinto a measurable objective framework. To evaluate theproposed CNN-based method, annotated importance of tengenres of videos of TVSum50 database have been usedas the ground truth. Experiments have been conducted byadopting mutually exclusive training and testing sets thatencompasses available genres of the dataset. The proposedmethod has been compared with the methods based onthe visual attention, motion attention, and SVD features.Experimental results reveal that the proposed CNN-basedmethod outperforms the existing feature-based methods interms of three evaluation metrics, namely, MAE, AEV andrelative F -measure. References [1] S. Brain, “YouTube statistics,” , 2014. [2] A. G. Money and H. Agius, “Video summarisation: A conceptualframework and survey of the state of the art,”

J. Visual Communica-tion and Image Representation , vol. 19, no. 2, pp. 121–143, 2008.[3] W. Ding and G. Marchionini, “A study on video browsing strategies,”University of Maryland at College Park, College Park, MD, Tech.Rep. Report No. UMIACS-TR-97-40, 1998.[4] J. Meng, H. Wang, J. Yuan, and Y.-P. Tan, “From keyframes to keyobjects: Video summarization by representative object proposal selec-tion,” in

Proc. IEEE Conf. Computer Vision and Pattern Recognition ,Las Vegas, NV, 2016, pp. 1039–1048.[5] Y. J. Lee, J. Ghosh, and K. Grauman, “Discovering important peopleand objects for egocentric video summarization,” in

Proc. IEEE Conf.Computer Vision and Pattern Recognition , Providence, RI, 2012, pp.1346–1353.[6] W. Lin, Y. Zhang, J. Lu, B. Zhou, J. Wang, and Y. Zhou, “Summariz-ing surveillance videos with local-patch learning-based abnormalitydetection, blob sequence optimization, and type-based synopsis,”

Neurocomputing , vol. 155, pp. 84–98, 2015.[7] B. Li and M. I. Sezan, “Event detection and summarization in sportsvideo,” in

Proc. IEEE Work. Content-Based Access of Image andVideo Libraries , Kauai, HI, 2001, pp. 132–138.[8] D. Cullen, J. Konrad, and T. D. Little, “Detection and summarizationof salient events in coastal environments,” in

IEEE Int. Conf. Ad-vanced Video and Signal-Based Surveillance , Beijing, China, 2012,pp. 7–12.[9] N. Ejaz, I. Mehmood, and S. W. Baik, “Efﬁcient visual attention basedframework for extracting key frames from videos,”

Signal Processing:Image Communication , vol. 28, no. 1, pp. 34–44, 2013.[10] K. S. Ntalianis and S. D. Kollias, “An optimized key-frames extractionscheme based on SVD and correlation minimization,” in

IEEE Int.Conf. Multimedia and Expo. , Amsterdam, The Netherlands, 2005, pp.792–795.[11] S. Lu, Z. Wang, T. Mei, G. Guan, and D. D. Feng, “A bag-of-importance model with locality-constrained coding based featurelearning for video summarization,”

IEEE Trans. Multimedia , vol. 16,no. 6, pp. 1497–1509, 2014.[12] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectiﬁer neuralnetworks,” in

Proc. Int. Conf. Artiﬁcial Intelligence and Statistics ,vol. 15, Fort Lauderdale, FL, 2011, pp. 315–323.[13] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A simple way to prevent neural networksfrom overﬁtting.”

J. Machine Learning Research , vol. 15, no. 1, pp.1929–1958, 2014.[14] X. Glorot and Y. Bengio, “Understanding the difﬁculty of trainingdeep feedforward neural networks,” in

Proc. Int. Conf. ArtiﬁcialIntelligence and Statistics , vol. 9, Sardinia, Italy, 2010, pp. 249–256.[15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learningrepresentations by back-propagating errors,”

Nature , vol. 323, pp.533–536, 1986.[16] M. Li, T. Zhang, Y. Chen, and A. J. Smola, “Efﬁcient mini-batchtraining for stochastic optimization,” in

Proc. ACM SIGKDD Int.Conf. Knowledge Discovery and Data Mining , New York, NY, 2014,pp. 661–670.[17] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[18] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “TVSum: Summa-rizing web videos using titles,” in

Proc. IEEE Conf. Computer Visionand Pattern Recognition , Boston, MA, 2015, pp. 5179–5187.[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁ-cation with deep convolutional neural networks,” in

Proc. Int. Conf.Neural Information Processing Systems , Lake Tahoe, NV, 2012, pp.1097–1105.[20] Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, “A user attention model forvideo summarization,” in