An Effective Way to Improve YouTube-8M Classification Accuracy in Google Cloud Platform
Zhenzhen Zhong, Shujiao Huang, Cheng Zhan, Licheng Zhang, Zhiwei Xiao, Chang-Chun Wang, Pei Yang
AAn Effective Way to Improve YouTube-8M Classification Accuracy in GoogleCloud Platform
Zhenzhen Zhong [email protected]
Shujiao Huang [email protected]
Cheng Zhan [email protected]
Licheng Zhang [email protected]
Zhiwei Xiao [email protected]
Chang-chun Wang [email protected]
Pei Yang [email protected]
Abstract
Large-scale datasets have played a significant rolein progress of neural network and deep learning areas.YouTube-8M is such a benchmark dataset for general multi-label video classification. It was created from over 7 millionYouTube videos (450,000 hours of video) and includes videolabels from a vocabulary of 4716 classes (3.4 labels/videoon average). It also comes with pre-extracted audio & vi-sual features from every second of video (3.2 billion featurevectors in total).Google cloud recently released the datasets and orga-nized ‘Google Cloud & YouTube-8M Video UnderstandingChallenge’ on Kaggle. Competitors are challenged to de-velop classification algorithms that assign video-level la-bels using the new and improved Youtube-8M V2 dataset.Inspired by the competition, we started exploration ofaudio understanding and classification using deep learningalgorithms and ensemble methods. We built several base-line predictions according to the benchmark paper [4] andpublic github tensorflow code. Furthermore, we improvedglobal prediction accuracy (GAP) from base level 77% to80.7% through approaches of ensemble.
1. Introduction
There is an English idiom: “a picture is worth a thou-sand words”. Such theory has been standing in human so-ciety for many years, as this is the way our brain functions.With the development of neural network and deep learn-ing, it can be applied to machine as well. In other words,we human beings are able to teach or train computer torecognize objects from pictures and even describe them inour natural language. Thanks Google for organizing this‘Google Cloud & YouTube-8M Video Understanding Chal- lenge’, which gives us a wonderful opportunity to test newideas and implement them with the Google cloud platform.In this paper, we first review the baseline algorithms, andthen introduce the innovative ensemble experiments.
2. Data
There are two types of data, video-level and frame-level features. The data and detailed informationare well explained in YouTube-8M dataset webpage( https://research.google.com/youtube8m/download.html ). Both of the video-level and frame-level data are stored as tensorflow.Example protocolbuffers, which have been saved as ‘tfrecord’ files.For each type of the data, it has been split in three sets:train, validate and test. The numbers of observations in eachdataset are given in following table.Train Validate TestNumber of Obs 4,906,660 1,401,82 700,640Video-level data example proto is given in following textformat: • “video id”: an id string; • “labels”: a list of integers;Note: the feature for test set is missing, but given fortrain and validation datasets. • “mean rgb”: a 1024-dim float list; • “mean audio”: a 128-dim float list.A few examples of video-level data are given in follow-ing table:4321 a r X i v : . [ s t a t . M L ] J un ideo id (ID) labels ( y ) mean rbg ( X ) mean audio ( X )‘-09K4OPZSSo’ [66] [0.11, -0.87,-0.19, -0.22,0.23, · · · ] [0.89, 1.31,-0.13, 0.20,-1.76, · · · ]‘-0MDly IiNM’ [37, 101, 29, 23] [-0.99, 1.02,-0.74, 0.09,0.56, · · · ] [-0.66, -1.12,0.61, -1.37,-0.01, · · · ]... Frame-level data has similar ‘video id’ and ‘labels’ fea-tures, but ‘rgb’ and ‘audio’ are given in each frame: • “video id”: e.g. ‘-09K4OPZSSo’; • “labels”: e.g. [66]; • Feature list “rgb”: e.g. [[a 1024-dim float list], [a1024-dim float list],...]; • Feature list “audio”: e.g. [[a 128-dim float list],[a 12-dim float list],...].Note: each frame represents one second of the video,which up to 300.The data can be represented as ( X train , y train ) , ( X val , y val ) and ( X test ) , where X is the information ofeach video (features), and y is the corresponding labels. Invideo-level data, X = ( mean rgb, mean audio ) .
3. Baseline Approaches [4] gave detailed introduction for some baseline ap-proaches, including logistic regression and mixture of ex-perts for video-level data, as well as frame-level logistic re-gression, deep bag of frame and long short-term memorymodels for frame-level data.Given the video-level representations, we train indepen-dent binary classifiers for each label using all the data. Ex-ploiting the structure information between the various la-bels is left for future work. A key challenge is to train theseclassifiers at the scale of this dataset. Even with a compactvideo-level representation for the 6M training videos, it isunfeasible to train batch optimization classifiers, like SVM.Instead, we use online learning algorithms, and use Ada-grad to perform model updates on the weight vectors givena small mini-batch of examples (each example is associatedwith a binary ground-truth value).
The average values of rgb and audio presentation are ex-tracted from each video for model training. We focus onperformance of logistic regression model and mixture of ex-perts(MoE) model, which can be trained within 2 hours onGoogle cloud platform.
Logistic regression computes the weighted entity similarlyto linear regression, and obtains the probability by outputlogistic results [1]. Its cost function is called as log-loss: λ || W e || + N (cid:88) i =1 L ( y i,e , σ ( W Te x i )) , (1)where σ ( · ) is the standard logistic, σ ( z ) = 1 / (1 + exp ( − z )) . The optimal weights can be found with gradientdescent algorithm.
Mixture of experts (MoE) was first proposed by Jacobs andJordan [3]. Given a set of training examples ( x i , g i ) , i =1 ...N for a binary classifier, where x i is the feature vectorand g i ∈ [0 , is the ground-truth, let L ( p i , g i ) be the log-loss between the predicted probability and the ground-truth: L ( p, g ) = − g log p − (1 − g ) log(1 − p ) . (2)Probability of each entity is calculated from a softmax dis-tribution of a set of hidden states. Cost function over all thedatasets is log-loss.Video-level models trained with RGB feature achieve70% - 74% precision accuracy. Adding audio feature raisesthe score to 78% precision accuracy. We also tested includ-ing validation dataset as part of training datasets. With thislarger training set, prediction accuracy on tests set is 2.5%higher than using original training sets only for model fit-ting. In other words, we used 90% of YouTube-8M datasetsto obtain better model parameters. Videos are decoded at one frame-per-second to extractframe-level presentations. In our experiments, most frame-level models achieve similar performance as video-levelmodels.
The frame-level features in the training dataset are obtainedby randomly sampling 20 frames in each video. Framesfrom the same video are all assigned the ground-truth of thecorresponding video. There are totally about 120 millionframes. So the frame-level features are: ( x i , y (cid:15)i ) , (cid:15) = 1 , ..., , i = 1 , · · · , M where x i ∈ R and y (cid:15)i ∈ { , } . The logistic models are trained in “one-vs-all” sense;hence there are totally 4800 models. For inference on test4322ata, we compute the probability of existence of label e ineach video ν as follows p ν ( e | x ν F ν ) = 1 F ν F ν (cid:88) j =1 p ν ( e | x νj ) , j = 1 , · · · , F ν , (3)where F ν is the number of frames in a video. That is, video-level probabilities are obtained by simply averaging frame-level probabilities. Deep bag of frame model is a convolutional neural net-work. The main idea is to design two layers in the con-volutional part. In the first layer, the up-projection layer,the weights are still applied on frames, although all selectedframes share the same parameter. The second layer is pool-ing the previous layer into video level. The approach enjoysthe computational benefits of CNN, while at the same timethe weights on the up-projection layer can still provide astrong representation of input features on frame level.For implementation and test, more features and inputdata can slightly improve the results. For example, if wecombined training and validate data, the score will be im-proved. When we add both features (RGB + audio), theresult is boosted by around 0.4%. The computing cost isquite low compared to other frame level models. It took 36hours using one single GPU for training data. The bench-mark model is not well implemented for parallel computingin the prediction stage. Using more GPUs doesn’t boosttraining speed. We tested 4 GPU, and the total time is onlyreduced by 10%.
Long short-term memory (LSTM) [5] is a recurrent neu-ral network (RNN) architecture. In contrast to conventionalRNN, LSTM uses memory cells to store, modify, and accessinternal state, allowing it to better discover long-range tem-poral relationships. As shown in Figure 1 ([6]) the LSTMcell stores a single floating point value and it maintained thevalue unless it is added to by the input gate or diminishedby the forget gate. The emission of the memory value fromthe LSTM cell is controlled by the output gate.The hidden layer H of the LSTM is computed as follows[6]: i t = σ ( W xi X t + W hi h t − + W ci c t − + b i ) (4) f t = σ ( W xf X t + W hf h t − + W cf c t − + b f ) (5) c t = f t c t − + i t tanh( W xc X t + W hc h t − + b c ) (6) o t = σ ( W xo x t + W ho h t − + W co c t + b o ) (7) h t = o t tanh( c t ) (8) Figure 1. where x denote the input, W denote weight matrices (e.g.the subscript hi is the hidden-input weight matrix), b termsdenote bias vectors, σ is the logistic sigmoid function, and i, f, o , and c are respectively the input gate, forget gate, out-put gate, and cell activation vectors.In the current project, the LSTM model was build fol-lowed a similar approach to [6]. Provided best performanceon the validation set, 2 stacked LSTM layers with 1024 hid-den units and 60 unrolling iterations were used [4].
4. Ensemble Approaches
Several predictions on the base level have been gener-ated. Most of the models perform reasonably well, and ag-gregating the predictions even better results. It is knownas ensemble learning to combine the base level models andtrain a second level to improve the prediction.There are several approaches of ensemble learning, suchas blending, averaging, bagging, voting, etc. Where blend-ing ([2]) is a powerful method for model ensemble. Aver-aging, the simple solution, works well, and is also appliedin this case. Bagging (short for bootstrap aggregating) isto train the same models on different random subsets of thetraining sets, and aggregating the results. The improvementfrom bagging could be limited. Majority voting is not ap-propriate in this case, since the outputs here are the confi-dence level probabilities instead of label ids.
A detailed introduction for popular kaggle ensem-bling methods are given in https://mlwave.com/kaggle-ensembling-guide . Define f , f , · · · bedifferent classification models, like logistic and MoE in thiscase.The general idea for blending method is given as follows: • Create a small holdout set, like 10% of the train set. • Build the prediction model with rest of the train set. • Train the stacker model in this holdout set only.4323raining the blend model for the YouTube-8M data re-quires large computational memories. Google cloud plat-form provides sufficient memory and computing resourcesfor blending. We tried this blending method in video-levellogistic and MoE base models with validation set as theholdout set. • Build the logistic and MoE model on video-level traindata, y train ∼ X train . • Do the inference and print the output predic-tion on validation set ( ˆ y logisticval , ˆ y moeval ) and test set( ˆ y logistictest , ˆ y moetest ). • Predictions on dataset p ( p = val or test) with model q ( q = logistic or MoE), ˆ y qp , gives top 20 predictions andtheir probabilities, e.g., VideoId LabelConfidencePairs100011194 1 0.991708 4 0.830637 1833 0.7816672292 0.730538 297 0.718730 3547 0.46528034 0.396639 1511 0.371649 2 0.3517880 0.303522 92 0.169908 933 0.164513198 0.145657 202 0.143494 658 0.10677674 0.089043 167 0.088266 33 0.052943332 0.049101 360 0.045714100253546 77 0.996484 21 0.987201 142 0.97188159 0.931193 112 0.817585 0 0.4456088 0.112624 11 0.100307 17 0.025623262 0.021074 1 0.020778 312 0.02006075 0.017796 57 0.011925 60 0.00553267 0.004512 69 0.004346 575 0.0040443960 0.003965 710 0.003961...
In order to apply second stage stacker model. We de-fine stacking new feature X q ∗ p to be a 4716-dimensionvector. Each dimension represents a label with cor-responding probability. The stacking new feature hasdefault value 0, and the 20 estimated probabilities aredefined. For simplicity, if there are 3 estimated labelwith probabilities, the new defined feature is given asfollowing: (ˆ y qp ) i = [1 0 .
99 4 0 .
83 5 0 . ⇒ ( X q ∗ p ) i = (0 . , , , . , . , , · · · , where i = 1 , · · · , if p = val, i =1 , · · · , if p = test. Write the new defined stack-ing feature to be X ∗ val = ( X logistic ∗ val , X moe ∗ val ) and X ∗ test = ( X logistic ∗ test , X moe ∗ test ) . • Stacker model is hard to run in local machine with newdefined feature X ∗ val , since it costs too much memo-ries. Hence, we save the new defined data as tfrecord file, so that the model can be run in Google cloud. Thenew defined data example proto is given as following:a. “video id”b. “labels”c. “logistic newfeature”: float array of length 4716d. “moe newfeature”: float array of length 4716 • Train the stacker model in the validation new feature X ∗ val , y val ∼ X ∗ val . In this case, we use only logisticand MoE as stacker model, since these two are well-defined in th package. Other models need to be definedif desire to use. • Do the inference on the test set X ∗ test , hence final pre-diction ˆ y ∗ test . Note since the new defined feature has length 4716 in-stead of 1024, train.py and test.py in Google cloud scriptneed some small correction to apply this method. Definedfeature names and feature sizes need to match with new de-fined feature name and corresponding length. By adding thenew features, the score has been improved from 0.760 to0.775, see Table 1.
The idea of averaging is to generate a smooth separationbetween different predictors and reduce over-fitting. Foreach video, if one label is predicted several times amongthese models, the mean value of these confidence level val-ues is used as the final prediction. On the other hand, ifa label id is rarely predicted among the base models, thefinal prediction is calculated using the sum of these confi-dence level values divided by the number of base models,eventually lower the confidence level. On a computer with8G physical memories, 30G virtual memory, the averagingprocess for the whole test dataset, which includes 700640YouTube video records, can be finished within 15 minutes,which is quite efficient considering the size of files.Several strategies are tested for the averaging:Strategy A: 3 frame level models, 2 video level mod-els, 2 blending models (video level logistic model blendingwith MoE prediction, video level MoE model blending withMoE prediction), the GAP score is 0.80396 after averaging,higher than the score of any individual base predictions.Strategy B: 3 frame level models, 2 video level models, 2blending models (video level logistic model blending withboth logistic and MoE prediction, video level MoE modelblending with both logistic and MoE prediction). The base-ment of the 2 blending predictions are the 2 video levelmodels, thus the 2 blending models are highly correlatedwith the 2 video level models. The video level logisticmodel is a weaker predictor than video level MoE predictor.4324 igure 2. Stacking Diverse Predictors Flow Chart
The usage of logistic prediction blended models increasesthe weight of this weaker predictor. Therefore, the GAPscore is 0.80380 after averaging, slightly lower than Strat-egy A.Strategy C: the base models are the same as in strategyA, except that the deep bag of frame pooling frame modelprediction is replaced by the one generated after fine-tuningthe hyper-parameters. Including the individual model withbetter performance, the score after averaging is improved to0.80424.Strategy D: the base models in strategy D is the sameas strategy C, but the weight of the blending model (videolevel MoE model blending with both logistic and MoE pre-diction) is increased to 2. Among all the base models, theGAP score of the MoE blending model is the highest. Thescore rises to 0.80692 with the weighted average strategy.Strategy E: similar to strategy D, the weight of the MoEblending model is 2. The logistic model blending with lo-gistic plus MoE prediction is replaced by the logistic modelblending only with MoE prediction, which reduces the lo-gistic components compared to strategy D. This approachyields the best GAP score, 0.80695.Such method boosts the final result by 7-8%, quite re-markable considering the simplicity nature. Table 1 showsresults of all base models, where the scores are calculateusing GAP metrics on Kaggle platform.
5. Conclusion
Utilizing the open resource of Youtube-8M train andevaluation datasets, we trained baseline models in a fast-track fashion. Video-level representations are trained witha) logistic models; b) MoE model, while frame-level fea-tures are trained with a) LSTM model; b) Dbof model; c)frame-level logistic model. We also demonstrated the effi-ciency of blending and averaging to improve the accuracy
Table 1. Predictions from Individual Base Models
Models Features Datasets for model fitting ScoreFrame LevelLSTM Model rgb validate 0.7457Frame LevelDeep Bag ofPooling Model I audio train+validate 0.77Frame LevelDeep Bag ofPooling Model II audio train 0.767Video LevelLogistic Model audio/rgb train+validate 0.76036Video LevelMixture ofExperts Model audio/rgb train+validate 0.78453Video LevelLogistic ModelBlending I audio/rgb/logistic/moe validate 0.77518Video LevelLogistic ModelBlending II audio/rgb/moe validate 0.76873Video LevelMixture ofExperts ModelBlending audio/rgb/logistic/moe validate 0.78617 of prediction. Blending plays a key role to raise the perfor-mance of baseline. We are able to train the blender modelon Google cloud, where the computer resource can digestthousands of features. The averaging solution, in addition,aggregates the wisdom from all predictions with low com-puter cost.
References [1] Aurelien Geron,
Hands on Machine Learning with ScikitLearn and Tensorflow .[2] A. Toscher and M. Jahrer,
The bigchaos solution to the netflixgrand prize , 2009.[3] Jordan, Michael I and Jacobs, Robert A,
Hierarchical mix-tures of experts and the EM algorithm , Neural computation,6, 2, 181-214, 1994.[4] Abu-El-Haija, Sami and Kothari, Nisarg and Lee, Joonseokand Natsev, Paul and Toderici, George and Varadarajan, Bal-akrishnan and Vijayanarasimhan, Sudheendra,
Youtube-8m:A large-scale video classification benchmark , arXiv preprintarXiv:1609.08675, 2016.[5] Hochreiter, Sepp and Schmidhuber, J¨urgen,
Long short-termmemory , Neural Computation, 9:8, 1735-1780, 2997.[6] Yue-Hei Ng, Joe and Hausknecht, Matthew and Vijaya-narasimhan, Sudheendra and Vinyals, Oriol and Monga, Rajatand Toderici, George,
Beyond short snippets: Deep networksfor video classification , Proceedings of the IEEE conferenceon computer vision and pattern recognition, 4694–4702, 2015., Proceedings of the IEEE conferenceon computer vision and pattern recognition, 4694–4702, 2015.