Improvements of Motion Estimation and Coding using Neural Networks
IInter-Prediction with Deep Learning
Abstract – Inter-Prediction is used effectively in multiple standards, including H.264 and HEVC (also known as H.265). It leverages correlation between blocks of consecutive video frames in order to perform motion compensation and thus predict block pixel values and reduce transmission bandwidth. In order to reduce the magnitude of the transmitted Motion Vector (MV) and thus reduce bandwidth, the encoder utilizes Predicted Motion Vector (PMV), which is derived by taking the median vector of the corresponding MVs of the neighboring blocks. In this research, we propose innovative methods, based on neural networks prediction, for improving the accuracy of the calculated PMV. We begin by showing a straightforward approach of calculating the best matching PMV and signaling its neighbor block index value to the decoder while reducing the number of bits required to represent the result without adding any computation complexity. Then we use a classification Fully Connected Neural Networks (FCNN) to estimate from neighbors the PMV without requiring signaling and show the advantage of the approach when employed for high motion movies. We demonstrate the advantages using fast forward movies. However, the same improvements apply to camera streams of autonomous vehicles, drone cameras, Pan-Tilt-Zoom (PTZ) cameras, and similar applications whereas the MVs magnitudes are expected to be large. We also introduce a regression FCNN to predict the PMV. We calculate Huffman coded streams and demonstrate an order of ~34% reduction in number of bits required to transmit the best matching calculated PMV without reducing the quality, for fast forward movies with high motion.
Index Terms — Motion Vectors, Inter Prediction, Video Encoding, Deep Learning, H.264, VP10. I. I NTRODUCTION
IDEO coding has undergone substantial performance improvements during the last two decades. Ever increasing demand for video content, resulting from intensive use of smartphones and a wide plethora of additional rich media consumer devices coupled with expensive channels’ bandwidth and higher consumer expectations, create the need to constantly reduce bandwidth requirements while retaining or improving quality. The prevailing standards used today are H.264 along with its latest improved edition H.265 (or HEVC – High Efficiency Video Coding) [1], [2] and AV1 from the Alliance of Open Media (AOMedia) [3]. The most fundamental algorithms utilized in these compression schemes are Intra-Prediction [4][5] and Inter-Prediction [1]. While Intra-Prediction takes advantage of spatial redundancies between pixel values of the same frame, Inter-Prediction leverages temporal inter-frame pixel value redundancies, e.g., the similarity between pixels in consecutive video frames. Inter-Prediction is based on the fact that frame regions remain very close to one another or sometimes even identical between consecutive video frames. Instead of coding the original pixel block values, they are predicted from similar blocks of previous and/or future encoded frames and only the residual error is coded, thus reducing the required bandwidth. In order to reconstruct the original block pixel values at the decoder end, a Motion Vector (MV) is transmitted, which indicates the block used for the prediction. In order to further increase compression, it is desirable to reduce the value of the transmitted MV and transform the “MV-signal” with lower coding information and thus reduced coding cost. This is accomplished by predicting its value (Predicted Motion Vector or PMV) from MVs available at the encoder end and by coding only the error between the PMV and the ground truth MV which is transmitted to the decoder end. As MVs are transmitted losselessly (without quantization), the decoder recovers the current motion vector from its prediction and the decoded error. The calculated PMV is also used in Motion Estimation (ME) at the encoder, which means that for a given block, the ME algorithm searches for the best corresponding matching block in the reference frame, thus estimating the best Ground Truth block/MV (GTMV). This initial estimation can begin with the region indicated by the value of the PMV and proceed with the search for refining it. The searching algorithm is not explicitly specified in the standard and is left for the implementation. It is clear that reducing the search area will reduce calculation complexity and improve performance when searching for the GTMV. Therefore, reducing the error between the PMV and the ground truth MV will necessarily allow a better computation efficiency when searching for the GTMV at the encoder side. In this paper, we are proposing new methods for improving the prediction of PMVs as compared to the Median vector method. Since the Median vector is not necessarily the best matching PMV, we propose to use the neighboring MV which is the best approximation of the GTMV. We examine three new different approaches for predicting the PMV. The first two methods select the best match between the 3 neighboring MV, while the third method uses the neighbors for estimating an approximated PMV – (1) signaling from encoder to encoder, with minimum extra added bandwidth; and (2) training a classification neural network to predict the best PMV from the three neighboring blocks’ MVs. (3) In addition, we also propose a method, based on regression Fully Connected Neural Network (FCNN) that takes as inputs the (𝑥, 𝑦) coordinate values of the motion vectors of neighboring blocks and predicts the PMV of the block. We compare our proposed
Raz Birman, Yoram Segal, Ofer Hadar, Senior Member, IEEE Department of Communication Systems Engineering, BGU Jenny Benois-Pineau, member IEEE, LABRI/ University of Bordeaux
Improvements of Motion Estimation and Coding using Neural Networks V nter-Prediction with Deep Learning methods to the Median vector, which is the prevailing method used by the standards and demonstrate substantial improvements of Mean Squared Error (MSE) between our PMV and the GTMV, compared to the MSE between the Median vector values and the GTMV. We also calculate the entropy of the difference between our PMV and the GTMV and use Huffman coding to calculate the number of required bits for transmission. We demonstrate a promising reduction in entropy as well as the number of required bits due to the better accuracy of our proposed PMV. Reducing the number of bits required to transmit the PMV typically has an impact of up to 30% on the overall encoded stream bit rate. This improvement increases in higher values of the quantization parameter (Q), due to the higher impact that the Q has on coded DCT coefficients. The decision which blocks are estimated with intra-prediction and which ones are estimated with inter-prediction is made before quantization and therefore it is not affected by Q. While quantization is applied to DCT coefficients, it is not applied to the transmission of MV errors. Larger Q values, which reduce bit rate transmission of DCT coefficients (while decreasing quality), do not impact the number of bits required to code MV errors. So the relative impact of MV error coding on bit rate increases with higher Q values. Moreover, PMV coding impact will be higher for high motion movies, where the magnitudes of the MVs are relatively high. Using deep learning for video coding is still an emerging research area [6][7][8]. In most research efforts, Convolutional Neural Networks (CNN) have been used to capture matching blocks features and using them for improving the predicted block, thus reducing the inter-prediction residual error [10][11][12][14]. This paper is the first research work that we are aware of, which employs neural networks and deep learning for predicting MVs. The remainder of the paper is organized as follows. In section II we provide an overview of related research work in the area of Inter-Prediction and the commonly used PMV estimation method used by the prevailing video coding standards. In section III we present our proposed analytic method to calculate the best matching PMV. In section IV we present our proposed method of estimating the best matching MV using a classification neural network. In section V we present our proposal to estimate the PMV value using a regression neural network. In section VI we present the results of our numeric calculations. Section VII concludes the paper with a summary and future work. II. R ELATED WORK
Various efforts have been invested in improving MV prediction accuracy and search algorithms. In [13] the authors have tackled the same challenge and have indicated the drawback of having to add signaling. They offered one selection method which is based on the content statistics, thus allowing the decoder to perform the selection without adding signaling. We use neural networks in order to accomplish that. In [15] the authors present a new technique for motion vectors prediction based on spatial and temporal prediction. The motion vector of a moving object is tracked using spatial and temporal prediction and used as a starting point for the ME searching algorithm at the encoder end. The predicted motion vector is selected from several candidate motion vectors according to the block matching criterion. Experiments show that this spatial-temporal prediction reduces the number of computations performed by the motion search algorithm by 30% for MPEG2 encoding and by 40% for H.263 encoding. The Median method has typically yielded sufficiently accurate results of the GTMV for coding purposes; therefore the majority of the research efforts have been invested in improving the efficiency of the motion estimation itself. In [16] a MV prediction method is presented. It is a Prediction Search Algorithm (PSA) for block motion estimation. The proposed method utilizes a linear combination of the motion vectors of the three adjacent blocks to obtain a predicted motion vector, namely, the initial search point. Simulation results show that the proposed PSA is better than the three-step search algorithm [17] and the four-step search algorithm [18] in terms of MSE with smaller computational workload. To improve the accuracy of the fast Block Matching Algorithms (BMAs), in [20], a new adaptive motion tracking search algorithm is proposed. Based on the spatial correlation of motion blocks, a predicted starting search point, which reflects the motion trend of the current block, is adaptively chosen. Experimental results show that the proposed algorithm enhances the accuracy of the fast center-biased Block-Matching-Algorithms (BMAs), such as the new three-step search [9], the four-step search [18], and the block-based gradient descent search [19], as well as reduces their computational cost. As in [17], L. Luo. et al. , [21] propose a new prediction search algorithm for block motion estimation utilizing the linear weighing of the MVs of the 3 adjacent blocks. In [22] E. Kaminsky and O. Hadar propose a method for effectively analyzing and selecting the most suitable motion estimation algorithm. All these methods use conventional prediction approaches such as least square estimation of weights in a linear combination of neighboring vectors, Median prediction and so on. Due to substantial improvements in compression efficiency, the contribution of PMV error coding is becoming more attractive as means to obtain further bit rate reduction while retaining the same quality (since it is not quantized). In recent years there have been efforts to harness the power of Neural Networks for improving predictions for video coding. More efforts have been invested in the improvement of Intra-Prediction [9][23][24][25][26]. However, there has also been some research for improving motion compensation. The primary focus of these research papers was on motion compensation at the single pixel level, which corresponds to optical flow. Thus, the authors of [27] use Convolutional Neural Networks (CNN) for predicting a heat map optical flow from consecutive video frames. nter-Prediction with Deep Learning A. Calculation of Predicted Motion Vector (PMV)
The prevailing algorithms developed for video coding standards calculate the PMV by taking the Median vector of the neighboring blocks’ MVs. This can be accomplished assuming that neighboring blocks have been predicted with Motion Estimation. However, some blocks are predicted with Intra-Prediction and not with Inter-Prediction. So in some cases there are less than 3 Motion Estimated neighbors. Therefore, the prediction applies only to a subset of the frame blocks. The Median vector is calculated for blocks that have three, two and one neighboring Motion Estimated blocks respectively. B. Primary benefits of more accurately predicted PMVs
A more accurate prediction of the PMVs can improve two different aspects of the compression algorithm: (1) more accurately predicted PMVs are expected to produce a lower error compared to the ground truth MV, thus reducing the number of bits required to represent the difference. Some caution should be exercised due to Huffman coding [29]. The important criteria for determining bit rate, when using Huffman coding, is the probability distribution of the coded values, which can be expressed by the normalized histogram of the predicted PMV residual values. If the motion vectors are accurately predicted, then the low error values will be more probable thus the number of bits required in entropic Hoffman coding will be lower. (2) the value of the PMV is usually used for motion compensation calculation at the encoder end, which points out the block with the largest pixels similarity to the predicted block. When the PMV is closer to the ground truth vector, it is possible to produce more efficient searching regions as well as increase the accuracy of finding the best matching vector, thus improving computation efficiency as well as reducing residual error.
Fig. 1.
Illustration of Inter-Prediction calculated PMV from 3 neighboring Motion- Estimated blocks C. Evaluation criteria
The common method to assess compression performance improvement results is Rate-Distortion (RD) curves. RD curves present the relation between bit rates and quality as it is calculated for different values of the quantization parameter (Q). RD curves provide the overall compression performance assessment criteria that encompasses spatial as well as temporal prediction algorithms. This research paper deals with improving the compression corresponding to encoding motion compensation MV errors. Quantization is not used for transmitting errors of PMVs. Moreover, as explained earlier, their impact changes with Q due to the dominant impact it has on the encoding of DCT coefficients. In order to isolate the contribution of our proposed new algorithms and evaluate their performance, we have determined to use entropy and Huffman coding. When using them as criteria, we can focus on motion estimation performance and compare different algorithms to improve it. III. C ALCULATING B EST
PMV To predict the PMV of the current block, the Inter-Prediction schemes use motion vectors of three surrounding blocks as illustrated in
Fig . . The standard calculation of the PMV derives the Median vector of the MVs of surrounding motion estimated blocks as indicated in equation (1). 𝑃𝑀𝑉 ⅈ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ = 𝑀𝑒𝑑𝑖𝑎𝑛 ⅈ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ = 𝑀𝑒𝑑𝑖𝑎𝑛 𝑀𝑉 ⅈ ⃗⃗⃗⃗⃗⃗⃗ = 𝑀𝑒𝑑𝑖𝑎𝑛 + 𝐸𝑟𝑟 𝑖 ⃗⃗⃗⃗⃗⃗⃗ {𝑖 = 1 … 𝑘} (1) Where k is the number of motion compensated predicted blocks in the frame. The median is calculated for the 𝑥 as well as for the 𝑦 component. The formula is depicted in equation (2). 𝑥 𝑃𝑀𝑉 = 𝑀𝑒𝑑𝑖𝑎𝑛{𝑥
𝑀𝑉1 , 𝑥
𝑀𝑉2 , 𝑥
𝑀𝑉3 } 𝑦 𝑃𝑀𝑉 = 𝑀𝑒𝑑𝑖𝑎𝑛{𝑦
𝑀𝑉1 , 𝑦
𝑀𝑉2 , 𝑦
𝑀𝑉3 } (2) Whereas MV1, MV2 and
MV3 represent the three surrounding MVs, as illustrated in
Fig. 1 . However, as it turns out, the median is not necessarily the best approximation out of the three. Let 𝑀𝑉 ⅈ,𝑗 = {𝑥 ⅈ,𝑗 , 𝑦 ⅈ,𝑗 } be the GTMV of block in frame position 𝑖, 𝑗 . Let 𝑀𝑉 ⅈ−1,𝑗 = {𝑥 ⅈ−1,𝑗 , 𝑦 ⅈ−1,𝑗 } , 𝑀𝑉 ⅈ−1,𝑗−1 ={𝑥 ⅈ−1,𝑗−1 , 𝑦 ⅈ−1,𝑗−1 } , and 𝑀𝑉 ⅈ,𝑗−1 = {𝑥 ⅈ,𝑗−1 , 𝑦 ⅈ,𝑗−1 } be the MVs of the 3 neighboring blocks – block on the right, block on the top corner, and block on the top respectively. Let the median of the neighboring MVs in the 𝑥 and in the 𝑦 coordinates respectively be represented by equation (3): 𝑀𝑒𝑑𝑖𝑎𝑛 𝑋 𝑖,𝑗 = 𝑀𝑒𝑑𝑖𝑎𝑛{𝑥 ⅈ−1,𝑗 , 𝑥 ⅈ−1,𝑗−1 , 𝑥 ⅈ,𝑗−1 } 𝑀𝑒𝑑𝑖𝑎𝑛 𝑌 𝑖,𝑗 = 𝑀𝑒𝑑𝑖𝑎𝑛{𝑦 ⅈ−1,𝑗 , 𝑦 ⅈ−1,𝑗−1 , 𝑦 ⅈ,𝑗−1 } (3) Let the delta between the median and the GTMV of block 𝑖, 𝑗 and that median be as describe in equation (4): ∆ 𝑚𝑒𝑑ⅈ𝑎𝑛_𝑥 𝑖,𝑗 = 𝑀𝑒𝑑𝑖𝑎𝑛 𝑋 𝑖,𝑗 − 𝑥 ⅈ,𝑗 (4) nter-Prediction with Deep Learning ∆ 𝑚𝑒𝑑ⅈ𝑎𝑛_𝑦 𝑖,𝑗 = 𝑀𝑒𝑑𝑖𝑎𝑛 𝑌 𝑖,𝑗 − 𝑦 ⅈ,𝑗 Instead of the median we propose an analytic method to find the neighbor MV with the lowest difference from GTMV along the 𝑥 and the 𝑦 coordinates respectively (see equation (5): ∆ 𝐵𝑒𝑠𝑡 𝑥𝑖,𝑗 = 𝑚𝑖𝑛𝑖𝑚𝑢𝑚{|𝑥 ⅈ−1,𝑗 − 𝑥 ⅈ,𝑗 |, |𝑥 ⅈ−1,𝑗−1 − 𝑥 ⅈ,𝑗 |, |𝑥 ⅈ,𝑗−1 − 𝑥 ⅈ,𝑗 |} ∆ 𝐵𝑒𝑠𝑡 𝑦𝑖,𝑗 = 𝑚𝑖𝑛𝑖𝑚𝑢𝑚{|𝑦 ⅈ−1,𝑗 − 𝑦 ⅈ,𝑗 |, |𝑦 ⅈ−1,𝑗−1 − 𝑦 ⅈ,𝑗 |, |𝑦 ⅈ,𝑗−1 − 𝑦 ⅈ,𝑗 |} (5) The Mean Square Error (MSE) between the median and the GTMV for the frame will be as describe in equation (6): 𝑀𝑆𝐸 𝑚𝑒𝑑ⅈ𝑎𝑛_𝑥 = 1𝑁 ∑ ∆ 𝑚𝑒𝑑ⅈ𝑎𝑛_𝑥 𝑖,𝑗
𝑀𝑆𝐸 𝑚𝑒𝑑ⅈ𝑎𝑛_𝑦 = 1𝑁 ∑ ∆ 𝑚𝑒𝑑ⅈ𝑎𝑛_𝑦 𝑖,𝑗 (6) Where 𝑁 represents the number of blocks. The MSE between the GTMV and the lowest difference neighbor is defined in equation (7): 𝑀𝑆𝐸
𝐵𝑒𝑠𝑡_𝑥 = 1𝑁 ∑ ∆
𝐵𝑒𝑠𝑡_𝑥 𝑖,𝑗
𝑀𝑆𝐸
𝐵𝑒𝑠𝑡_𝑦 = 1𝑁 ∑ ∆
𝐵𝑒𝑠𝑡_𝑦 𝑖,𝑗 (7) By definition, as indicated in equation (8):
𝑀𝑆𝐸
𝐵𝑒𝑠𝑡_𝑥 ≤ 𝑀𝑆𝐸 𝑚𝑒𝑑ⅈ𝑎𝑛_𝑥
𝑀𝑆𝐸
𝐵𝑒𝑠𝑡_𝑦 ≤ 𝑀𝑆𝐸 𝑚𝑒𝑑ⅈ𝑎𝑛_𝑦 (8) Therefore, our best PMV is derived by finding the MV with the lowest MSE when compared to the GTMV. The same MV will also provide the best coding efficiency since it will also yield lower entropy than the median, which in turn results in less required coding bits. In the results section below, we have shown the entropy as well as performed Huffman coding to calculate the number of bits required to code the residual error between the GTMV and the best PMV. The drawback of this approach is that we have to add signaling to the code stream in order to notify the decoder of the selected PMV per each and every motion compensated block. However, since the median can be selected as default, we only need to signal in the case of selecting one of the other neighboring MVs, therefore we can reduce the required signaling to average less than one bit per block. Since we deal with blocks, which have three motion compensated neighbors, we need to select one out of the three. The decoder can calculate the median, so we only need to signal whether the PMV is larger or smaller than the median. Such signaling can be accomplished with one extra bit for non-median blocks. IV. U SING CLASSIFICATION NN TO PREDICT BEST
PMV Signaling can be altogether avoided by training a classification neural network to predict the best PMV. The network is trained once on a dataset of blocks that is retrieved from multiple movies. Once trained, all network weights are set and can be used at runtime efficiently by the encoder and the decoder to perform the prediction from the MVs of neighboring blocks. The inputs of the network will be the 3 coordinate pairs of the neighboring MVs and the argument of the median, which is the index of the corresponding MV, as indicated in equation (9). 𝑖𝑛𝑝𝑢𝑡 ⅈ,𝑗 = {𝑥 ⅈ−1,𝑗 , 𝑦 ⅈ−1,𝑗 , 𝑥 ⅈ−1,𝑗−1 , 𝑦 ⅈ−1,𝑗−1 , 𝑥 ⅈ,𝑗−1 , 𝑦 ⅈ,𝑗−1 , 𝐴𝑟𝑔 𝑚𝑒𝑑ⅈ𝑎𝑛 𝑥 , 𝐴𝑟𝑔 𝑚𝑒𝑑ⅈ𝑎𝑛 𝑦 } (9) Classification network Softmax layer outputs three probabilities and the highest probable MV is selected as the best PMV. We provide the network as inputs also the arguments of the median MV indexes to assist it to distinguish between close classification probabilities. This is since the probabilities may be close in some cases and since we know that the median is, in some cases, the best PMV. The output neuron of the network with the highest probability of classified as best PMV is indicated in equation (10) for the 𝑥 and 𝑦 coordinates respectively, where ∅ 𝑥 𝑖,𝑗 and ∅ 𝑦 𝑖,𝑗 represent the index of the neuron with the highest probability form block 𝑖, 𝑗 in the 𝑥 and 𝑦 coordinates respectively. 𝜙 𝑥ⅈ,𝑗 (𝑖𝑛𝑝𝑢𝑡) ≈ 𝐴𝑟𝑔𝑚𝑖𝑛{(𝑥 ⅈ−1,𝑗 − 𝑥 ⅈ,𝑗 ), (𝑥 ⅈ−1,𝑗−1 − 𝑥 ⅈ,𝑗 ), (𝑥 ⅈ,𝑗−1 − 𝑥 ⅈ,𝑗 )} 𝜙 𝑦ⅈ,𝑗 (𝑖𝑛𝑝𝑢𝑡) ≈ 𝐴𝑟𝑔𝑚𝑖𝑛{(𝑦 ⅈ−1,𝑗 − 𝑦 ⅈ,𝑗 ), (𝑦 ⅈ−1,𝑗−1 − 𝑦 ⅈ,𝑗 ), (𝑦 ⅈ,𝑗−1 − 𝑦 ⅈ,𝑗 )} (10) The classification network that we have used has an input layer, 5 hidden layers with 8 neurons each, and a softmax layer with 3 outputs. The input is specified in equation (9). The outputs are 3 probabilities corresponding to the 3 neighboring MVs. The highest probability output indicates the predicted best PMV, as indicated in equation (11). 𝐴𝑟𝑔
𝑃𝑀𝑉_𝑥 = 𝐴𝑟𝑔𝑚𝑎𝑥{𝑃𝑟 𝑋 𝑀𝑉1 , 𝑃𝑟 𝑋 𝑀𝑉2 , 𝑃𝑟 𝑋 𝑀𝑉3 } 𝐴𝑟𝑔
𝑃𝑀𝑉_𝑦 = 𝐴𝑟𝑔𝑚𝑎𝑥{𝑃𝑟 𝑌 𝑀𝑉1 , 𝑃𝑟 𝑌 𝑀𝑉2 , 𝑃𝑟 𝑌 𝑀𝑉3 } (11) Since MV values can be positive and negative, and since the argument coordinates of the median vector are different by nature from the continuum of the MV values, we have normalized all MV coordinates to the range of -0.8 – 0.8 and nter-Prediction with Deep Learning have mapped the 3 possible median argument values to 0.85, 0.90, and 0.95 respectively, as indicated in equation (12). Using this mapping scheme, we have accomplished an input range of -0.8 – 0.95 and clearly and consistently distinguished the median coordinates from the (𝑥, 𝑦) coordinates of the input MVs, thus improving the network learning capacity. 𝐼𝑛𝑝𝑢𝑡 𝑀𝑉 𝑥𝑖,𝑗 = 0.8 ∗ 𝑀𝑉 𝑥 𝑖,𝑗 / max ∀ 𝑖,𝑗 {𝑀𝑉 𝑥 𝑖,𝑗 } 𝐼𝑛𝑝𝑢𝑡 𝑀𝑉 𝑦𝑖,𝑗 = 0.8 ∗ 𝑀𝑉 𝑦 𝑖,𝑗 / max ∀ 𝑖,𝑗 {𝑀𝑉 𝑦 𝑖,𝑗 } 𝐼𝑛𝑝𝑢𝑡
𝐴𝑟𝑔 𝑚𝑒𝑑𝑖𝑎𝑛𝑥 = 𝐴𝑟𝑔 𝑚𝑒𝑑𝑖𝑎𝑛 𝑥 ∗ 0.05 + 0.85 𝐼𝑛𝑝𝑢𝑡
𝐴𝑟𝑔 𝑚𝑒𝑑𝑖𝑎𝑛𝑦 = 𝐴𝑟𝑔 𝑚𝑒𝑑𝑖𝑎𝑛 𝑦 ∗ 0.05 + 0.85 (12) We further used the 𝑡𝑎𝑛ℎ activation function, which has a suitable dynamic range of {−1 ∶ 1} . The classification network is illustrated in Fig. 2.
The classification accuracy accomplished with 3 neighboring MVs varies between 70% - 80%, depending on dominant objects’ movements between video frames. We have learned that it is less likely to obtain sufficient improvement of the PMV when testing movies with relatively small movements, e.g., the small magnitude of MVs. In order to further substantiate this observation we have tested fast forward (FFW) movies.
Fig. 2:
Classification network to find the best PMV (𝑥 𝑝 , 𝑦 𝑝 ) FFW movies are created by skipping frames. The resulting MVs between 2 consecutive frames are calculated as an aggregation of the MVs between the skipped frames, as described in [30]. Therefore, the resulting MVs are of higher magnitudes. We further substantiated this observation by calculating MV statistics of our datasets (see section VI-D below). We show 35% - 66% improvement of MSE and corresponding 3% - 7% saving of required coding bits when running our algorithms on selected FFW movies. V. U SING R EGRESSION NN TO PREDICT THE
PMV An additional method that we propose is predicting the PMV using trained regression FCNN. We have used a FCNN that includes 1 to 5 hidden layers. The network is fed with 6 numbers, representing the {𝑥; 𝑦} values of the surrounding block MVs and outputs a value, representing the 𝑥 or 𝑦 value of the PMV of the current block. The network architecture (with an example of 3 hidden layers) is depicted in Fig. 3 . It works as a regressor using Euclidean loss function which, in our case, is a mean squared error between predicted and ground-truth motion vectors resulting from the motion estimation algorithm (see equation (13)). Two different networks of identical architecture were trained to predict the 𝑥 and 𝑦 values of the PMV respectively. Fig. 3:
FCNN architecture used to predict the value of the PMV (𝑥 𝑝 , 𝑦 𝑝 ) VI. R ESULTS A. Dataset
The dataset used for training is extracted from the Table Tennis movies of UCF101 videos dataset [28]. Here strong and irregular motions are observed by visual inspection of selected videos. An illustration of 2 arbitrary frames from the dataset is provided in
Fig. 4 . We scanned the data URLs and extracted six different datasets from multiple randomly selected Table Tennis movies. We used as many video files as necessary to satisfy the specified number of blocks that we used for training and for testing respectively. Motion-compensated blocks with zero movements were ignored when compiling the blocks’ dataset. In all cases we used one dataset for training and tested with the remaining 5 datasets while averaging the performance results. An illustration of the MVs superimposed on the frame is provided in
Fig. 5.
The videos we worked with have been analyzed to have blocks of sizes: 16x16, 8x8, 8x16, 16x8 due to the target compression standard (H.264). These 4 different blocks create a variety of = 64 permutations of neighboring cases. A subset of the cases, that demonstrates the considered neighbors is depicted in Fig. 6 . To assess the accuracy of the proposed prediction scheme, we calculate Mean Squared Error (MSE) of equation (13) over the full dataset. nter-Prediction with Deep Learning
Fig. 4:
Sample frames from the UCF101 Table Tennis Dataset
We have used FFMPEG to extract MVs from the movies. In videos with strong object motions, such as sports movies with large body parts movements, the MVs distribution is dispersed, as illustrated in an example from UCF101 table tennis sequences in
Fig. 4 with motion vectors illustrated in
Fig. 5 . Fig. 5:
Sample partial frames with superimposed MVs
𝑀𝑆𝐸 = 1𝑁 ∑{(𝑥̂ − 𝑥) + (𝑦̂ − 𝑦) } 𝑁 (13) Whereas: N is the number of samples in the dataset x̂ , ŷ are the predicted values of the block’s MV (PMV) x, y are the ground truth values of the block’s MV (GTMV) Using the ‘Table Tennis’ videos subset of the UCF101 dataset [28] we have extracted 50,000 blocks that satisfy a neighboring criterion of having 3, 2 or 1 motion compensated blocks respectively. All remaining blocks are estimated with Intra-Prediction and therefore not relevant to our algorithms. Since not all blocks have neighbors predicted with motion compensation, we have divided the dataset into three different categories according to the following neighboring criteria: 1. Blocks with 3 motion compensation neighbors 2.
Blocks with 2 motion compensation neighbors 3.
Blocks with 1 motion compensation neighbor For testing we have extracted a sample of 2,000 blocks from the same dataset, while ensuring that testing blocks and training blocks are extracted from different movies. B. Calculating best PMV
At the encoder side, we have all the information necessary to calculate the best matching neighboring MV, which is the closet in value to the GTMV. We have performed these calculations and compared them to the median MV. We compared MSE relative to the GTMV, entropy of the difference from the GTMV (which is an indication of coding bits requirement) and the actual number of bits, which are calculated using Huffman coding of all PMVs over the datasets.
Table 1 and
Table 2 provide the calculated MSE, Entropy and number of required coding bits for the ∆𝑥 and for the ∆𝑦 coordinates respectively. The average savings indicated by these numbers are consistent for both coordinates and yields 75% improvement of MSE, 42% improvement of the entropy, 35% reduction of bits required to code the 𝑀𝑉 𝑥 component and 31% improvement of bits required to code the 𝑀𝑉 𝑦 component. When averaging over the datasets, these improvements have very small variation, in the order of ±0.005%, when considering the standard deviation of the results. The average savings percentages are depicted in Table 3.
Fig. 6:
Subset of the possible block neighbors used to calculate the MVs
Table 1:
MSE, Entropy and number of Bits of the Median compared to the best PMV with respect to the GTMV. Calculated for ∆x using 5 different datasets and averaged 𝑴𝒆𝒅𝒊𝒂𝒏 𝒙 Best
𝑷𝑴𝑽 𝒙 MSE Entropy
Dataset1
Dataset2
Dataset3
Dataset4
Dataset5
Average± Std Dev
Table 2:
MSE, Entropy and number of Bits of the Median compared to the best PMV with respect to the GTMV. Calculated for ∆𝑦 using 5 different datasets and averaged 𝑴𝒆𝒅𝒊𝒂𝒏 𝒚 Best
𝑷𝑴𝑽 𝒚 MSE Entropy
Dataset1
Dataset2
Dataset3
Dataset4
Dataset5
Average± Std Dev
These improvements are substantial. However, they do not include the signaling bits required to indicate to the decoder which is the best PMV. The results of including the signaling bits for the cases when the median is not the selected PMV, are nter-Prediction with Deep Learning presented in
Table 4 . The results show an average reduction of 23% and 18% of bits required to code the 𝑀𝑉 𝑥 component the 𝑀𝑉 𝑦 component respectively. Table 3: % improvement of MSE, Entropy and number of Bits with respect to the Median vector. Calculated for ∆𝑥 and for ∆𝑦 , using 5 different datasets and averaged 𝑩𝒆𝒔𝒕 𝒙 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒙
𝑩𝒆𝒔𝒕 𝒚 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒚
MSE Entropy
Dataset1
75% 42% 36% 76% 42% 31%
Dataset2
75% 42% 36% 76% 42% 31%
Dataset3
75% 42% 35% 76% 42% 31%
Dataset4
75% 42% 36% 75% 42% 31%
Dataset5
74% 42% 35% 76% 43% 31%
Average
Table 4:
Saving of required coding bits, compared to the Median vector, when using the best PMV and adding signaling 𝒙 𝒚 𝛁 𝑴𝒆𝒅𝒊𝒂𝒏 𝛁 𝑩𝒆𝒔𝒕 % Saving 𝛁 𝑴𝒆𝒅𝒊𝒂𝒏 𝛁 𝑩𝒆𝒔𝒕 % Saving
Dataset1 153,401 117,667 23% 105,870 89,105 16% Dataset2 146,390 116,712 20% 125,586 103,100 18% Dataset3 160,488 123,120 23% 117,159 96,127 18% Dataset4 170,982 128,340 25% 123,469 100,573 19% Dataset5 150,068 115,673 23% 112,512 92,764 18% Average C. Using a classification network to predict best PMV
We can save the signaling by training a classification neural network to predict the best PMV from the 3 neighboring motion compensated blocks, as described in section IV. We have trained two different networks with the architecture depicted in
Fig. 2 – one for the
𝑃𝑀𝑉 𝑥 component and second for the 𝑃𝑀𝑉 𝑦 component. The training accuracy that is obtained is 70.4% for 𝑃𝑀𝑉 𝑥 and 79% for 𝑃𝑀𝑉 𝑦 respectively. The network convergence graphs with training epochs are depicted in Fig. 7 and
Fig. 8 for the
𝑃𝑀𝑉 𝑥 component and the 𝑃𝑀𝑉 𝑦 component respectively. The training data is split with 30% validation set and the accuracy is calculated on the validation set during training. The MSE, Entropy and number of required coding bits were calculated for the 𝑃𝑀𝑉 𝑥 and the 𝑃𝑀𝑉 𝑦 components. The results are provided in Table 5 and
Table 6 respectively. As can be seen in
Table , the classification network accuracy is insufficient to obtain improvement. The average degradation of MSE is of 16% and 21%, average degradation in entropy is 7% and 9% for the 𝑀𝑉 𝑥 and the 𝑀𝑉 𝑦 components respectively. The average degradation in the number of bits required for coding is 6% and 7% for 𝑀𝑉 𝑥 and the 𝑀𝑉 𝑦 respectively. Nevertheless, we have observed that the degradation depends on the magnitude of the MVs. In order to further substantiate this observation we have applied the classification network to FFW movies. Fig. 7:
Classification network convergence when training for
𝑃𝑀𝑉 𝑥 component Fig. 8:
Classification network convergence when training for
𝑃𝑀𝑉 𝑦 component Table 5:
MSE, Entropy and number of Bits of the Median vector compared to the classification network predicted best PMV with respect to the GTMV. Calculated for ∆x using 5 different datasets and averaged 𝑴𝒆𝒅𝒊𝒂𝒏 𝒙 Best
𝑷𝑴𝑽 𝒙 (Classification) MSE Entropy Dataset1 4.3528 1.707 22,222 5.3815 1.7781 23,066 Dataset2 4.7668 1.3244 19,835 6.4904 1.5478 22,381 Dataset3 8.304 1.7887 20,746 8.5806 1.8989 21,982 Dataset4 8.0678 1.8207 19,301 8.5573 1.9172 20,275 Dataset5 5.3619 1.7695 19,695 5.8626 1.7921 19,832 Average± Std Dev 6.1707± 1.88 1.6821± 0.2 20,360± 1,168 6.9745± 1.51 1.7868± 0.15 21,507± 1,391 D. Application for Fast Forward movies
Since FFW movies are created by skipping frames and since the final movie MVs are an aggregation of the MVs between the individual skipped frames [30], the magnitude of MVs in FFW movies is much larger than that of regular movies, and therefore our classification network efficiency is sufficient to obtain improvement. nter-Prediction with Deep Learning
Table 6:
MSE, Entropy and number of Bits of the Median vector compared to the classification network predicted best PMV with respect to the GTMV. Calculated for ∆𝑦 using 5 different datasets and averaged 𝑴𝒆𝒅𝒊𝒂𝒏 𝒚 Best
𝑷𝑴𝑽 𝒚 (Classification) MSE Entropy Dataset1 2.6193 1.4594 19,379 3.2857 1.5642 20,516 Dataset2 1.1909 0.9895 16,232 1.5902 1.1639 17,989 Dataset3 4.517 1.562 18,348 4.6598 1.6541 19,308 Dataset4 4.3742 1.5877 17,084 4.9874 1.6801 17,903 Dataset5 3.9654 1.3801 15,922 5.1618 1.5087 17,032 Average± Std Dev 3.3333± 1.41 1.3957± 0.24 17,393± 1,455 3.937± 1.5 1.5142± 0.21 18,550± 1,367
Table 7: % degradation of MSE, Entropy and number of Bits with respect to the Median vector. Calculated for ∆𝑥 and for ∆𝑦 , using 5 different datasets and averaged 𝑩𝒆𝒔𝒕 𝒙 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒙 (Classification)
𝑩𝒆𝒔𝒕 𝒚 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒚 (Classification) MSE Entropy
Dataset1 -24% -4% -4% -25% -7% -6%
Dataset2 -36% -17% -13% -34% -18% -11%
Dataset3 -3% -6% -6% -3% -6% -5%
Dataset4 -6% -5% -5% -14% -6% -5%
Dataset5 -9% -1% -1% -30% -9% -7%
Average -16%± 14% -7%± 6% -6%± 4% -21%± 13% -9%± 5% -7%± 2%
We used 2 different cars traffic FFW movies. One frame of each movie is depicted in
Fig. 9 . We compared the MV statistics of blocks’ subsets extracted from these 2 movies to those of our UCF101 Table Tennis dataset. The results of average MV magnitudes as well as standard deviations over 50,000 sampled blocks are depicted in
Table 8 . As can be seen, the standard deviation of the MV of the FFW movies is much larger than that of the Table Tennis movies, thus indicating much larger MV magnitudes.
Fig. 9:
Sample of one frame from two different FFW movies
Table 8:
MV statistics for datasets extracted from UCF101 Table Tennis movies compared to two FFW movies 𝑀𝑉 𝑥 𝑀𝑉 𝑦 Average Std Dev Average Std Dev Table Tennis Movies -0.0394± 0.079 3.1092± 0.37 0.1748± 0.25 2.438± 0.21 FFW Movies 3.1377± 3.23 22.7361± 10.41 -0.6803± 0.49 4.4247± 1.76
We trained the classification neural network of
Fig. 2 for these movies and calculated the resulting MSE, entropy and number of required bits compared to the median. The results are depicted in
Table 9 and in
Table for 𝑃𝑀𝑉 𝑥 and 𝑃𝑀𝑉 𝑦 respectively. The accuracy that was obtained during training was 84% and 80% for 𝑀𝑉 𝑥 and 𝑀𝑉 𝑦 respectively. The improvement results of the MSE, entropy and number of coding bits are depicted in Table 11 . As can be seen, there is a large variation of the results in the ∆𝑥 and the ∆𝑦 as well as between the two movies. This is due to the content, which has different distributions of MVs in these dimensions. Table 9:
MSE, Entropy and number of Bits of the Median vector compared to the classification network predicted best PMV with respect to the GTMV. Calculated for ∆x using 2 FFW movies. 𝑴𝒆𝒅𝒊𝒂𝒏 𝒙 Best
𝑷𝑴𝑽 𝒙 (Classification) MSE Entropy Movie 1 87.3705 1.821 5,617 37.4915 1.678 5,156 Movie 2 600.5415 2.3094 7,374 144.8425 2.1522 6,580
Repeating the calculation for FFW movies with the analytically derived best PMV of section B above, we obtain improvements of 20% and 11% for 𝑃𝑀𝑉 𝑥 and 𝑃𝑀𝑉 𝑦 respectively. Table 10:
MSE, Entropy and number of Bits of the Median vector compared to the classification network predicted best PMV with respect to the GTMV. Calculated for ∆𝑦 using 2 FFW movies. 𝑴𝒆𝒅𝒊𝒂𝒏 𝒚 Best
𝑷𝑴𝑽 𝒚 (Classification) MSE Entropy Movie 1 4.7835 1.096 3,779 4.6185 1.096 3,714 Movie 2 26.346 1.6281 5,197 8.551 1.5354 4,720
Table 11: % improvement of MSE, Entropy and number of Bits with respect to the Median vector. Calculated for ∆𝑥 and for ∆𝑦 ,Calculated for 2 FFW movies. 𝑩𝒆𝒔𝒕 𝒙 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒙 (Classification)
𝑩𝒆𝒔𝒕 𝒚 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒚 (Classification) MSE Entropy
Movie 1 57% 8% 8% 3% 0% 2% Movie 2 76% 7% 11% 68% 6% 9% E. Using regression to predict the PMV
We have implemented a regression network of the architecture depicted in
Fig. 3 to predict the value of the PMV based on the MVs of the neighboring blocks. The regression network learns to predict a PMV value. This PMV is not necessarily identical to one of the neighboring block’s MVs, which were predicted by the classification network of the previous method. We trained the network with 50,000 blocks with a validation split of 30%. The loss function minimizes the MSE between the predicted PMV and the ground truth MV for each one of the PMV components separately. The regression network typically converges according to that criteria before reaching 50 epochs of the complete dataset size (no batching used in training). We trained two separate networks, for predicting 𝑥 and 𝑦 of the PMV respectively, for each case using the extracted MVs dataset. The participating movies were selected randomly from within the dataset. The test was run on a dataset of 2,000 blocks, also selected from randomly selected movies from the same dataset, ensuring that they are always different from the ones used for training. The test was run 5 times for different randomly selected datasets. The results for blocks with 3 motion-compensated neighboring nter-Prediction with Deep Learning blocks and with 2 neighboring blocks are summarized in Table 12 and
Table 13 respectively. As can be seen, the prediction results for 3 motion compensated neighbors are ~20% better in terms of MSE than those of the Median vector. The trained network results for blocks with 2 motion-compensated vectors, whereas the median is actually the average of the two, are not as good but still better compared to the Median vector.
Table 12:
MSE of regression Network-Predicted versus Median-Predicted for blocks with 3 motion compensated neighbors
𝑷𝑴𝑽 𝒙 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒙 (Regression)
𝑷𝑴𝑽 𝒚 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒚 (Regression)
Median Predicted Diff Median Predicted Diff Dataset1
Dataset2
Dataset3
Dataset4
Dataset5
Average
21% ± 5% 20% ± 3%
The trained network prediction results for blocks with only one motion-compensated neighbor, whereas the Median vector is actually the value of the neighboring block’s vector itself, are not as good as the Median vector and therefore we recommend using the base-line Median vector in this case.
Table 13:
MSE of regression Network-Predicted versus Median-Predicted for blocks with 2 motion compensated neighbors
𝑷𝑴𝑽 𝒙 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒙 (Regression)
𝑷𝑴𝑽 𝒚 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒚 (Regression)
Median Predicted Diff Median Predicted Diff Dataset1 Dataset2 Dataset3 Dataset4
Dataset5 Average
13% ± 4% 8% ± 5%
We compared the results of the entropy and the number of coding bits to the results of the previous method – the classification network – when running the prediction for the FFW movies. The comparison results are depicted in
Table 14 and in
Table for 𝑀𝑉 𝑥 and 𝑀𝑉 𝑦 respectively. Table 14:
Entropy and number of Bits of the Median vector compared to the regression network predicted best PMV with respect to the GTMV. Calculated for ∆x using 2 FFW movies. 𝑴𝒆𝒅𝒊𝒂𝒏 𝒙 𝑷𝑴𝑽 𝒙 (Regression) Improvement Entropy Movie 1
As can be seen, the results of the regression network provide substantially better accuracy of the PMV for the FFW movies. This can be attributed to the relatively low classification accuracy that has been obtained while training the classification neural networks. The number of layers used for the neural network has a substantial impact on the number of matrix multiplications required to predict MVs, thus reducing the number of layers is desirable in order to improve computational effectiveness. We have used different numbers of hidden layers within the range of 1 – 5.
Table 15:
Entropy and number of Bits of the Median vector compared to the regression network predicted best PMV with respect to the GTMV. Calculated for ∆𝑦 using 2 FFW movies. 𝑴𝒆𝒅𝒊𝒂𝒏 𝒚 𝑷𝑴𝑽 𝒚 (Regression) Improvement Entropy Movie 1
Our results show that 1 hidden layer is sufficient to obtain the improved prediction results. Increasing the number of layers does not necessarily improve the prediction accuracy, as can be seen in
Table 16 . For each run we used 5 different datasets and averaged the improvement results. A plausible explanation is that the nature of the movements requires a function that can be satisfied with a 1-hidden-layer network. As we increase the number of layers unnecessarily, the number of redundant weights increases and given the same dataset size we reach overfitting and thus deteriorating accuracy results on the test dataset. In order to accomplish satisfactory training results with a larger number of layers, the dataset size has to be increased. More layers may be required when processing movies with higher motion activity.
Table 16:
MSE Prediction accuracy vs. the number of hidden layers
𝑷𝑴𝑽 𝒙 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒙 (Regression)
𝑷𝑴𝑽 𝒚 /𝛁 𝑴𝒆𝒅𝒊𝒂𝒏_𝒚 (Regression) F. Neural Network Training Optimization
The networks were implemented in Python using Keras. The networks were trained with a full dataset per epoch. Keras criteria for halting training were defined with a patience parameter of 20 epochs of no improvement with a min-delta for loss function of 0.01. The FCNN used for classification and for regression are depicted in
Fig. 2 and in
Fig. 3 respectively. The networks were optimized using the Adam Optimizer [31], which is an improvement of Stochastic Gradient Decent (SGD) algorithm [32], that is using Momentum, which is effectively a factored running average of the gradients in the different steps so far, and RMSprop, which introduces a factored square of the gradient in order to reduce variations in steeper directions and prefer more gradual and stable ones. The Momentum, RMSprop and Adam optimizer formulas are described in the Appendix. We used the Adam optimizer with a constant learning rate of 0.001 and decay coefficients of 0.9 and 0.999 for 𝛽 and 𝛽 respectively, which are the default for Keras Adam optimizer library and were proven to provide sufficiently fast convergence rate. VII. C ONCLUSIONS AND FUTURE WORK
In this paper we have proposed three algorithms to improve the value of the Predicted Motion Vector (PMV). We have nter-Prediction with Deep Learning demonstrated potential improvements in the order of 20% savings on the encoding efficiency of the PMV, when selecting the best PMV analytically. We further used a classification neural network for predicting which of the neighboring MVs is the best prediction of the GTMV. We have demonstrated a bit rate reduction between 5%-9% for fast forward movies which have high motion. The same applies to other movies with high motion whereas the magnitude of the MVs is large, such as movies taken from the cameras of autonomous vehicles, drones, and PTZ cameras. We further used a Fully Connected regression Neural Network (FCNN) approach for predicting the PMV for a target block from its neighboring motion compensated blocks. We have accomplished substantial accuracy improvement compared to the commonly used Median-based prediction. Our classification network for selecting the best neighboring PMV was not sufficiently accurate to reduce the bit rates of standard movies. However, it does reduce the bit rates of FFW movies by ~7%. While the analytic calculation of the best PMV from the 3 neighboring blocks provides the largest reduction of bit rate (~34%) for all movies, having to add signaling reduces the bits saving to ~20%. A regression network accomplishes the best improvement of FFW movies, reducing the bit rate to ~34%. The results of bit reduction for the 3 methods when applied to the FFW movies are depicted in
Table 17 . Table 17:
Bits reduction comparison of the 3 proposed PMV methods
For
𝑷𝑴𝑽 𝒙 For
𝑷𝑴𝑽 𝒚 Best PMV 20% 11% Classification 9% 5% Regression 36% 31%
The accuracy of the predicted PMVs can be further improved by incorporating additional, 2 nd order neighboring motion compensated MVs and/or inter-frame MVs, taking advantage of the temporal domain. The same added 2 nd order neighbors can also be used to improve the accuracy of the classification network and thus provide a closer prediction of the best neighbor MV, which clearly accomplishes large efficiency improvement. The improvements in the accuracy of the PMV can also be leveraged for exploring more computational efficient motion estimation algorithms. We are also proposing to improve the entropy of the predicted MVs and therefore coding efficiency, by incorporating a related criterion in the neural network loss function during training. In this paper we have not considered computation complexity. While the Best PMV method retains the same calculation complexity of the Median, the Classification and Regression neural networks increase the calculation complexity. Assuming that training is performed in advance and runtime only perform parallel matrix manipulations using Graphical Processing Units (GPU), we assume that the added complexity is acceptable and will still allow real time processing. However, this matter can be further explored and investigated in future research work. A PPENDIX
The Momentum, RMSprop and Adam optimizer formulas are provided in equations (14), (15) and (16) respectively. 𝑣 𝑡+1 = 𝜌𝑣 𝑡 + 𝑔 𝑡 𝑤 𝑡+1 = 𝑤 𝑡 − 𝛼𝑣 𝑡+1 (14) Where for iteration time 𝑡 , 𝑤 𝑡 corresponds to the weights that are updated 𝑣 𝑡 corresponds to the derivative of the gradient 𝑔 𝑡 corresponds to the gradient 𝜌 is a friction hyperparameter momentum coefficient; and 𝛼 is the learning rate hyperparameter 𝑚 𝑡+12 = 𝛽𝑚 𝑡2 + (1 − 𝛽)𝑔 𝑡2 𝑤 𝑡+1 = 𝑤 𝑡 − 𝛼𝑔 𝑡 /(√𝑚 𝑡2 + 𝜀) (15) Where for iteration time t, 𝑤 𝑡 corresponds to the weights that are updated 𝑔 𝑡 corresponds to the gradient 𝑚 𝑡2 corresponds to a moving estimate of the squared gradient 𝛽 is a decay rate hyperparameter 𝛼 is the learning rate hyperparameter; and 𝜀 is a small value that protects from dividing by zero 𝑚 𝑡+11 = 𝛽 𝑚 𝑡1 + (1 − 𝛽 )𝑔 𝑡 𝑚 𝑡+12 = 𝛽 𝑚 𝑡2 + (1 − 𝛽 )𝑔 𝑡2 𝑤 𝑡+1 = 𝑤 𝑡 − 𝛼𝑚 𝑡1 ̂ /(√𝑚 𝑡2 ̂ + 𝜀) 𝑚 𝑡1 ̂ = 𝑚 𝑡1 (1 − 𝛽 ) 𝑚 𝑡2 ̂ = 𝑚 𝑡2 (1 − 𝛽 ) (16) Where for iteration time t, 𝑤 𝑡 corresponds to the weights that are updated 𝑔 𝑡 corresponds to the gradient 𝑚 𝑡1 corresponds to a moving estimate of the gradient 𝑚 𝑡2 corresponds to a moving estimate of the squared gradient 𝛽 and 𝛽 are decay rate hyperparameters 𝛼 is the learning rate hyperparameter; and 𝜀 is a small value that protects from dividing by zero. nter-Prediction with Deep Learning A CKNOWLEDGMENT
This research work was partially supported by the Chateaubriand grant. R
EFERENCES [1]
Iain E. Richardson, The H.264 Advanced Video Compression Standard, 2nd Edition, April (2010) [2] “ITU-T Recommendation H.265/ ISO/IEC 23008-2:2013 MPEG-H Part 2: High Efficiency Video Coding (HEVC),” (2013) [3]
D. Mukherjee, H. Su, J. Bankoski, A. Converse, J. Han,Z.Liu and Y. Xu, “An overview of new video coding tools under consideration for VP10: the successor to VP9” , SPIE., (2015). [4]
J. Lainema and K. Ugur. "Angular intra prediction in high efficiency video coding (HEVC)." Multimedia Signal Processing (MMSP), IEEE 13th International Workshop., (2011). [5]
O. Hadar, A. Shleifer, D. Mukherjee, U. Joshi, I. Mazar, M. Yuzvinsky, N. Tavor, N. Itzhak, and R. Birman, “Novel Modes and Adaptive Block Scanning Order for Intra Prediction in AV1”, in SPIE Optics + Photonics conference, San Diego, California (USA)., (2017). [6]
R. Birman, Y. Segal, and O. Hadar, “Overview of Research in the field of Video Compression using Deep Neural Networks”, in Multimedia Tools and Applications, pp.1-24. (2020) [7]
Y. Zhang, S. Kwong, and S. Wang, “Machine learning based video coding optimizations: A survey”, in Information Sciences, 506, pp.395-423. (2020) [8]
D. Liu, Y. Li, J. Lin, H. Li, and F. Wu, “Deep Learning-Based Video Coding: A Review and A Case Study”, in arXiv preprint arXiv:1904.12462., (2019) [9]
A. Shleifer, C. Lanka, M. Setia, S. Agarwal, O. Hadar, and D. Mukherjee, "Novel intra prediction modes for VP10 codec", Proceedings of SPIE Vol. 9971, 997114 (2016) [10]
Z. Zhao, S. Wang, X. Zhang, S. Ma, and J. Yang, “CNN-based bi-directional motion compensation for high efficiency video coding”, in IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1-4)., (2018). [11]
S. Huo, D. Liu, F. Wu, and H. Li, “Convolutional Neural Network-Based Motion Compensation Refinement for Video Coding”, in IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1-4)., (2018). [12]
J.K. Lee, N. Kim, S. Cho, and J.W. Kang, “Convolution Neural Network based Video Coding Technique using Reference Video Synthesis”, in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 505-508). IEEE., (2018). [13]
G. Laroche, J. Jung, and B. Pesquet-Popescu, “RD optimized coding for motion vector predictor selection”, in IEEE Transactions on Circuits and Systems for Video Technology, 18(9), pp.1247-1257., (2008). [14]
Y. Wang, X. Fan, C. Jia, D. Zhao, and W. Gao, “Neural Network Based Inter Prediction for HEVC”, in IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6)., (2018). [15]
I. Ismaeil, A. Docef, F. Kossentini, and R. Ward, "Efficient Motion Estimation Using Spatial and Temporal Motion Vector Prediction", in Proceedings 1999 International Conference on Image Processing (ICIP) 24-28 Oct., (1999). [16]
L. Lijun, Z. Cairong, G. Xiqi, and H. Zhenya, "A new prediction search algorithm for block motion estimation in video coding", in IEEE Transactions on Consumer Electronics, Vol. 43 , No. 1 , pp. 56-61, Feb (1997). [17]
Li, Reoxiang, B. Zeng, and M.L. Liou, “A new three-step search algorithm for block motion estimation”, in IEEE transactions on circuits and systems for video technology, 4(4), pp.438-442., (1994). [18]
L.M. Po, and W.C. Ma, “A novel four-step search algorithm for fast block motion estimation”, in IEEE transactions on circuits and systems for video technology, 6(3), pp.313-317., (1996). [19]
L.K. Liu, and E. Feig, “A block-based gradient descent search algorithm for block motion estimation in video coding”, in IEEE Transactions on circuits and systems for Video Technology, 6(4), pp.419-422., (1996). [20]
J. B. Xu, L. M. Po, and C. K. Cheung, "Adaptive motion tracking block matching algorithms for video coding," in IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 7, pp. 1025-1029, Oct., (1999). [21]
L. Luo, C. Zou, X. Gao, and Z. He, “A new prediction search algorithm for block motion estimation in video coding”, in IEEE transactions on consumer electronics, 43(1), pp.56-61., (1977). [22]
E. Kaminsky, and O. Hadar, “Multiparameter method for analysis and selection of motion estimation algorithms for video compression”, in Multimedia Tools and Applications, 38(1), pp.119-146. (2008) [23]
T. Laude, and J. Ostermann. "Deep learning-based intra prediction mode decision for HEVC", in Picture Coding Symposium (PCS), IEEE., (2016). [24]
W. Cui, T. Zhang, S. Zhang, F. Jiang, W. Zuo, and D. Zhao, “Convolutional neural networks based intra prediction for HEVC”. in arXiv preprint arXiv:1808.05734., (2018). [25]
J. Li, B. Li, J. Xu, and R. Xiong, “Intra prediction using fully connected network for video coding”, in IEEE International Conference on Image Processing (ICIP) (pp. 1-5)., (2017). [26]
R. Birman, Y. Segal, A. D. Malka, and O. Hadar, “Intra prediction with deep learning”, in SPIE Optics + Photonics conference, San Diego, California (USA)., (2018). [27]
A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks”, In Proceedings of the IEEE international conference on computer vision (pp. 2758-2766)., (2015). [28]
K. Soomro, A.R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild”, in arXiv preprint arXiv:1212.0402., (2012). [29]
D. A. Huffman, “A method for the construction of minimum redundancy codes”, in Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, (1952). [30]
T.K. Lee, C.H. Fu, Y.L. Chan, and W.C. Siu, “A new motion vector composition algorithm for fast-forward video playback in H. 264”, In Proceedings of 2010 IEEE International Symposium on Circuits and Systems (pp. 3649-3652). (2010). [31]
D. Kingma, and J. Ba, “A Method for Stochastic Optimization”. arXiv:1412.6980 [cs.LG], December (2014). [32]