[PDF] Attention-Based Neural Networks for Chroma Intra Prediction in Video Coding

Abstract

Neural networks can be successfully used to improve several modules of advanced video coding schemes. In particular, compression of colour components was shown to greatly benefit from usage of machine learning models, thanks to the design of appropriate attention-based architectures that allow the prediction to exploit specific samples in the reference region. However, such architectures tend to be complex and computationally intense, and may be difficult to deploy in a practical video coding pipeline. This work focuses on reducing the complexity of such methodologies, to design a set of simplified and cost-effective attention-based architectures for chroma intra-prediction. A novel size-agnostic multi-model approach is proposed to reduce the complexity of the inference process. The resulting simplified architecture is still capable of outperforming state-of-the-art methods. Moreover, a collection of simplifications is presented in this paper, to further reduce the complexity overhead of the proposed prediction architecture. Thanks to these simplifications, a reduction in the number of parameters of around 90% is achieved with respect to the original attention-based methodologies. Simplifications include a framework for reducing the overhead of the convolutional operations, a simplified cross-component processing model integrated into the original architecture, and a methodology to perform integer-precision approximations with the aim to obtain fast and hardware-aware implementations. The proposed schemes are integrated into the Versatile Video Coding (VVC) prediction pipeline, retaining compression efficiency of state-of-the-art chroma intra-prediction methods based on neural networks, while offering different directions for significantly reducing coding complexity.

Full PDF

JJOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 1

Attention-Based Neural Networks for Chroma IntraPrediction in Video Coding

Marc G´orriz Blanch,

Student Member IEEE,

Saverio Blasi, Alan F. Smeaton,

Fellow IEEE,

Noel E. O’Connor,

Member IEEE, and Marta Mrak,

Senior Member IEEE

Abstract —Neural networks can be successfully used to im-prove several modules of advanced video coding schemes. Inparticular, compression of colour components was shown togreatly beneﬁt from usage of machine learning models, thanksto the design of appropriate attention-based architectures thatallow the prediction to exploit speciﬁc samples in the referenceregion. However, such architectures tend to be complex andcomputationally intense, and may be difﬁcult to deploy in apractical video coding pipeline. This work focuses on reducingthe complexity of such methodologies, to design a set of simpli-ﬁed and cost-effective attention-based architectures for chromaintra-prediction. A novel size-agnostic multi-model approach isproposed to reduce the complexity of the inference process. Theresulting simpliﬁed architecture is still capable of outperformingstate-of-the-art methods. Moreover, a collection of simpliﬁcationsis presented in this paper, to further reduce the complexityoverhead of the proposed prediction architecture. Thanks tothese simpliﬁcations, a reduction in the number of parametersof around 90% is achieved with respect to the original attention-based methodologies. Simpliﬁcations include a framework for re-ducing the overhead of the convolutional operations, a simpliﬁedcross-component processing model integrated into the originalarchitecture, and a methodology to perform integer-precisionapproximations with the aim to obtain fast and hardware-awareimplementations. The proposed schemes are integrated into theVersatile Video Coding (VVC) prediction pipeline, retainingcompression efﬁciency of state-of-the-art chroma intra-predictionmethods based on neural networks, while offering differentdirections for signiﬁcantly reducing coding complexity.

Index Terms —Chroma intra prediction, convolutional neuralnetworks, attention algorithms, multi-model architectures, com-plexity reduction, video coding standards.

I. I

NTRODUCTION E FFICIENT video compression has become an essentialcomponent of multimedia streaming. The convergenceof digital entertainment followed by the growth of web ser-vices such as video conferencing, cloud gaming and real-timehigh-quality video streaming, prompted the development ofadvanced video coding technologies capable of tackling theincreasing demand for higher quality video content and its con-sumption on multiple devices. New compression techniquesenable a compact representation of video data by identifying

Manuscript submitted July 1, 2020. The work described in this paper hasbeen conducted within the project JOLT funded by the European Union’s Hori-zon 2020 research and innovation programme under the Marie SkłodowskaCurie grant agreement No 765140.M. G´orriz Blanch, S. Blasi and M. Mrak are with BBC Research &Development, The Lighthouse, White City Place, 201 Wood Lane, Lon-don, UK (e-mail: [email protected], [email protected],[email protected]).A. F. Smeaton and N. E. O’Connor are with Dublin City University, Glas-nevin, Dublin 9, Ireland (e-mail: [email protected], [email protected]). Fig. 1. Visualisation of the attentive prediction process. For each referencesample 0-16 the attention module generates its contribution to the predictionof individual pixels from a target × block. and removing spatial-temporal and statistical redundancieswithin the signal. This results in smaller bitstreams, enablingmore efﬁcient storage and transmission as well as distributionof content at higher quality, requiring reduced resources.Advanced video compression algorithms are often complexand computationally intense, signiﬁcantly increasing the en-coding and decoding time. Therefore, despite bringing highcoding gains, their potential for application in practice islimited. Among the current state-of-the-art solutions, the nextgeneration Versatile Video Coding standard [1] (referred to asVVC in the rest of this paper), targets between 30-50% bettercompression rates for the same perceptual quality, supportingresolutions from 4K to 16K as well as 360 ◦ videos. Onefundamental component of hybrid video coding schemes, intraprediction, exploits spatial redundancies within a frame bypredicting samples of the current block from already recon-structed samples in its close surroundings. VVC allows alarge number of possible intra prediction modes to be used a r X i v : . [ ee ss . I V ] F e b OURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 2 on the luma component at the cost of a considerable amountof signalling data. Conversely, to limit the impact of modesignalling, chroma components employ a reduced set of modes[1].In addition to traditional modes, more recent research intro-duced schemes which further exploit cross-component correla-tions between the luma and chroma components. Such corre-lations motivated the development of the Cross-ComponentLinear Model (CCLM, or simply LM in this paper) intramodes. When using CCLM, the chroma components arepredicted from already reconstructed luma samples using alinear model. Nonetheless, the limitation of simple linearpredictions comes from its high dependency on the selectionof predeﬁned reference samples. Improved performance canbe achieved using more sophisticated Machine Learning (ML)mechanisms [2], [3], which are able to derive more complexrepresentations of the reference data and hence boost theprediction capabilities.Methods based on Convolutional Neural Networks (CNNs)[2], [4] provided signiﬁcant improvements at the cost of twomain drawbacks: the associated increase in system complex-ity and the tendency to disregard the location of individualreference samples. Related works deployed complex neuralnetworks (NNs) by means of model-based interpretability[5]. For instance, VVC recently adopted simpliﬁed NN-basedmethods such as Matrix Intra Prediction (MIP) modes [6]and Low-Frequency Non Separable Transform (LFNST) [7].For the particular task of block-based intra-prediction, theusage of complex NN models can be counterproductive ifthere is no control over the relative position of the referencesamples. When using fully-connected layers, all input samplescontribute to all output positions, and after the consecutiveapplication of several hidden layers, the location of eachinput sample is lost. This behaviour clearly runs counterto the design of traditional approaches, in which predeﬁneddirectional modes carefully specify which boundary locationscontribute to each prediction position. A novel ML-basedcross-component intra-prediction method is proposed in [4],introducing a new attention module capable of tracking thecontribution of each neighbouring reference sample whencomputing the prediction of each chroma pixel, as shown inFigure 1. As a result, the proposed scheme better capturesthe relationship between the luma and chroma components,resulting in more accurate prediction samples. However, suchNN-based methods signiﬁcantly increase the codec complex-ity, increasing the encoder and decoder times by up to 120%and 947%, respectively.This paper focuses on complexity reduction in video codingwith the aim to derive a set of simpliﬁed and cost-effectiveattention-based architectures for chroma intra-prediction. Un-derstanding and distilling knowledge from the networks en-ables the implementation of less complex algorithms whichachieve similar performance to the original models. Moreover,a novel training methodology is proposed in order to design ablock-independent multi-model which outperforms the state-of-the-art attention-based architectures and reduces inferencecomplexity. The use of variable block sizes during traininghelps the model to better generalise on content variety while ensuring higher precision on predicting large chroma blocks.The main contributions of this work are the following: • A competitive block-independent attention-based multi-model and training methodology; • A framework for complexity reduction of the convolu-tional operations; • A simpliﬁed cross-component processing model usingsparse auto-encoders; • A fast and cost-effective attention-based multi-model withinteger precision approximations.This paper is organised as follows: Section II provides abrief overview on the related work, Section III introducesthe attention-based methodology in detail and establishes themathematical notation for the rest of the paper, Section IVpresents the proposed simpliﬁcations and Section V showsexperimental results, with conclusion drawn in Section VI.II. B

ACKGROUND

Colour images are typically represented by three colourcomponents (e.g. RGB, YCbCr). The YCbCr colour schemeis often adopted by digital image and video coding standards(such as JPEG, MPEG-1/2/4 and H.261/3/4) due to its abilityto compact the signal energy and to reduce the total requiredbandwidth. Moreover, chrominance components are often sub-sampled by a factor of two to conform to the YCbCr 4:2:0chroma format, in which the luminance signal contains most ofthe spatial information. Nevertheless, cross-component redun-dancies can be further exploited by reusing information fromalready coded components to compress another component.In the case of YCbCr, the Cross-Component Linear model(CCLM) [8] uses a linear model to predict the chroma signalfrom a subsampled version of the already reconstructed lumablock signal. The model parameters are derived at both theencoder and decoder sides without needing explicit signallingin the bitstream.Another example is the Cross-Component Prediction (CCP)[9] which resides at the transform unit (TU) level regardlessof the input colour space. In case of YCbCr, a subsampled anddequantised luma transform block (TB) is used to modify thechroma TB at the same spatial location based on a contextparameter signalled in the bitstream. An extension of thisconcept modiﬁes one chroma component using the residualsignal of the other one [10]. Such methodologies signiﬁcantlyimproved the coding efﬁciency by further exploiting the cross-component correlations within the chroma components.In parallel, recent success of deep learning applicationin computer vision and image processing inﬂuenced designof novel video compression algorithms. In particular in thecontext of intra-prediction, a new algorithm [3] was introducedbased on fully-connected layers and CNNs to map the predic-tion of block positions from the already reconstructed neigh-bouring samples, achieving BD-rate (Bjontegaard Delta rate)[11] savings of up to 3.0% on average over HEVC, for approx.200% increase in decoding time. The successful integrationof CNN-based methods for luma intra-prediction into existingcodec architectures has motivated research into alternativemethods for chroma prediction, exploiting cross-component

OURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 3

Fig. 2. Baseline attention-based architecture for chroma intra prediction presented in [4] and described in Section III. redundancies similar to the aforementioned LM methods. Anovel hybrid neural network for chroma intra prediction wasrecently introduced in [2]. A ﬁrst CNN was designed toextract features from reconstructed luma samples. This wascombined with another fully-connected network used to extractcross-component correlations between neighbouring luma andchroma samples. The resulting architecture uses complex non-linear mapping for end-to-end prediction of chroma channels.However, this is achieved at the cost of disregarding the spatiallocation of the boundary reference samples and signiﬁcantincrease of the complexity of the prediction process. As shownin [4], after a consecutive application of fully-connected layersin [2], the location of each input boundary reference sampleis lost. Therefore, the fully-convolutional architecture in [4]better matches the design of the directional VVC modes andis able to provide signiﬁcantly better performance.The use of attention models enables effective utilisationof the individual spatial location of the reference samples[4]. The concept of “attention-based” learning is a well-known idea used in deep learning frameworks, to improve theperformance of trained networks in complex prediction tasks[12], [13], [14]. In particular, self-attention is used to assess theimpact of particular input variables on the outputs, wherebythe prediction is computed focusing on the most relevantelements of the same sequence [15]. The novel attention-based architecture introduced in [4] reports average BD-ratereductions of -0.22%, -1.84% and -1.78% for the Y, Cb andCr components, respectively, although it signiﬁcantly impactsthe encoder and decoder time.One common aspect across all related work is that whilstthe result is an improvement in compression this comes at theexpense of increased complexity of the encoder and decoder.In order to address the complexity challenge, this paper aimsto design a set of simpliﬁed attention-based architectures forperforming chroma intra-prediction faster and more efﬁciently.Recent works addressed complexity reduction in neural net-works using methods such as channel pruning [16], [17],[18] and quantisation [19], [20], [21]. In particular for video compression, many works used integer arithmetic in orderto efﬁciently implement trained neural networks on differenthardware platforms. For example, the work in [22] proposes atraining methodology to handle low precision multiplications,proving that very low precision is sufﬁcient not just forrunning trained networks but also for training them. Similarly,the work in [23] considers the problem of using variationallatent-variable models for data compression and proposesinteger networks as a universal solution of range coding asan entropy coding technique. They demonstrate that suchmodels enable reliable cross-platform encoding and decodingof images using variational models. Moreover, in order toensure deterministic implementations on hardware platforms,they approximate non-linearities using lookup tables. Finally,an efﬁcient implementation of matrix-based intra predictionis proposed in [24], where a performance analysis evaluatesthe challenges of deploying models with integer arithmeticin video coding standards. Inspired by this knowledge, thispaper develops a fast and cost-effective implementation of theproposed attention-based architecture using integer precisionapproximations. As shown Section V-D, while such approxi-mations can signiﬁcantly reduce the complexity, the associateddrop of performance is still not negligible.III. A

TTENTION - BASED ARCHITECTURES

This section describes in detail the attention-based approachproposed in [4] (Figure 2), which will be the baseline for thepresented methodology in this paper. The section also providesthe mathematical notation used for the rest of this paper.Without loss of generality, only square blocks of pixelsare considered in this work. After intra-prediction and recon-struction of a luma block in the video compression chain,luma samples can be used for prediction of co-located chromacomponents. In this discussion, the size of a luma block isassumed to be (downsampled to) N × N samples, which isthe size of the co-located chroma block. This may require theusage of conventional downsampling operations, such as inthe case of using chroma sub-sampled picture formats such OURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 4

Fig. 3. Proposed multi-model attention-based architectures with the integration of the simpliﬁcations introduced in this paper. More details about the model’shyperparameters and a description of the referred schemes can be found in Section V. as 4:2:0. Note that a video coding standard treats all imagesamples as unsigned integer values within a certain precisionrange based on the internal bit depth. However, in order toutilise common deep learning frameworks, all samples areconverted to ﬂoating point and normalised to values within therange [0 , . For the chroma prediction process, the referencesamples used include the co-located luma block X ∈ IR N × N ,and the array of reference samples B c ∈ IR b , b = 4 N +1 fromthe left and from above the current block (Figure 1), where c = Y , C b or C r refers to the three colour components. B is constructed from samples on the left boundary (startingfrom the bottom-most sample), then the corner is added,and ﬁnally the samples on top are added (starting from theleft-most sample). In case some reference samples are notavailable, these are padded using a predeﬁned value, followingthe standard approach deﬁned in VVC. Finally, S ∈ IR × b is the cross-component volume obtained by concatenatingthe three reference arrays B Y , B Cb and B Cr . Similar tothe model in [2], the attention-based architecture adopts ascheme based on three network branches that are combined toproduce prediction samples, illustrated in Figure 2. The ﬁrsttwo branches work concurrently to extract features from theinput reference samples.The ﬁrst branch (referred to as the cross-component bound-ary branch) extracts cross component features from S ∈ IR × b by applying I consecutive D i - dimensional × convolutional layers to obtain the S i ∈ IR D i × b output featuremaps, where i = 1 , . . . I . By applying × convolutions, theboundary input dimensions are preserved, resulting in an D i -dimensional vector of cross-component information for eachboundary location. The resulting volumes are activated usinga Rectiﬁed Linear Unit (ReLU) non-linear function.In parallel, the second branch (referred to as the lumaconvolutional branch) extracts spatial patterns over the co-located reconstructed luma block X by applying convolu-tional operations. The luma convolutional branch is deﬁnedby J consecutive C j -dimensional × convolutional layerswith a stride of , to obtain X j ∈ IR C j × N feature maps from the N input samples, where j = 1 , . . . J . Similar to thecross-component boundary branch, in this branch a bias anda ReLU activation are applied within convolutional layer.The feature maps ( S I and X J ) from both branches areeach convolved using a × kernel, to project them intotwo corresponding reduced feature spaces. Speciﬁcally, S I is convolved with a ﬁlter W F ∈ IR h × D to obtain the h -dimensional feature matrix F . Similarly, X J is convolved witha ﬁlter W G ∈ IR h × C to obtain the h -dimensional featurematrix G . The two matrices are multiplied together to obtainthe pre-attention map M = G T F . Finally, the attention matrix A ∈ IR N × b is obtained applying a softmax operation to eachelement of M , to generate the probability of each boundarylocation being able to predict a sample location in the block.Each value α j,i in A is obtained as: α j,i = exp ( m i,j /T ) (cid:80) b − n =0 exp ( m n,j /T ) , (1)where j = 0 , ..., N − represents the sample location inthe predicted block, i = 0 , ..., b − represents a referencesample location, and T is the softmax temperature parametercontrolling the smoothness of the generated probabilities, with < T ≤ . Notice that the smaller the value of T , the morelocalised are the obtained attention areas resulting in corre-spondingly fewer boundary samples contributing to a givenprediction location. The weighted sum of the contribution ofeach reference sample in predicting a given sample at a speciﬁclocation is obtained by computing the matrix multiplicationbetween the cross-component boundary features S I and theattention matrix A , or formally S TI A . In order to further reﬁne S TI A , this weighted sum can be multiplied by the output ofthe luma branch. To do so, the output of the luma branchmust be transformed to change its dimensions by means of a × convolution using a matrix W ¯ x ∈ IR D × C to obtain atransformed representation ¯ X , then O = ¯ X (cid:12) ( S TI A ) , where (cid:12) is the element-wise product.Finally, the output of the attention model is fed into the thirdnetwork branch, to compute the predicted chroma samples. In OURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 5 this branch, a ﬁnal CNN is used to map the fused features fromthe ﬁrst two branches as combined by means of the attentionmodel into the ﬁnal chroma prediction. The prediction headbranch is deﬁned by two convolutional layers, applying E -dimensional × convolutional ﬁlters and then -dimensional × ﬁlters for deriving the two chroma components at once.IV. M ULTI - MODEL ARCHITECTURES

This section introduces a new multi-model architecturewhich improves the baseline attention-based approach (SectionIII, [4]). The main improvement comes from its block-sizeagnostic property as the proposed approach only requires onemodel for all block sizes. Furthermore, a range of simpli-ﬁcations is proposed with the aim to reduce the complex-ity of related attention-based architectures while preservingprediction performance as much as possible. The proposedsimpliﬁcations include a framework for complexity reductionof the convolutional operations, a simpliﬁed cross-componentboundary branch using sparse autoencoders and insights forfast and cost-effective implementations with integer precisionapproximations. Figure 3 illustrates the proposed multi-modelattention-based schemes with the integration of the simpliﬁca-tions described in this section.

A. Multi-model size agnostic architecture

In order to handle variable block sizes, previous NN-basedchroma intra-prediction methods employ different architec-tures for blocks of different sizes. These architectures differin the dimensionality of the networks, which depend on giveblock size, as a trade-off between model complexity andprediction performance [2]. Given a network structure, thedepth of the convolutional layers is the most predominantfactor when dealing with variable input sizes. This means thatincreasingly complex architectures are needed for larger blocksizes, in order to ensure proper generalisation for these blockswhich have higher content variety. Such a factor signiﬁcantlyincreases requirements for inference because of the number ofmultiple architectures.In order to streamline the inference process, this workproposes a novel multi-model architecture that is independentof the input block size. Theoretically, a convolutional ﬁltercan be applied over any input space. Therefore, the fully-convolutional nature of the proposed architecture ( × kernelsfor the cross-component boundary branch and × kernelsfor the luma convolutional one) allows the design of a sizeagnostic architecture. As shown in Figure 4, the same taskcan be achieved using multiple models with different inputsizes sharing the weights, such that a uniﬁed set of ﬁlters canbe used a posterior, during inference. The given architecturemust employ a number of parameters that is sufﬁciently largeto ensure proper performance for larger blocks, but not toolarge to incur overﬁtting for smaller blocks.Figure 5 describes the algorithmic methodology employedto train the multi-model approach. As deﬁned in Section III,the co-located luma block X ∈ IR N × N and the cross-component volume S ∈ IR × b are considered as inputs tothe chroma prediction network. Furthermore, for training of a Fig. 4. Illustration of the proposed multi-model training and inferencemethodologies. Multiple block-dependent models θ N ( W ( t ) ) are used duringtraining time. A size-agnostic model with a single set of trained weighs W is then used during inference. Require: { X ( N ) m , S ( N ) m , Z ( N ) m } , m ∈ [0 , M ) , N ∈ { , , } Require: θ N ( W ( t ) ) : N model with shared weights W ( t ) Require: L ( t ) reg : Objective function at training step t t ← (initialise timestep) while θ t not converged do for m ∈ [0 , M ) do for N ∈ { , , } do t ← t + 1 L ( t ) reg ← M SE ( Z ( N ) m , θ N ( X ( N ) m , S ( N ) m ; W ( t − )) g ( t ) ← ∇ W L ( t ) reg (get gradients at step t ) W ( t ) ← optimiser ( g ( t ) ) end for end for end while Fig. 5. Training algorithm for the proposed multi-model architecture. multi-model the ground-truth is deﬁned as Z ( N ) m , for a giveninput { X ( N ) m , S ( N ) m } , and the set of instances from a databaseof M samples or batches is deﬁned as { X ( N ) m , S ( N ) m , Z ( N ) m } ,where m = 0 , . . . M − and N ∈ { , , } is the setof supported square block sizes N × N (the method can beextended to a different set of sizes). As shown in Figure 4,multiple block-dependent models θ N ( W ) with shared weights W are updated in a concurrent way following the order ofsupported block sizes. At training step t , the individual model θ N ( W ( t ) ) is updated obtaining a new set of weights W ( t +1) .Finally, a single set of trained weights W is used duringinference, obtaining a size-agnostic model θ ( W ) . Model pa-rameters are updated by minimising the Mean Square Error(MSE) regression loss L reg , as in: L ( t ) reg = 1 C · N (cid:107) Z ( N ) m − θ N ( X ( N ) m , S ( N ) m ; W ( t − (cid:107) , (2)where C = 2 refers to the number of predicted chromacomponents, and θ N ( W ( t − ) is the block-dependent modelat training step t − . OURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 6

Fig. 6. Visualisation of the receptive ﬁeld of a 2-layer convolutional branchwith × kernels. Observe that an output pixel in layer is computed byapplying a × kernel over a ﬁeld F of × samples from the ﬁrst layer’soutput space. Similarly, each of the F values are computed by means ofanother × kernel looking at a ﬁeld F of × samples over the input. B. Simpliﬁed convolutions

Convolutional layers are responsible for most of the net-work’s complexity. For instance, based on the network hyper-parameters from experiments in Section V, the luma convo-lutional branch and the prediction head branch (with × convolutional kernels) alone contain , out of , parameters, which constitute more than 90% of the parametersin the entire model. Therefore, the model complexity can besigniﬁcantly reduced if convolutional layers can be simpli-ﬁed. This subsection explains how a new simpliﬁed structurebeneﬁcial for practical implementation can be devised byremoving activation functions, i.e. by removing non-linearities.It is important to stress that such process is devised only forapplication on carefully selected layers, i.e. for branches wheresuch simpliﬁcation does not signiﬁcantly reduce expectedperformance.Consider speciﬁc two-layer convolutional branch (e.g. lumaconvolutional branch from Figure 2) formulated as: Y = R ( W ∗ R ( W ∗ X + b ) + b ) (3)where C i are the number of features in layer i , b i ∈ IR C i are biases, K i × K i are square convolutional kernel sizes, W ∈ IR K × C × C and W ∈ IR K × C × C are the weightsand bias of the ﬁrst ( i = 1 ) and the second ( i = 2 ) layers,respectively, C the dimensions of the input feature map, R is a Rectiﬁed Linear Unit (ReLU) non-linear activationfunction and ∗ denotes convolution operation. Input to thebranch is X ∈ IR N × C and the result is a volume of features Y ∈ IR N × C , which correspond to X and X from Figure 2,respectively. Removing non-linearities, the given branch canbe simpliﬁed as: ˆ Y = W ∗ ( W ∗ X + b ) + b , (4)where it can be observed that a new convolution and bias termscan be deﬁned using trained parameters from the two initiallayers, to form a new single layer: ˆ Y = W c ∗ X + b c , (5) Fig. 7. Visualisation of the learnt colour space resulting of encoding inputYCbCr colours to the 3-dimensional hidden space of the autoencoder. where W c ∈ IR [ ˆ K × C ] × C is the function of W and W with ˆ K = K + K − , and b c is a constant vector derived from W , b and b . Figure 6 (a) illustrates the operations performed inEq. 4 for K = K = 3 and C = 1 . Analysing the receptiveﬁeld of the whole branch, a pixel within the output volume Y is computed by applying a K × K kernel over a ﬁeld F fromthe ﬁrst layer’s output space. Similarly, each of the F valuesare computed by means of another K × K kernel lookingat a ﬁeld F . Without the non-linearities, and equivalent ofthis process is simpliﬁed, Figure 6 (b) and Eq. 5. Notice that ˆ K = K + K − equals in the example in Figure 6. For avariety of parameters, including the values of C , C i and K i used in [4] and in this paper, this simpliﬁcation of concatenatedconvolutional layers allows reduction of model’s parameters atinference time, which will be shown in Section V-C.Finally, it should be noted that we limit the removal ofactivation functions only to branches which include more thanone layer, from which at least one layer has K i > , and onlythe activation functions between layers in the same branch areremoved (to be able to merge them as in Equation 5). In suchbranches with at least one K i > the number of parametersis typically very high, while the removal of non-linearitiestypically does not impact prediction performance. Activationfunctions are not removed from the remaining layers. It shouldbe noted that in the attention module and at the intersectionsof various branches the activation functions are critical andtherefore are left unchanged. Section V-C performs an ablationtest to evaluate the effect of removing the non-linearities, anda test to evaluate how would a convolutional branch directlytrained with large-support kernels ˆ K perform. C. Simpliﬁed cross-component boundary branch

In the baseline model, the cross-component boundarybranch transforms the boundary inputs S ∈ IR × b into D J -dimensional feature vectors. More speciﬁcally, after applying J = 2 consecutive × convolutional layers, the branchencodes each boundary colour into a high dimensional featurespace. It should be noted that a colour is typically representedby 3 components, indexed within a system of coordinates(referred to as the colour space). As such, a three-dimensionalfeature space can be considered as the space with minimum OURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 7 dimensionality that is still capable of representing colourinformation. Therefore, this work proposes the use of autoen-coders (AE) to reduce the complexity of the cross-componentboundary branch, by compacting the D -dimensional featurespace into a reduced, -dimensional space. An AE tries to learnan approximation to the identity function h ( x ) ≈ x such thatthe reconstructed output ˆ x is as close as possible to the input x . The hidden layer will have a reduced dimensionality withrespect to the input, which also means that the transformationprocess may introduce some distortion, i.e. the reconstructedoutput will not be identical to the input.An AE consists of two networks, the encoder f which mapsthe input to the hidden features, and the decoder g whichreconstructs the input from the hidden features. Applying thisconcept, a compressed representation of the input can beobtained by using the encoder part alone, with the goal ofreducing the dimensionality of the input vectors. The encodernetwork automatically learns how to reduce the dimensionsof the input vectors, in a similar fashion to what could beobtained applying a manual Principal Component Analysis(PCA) transformation. The transformation learned by the AEcan be trained using the same loss function that is used in thePCA process [25]. Figure 7 shows the mapping function ofthe resulting colour space when applying the encoder networkover the YCbCr colour space.Overall, the proposed simpliﬁed cross-component boundarybranch consists of two × convolutional layers using LeakyReLU activation functions with a slope α = 0 . . First, a D -dimensional layer is applied over the boundary inputs S toobtain S ∈ IR D × b feature maps. Then, S is fed to the AE’sencoder layer f with output dimensions, to obtain the hiddenfeature maps S ∈ IR × b . Finally, a third × convolutionallayer (corresponding to the AE decoder layer g ) is appliedto generate the reconstructed maps ˜ S with D -dimensions.Notice that the decoder layer is only necessary during thetraining stage to obtain the reconstructed inputs necessary toderive the values of the loss function. Only the encoder layeris needed when using the network, in order to transform theinput feature vectors into the dimensional, reduced vectors.Figure 3 illustrates the branch architecture and its integrationwithin the simpliﬁed multi-model.Finally, in order to interpret the behaviour of the branchand to identify prediction patterns, a sparsity constraint canbe imposed on the loss function. Formally, the following canbe used: L AE = λ r D · b (cid:107) S − ˜ S (cid:107) + λ s · b (cid:107) S (cid:107) , (6)where the right-most term is used to keep the activationfunctions in the hidden space remain inactive most of thetime, and only return non-zero values for the most descriptivesamples. In order to evaluate the effect of the sparsity term,Section V-C performs an ablation test that shows its positiveregularisation properties during training.The objective function in Equation 2 can be updated suchthat the global multi-model loss L considers both L reg and L AE as: L = λ reg L reg + λ AE L AE (7) where λ reg and λ AE control the contribution of both losses. D. Integer precision approximation

While the training algorithm results in IEEE-754 64-bitﬂoating point weights and prediction buffers, an additionalsimpliﬁcation is proposed in this paper whereby the networkweights and prediction buffers are represented using ﬁxed-point integer arithmetic. This is beneﬁcial for deployment ofresulting multi-models in efﬁcient hardware implementations,which complex operations such as Leaky ReLU and softmaxactivation functions can become serious bottlenecks. All thenetwork weights obtained after the training stage are thereforeappropriately quantised to ﬁt 32-bit signed integer values. itshould be noted that integer approximation introduces quanti-sation errors, which may have an impact on the performanceof the overall predictions.In order to prevent arithmetic overﬂows after performingmultiplications or additions, appropriate scaling factors aredeﬁned for each layer during each of the network predic-tion steps. To further reduce the complexity of the integerapproximation, the scaling factor K l for a given layer l isobtained as a power of , namely K l = 2 O l , where O l is therespective precision offset. This ensures that multiplicationscan be performed by means of simple binary shifts. Formally,the integer weights ˜ W l and biases ˜ b l for each layer l in thenetwork with weights W l and bias b l can be obtained as: ˜ W l = (cid:98) W l · O l (cid:99) ; ˜ b l = (cid:98) b l · O l (cid:99) . (8)The offset O l depends on the offset used on the previous layer O l − , as well as on an internal offset O x necessary to preserveas much decimal information as possible, compensating forthe quantisation that occurred in the previous layer, namely O l = O x − O l − .Furthermore, in this approach the values predicted by thenetwork are also integers. In order to avoid deﬁning largeinternal offsets at each layer, namely having large values of O x , an additional stage of compensation is applied to thepredicted values, to keep their values in the range of 32-bitsigned integer. For this purpose, another offset O y is deﬁned,computed as O y = O x − O l . The values generated by layer l are then computed as: Y l = (( ˜ W Tl X l + ˜ b l ) + (1 << ( O y − >> O y , (9)where << and >> represent the left and right binary shifts,respectively, and the offset (1 << ( O y − is considered toreduce the rounding error.Complex operations requiring ﬂoating point divisions needto be approximated to integer precision. The Leaky ReLUactivation functions applied on the cross-component boundarybranch use a slope α = 0 . which multiplies the negativevalues. Such an operation can be simply approximated bydeﬁning a new activation function ˜ A ( x ) for any input x asfollows: ˜ A ( x ) = (cid:26) x ≥ · x >> x < (cid:27) (10) OURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 8

Conversely, the softmax operations used in the attentionmodule are approximated following a more complex method-ology, similar to the one used in [26]. Consider the matrix M as deﬁned in Equation 1 and a given row j in M , and a vector m j as input to the softmax operation. First, all elements m j in a row are subtracted by the maximum element in the row,namely: ˆ m i,j = ( m i,j /T − max i ( m i,j /T )) (11)where T is the temperature of the softmax operation, set to . as previously mentioned. The transformed elements ˆ m i,j rangebetween the minimum signed integer value and zero, becausethe arguments ˆ m i,j are obtained by subtracting the elementsin M by the maximum element in each row. To further reducethe possibility of overﬂows, this range is further clipped to aminimum negative value, set to pre-determined number V e , sothat any ˆ m i,j < V e is set equal to V e .The elements ˆ m i,j are negative integer numbers within therange [ V e , , meaning there is a ﬁxed number of N e = | V e | +1 possible values they can assume. To further simplify theprocess, such an exponential function is replaced by a pre-computed look-up table containing N e integer elements. Tominimise the approximation error, the exponentials are scaledby a given scaling factor before being approximated to thenearest integer and stored in the corresponding look-up table LUT-EXP . Formally, for a given index k , where ≤ k ≤ N e − , the k -th integer input is obtained as s k = V e + k . The k -th element in the look-up table can then be computed as theapproximated, scaled exponential value for s k , or: LUT-EXP ( k ) = (cid:98) K e e s k (cid:99) (12)where K e = 2 O e is the scaling factor, chosen in a way tomaximise the preservation of the original decimal information.When using the look-up table during the prediction process,given an element ˆ m i,j the corresponding index k can beretrieved as: k = | V e − ˆ m i,j | , to produce the numeratorin the softmax function.The integer approximation of the softmax function can thenbe written as: ˆ α j,i = LUT-EXP ( | V e − ˆ m i,j | ) D ( j ) , (13)where: D ( j ) = b − (cid:88) n =0 LUT-EXP ( | V e − ˆ m n,j | ) , (14)Equation 13 implies performing an integer division betweenthe numerator and denominator. This is not ideal, and integerdivisions are typically avoided in low complexity encoderimplementations. A simple solution to remove the integerdivision can be obtained by replacing it with a binary shift.However, a different approach is proposed in this paper toprovide a more robust approximation that introduces smallererrors in the division. The denominator D ( j ) as in Equation 14is obtained as the sum of b values extracted from LUT-EXP ,where b is the number of reference samples extracted fromthe boundary of the block. As such, the largest blocks underconsideration ( × ) will result in the largest possible valueof reference samples b MAX . This means that the maximum value that this denominator can assume is obtained when b = b MAX and when all input ˆ m i,j = 0 (which correspondto LUT-EXP ( | V e | ) = K e ), corresponding to V s = b MAX K e .Similarly, the minimum value (obtained when ˆ m i,j = V e ) is .Correspondingly, D ( j ) , can assume any positive integer valuein the range [0 , V s ] .Considering a given scaling factor K s = 2 O s , integer divi-sion by D ( j ) can be approximated using a multiplication bythe factor M ( j ) = (cid:98) K s /D ( j ) (cid:99) . A given value of M ( j ) couldbe computed for all V s +1 possible values of D ( j ) . Such valuescan then be stored in another look-up table LUT-SUM . Clearlythough, V s is too large which means LUT-SUM would beimpractical to use due to storage and complexity constraints.For that reason, a smaller table is used, obtained by quantisingthe possible values of D ( j ) . A pre-deﬁned step Q is used,resulting in N s = ( V s + 1) /Q quantised values of D ( j ) . Thetable LUT-SUM of size N s is then ﬁlled accordingly, whereeach element in the table is obtained as: LUT-SUM ( l ) = (cid:98) K s / ( lQ ) (cid:99) (15)Finally, when using the table during the prediction process,given an integer sum D ( j ) , the corresponding index l canbe retrieved as: l = (cid:98) D ( j ) /Q (cid:99) . Following from thesesimpliﬁcations, given an input ˆ m i,j obtained as in Equation 11,the integer sum D ( j ) obtained from Equation 14, and aquantisation step Q , the simpliﬁed integer approximation ofthe softmax function can eventually be obtained as: ˜ α j,i = LUT-EXP ( | V e − ˆ m i,j | ) · LUT-SUM ( (cid:98) D ( j ) /Q (cid:99) ) , (16)Notice that ˜ α j,i values are ﬁnally scaled by K o = K e · K s .V. E XPERIMENTS

A. Training settings

Training examples were extracted from the DIV2K dataset[27], which contains high-deﬁnition high-resolution content oflarge diversity. This database contains training samplesand samples for validation, providing lower resolutionversions with downsampling by factors of , and witha bilinear and unknown ﬁlters. For each data instance, oneresolution was randomly selected and then M blocks of each N × N sizes ( N = 4 , , ) were chosen, making balanced setsbetween block sizes and uniformed spatial selections withineach image. Moreover, 4:2:0 chroma sub-sampling is assumed,where the same downsampling ﬁlters implemented in VVC areused to downsample co-located luma blocks to the size of thecorresponding chroma block. All the schemes were trainedfrom scratch using the Adam optimiser [28] with a learningrate of − . B. Integration into VVC

The methods introduced in the paper where integratedwithin a VVC encoder, using the VVC Test Model (VTM)7.0 [29]. The integration of the proposed NN-based cross-component prediction into the VVC coding scheme requiresnormative changes not only in the prediction process, but alsoin the way the chroma intra-prediction mode is signalled inthe bitstream and parsed by the decoder.

OURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 9

TABLE IN

ETWORK HYPERPARAMETERS DURING TRAINING

Branch ( C in , K × K, C out ) Scheme 1 & 3 Scheme 2

CC Boundary , × , , × ,

32 3 , × , , × , Luma Convolutional , × , , × ,

64 1 , × , , × , Attention Module , × , , × , , × ,

32 32 , × , , × , , × , Prediction Head , × , , × , , × , , × , TABLE IIN

ETWORK HYPERPARAMETERS DURING INFERENCE

Branch ( C in , K × K, C out ) Scheme 1 & 3 Scheme 2

CC Boundary , × , , × ,

32 3 , × , , × , Luma Convolutional , × ,

64 1 , × , Attention Module , × , , × , , × ,

32 32 , × , , × , , × , Prediction Head , × , , × , A new block-level syntax ﬂag is introduced to indicatewhether a given block makes use of one of the proposedschemes. If the proposed NN-based method is used, a pre-diction is computed for the two chroma components. Noadditional information is signalled related to the chroma intra-prediction mode for the block. Conversely, if the methodis not used, the encoder proceeds in signalling the chromaintra-prediction mode as in conventional VVC speciﬁcations.For instance, a subsequent ﬂag is signalled to identify ifconventional LM modes are used in the current block or not.The prediction path also needs to accommodate the new NN-based predictions. This largely reuses prediction blocks thatare needed to perform conventional CCLM modes. In termsof mode selection at the encoder side, the new NN-based modeis added to the conventional list of modes to be tested in fullrate-distortion sense.

C. Architecture conﬁgurations

The proposed multi-model architectures and simpliﬁcations(Section IV) are implemented in 3 different schemes: • Scheme 1: Multi-model architecture (Section IV-A) ap-plying the methodology in Section IV-B to simplify theconvolutional layers within the luma convolutional branchand the prediction branch, as illustrated in Figure 3. • Scheme 2: The multi-model architecture in Scheme 1applying the methodology in Section IV-C to simplifythe cross-component boundary branch. As shown in Fig-ure 3, the integration of the simpliﬁed branch requiresmodiﬁcation of the initial architecture with changes inthe attention module and the prediction branch. • Scheme 3: Architecture in Scheme 1 with the integerprecision approximations described in Section IV-D.In contrast to previous state-of-the-art methods, the pro-posed multi-model does not need to adapt its architecture to the input block size. Notice that the fully-convolutionalarchitecture introduced in [4] enables this design and is ableto signiﬁcantly reduce the complexity of the cross-componentboundary branch in [2], which uses size-dependent fully-connected layers. Table I shows the network hyperparametersof the proposed schemes during training, whereas Table IIshows the resulting hyperparameters for inference after ap-plying the proposed simpliﬁcations. As shown in Tables IIIand IV, the employed number of parameters in the proposedschemes represents the trade-off between complexity andprediction performance, within the order of magnitude ofrelated attention-based CNNs in [4]. The proposed simpliﬁ-cations signiﬁcantly reduce (around 90%) the original trainingparameters, achieving lighter architectures for inference time.Table III show that the inference version of Scheme 2 reducesto around 85%, 96% and 99% the complexity of the hybridCNN models in [2] and to around 82%, 96% and 98% thecomplexity of the attention-based models in [4], for × , × and × input block sizes, respectively. Finally, in orderto provide more insights about the computational cost andcompare the proposed schemes with the state-of-the-art meth-ods, Table V shows the number of ﬂoating point operations(FLOPs) for each architecture per block size. The reduction ofoperations (e.g. additions and matrix multiplications) to arriveto the predictions is one the predominant factors towards thegiven speedups. Notice the signiﬁcant reduction of FLOPs forthe proposed inference models.In order to obtain a preliminary evaluation of the proposedschemes and to compare their prediction performance with thestate-of-the-art methods, the trained models were tested on theDIV2K validation set (with 100 multi-resolution images) bymeans of averaged PSNR. Test samples were obtained withthe same methodology as used in Section V-A for generatingthe training dataset. Notice that this test uses the trainingversion of the proposed schemes. As shown in Table IV, themulti-model approach introduced in Scheme 1 improves theattention-based CNNs in [4] for × and × blocks, whileonly a small performance drop can be observed for × blocks. However, because of using a ﬁxed architecture for allblock sizes, the proposed multi-model architecture averagesthe complexity of the individual models in [4] (Table III),slightly increasing the complexity of the × model andsimplifying the × architecture. The complexity reductionin the × model leads to a small drop in performance. Ascan be observed from Table IV , the generalisation processinduced by the multi-model methodology ([4] with multi-model, compared to [4]) can minimise such drop by distillingknowledge from the rest of block sizes, which is especiallyevident for × blocks where a reduced architecture canimprove the state-of-the-art performance.Finally, the simpliﬁcations introduced in Scheme 2 (e.g.the architecture changes required to integrate the modiﬁedcross-component boundary branch within the original model)lower the prediction performance of Scheme 1. However,the highly simpliﬁed architecture is capable of outperformingthe hybrid CNN models in [2], observing training PSNRimprovements of an additional 1.30, 2.21 and 2.31 dB for × , × and × input block sizes, respectively. The OURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 10

TABLE IIIM

ODEL COMPLEXITY PER BLOCK SIZE

Model (parameters) × × × Hybrid CNN [2]

Attention-based CNN [4]

Scheme 1 & 3 (train/inference) / Scheme 2 (train/inference) / TABLE IVP

REDICTION PERFORMANCE PER BLOCK SIZE

Model (PSNR) × × × Hybrid CNN [2] .

61 31 .

47 33 . Attention-based CNN [4] .

23 33 .

13 36 . [4] with multi-model .

55 33 .

21 36 . Scheme 1 single layer training .

36 33 .

05 35 . Scheme 2 without sparsity .

89 32 .

66 35 . (proposed) Scheme 1 .

54 33 .

20 35 . (proposed) Scheme 2 .

91 32 .

68 35 . combination of attention-based architectures with the proposedmulti-model methodology (Scheme 1) considerably improvesthe NN-based chroma intra-prediction methods in [2], showingtraining PSNR improvements by additional 1.93, 1.73 and2.68 dB for the supported block sizes. In Section V-D it willbe shown how this relatively small PSNR differences lead tosigniﬁcant differences in codec performance.Several ablations were performed in order to evaluate theeffects of the proposed simpliﬁcations. First, the effect ofthe multi-model methodology is evaluated by directly con-verting the models in [4] to the size-agnostic architecture inScheme 1 but without the simpliﬁcations in Section IV-B([4] with multi-model). As can be shown in Table IV, suchmethodology improves the × and × models, withspecial emphasis in the × case where the number ofparameters is smaller than in [4]. Moreover, the removal ofnon-linearities towards Scheme 1 does not signiﬁcantly affectthe performance, with a negligible PSNR loss of around 0.3dB ([4] with multi-model compared with Scheme 1). Secondly,in order to evaluate the simpliﬁed convolutions methodologyin Section IV-B, a version of Scheme 1 was trained withsingle-layer convolutional branches with large support kernels(e.g. instead of training 2 linear layers with × kernels andthen combining them into × kernels for inference, trainingdirectly a single-layer branch with × kernels). Experimentalresults show the positive effects of the proposed methodology,observing a signiﬁcant drop of performance when a single-layer trained branch is applied (Scheme 1 with single layertraining compared with Scheme 1). Finally, the effect of thesparse autoencoder of Scheme 2 is evaluated by removingthe sparsity term in Equation 7. As can be observed, theregularisation properties of the sparsity term, i.e. preventinglarge activations, boosts the generalisation capabilities of themulti-model and slightly increases the prediction performanceby around 0.2 dB. (Scheme 2 without sparsity compared withScheme 2). TABLE VFLOP

S PER BLOCK SIZE

Model (parameters) × × × Hybrid CNN [2]

Attention-based CNN [4]

Scheme 1 & 3 (train/inference) / Scheme 2 (train/inference) / D. Simulation Results

The VVC reference software VTM-7.0 is used as ourbenchmark and our proposed methodology is tested under theCommon Test Conditions (CTC) [30], using the suggested all-intra conﬁguration for VVC with a QP of 22, 27, 32 and 37. Inorder to fully evaluate the performance of the proposed multi-models, the encoder conﬁguration is constrained to supportonly square blocks of × , × and × pixels.A corresponding VVC anchor was generated under theseconditions. BD-rate is adopted to evaluate the relative com-pression efﬁciency with respect to the latest VVC anchor. Testsequences include 26 video sequences of different resolutions: × (Class A1 and A2), × (Class B), × (Class C), × (Class D), × (ClassE) and screen content (Class F). The “EncT” and “DecT” are“Encoding Time” and “Decoding Time”, respectively.A colour analysis is performed in order to evaluate theimpact of the chroma channels on the ﬁnal prediction per-formance. As suggested in previous colour prediction works[31], standard regression methods for chroma prediction maynot be effective for content with wide distributions of colours.A parametric model which is trained to minimise the Euclideandistance between the estimations and the ground truth com-monly tends to average the colours of the training examplesand hence produce desaturated results. As shown in Figure 8,several CTC sequences are analysed by computing the loga-rithmic histogram of both chroma components. The width ofthe logarithmic histograms is compared to the compressionperformance in Table VI. Gini index [32] is used to quantifythe width of the histograms, obtained as Gini ( H ) = 1 − B − (cid:88) b =0 (cid:32) H ( b ) (cid:80) B − k =0 H ( k ) (cid:33) (17)being H a histogram of B bins for a given chroma component.Notice that the average value between both chroma compo-nents is used in Table VI. A direct correlation between Giniindex and coding performance can be observed in Table VI,suggesting that Scheme 1 performs better for narrower colourdistributions. For instance, the Tango 2 sequence with a Giniindex of 0.63 achieves an average Y/Cb/Cr BD-rates of -0.46%/-8.13%/-3.13%, whereas Campﬁre with wide colourhistograms (Gini index of 0.98), obtains average Y/Cb/CrBD-rates of -0.21%/0.14%/-0.88%. Although the distributionsof chroma channels can be a reliable indicator of predictionperformance, wide colour distributions may not be the onlyfactor in restricting chroma prediction capabilities of proposedmethods, which can be investigated in future work.A summary of the component-wise BD-rate results forall the proposed schemes and the related attention-based OURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 11

Fig. 8. Comparison of logarithmic colour histograms for different sequences.TABLE VIBD-R

ATES (%)

SORTED BY G INI INDEX

Sequence

Scheme 1

GiniY Cb CrTango2 -0.46 -8.13 -3.13 0.63MarketPlace -0.59 -2.46 -3.06 0.77FoodMarket4 -0.16 -1.60 -1.55 0.85DaylightRoad2 -0.09 -5.74 -1.85 0.89Campﬁre -0.21 0.14 -0.88 0.98ParkRunning3 -0.31 -0.73 -0.77 0.99 approach in [4] is shown in Table VII for all-intra conditions.Scheme 1 achieves an average Y/Cb/Cr BD-rates of -0.25%/-2.38%/-1.80% compared with the anchor, suggesting that theproposed multi-model size agnostic methodology can improvethe coding performance of the related attention-based block-dependent models. Besides improving the coding performance,Scheme 1 signiﬁcantly reduces the encoding (from 212% to164%) and decoding (from 2163% to 1302%) times demon-strating the positive effect of the inference simpliﬁcation.Finally, the proposed simpliﬁcations introduced in Scheme2 and Scheme 3 further reduce the encoding and decoding timeat the cost of a drop in the coding performance. In particular,the simpliﬁed cross-component boundary branch introducedin Scheme 2, achieves an average Y/Cb/Cr BD-rates of -0.13%/-1.56%/-1.63% and, compared to Scheme 1, reduces theencoding (from 164% to 146%) and decoding (from 1302%to 665%) times. Scheme 3 has lower reduction of encodingtime (154%) than Scheme 2, but it achieves higher reductionin decoding time (665%), although the integer approximationslowers the performance achieving average Y/Cb/Cr BD-ratesof -0.16%/-1.72%/-1.38%.As described in Section IV, the simpliﬁed schemes in- troduced here tackle the complexity reduction of Scheme 1with two different methodologies. Scheme 2 proposes directmodiﬁcations on the original architecture which need to beretrained before being integrated in the prediction pipeline.Conversely, Scheme 3 directly simpliﬁes the ﬁnal predictionprocess by approximating the already trained weights fromScheme 1 with integer-precision arithmetic. Therefore, thesimulation results suggest that the methodology in Scheme3 is better at retaining the original performance since aretraining process is not required. However, the highly reducedarchitecture in Scheme 2 is capable of approximating theperformance of Scheme 3 and further reduce the decoder time.Overall, the comparison results in Table VII demonstratethat proposed models offer various trade-offs between com-pression performance and complexity. While it has been shownthat the complexity can be signiﬁcantly reduced, it is still notnegligible. Challenges for future work include integerisationof the simpliﬁed scheme (Scheme 2) while preventing thecompression drop observed for Scheme 3. Recent approaches,including a published one which focuses on intra prediction[24], demonstrate that sophisticated integerisation approachescan help retain compression performance of originally trainedmodels while enabling them to become signiﬁcantly less com-plex and thus be integrated into future video coding standards.VI. C

ONCLUSION

This paper showcased the effectiveness of attention-basedarchitectures in performing chroma intra-prediction for videocoding. A novel size-agnostic multi-model and its corre-sponding training methodology were proposed to reduce theinference complexity of previous attention-based approaches.Moreover, the proposed multi-model was proven to bettergeneralise to variable input sizes, outperforming state-of-the-art attention-based models with a ﬁxed and much simplerarchitecture. Several simpliﬁcations were proposed to furtherreduce the complexity of the original multi-model. First,a framework for reducing the complexity of convolutionaloperations was introduced and was able to derive an infer-ence model with around 90% fewer parameters than its rela-tive training version. Furthermore, sparse autoencoders wereapplied to design a simpliﬁed cross-component processingmodel capable of further reducing the coding complexityof its preceding schemes. Finally, algorithmic insights wereproposed to approximate the multi-model schemes in integer-precision arithmetic, which could lead to fast and hardware-aware implementations of complex operations such as softmaxand Leaky ReLU activations.The proposed schemes were integrated into the VVC an-chor VTM-7.0, signalling the prediction methodology as anew chroma intra-prediction mode working in parallel withtraditional modes towards predicting the chroma componentsamples. Experimental results show the effectiveness of theproposed methods, retaining compression efﬁciency of pre-viously introduced neural network models, while offering 2different directions for signiﬁcantly reducing coding complex-ity, translated to reduced encoding and decoding times. Asfuture work, we aim to implement a complete multi-model

OURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, OCTOBER 2020 12

TABLE VIIBD-R

ATE (%) OF Y , Cb AND Cr FOR ALL PROPOSED SCHEMES AND [4]

UNDER ALL - INTRA C OMMON T EST C ONDITIONS

Class A1 Class A2 Class B Class CY Cb Cr Y Cb Cr Y Cb Cr Y Cb CrScheme 1 -0.28 -3.20 -1.85 -0.25 -3.11 -1.54 -0.26 -2.28 -2.33 -0.30 -1.92 -1.57Scheme 2 -0.08 -1.24 -1.26 -0.12 -1.59 -1.31 -0.15 -1.80 -2.21 -0.20 -1.41 -1.62

Scheme 3 -0.19 -2.25 -1.56 -0.13 -2.44 -1.12 -0.16 -1.78 -2.05 -0.20 -1.44 -1.29Anchor + [4] -0.26 -2.17 -1.96 -0.22 -2.37 -1.64 -0.23 -2.00 -2.17 -0.26 -1.64 -1.41Class D Class E Class F Overall EncT[%] DecT[%]Y Cb Cr Y Cb Cr Y Cb Cr Y Cb CrScheme 1 -0.29 -1.70 -1.77 -0.13 -1.59 -1.45 -0.50 -1.58 -1.99 -0.25 -2.38 -1.80 164% 1302%Scheme 2 -0.18 -1.42 -1.73 -0.08 -1.67 -1.40 -0.34 -1.50 -1.90 -0.13 -1.56 -1.63

Anchor + [4] -0.25 -1.55 -1.67 -0.03 -1.35 -1.77 -0.44 -1.30 -1.55 -0.21 -1.90 -1.81 for all VVC block sizes in order to ensure a full usageof the proposed approach building on the promising resultsshown in the constrained test conditions. Finally, an improvedapproach for integer approximations may enable the fusion ofall proposed simpliﬁcations, leading to a fast and powerfulmulti-model. R

EFERENCES[1] B. Bross, J. Chen, and S. Liu, “Versatile Video Coding (VVC) draft 7,”Geneva, Switzerland, October 2019.[2] Y. Li, L. Li, Z. Li, J. Yang, N. Xu, D. Liu, and H. Li, “A hybrid neuralnetwork for chroma intra prediction,” in . IEEE, 2018, pp. 1797–1801.[3] J. Pfaff, P. Helle, D. Maniry, S. Kaltenstadler, B. Stallenberger,P. Merkle, M. Siekmann, H. Schwarz, D. Marpe, and T. Wiegand, “Intraprediction modes based on neural networks,”

Document JVET-J0037-v2, Joint Video Exploration Team of ITU-T VCEG and ISO/IEC MPEG ,2018.[4] M. G´orriz, S. Blasi, A. F. Smeaton, N. E. O’Connor, and M. Mrak,“Chroma intra prediction with attention-based CNN architectures,” arXivpreprint arXiv:2006.15349 , accepted for publication at IEEE ICIP,October 2020.[5] L. Murn, S. Blasi, A. F. Smeaton, N. E. O’Connor, and M. Mrak, “Inter-preting cnn for low complexity learned sub-pixel motion compensationin video coding,” arXiv preprint arXiv:2006.06392 , 2020.[6] P. Helle, J. Pfaff, M. Sch¨afer, R. Rischke, H. Schwarz, D. Marpe,and T. Wiegand, “Intra picture prediction for video coding with neuralnetworks,” in . IEEE, 2019,pp. 448–457.[7] X. Zhao, J. Chen, A. Said, V. Seregin, H. E. Egilmez, and M. Kar-czewicz, “Nsst: Non-separable secondary transforms for next generationvideo coding,” in . IEEE, 2016,pp. 1–5.[8] K. Zhang, J. Chen, L. Zhang, X. Li, and M. Karczewicz, “Enhancedcross-component linear model for chroma intra-prediction in videocoding,”

IEEE Transactions on Image Processing , vol. 27, no. 8, pp.3983–3997, 2018.[9] L. T. Nguyen, A. Khairat and D. Marpe, “Adaptive inter-plane predictionfor RGB content,”

Document JCTVC-M0230 , Incheon, April 2013.[10] M. Siekmann, A. Khairat, T. Nguyen, D. Marpe, and T. Wiegand,“Extended cross-component prediction in hevc,”

APSIPA transactionson signal and information processing , vol. 6, 2017.[11] G. Bjontegaard, “Calculation of average PSNR differences between rd-curves,”

VCEG-M33 , 2001.[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advancesin neural information processing systems , 2017, pp. 5998–6008.[13] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, andY. Bengio, “A structured self-attentive sentence embedding,” arXivpreprint arXiv:1703.03130 , 2017.[14] A. P. Parikh, O. T¨ackstr¨om, D. Das, and J. Uszkoreit, “A decompos-able attention model for natural language inference,” arXiv preprintarXiv:1606.01933 , 2016.[15] J. Cheng, L. Dong, and M. Lapata, “Long short-term memory-networksfor machine reading,” arXiv preprint arXiv:1601.06733 , 2016. [16] Y. He, X. Zhang, and J. Sun, “Channel pruning for acceleratingvery deep neural networks,” in

Proceedings of the IEEE InternationalConference on Computer Vision , 2017, pp. 1389–1397.[17] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang,and J. Zhu, “Discrimination-aware channel pruning for deep neuralnetworks,” in

Advances in Neural Information Processing Systems , 2018,pp. 875–886.[18] T.-W. Chin, R. Ding, C. Zhang, and D. Marculescu, “Towards efﬁcientmodel compression via learned global ranking,” in

Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition ,2020, pp. 1518–1528.[19] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam,and D. Kalenichenko, “Quantization and training of neural networksfor efﬁcient integer-arithmetic-only inference,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2018,pp. 2704–2713.[20] Y. Cai, Z. Yao, Z. Dong, A. Gholami, M. W. Mahoney, and K. Keutzer,“Zeroq: A novel zero shot quantization framework,” in

Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition ,2020, pp. 13 169–13 178.[21] S. Xu, H. Li, B. Zhuang, J. Liu, J. Cao, C. Liang, and M. Tan,“Generative low-bitwidth data free quantization,” arXiv preprintarXiv:2003.03603 , 2020.[22] M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neu-ral networks with low precision multiplications,” arXiv preprintarXiv:1412.7024 , 2014.[23] J. Ball´e, N. Johnston, and D. Minnen, “Integer networks for datacompression with latent-variable models,” in

International Conferenceon Learning Representations , 2018.[24] M. Sch¨afer, B. Stallenberger, J. Pfaff, P. Helle, H. Schwarz, D. Marpe,and T. Wiegand, “Efﬁcient ﬁxed-point implementation of matrix-basedintra prediction,” in . IEEE, 2020, pp. 3364–3368.[25] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,”

IEEE transactions on pattern analysisand machine intelligence , vol. 35, no. 8, pp. 1798–1828, 2013.[26] X. Geng, J. Lin, B. Zhao, A. Kong, M. M. S. Aly, and V. Chandrasekhar,“Hardware-aware softmax approximation for deep neural networks,” in

Asian Conference on Computer Vision . Springer, 2018, pp. 107–122.[27] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, and L. Zhang,“Ntire 2017 challenge on single image super-resolution: Methods andresults,” in

Proceedings of the IEEE conference on computer vision andpattern recognition workshops , 2017, pp. 114–125.[28] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[29] S. K. J. Chen, Y. Ye, “Algorithm description for versatile video codingand test model 7 (vtm 7),”

Document JVET-P2002 , Geneva, October2019.[30] J. Boyce, K. Suehring, X. Li, and V. Seregin, “JVET common testconditions and software reference conﬁgurations,”

Document JVET-J1010 , Ljubljana, Slovenia, July 2018.[31] M. G. Blanch, M. Mrak, A. F. Smeaton, and N. E. O’Connor, “End-to-end conditional gan-based architectures for image colourisation,” in . IEEE, 2019, pp. 1–6.[32] R. Davidson, “Reliable inference for the gini index,”