Quality-Driven Dynamic VVC Frame Partitioning for Efficient Parallel Processing
Thomas Amestoy, Wassim Hamidouche, Cyril Bergeron, Daniel Menard
QQUALITY-DRIVEN DYNAMIC VVC FRAME PARTITIONING FOR EFFICIENT PARALLELPROCESSING
Thomas AMESTOY (cid:63), † , Wassim HAMIDOUCHE (cid:63) , Cyril BERGERON † and Daniel MENARD (cid:63)(cid:63) Univ Rennes, INSA Rennes, CNRS, IETR - UMR 6164, Rennes, France. Emails: fi[email protected] † Thales SIX GTS France, HTE/STR/MMP Gennevilliers, France. Emails: fi[email protected]
ABSTRACT
VVC is the next generation video coding standard, offering codingcapability beyond HEVC standard. The high computational com-plexity of the latest video coding standards requires high-level par-allelism techniques, in order to achieve real-time and low latencyencoding and decoding. HEVC and VVC include tile grid partition-ing that allows to process simultaneously rectangular regions of aframe with independent threads. The tile grid may be further par-titioned into a horizontal sub-grid of Rectangular Slices (RSs), in-creasing the partitioning flexibility. The dynamic Tile and Rectan-gular Slice (TRS) partitioning solution proposed in this paper ben-efits from this flexibility. The TRS partitioning is carried-out at theframe level, taking into account both spatial texture of the contentand encoding times of previously encoded frames. The proposed so-lution searches the best partitioning configuration that minimizes thetrade-off between multi-thread encoding time and encoding qualityloss. Experiments prove that the proposed solution, compared touniform TRS partitioning, significantly decreases multi-thread en-coding time, with slightly better encoding quality.
Index Terms — Video Compression, VVC, High Level Paral-lelism, Rectangular Slices, VTM
1. INTRODUCTION
In recent years, the democratization of multimedia applications, cou-pled with the emergence of high resolution and new video formats(8K, 360°), has led to a drastic increase in the volume of exchangedvideo content [1]. This increasing need for higher compressionrates prompted the Joint Video Exploration Team (JVET) to de-velop a new video coding standard called Versatile Video Coding(VVC) with coding capability beyond High Efficiency Video Coding(HEVC) [2]. The bit-rate savings brought by VVC [3] are howevercoupled with a considerable encoding computational complexity in-crease. This latter is estimated to 10 and 27 times HEVC compu-tational complexity in Inter and Intra coding configuration, respec-tively [4]. In real-time implementations of VVC codec, intense par-allel processing will therefore be mandatory to achieve real-time en-coding and decoding.Techniques of video parallel processing essentially operate atthree levels of parallelism: data level, frame level and high-level.The data level parallelism techniques are applied on elementary op-erations, and no encoding quality is lost compared to sequential en-coding. They include among other techniques relying on Single In-struction on Multiple Data (SIMD) architectures [5]. Frame leveland high-level parallelism operate at thread level.The frame leveltechniques encode a group of frames in parallel where each thread is
This project has received funding from Bpifrance Financement undergrant DOS0061463/00 (EFIGI FUI project). assigned to a single frame [6]. The encoding time of a single frameis not reduced with frame level techniques, i.e. the latency is notreduced. In high-level parallelism techniques, the threads operateon continuous regions of the frame, as tiles or slices [7]. Tiles andslices are independently encodable and decodable, allowing severalthreads to process simultaneously the same frame. These techniquesimprove equally both speed-up and latency. However, by enablingindependent processing of frame regions, prediction dependenciesacross boundaries are broken and entropy encoding state is reinitial-ized for each region. These restrictions lead to an encoding qualityloss compared to the encoding of the non-partitioned sequence. Theencoding quality decreases with the number of independent regionsof the frame, as has been measured in HEVC by
Chin et al. [8].In HEVC and VVC standards, only grid shaped tile partitioningis allowed, as shown by
Fig. 1a . The tiles are delimited by the con-tinuous black lines and the dashed lines correspond to the CodingTree Unit (CTU) delimitation. The tile partitioning forms a 2x2 gridand tiles are labelled from 0 to 3. In order to increase the partitioningopportunities, VVC combines the tile partitioning with the new con-cept of Rectangular Slices (RSs). The partitioning combining tilesand RSs is further called Tile and Rectangular Slice (TRS) partition-ing.
Fig. 1b shows the TRS partitioning of a frame into the same 2x2tile grid than
Fig. 1a , combined with 4 RSs. The RSs are delimitedby the continuous red lines and are labeled from A to D. The RS maycontain one or several complete tiles, forming together a rectangularregion of the frame. Moreover, as shown in the examples C and D , aRS may be a rectangular sub-region of the tile, composed of a num-ber of complete and consecutive CTU rows of a tile. In this lattercase, the RSs allow to further partition the tile grid into a horizontalsub-grid, improving greatly the tile grid partitioning flexibility. CTUTile (a) Grid of 4 tiles.Tiles labeled from 0 to 3 A BC
CTUTileRectangularSlice D
02 13 (b) Tiles combined with 4 RSs. RSslabeled from A to D.
Fig. 1 : Illustration of tile partitioning in HEVC and TRS partitioningin VVC.The partitioning of a frame into tiles and RSs raises two dis-tinct optimization issues: on one side the multi-thread encoding timeminimization (or speedup maximization), on the other side the min-imization of encoding quality loss caused by the partitioning. In theliterature, both issues have been addressed for HEVC tile partition- a r X i v : . [ c s . D C ] D ec ng. The multi-thread encoding time minimization is investigated by Storch et al [9] and
Koziri et al. [10]. They observe that the encod-ing time does not vary significantly from a CTU to the co-locatedCTU in the closest temporal frame. Considering this temporal sta-bility, the authors use the encoding times of previous frames to deter-mine the tile partitioning that minimizes the multi-thread encodingtime. In [11], the time estimator for each CTU is computed based onpreviously encoded frame CTU statistics (number of Skip, Inter, In-tra blocks for instance). Authors in [12, 13] minimize the encodingquality loss induced by the tile partitioning by analyzing the CTUluminance variances of the frame. The technique proposed in [14]focuses on the particular case of variable number of available cores.The encoding loss is lowered in some cases by setting a number oftiles inferior to the number of available cores. However, the relatedworks on HEVC tile partitioning only address independently mini-mization of encoding time and encoding quality loss.In this work, we take advantage of the increased flexibility of-fered by the RSs in VVC, in order to propose a dynamic TRS parti-tioning solution under VVC Test Model (VTM)-6.2 software. Priorto the encoding of a frame, the TRS partitioning stage uses the spa-tial information and the times of previously encoded CTUs in orderto optimize the TRS partitioning. The proposed solution minimizes atrade-off between encoding time and encoding video quality, whichis a novel approach compared to related works. Moreover, to thebest of our knowledge, this is the first work that implements a multi-thread VVC reference encoder, generating baseline results for futurerelated works.The rest of the paper is organized as follows. Section 2 describesthe proposed solution, which establishes the trade-off between en-coding time and encoding quality. Section 3 presents and analysesthe experimental results on VTM-6.2. Finally, Section 4 concludesthis paper.
2. DYNAMIC FRAME PARTITIONING FOR PARALLELPROCESSING
As mentioned in Section 1, the proposed TRS partitioning solutionaddresses simultaneously the minimization of encoding time and thelimitation of encoding quality loss. This section first describes theencoding time minimization of the current frame, using times ofpreviously encoded co-located CTUs. The second subsection intro-duces the clustering of spatial information into the RSs to limit theencoding quality loss. The last subsection describes the proposedsolution, that establishes a trade-off between encoding time and en-coding quality.
Let P be the partitioning of current frame into n RSs: P = { s , ..., s n − } . In the following, T ( P ) is the encoding time of cur-rent frame partitioned with P , and simultaneously processed by N threads in parallel (each thread entirely dedicated to encode a singleRS). In this case, T ( P ) is equal to the time required by the slow-est thread to encode his RS. Eq. 1 formally establishes T ( P ) , with T ( c i ) the encoding time of CTU c i and T ( s j ) the encoding time ofthe RS s j . T ( s j ) = (cid:88) c i ∈ s j T ( c i ) ,T ( P ) = max s j ∈ P ( T ( s j )) . (1) Eq. 1 shows that T ( P ) is directly determined by the CTU encod-ing times T ( c i ) . However, during the TRS partitioning stage, thesevalues are not available, since the TRS partitioning stage takes placebefore the encoding of current frame. In order to overcome this lackof information, the values T ( c i ) are replaced during the TRS parti-tioning stage by estimated values noted ˜ T ( c i ) .Several related works [9, 10] define ˜ T ( c i ) as the encoding timeof the co-located CTU (located at the same spatial coordinates) inthe closest temporal frame previously encoded. This choice is moti-vated by the temporal continuity of the video sequences content. InRandom Access (RA) configuration, authors in [15] have shown that T ( c i ) is more correlated with the times of the co-located CTU in co-Temporal Layer (TL) frame, compared to the co-located CTU of theclosest temporal frame. The co-TL frame refers to the previously en-coded frame belonging to same temporal layer. This is caused by theshared coding parameters of frames at similar temporal level in thegroup of pictures structure defined by the Common Test Conditions(CTC) [16]. Following the results of [15], the selected estimator ˜ T ( c i ) is defined as the encoding time of the co-located CTU in theco-TL frame. The encoding time minimization technique consistsin the search of a TRS partitioning P that minimizes the estimated ˜ T ( P ) , computed with ˜ T ( c i ) values as an input. Fig. 2 : TRS partitioning of
BQTerrace frame P ∗ gathers similar spatialinformation inside the same RSs. This corresponds to a K-meanclustering [17] of the spatial information into the RSs, further calledRS clustering. The RS clustering searches the TRS partitioning P ∗ that minimizes the sum of luminance variance on all RSs. Eq. 2computes the partitioning P ∗ where p i is the value of luminancesamples, and µ j is the mean of RS s j luminance samples. P ∗ = argmin P (cid:88) s j ∈ P (cid:88) p i ∈ s j ( p i − µ j ) . (2) Fig. 2 shows the 8 RSs partitioning, obtained by solving Eq. 2for frame
BQTerrace . In
Fig. 2 , regions of the framewith similar spatial information tend to be clustered into the sameRSs. The dark water of the river is almost entirely contained in RSs6 and 7, and the light homogeneous regions of the frame are mainlyncluded in RSs 0, 3 and 5. On the other hand, the RSs 1, 2 and 4contain the regions with more complex spatial information.
The TRS partitioning in
Fig. 2 gathers similar spatial informationinside the same RSs, but is far from optimal regarding the encodingtime minimization. For instance, the encoding at QP = 27 of RS Fig. 3 .The TRS partitioning stage, enclosed in the blue dashed box, is ap-plied prior to the parallel encoding of current frame F cur , enclosedin the red dashed box. The TRS partitioning stage is divided into2 distinct steps. The first step is called encoding time minimizationstep. This step computes the minimum estimated encoding time, de-fined by Equation 3 and noted ˜ T min . ˜ T min = min P ( ˜ T ( P )) (3)The encoding time minimization step takes the CTU times of theco-TL frame F TL as input. Step 2: RS Clustering
RestrictionOptimization
TRS Partitioning Stage
Luminance F cur
Lagrangian λ (%) Parallel VTM
Step 3: Encode Fcur k Amin(P) > Amax(P)
Step 1: Enc Time Minimization
Tmin = min (T(P)) P k Amin(P) > Amax (P) ~ ~ T(P) < Tmin (1 + λ ) ~ ~ T(c ) ... ...... ... ... ... ...... ... ... T(c ) T(c ) ... ... Times CTU F TL ~ ~ ~ ~ Tmin P Fig. 3 : Proposed solution flowchart.The second step of the TRS partitioning stage computes the RSclustering of F cur , under encoding time constraint. This step takesas inputs ˜ T min estimated during previous step, the luminance sam-ples of F cur , and a lagrangian parameter λ that manages the trade-off between encoding time and encoding quality. The possible valuesfor ˜ T ( P ) are bounded by Eq. 4. ˜ T ( P ) ≤ ˜ T min · (1 + λ ) (4)When λ = 0 , only the partitioning P that minimizes the estimatedtime is considered, since ˜ T ( P ) = ˜ T min . When λ increases, morepartitioning opportunities are offered to the RS clustering, and there-fore higher weight is given to encoding quality compared to encod- ing time minimization. The parameter λ is therefore a means for theencoder to manage the trade-off, according to the requirement.The aim of this paper is to show the relevance of a solution com-bining the 2 complementary steps previously presented. For this rea-son, a near exhaustive search is conducted to compute both ˜ T min andRS clustering. As shown in Fig. 3 , the only constraint given to thesearch algorithm: k · A min ( P ) > A max ( P ) , with A min and A max the area of the smallest and the largest RSs, respectively. The con-stant k is set to in this work in order to contain search complexity.The choice of less complex heuristics for the TRS partitioning stageis a distinct issue, that will be part of future works. The global com-plexity overhead induced by the TRS partitioning stage is nonethe-less measured and discussed further in this paper.
3. EXPERIMENTAL RESULTS
This section presents the experimental setup, as well as the perfor-mance of the proposed TRS partitioning solution.
The following experiments are conducted under VTM-6.2 software,built with gcc compiler version 7.4.0, under Linux version 4.15.0-74-generic as distributed in Ubuntu-18.04.1. The platform setup iscomposed of Central Processing Units (CPUs) Intel(R) Xeon(R) E5-2690 v3 clocked at 2.60 GHz, each of them disposing of 12 cores.The cores have each 768KB L1 cache, 3MB L2 cache and 30MB L3cache.The high-level parallelism structures included in VVC standardallow to tackle complexity increase on multi-core processors. Thiscomplexity increase raises a critical issue mainly for high resolutionvideo sequences. For this reason, the test sequences selected in thiswork contain 4 Ultra High Definition (UHD) and 5 Full High Defi-nition (FHD) sequences included in the CTC [16]:
CatRobot1 , Day-lightRoad2 , FoodMarket4 , Tango2 (UHD), and
BQTerrace , Cactus , MarketPlace , RitualDance (FHD). The test sequences are encodedunder RA configuration at four Quantization Parameter (QP) values:22, 27, 32, 37. The performance of our TRS partitioning solutionis assessed by measuring the trade-off between the encoding qualityusing the Bjøntegaard Delta BitRate (BD-BR) [18] and the multi-thread speed-up σ , defined by Eq. 5. σ = 14 (cid:88) QP i ∈{ , , , } T O ( QP i ) T R ( QP i ) (5) T O ( QP i ) and T R ( QP i ) are the original time (encoded with 1 RSand 1 single thread) and reduced time (encoded with N RSs and Nthreads) spent to encode the video sequence with QP i , respectively.The overhead induced by TRS partitioning stage is further noted θ and measured in percentage of T R . The theoretical upper bound in terms of speed-up, noted σ max , forthe proposed solution is computed with the Amdahl law [19]. Let s be the sequential part (in % ) of an application. The upper bound σ max obtainable with n threads is expressed by Eq. 6. σ max ( n ) = 1 s + − sn (6)n our case, the sequential portion of VTM-6.2 encoder containsthe data initialization, entropy, in-loop filter and bitstream writ-ing stages. All together, these stages represent of the encod-ing time in average across test sequences and QP values. There-fore, Eq. 6 provides the following upper bounds: σ max (4) = 3 . σ max (4) = 3 . σ max (4) = 3 . , σ max (8) = 6 . σ max (8) = 6 . σ max (8) = 6 . and σ max (12) = 8 . σ max (12) = 8 . σ max (12) = 8 . .As mentioned in Section 2.3, the lagrangian parameter λ manages the trade-off between encoding quality and encoding timeminimization induced by the TRS partitioning. Three values ofparameter λ (0, 0.1 and 0.3) are tested, and the one offering thebest trade-off is selected according to thread number and resolution. Table 1 presents the average results obtained with the selected λ values, according to the resolution and number of threads n .Moreover, the results of the uniform TRS partitioning applied on thetest sequences is also presented, in order to evaluate the performanceof the proposed solution. The uniform TRS partitioning is an usualand straightforward technique that partitions the frame in a grid ofthe same RS dimension. Table 1 : Average speed-up σ , BD-BR and overhead θ obtained byboth uniform and proposed TRS partitioning, according to the reso-lution and number of threads n . FHD UHDUnif Proposed Unif Proposed λ = 0 λ = 0 BD-BR (%) 1.62 n = 4 Speed-up σ θ (%) λ = 0 . λ = 0 . BD-BR (%) 2.69 2.80 2.39 n = 8 Speed-up σ θ (%) λ = 0 . λ = 0 . BD-BR (%) 4.31 n = 12 Speed-up σ θ (%) Table 1 shows that the proposed TRS partitioning solution en-ables better results compared to uniform TRS partitioning in term of σ , regardless the resolution and number of threads n . The σ increaseranges from . to . , for UHD content with n = 4 and n = 12 ,respectively. The proposed TRS partitioning solution therefore re-duces significantly the distance to the upper bounds σ max computedby Amdahl law, compared to uniform TRS partitioning. This signifi-cant σ increase proves the efficiency of the encoding time minimiza-tion step, presented in Section 2.1. It is important to note that theencoding time of every frame is reduced. Therefore both speed-upand latency are improved equally by the proposed solution.In term of BD-BR, the results of the proposed solution with theselected λ values are slightly better (around − . ) compared touniform TRS partitioning. Two exceptions are however noticeable.The BD-BR decrease is substantial ( − . ) for FHD content with n = 12 , and the only case for which the BD-BR is slightly higheris for FHD content with n = 8 ( +0 . ). The related works inHEVC minimizing the BD-BR reported . [12] and . [15]average BD-BR decrease with 8 threads on FHD and UHD content.Our results in term of BD-BR are therefore close to the results ofpreviously mentioned works, even though these works minimize theBD-BR without taking into consideration the speed-up optimization. The conclusion of Table 1 is that the proposed solution is ableto maintain the BD-BR increase to values close to uniform RSpartitioning. The variation of λ value is however not sufficientto decrease significantly the BD-BR, except for FHD contentwith n = 12 . On the other hand, the proposed solution is highlyeffective to increase the speed-up offered by the TRS partitioningin VVC. Regarding the overhead θ , the values are half inducedby the encoding time minimization step, and half by the encodingquality loss limitation step. The values are negligible when n = 4 and n = 8 . For n = 12 , θ is greater than . due to the almostexhaustive search implemented (see Section 2.3). We are confidentthat the investigation of simple heuristics in future works will reducegreatly θ , without degrading the results presented in Table 1 . Table 2 : Proposed solution with λ = 0 and λ = 0 . , encoded with8 threads, according to UHD sequence. λ = 0 λ = 0 λ = 0 λ = 0 . λ = 0 . λ = 0 . Sequence
BD-BR(in %) σ BD-BR(in %) σ CatRobot1
DaylightRoad
FoodMarket
Tango2
Average 2.49 5.43 2.33 5.34Table 2 shows the performance of the proposed solution with λ = 0 and λ = 0 . running with 8 threads, according to the UHDsequence. As explained in Section 2.3, the higher λ , the more impor-tance is given to encoding quality with regard to the speed-up. Theresults of Table 2 are coherent with this explanation. Indeed, forevery sequence the proposed solution with λ = 0 . enables betterBD-BR but lower σ compared to the proposed solution with λ = 0 .In average, the BD-BR is 0.16% better when selecting λ = 0 . ,without degrading significantly σ (-0.09). The results are particu-larly noticeable for sequence FoodMarket . For this sequence, theBD-BR is 0.24% better and σ only decreases by 0.06% when select-ing λ = 0 . , compared to the proposed solution with λ = 0 .
4. CONCLUSION
In this paper, a dynamic TRS partitioning is proposed for next gen-eration video standard VVC. The proposed solution combines twotechniques to minimize multi-thread encoding time and encodingquality loss, respectively. A lagrangian parameter λ is applied, al-lowing to select a trade-off between encoding time and encodingquality. The experiments show that the proposed solution decreasessignificantly multi-thread encoding time, with slightly better encod-ing quality, compared to uniform RS partitioning. Future works willfocus among other points on the improvement of the CTU time es-timator, used in the encoding time minimization step. Instead ofsimply relying on the co-located CTU times of the co-TL frame, fu-ture solutions will rely on CTU deduced by motion information. Theinvestigation of lightweight heuristics for the TRS partitioning stagewill also be part of future works. We are confident they will reducedrastically the overhead, especially for 12 threads encodings of UHDcontent. . REFERENCES [1] CISCO, “Global 2021 forecast highlights,” p. 6, 2016.[2] Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and ThomasWiegand, “Overview of the High Efficiency Video Coding(HEVC) Standard,” IEEE Transactions on Circuits and Sys-tems for Video Technology , vol. 22, no. 12, pp. 1649–1668,Dec. 2012.[3] Naty Sidaty, Wassim Hamidouche, Olivier Deforges, PierrickPhilippe, and Jerome Fournier, “Compression Performance ofthe Versatile Video Coding: HD and UHD Visual Quality Mon-itoring,” in , Ningbo,China, Nov. 2019, pp. 1–5, IEEE.[4] Frank Bossen, Karsten Suehring, and Xiang Li, “JVET-P0003:AHG report: Test model software development (AHG3),”2019.[5] Benjamin Bross, Mauricio Alvarez-Mesa, Valeri George,Chi Ching Chi, Tobias Mayer, Ben Juurlink, and ThomasSchierl, “HEVC real-time decoding,” San Diego, California,United States, Sept. 2013, p. 88561R.[6] Wassim Hamidouche, Mickael Raulet, and Olivier Deforges,“4K Real-Time and Parallel Software Video Decoder for Mul-tilayer HEVC Extensions,”
IEEE Transactions on Circuits andSystems for Video Technology , vol. 26, no. 1, pp. 169–180, Jan.2016.[7] Kiran Misra, Andrew Segall, Michael Horowitz, Shilin Xu, Ar-ild Fuldseth, and Minhua Zhou, “An Overview of Tiles inHEVC,”
IEEE Journal of Selected Topics in Signal Process-ing , vol. 7, no. 6, pp. 969–977, Dec. 2013.[8] Chi Ching Chi, M. Alvarez-Mesa, B. Juurlink, G. Clare,F. Henry, S. Pateux, and T. Schierl, “Parallel Scalability andEfficiency of HEVC Parallelization Approaches,”
IEEE Trans-actions on Circuits and Systems for Video Technology , vol. 22,no. 12, pp. 1827–1838, Dec. 2012.[9] Iago Storch, Daniel Palomino, Bruno Zatt, and Luciano Agos-tini, “Speedup-aware history-based tiling algorithm for theHEVC standard,” in , Phoenix, AZ, USA, Sept. 2016, pp.824–828, IEEE.[10] Maria Koziri, Panos K. Papadopoulos, Nikos Tziritas, NikosGiachoudis, Thanasis Loukopoulos, Samee U. Khan, andGeorgios I. Stamoulis, “Heuristics for tile parallelism inHEVC,” in , Kos, Greece, Aug. 2017, pp. 1514–1518, IEEE.[11] Yong-Jo Ahn, Tae-Jin Hwang, Dong-Gyu Sim, and Woo-JinHan, “Complexity model based load-balancing algorithm forparallel tools of HEVC,” in , Kuching, Malaysia, Nov. 2013, pp.1–5, IEEE.[12] Cauane Blumenberg, Daniel Palomino, Sergio Bampi, andBruno Zatt, “Adaptive content-based Tile partitioning algo-rithm for the HEVC standard,” in , San Jose, CA, USA, Dec. 2013, pp. 185–188,IEEE.[13] Xin Jin and Qionghai Dai, “Clustering-Based Content Adap-tive Tiles Under On-chip Memory Constraints,”
IEEE Trans-actions on Multimedia , vol. 18, no. 12, pp. 2331–2344, Dec.2016. [14] Giovani Malossi, Daniel Palomino, Claudio Diniz, AltamiroSusin, and Sergio Bampi, “Adjusting video tiling to avail-able resources in a per-frame basis in High Efficiency VideoCoding,” in , Vancouver, BC, Canada, June2016, pp. 1–4, IEEE.[15] Chia-Hsin Chan, Chun-Chuan Tu, and Wen-Jiin Tsai, “Im-prove load balancing and coding efficiency of tiles in high ef-ficiency video coding by adaptive tile boundary,”
Journal ofElectronic Imaging , vol. 26, no. 1, pp. 013006, Jan. 2017.[16] Jill Boyce, Karsten Suehring, and Xiang Li, “JVET-J1010:JVET common test conditions and software reference configu-rations,” 2018.[17] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means Clustering Algorithm,”