Fine granularity access in interactive compression of 360-degree images based on rate-adaptive channel codes
11 Fine granularity access in interactive compressionof 360-degree images based on rate-adaptivechannel codes
Navid Mahmoudian Bidgoli, Thomas Maugey, Aline Roumy
This is a pre-print of an article published in IEEE Transactions on Multimedia. The final authenticated version isavailable online at: https://doi.org/10.1109/TMM.2020.3017890
Abstract —In this paper, we propose a new interactive com-pression scheme for omnidirectional images. This requires twocharacteristics: efficient compression of data, to lower the storagecost, and random access ability to extract part of the compressedstream requested by the user (for reducing the transmission rate).For efficient compression, data needs to be predicted by a seriesof references that have been pre-defined and compressed. Thiscontrasts with the spirit of random accessibility. We propose a so-lution for this problem based on incremental codes implementedby rate-adaptive channel codes. This scheme encodes the imagewhile adapting to any user request and leads to an efficient codingthat is flexible in extracting data depending on the availableinformation at the decoder. Therefore, only the information that isneeded to be displayed at the user’s side is transmitted during theuser’s request, as if the request was already known at the encoder.The experimental results demonstrate that our coder obtainsa better transmission rate than the state-of-the-art tile-basedmethods at a small cost in storage. Moreover, the transmissionrate grows gradually with the size of the request and avoids astaircase effect, which shows the perfect suitability of our coderfor interactive transmission.
Index Terms —360-degree content, intra prediction, incremen-tal codes, interactive compression, user-dependent transmission.
I. I
NTRODUCTION
Omnidirectional or 360 ◦ images/videos are becoming in-creasingly popular as they offer an immersive experience invirtual environments, and are available with cheap devices, i.e. ,omnidirectional cameras and Head-Mounted-Displays (HMD).As a result, a gigantic number of 360 ◦ videos has beenuploaded on video streaming platforms. The specificity of360 ◦ contents is that they are accessed in an interactive way, i.e. only a small portion, called viewport , of the whole imageis requested by the user. This has two consequences. First, toreach a sufficient resolution at the observer’s side, the globalresolution of the input signal is significantly high, (typically4K or 8K). This underlines the need for efficient compression algorithms. Second, to save bandwidth, the coding algorithmshould allow random access to the compressed data, and onlytransmit what is requested. However, efficient compressionintroduces dependencies in the compressed representation,which is in contrast to the need of random accessibility. The authors are with the Inria, Univ Rennes, CNRS, IRISA (e-mail: { navid.mahmoudian-bidgoli, thomas.maugey, aline.roumy } @inria.fr ).This work was supported by the Cominlabs excellence laboratory withfunding from the French National Research Agency (ANR-10-LABX-07-01)and by the Brittany Region (Grant No. ARED 9582 InterCOR). Fig. 1. At a given instant, a user only observes a small portion of the 360 ◦ image. The goal of interactive coding schemes is to transmit only this usefulinformation while keeping a good global compression efficiency. Different approaches have been proposed to solve thistradeoff in the context of 360 ◦ data. All these solutions rely onpowerful 2D video compression tools such as Versatile VideoCoding (VVC), and thus on the projection of the acquiredspherical content onto a 2D plane (equirectangular [1], cubemap [2], [3], dodecahedron [4], pyramid [3]). A first com-promise solution consists in sending the entire 360 ◦ image,achieving a high compression efficiency but a low randomaccessibility. A second approach consists in partitioning the 2Dprojected 360 ◦ image into sub-parts, called tiles , and encodingthem separately [5], [6]. The transmission rate is significantlyreduced because only the tiles necessary to reconstruct therequested viewport are transmitted. However, the approach hastwo main drawbacks: first, the correlation between the tilesis not exploited, and second, the sent tiles contain usuallysignificantly more data than the requested viewport. Therefore,the tile-based approach also fails to achieve the transmissionrate of an oracle coder that knows the request in advanceand encodes the requested viewport(s) only. To achieve thisideal transmission rate, one could encode and store all possiblerequested viewports, but this exhaustive solution leads to ahuge storage cost.In this paper, we propose a novel 360 ◦ image interactivecoding scheme that achieves the oracle transmission rate,while keeping the storage cost reasonable. More precisely, thestorage size is far less than the exhaustive storage solution andclose to the solution without random access (compression ofthe entire 360 ◦ image). a r X i v : . [ c s . MM ] A ug Spherical ImageMapping
EquirectangularCubeMap
Encoding
Server
Separated bitstreams RequestExtraction Inverse mapping I Extracted bitstreamsDecoding
EquirectangularCubeMap
Viewport projection
User
2D image Requested subset of the spherical image ViewportRequested subset of the 2D image
Fig. 2. Interactive compression of a spherical image Γ . Mapping and encoding operations are performed offline whereas extraction, decoding and viewportconstruction are performed online, upon user’s request. Classical compression schemes work block-wise, and use afixed block scanning order for compression and decompres-sion. Therefore, in 360 ◦ tile-based image compression, thesmallest entity that can be accessed is a tile, a predefinedset of blocks. By contrast, we propose to encode the blockssuch that any decoding order is possible at the decoder’sside. As a consequence, i) only the useful blocks are sentand decoded, ii) the correlation between the blocks is stilltaken into account. Therefore, i) and ii) enable our coderto reach the same transmission rate as the oracle coder. Inpractice, for each block to be encoded, several predictions aregenerated (corresponding to different block decoding orders),and a bitstream is built based on rate-adaptive channel codesstudied in [7]–[9]. This bitstream compensates for the worstprediction, and a substream can be extracted from it to correctany other predictions.Moreover, to build an efficient end-to-end coder, we alsopropose novel prediction modes to adapt not only to interac-tivity, but also to the geometrical properties and discontinuitiesof the 360 ◦ mapped data. Then, a new strategy is developed tospread the access points (independently standalone decodableblocks) such that the decoding process can be triggeredwhatever the requested viewport is. Then, a new scanningorder is proposed in order to further lower the transmissionrate. The experimental results demonstrate the ability of ourproposed coder to reach the same transmission rate as theoracle, while incurring a reasonable additional storage cost.The following of this paper is organized as follows. Thecompression of 360 ◦ content with the user-dependent trans-mission is formulated in Section II. The overview of theproposed coder is given in Section III. In Section IV the detailsof the proposed coder are explained. Finally, in Section Vwe compare our approach with baseline coders, including thetiling approach.II. P ROBLEM FORMULATION FOR I NTERACTIVECOMPRESSION OF ◦ CONTENT
Before describing our core contributions on interactive com-pression of 360 ◦ images (in Sec. III and IV), we first overviewthe interactive 360 ◦ coding model. This model, depicted onFig. 2, is general and encompasses state-of-the-art but alsothe proposed scheme. Then, the coding scheme optimizationproblem is formulated and different measures for evaluationare proposed. A. Data model, projection and compression for storage on theserver
A 360 ◦ data, denoted by Γ , is a spherical image , i.e. , a setof pixels defined on the sphere. Each pixel is determined byits value ( e.g. , R,G,B) and its spherical coordinates θ , i.e. , a2D vector composed of the latitude ∈ [ − π/ , π/ , and thelongitude ∈ [ − π, π ) . To be able to use classical compressionalgorithms, this data Γ is first mapped onto a Euclidean space.This leads to I , the Euclidean 2D representation, also calledthe
2D image . For instance, the equirectangular projectionconsists in uniformly sampling the spherical coordinates θ ,and in replacing θ by Euclidean coordinates: ( x, y ) ≈ θ . Themapping function from the sphere to the plane is denoted by m : Γ (cid:55)→ I. Then, compression is performed with the encoding function f . To allow interactivity, the compression leads to a set ofindependently extractable bitstreams ( b , . . . , b n ) : ( f ◦ m )(Γ) = ( b , . . . , b n ) , (1)where ◦ stands for the composition of functions. The storagecost , i.e. , the size of the data stored on the server is definedas: S = | f ◦ m (Γ) | = n (cid:88) i =1 | b i | . (2) B. Bitstream extraction and decoding upon user’s request
In interactive compression, the viewer points in the direction θ to get a subset γ ( θ ) of the whole spherical content Γ . Uponrequest, the server extracts the bitstreams with indices i ∈ I θ g θ : ( b , . . . , b n ) (cid:55)→ ( b i ) i ∈I θ , and sends it to the user at rate R θ = | g θ ( b , . . . , b n ) | = (cid:88) i ∈I ( θ ) | b i | , (3)where R θ is called the transmission rate .At the user’s side, the extracted bitstreams allow to recon-struct γ ( θ ) , i.e. , the requested subset of the spherical content: m − ◦ h θ : ( b i ) i ∈I θ (cid:55)→ ˆ γ ( θ ) , where h θ is the decoding function of the received bitstreams.Note that due to lossy compression γ ( θ ) (cid:54) = ˆ γ ( θ ) . Finally, a 2D image, called viewport v ( θ ) is displayed tothe user (Fig. 2). This image is the projection (denoted φ )of γ ( θ ) , the requested portion of the spherical image, onto aplane orthogonal to the direction θ , and tangent to the sphere.Note that the size and resolution of the viewport determinesthe subset γ ( θ ) of pixels requested.Evaluating the quality of 360-degree images is a wideproblem that has been intensively studied by the community.A comprehensive review and analysis is given in [10]. Tomeasure a realistic quality of experience at the user side, thedistortion is measured on the image displayed to the user [11].Formally, the distortion is D θ = || v ( θ ) − ˆ v ( θ ) || = || φ ( γ ( θ )) − φ (ˆ γ ( θ )) || . (4) C. Storage-Transmission trade-off in state-of-the-art methods
State-of-the-art 360 ◦ coders achieve interactivity, or equiva-lently separable bitstreams (1), by dividing the 2D image I intosub-elements and encode each sub-element independently with f , the encoding function of a classical 2D video codec, suchas VVC. For instance, in the case of the cubemap projection,the sphere is projected on a cube and each sub-element is aface of the cube [2], [3]. Similarly, [12] divides the sphere into elements. In the case of the equirectangular projection [5],[6], [13], [14], the 2D image is partitioned into sub-imagescalled tiles , and each tile is coded separately.Thus, all state-of-the-art approaches use the same encodingfunction f , and they only differ in the mapping m . So, thesemethods attempt to solve min m λS + E θ ( R θ ) , such that E θ ( D θ ) ≤ δ. (5)by proposing different mapping functions m , where E θ is theexpectation over all users’ requests. D. Our goal: oracle transmission rate with small storage cost
Our coder differs from state-of-the-art methods in severalways. First, our goal is to achieve the same transmissionrate R ∗ as the oracle scheme that knows the request inadvance ( i.e. , upon encoding). Second, the storage should notbe significantly penalized. Finally, to achieve this goal, theencoding and decoding functions ( f, g θ , h θ ) are modified, butnot the mapping m . So, our method attempts to solve min f,g θ ,h θ S, such that E θ ( D θ ) ≤ δ, E θ ( R θ ) = R ∗ , (6)by proposing new functions ( f, g θ , h θ ) . Since the proposedmethod is compatible with any mapping m , we adopt, withoutloss of generality, the equirectangular projection in the rest ofthis paper. III. P RINCIPLE AND OVERVIEW OF THE PROPOSEDINTERACTIVE CODING SCHEME
The strength of conventional coders resides in the block-wise compression of the image, where the correlation betweenthe blocks is captured thanks to a predictive coding scheme[15]. Predictive coding is based on the principle of conditionalcoding , illustrated in Fig. 3. In this scheme, an i.i.d. source ¯ X is compressed losslessly, while another i.i.d. source ¯ Y , called side information (SI), is available at both encoder and decoder.The achievable compression rate is the conditional entropy H ( ¯ X | ¯ Y ) [16], which is smaller than the rate H ( ¯ X ) achievedwhen no SI is available. This shows the benefit of the SI. ¯ X models the image block to be compressed, and ¯ Y models theprediction of ¯ X using another block. Fig. 3. Lossless conditional source coding [16]: when SI ¯ Y is available atboth encoder and decoder, the source ¯ X is compressed at rate H ( ¯ X | ¯ Y ) ,which is smaller than its entropy H ( ¯ X ) . This scheme is efficient but imposes a fixed pre-definedencoding and decoding order of blocks, which contradicts theidea of randomly accessing any part of an image. Therefore,to allow random access, state-of-the-art 360 ◦ coders (seeSection II-C) partition the image into subelements, for instancetiles (blue rectangles in Fig. 4(a)). Each tile is coded separatelyand contains several blocks (red border). In each tile, thereis a reference block (red block with yellow border) that iscoded independently of the other blocks. Remaining blocksare encoded/decoded in a fixed order depicted by the blackarrows. Then, upon request of a viewport, whose projectionon the 2D image is depicted in green, all tiles containing atleast some of the requested green area are sent and decoded.Note that the correlation between the tiles is not exploited andalso potentially many unrequested blocks are sent. Therefore,these schemes cannot achieve the oracle compression rate of R ∗ .By contrast, in our scheme, the fixed decoding order con-straint is relaxed. A block (rectangle with a red border inFig. 4(b)) can be decoded using any combination of neigh-boring blocks that are available. Then, for each requestedviewport, only the displayed blocks are decoded, and thecorrelation between the blocks is optimally taken into account.Therefore, our scheme can reach the oracle compression rate R ∗ , while maintaining the constraint that the request is notknown in advance.The aforementioned optimality can be explained from [7],which studies lossless coding of a source with several SIavailable at the encoder, see Fig. 5. The source is encoded ina single bitstream and once it is known which SI is availableat the decoder a part of the compressed bitstream is extracted.In the 360 ◦ image context, ¯ X models the image block to becompressed and each ¯ Y j corresponds to the prediction for ¯ X using a combination of the neighboring blocks indexed by j . In (a) tile-based approach(b) our coding scheme Fig. 4. The green region is the requested portion of the 2D image. Thetransmitted blocks are represented with rectangles with red borders (includingthe filled ones that correspond to the reference or access blocks), and thearrows depict the block decoding order. For the same request, the proposedscheme (b) reduces the amount of information sent but not displayed comparedto the tile-based approach (a). Moreover, the correlation among the blocksis still taken into account with our coder, while the tile-based approachcompresses each tile separately. [7]–[9], it is shown that incremental coding optimally exploitsthe dependence between the sources because the extractioncan be made at rate H ( ¯ X | ¯ Y j ) , i.e. , at the same rate as if theencoder knows in advance the index of the SI. Moreover, toachieve this optimality, there is no need to perform exhaustivestorage to store a bitstream for each possible SI, i.e. , witha rate (cid:80) j H ( ¯ X | ¯ Y j ) . Indeed, storing a bitstream for the lesscorrelated SI at rate max j H ( ¯ X | ¯ Y j ) is sufficient. In particular, max j H ( ¯ X | ¯ Y j ) (cid:28) (cid:80) j H ( ¯ X | ¯ Y j ) , especially when the numberof possible SI is large.The proposed image interactive coding scheme works asfollows. At the encoder, for each block (rectangle with redborder in Fig. 4(b)), a set of predictions is computed, oneper set of neighboring blocks, that might be available at thedecoder. Then, each block is encoded with the incrementalencoder, conditionally to the set of SI. Then, the referenceblocks of the tile-based approach are replaced by the so-called access blocks (red shaded), for which, the set of predictions iscomplemented by a ¯ Y ∅ , being an empty prediction. Therefore,this block can be decoded independently (as the first decodedblock) or as a function of a the neighboring blocks if theyare already decoded. In other words, these access blocks arestored with a rate H ( ¯ X ) but can be transmitted at the properrate, i.e., H ( ¯ X ) if the block is the first block to be decodedand H ( ¯ X | ¯ Y j ) if not. At the decoder side, given a user request (green areain Fig. 4(b)), an access block is decoded. Then, from thisblock, a prediction for a neighboring block is calculated. Thisprediction is corrected thanks to the extracted bitstream thathas been generated by the incremental source encoder, and thisreconstructs the neighboring block, see Fig. 5. Decoding of thefollowing blocks is done in the same way until all blocks inthe requested region are reconstructed. The decoding order isdepicted as a black arrow in Fig. 4(b). Fig. 5. Compression rates achieved by incremental coding. A source ¯ X iscompressed knowing that a SI ¯ Y j is available at the decoder. The encoderknows the set { ¯ Y i , i = 1 , . . . , J } but does not know which SI of the set isavailable at the decoder. The source can be compressed at rate max i H ( ¯ X | ¯ Y i ) corresponding to the conditional encoding with respect to the less correlatedSI. From this description, a sub-description can be extracted at rate H ( ¯ X | ¯ Y j ) ,i.e. at the same rate as if the SI was known by the encoder. IV. D
ETAILED DESCRIPTION OF THE PROPOSEDINTERACTIVE CODING SCHEME
We now detail our interactive coding scheme. We firstpresent the overall architecture, as well as the issues raisedby the construction of an interactive coding scheme in Sec-tion IV-A. These issues are further elaborated in the followingsections. For instance, we formulate the access block place-ment problem and propose a greedy solution in Section IV-B.We also explain the construction of the set of predictions inSection IV-C. We then detail the practical implementation ofthe incremental entropy coder, and its integration within aninteractive image coder in Section IV-D. Finally, we formulatethe decoding order problem and propose a greedy solution inSection IV-E. The way our proposed scheme handles naviga-tion within a still scene is also discussed in Section IV-E.
A. Proposed architecture
The proposed architecture is summarized in Fig. 6. Theencoder first splits the input image in several blocks X j . Foreach block, several predictions are computed. There is oneprediction per possible set of neighboring blocks available atthe decoder (see Section IV-C). For some of these blocks, theso-called access blocks ( { X k } k ∈A ), i.e. an empty prediction,is added to the prediction set, so that they can be decodedindependently to start the decoding process. These blocks needto be carefully selected to ensure that any requested viewportcan be decoded, but do not impact too much the storage costand transmission rate. This trade-off is studied in Section IV-B.Then, each block X j is encoded with an incremental coderas a function of different predictions (see Section IV-D).This incremental coder, together with the quantizer and thetransform must be carefully integrated in the image scheme toavoid the propagation of quantization errors [17, Sec. 11.3].This is presented in Section IV-D. (a) Encoder(a) Extractor and decoder Fig. 6. Proposed architecture.
At the decoder, given a requested viewport, one accessblock is first decoded. We optimize the block decoding order,to reach a better storage and transmission efficiency (seeSection IV-E for details). Then, each block is consecutivelydecoded by first generating a prediction based on the alreadydecoded blocks, then extract the necessary bitstream, and runthe incremental decoder (see Section IV-D).We now turn to the optimization of the two critical featuresidentified above: the access block set A and the decodingorder, denoted τ . Both features influence the storage cost S ( A , τ ) and the transmission rate R ( A , τ ) , and must satisfythe constraint that any requested viewport must be decoded.To avoid sending unrequested blocks, and therefore minimizethe transmission rate of the first request, we turn this constraintinto the more stringent one: Constraint 1:
There is at least one access block per request.More precisely, with any rotation angle θ of the HMD device,there exists at least one access block belonging to the mappedvisible region of the sphere ( m ◦ γ )( θ ) , i.e., C ( A ) = {∀ θ, ∃ k ∈ A s.t. X k ∈ ( m ◦ γ )( θ ) } . (7)The new Constraint 1 is more stringent than imposing thatany request can be decoded, as it could increase the numberof access blocks and therefore the storage. However, theproportion of access block among all blocks remains low (as shown in Fig. 7). Therefore, storage is not significantlyincreased.As a consequence the access block/decoding order optimiza-tion problem can be formulated as min A ,τ s.t. C ( A ) R ( A , τ ) + λS ( A , τ ) , (8)where the constraint C ( A ) , corresponding to Constraint 1 isdefined in (7). Furthermore, the storage does not depend onthe decoding order. Moreover, we have tested that, given adecoding order, the position of the access block has littleinfluence on the transmission, see Section V-A1. Therefore, theoptimization problem (8) can be simplified into two equivalentsubproblems: min A ,τ s.t. C ( A ) R ( τ ) + λS ( A ) ⇔ min A s.t. C ( A ) S ( A )min τ R ( τ ) (9a)(9b) B. The access block placement problem
To solve the access block placement problem (9a), we pro-pose two strategies. The first strategy is content independentand assumes that all blocks have the same storage cost. This algorithm is called
Fixed-based approach in the following. Inother words, it approximates (9a) by min A s.t. C ( A ) |A| . (10)The advantage of this strategy is that the solution only dependson the spatial resolution and on the field of view of the HMD.As a consequence, there is no need to transmit the positionsof the access blocks to the decoder. The new problem (10) isstill complex since the constraint C ( A ) is a non-trivial functionof the access block set. To solve it, we propose a greedyalgorithm. Starting from the north pole, we change the angle θ to sweep the sphere to the south pole, and at each step,we check if an access block exists in ( m ◦ γ )( θ ) . If not, theblock which contains the center of the area is added to the A set. The details of the algorithm can be found in Alg. 1. Theobtained access block locations in the equirectangular imageare shown in Fig. 7a.The second strategy instead takes into account the visualcontent and is called Content-based . In this strategy, first,the number of access blocks needs to be sent, which requires log ( N ) bits, if fixed-length encoding is used, and if N represents the number of blocks in the image. Second, for eachaccess block, the position of the block must be transmitted aswell (again log ( N ) bits for each access block). If we denotethe cost to encode the block of index k , independently of anyother block by R k | , (9a) can now be rewritten as min A s.t. C ( A ) log ( N ) + |A| log ( N ) + (cid:88) k ∈A R k | . (11)Again, to solve this problem, we propose a greedy algorithm.It is similar to the previous one. The only difference is that, ifno access block has been found in the current area, then theblock with minimum rate is chosen. If many such blocks arefound, then the one, which is closer to the center is chosen.The obtained access blocks positions are shown in Fig. 7b. Theimpact of both approaches on the storage cost are evaluatedin Sec. V-A2. C. Set of possible side information and prediction functions
The estimation of a given block by one or several ofits neighbors is performed by adapting the intra-predictionmodule of conventional video coding standards [15]. Moreprecisely, the boundary row/column pixels from the adjacentblocks are combined to form a directional or planar prediction.By construction, the prediction depends on which neighboringblocks are already decoded and available to serve as reference.In conventional video coding, the blocks are encoded anddecoded with the raster scanning order, which means that thereference blocks are the left or top side. In our case, theorder depends on the requested viewport and therefore, thedecoder should be able to address any possible order. First, thisdecoding order influences which blocks are used as reference(also called context or causal information). Second, given thesereferences, new prediction functions have to be defined.The blocks that can be used as reference depend on thedecoding order. We categorize them according to the number
Alg. 1. Algorithm for placing access blocks
Input: θ = ( θ , θ ) θ : longitude, θ ∈ [ − π, π ) θ : latitude, θ ∈ [ − π/ , π/ θ , ∆ θ : predefined steps center ( θ ) : returns the block index in m (Γ) which containsthe center of γ ( θ ) Output: A Initialization : A = ∅ for θ = − π/ to θ ≤ π/ do for θ = − π to θ < π do θ = ( θ , θ ) if (cid:64) k ∈ A s.t. X k ∈ ( m ◦ γ )( θ ) then A ← A ∪ { center ( θ ) } end if θ ← θ + ∆ θ end for θ ← θ + ∆ θ end for return A (a)(b)Fig. 7. Access block placement. (a) Fixed-based approach in which accessblocks (in red) are selected without taking into account the content. (b)Content-based approach that takes into account the content to minimize thestorage. of neighboring blocks available at the decoder. This leads to3 types, as illustrated in Fig. 8: • Type 1 : Only one of the adjacent blocks has been de-coded. • Type 2 : A vertical and a horizontal border at the bound-aries are available to perform the prediction. This meansthat a block at the top or bottom and a block at the right orleft side of the current block have been decoded already. • Type 3 : This type is similar to type 2 except that, in addition to the vertical and horizontal border, the cornerblock between them has also been decoded.For each type, there exist 4 states, which leads to 12 possiblecontexts in total. Each context gives rise to a possible pre-diction. Prediction functions are deduced from the predictionmodes of conventional video coding standards by a rotation ofmultiples of 90 ◦ . Examples of type 3 SI predictions are shownin Fig. 8d with their corresponding prediction mode.In general, if there is no limitation for the navigation, allof these predictions are generated and used to encode theblock X . For the blocks which are on the top and bottom rowsof the equirectangular image, since navigation can not comefrom the top or bottom side respectively, the non-available sideinformation is removed from the possible set. By contrast, forthe blocks which are on the left and right boundaries of theequirectangular image, since when they are wrapped to thesurface of the sphere they are adjacent to each other, theycan be used as a reference to predict the opposite side. Forexample, for a block which is located at the rightmost part ofthe equirectangular image, the leftmost block at the same rowcan be used as a reference for its prediction and vice versa. X XX X (a) Type 1 SI
X XX X (b) Type 2 SI
X XX X (c) Type 3 SI (d) Intra prediction exampleFig. 8. Set of possible SI per block. (a-c) Reference pixels (in green) usedto predict the block X for each prediction type. The prediction serves as SIfor the entropy coder. The available neighboring blocks are depicted in darkgray. (d) Intra-prediction examples for type 3. The predictions for a block X are denoted by Y , Y , . . . , Y L in the following, and each prediction corresponds to onepossible SI. In order to be consistent at both encoder anddecoder, and avoid propagation of quantization errors, thepredictions are generated with the lossy (quantized) versionof the neighboring blocks. D. Practical implementation of incremental source code
A practical solution to implement incremental codes can beachieved by rate-adaptive channel codes such as rate-adaptiveLow-Density Parity-Check (RA-LDPC). Channel codes havebeen used in other interactive coding schemes like stream-switching [18], [19], where distributed source coding (DSC)is used to merge multiple SI frames. More precisely, andsimilar to our context, a set of SI is computed at the encoderand only one of them is available at the decoder. However,the current block is encoded by exploiting the worst-casecorrelation between a set of potential SI frames and the targetframe. Here, thanks to the theoretical work obtained in [7]–[9]for the incremental code, we propose to store the data w.r.t. theworst-correlated prediction (or equivalently SI), but transmitoptimally by extracting the amount of data required to decodebased on what is available at the decoder.The proposed incremental coding scheme for encoding anddecoding a block X is depicted in Fig. 9. The input block X is first transformed and quantized resulting to a signal ¯ X that is split into several bitplanes ¯ X , . . . , ¯ X P . We nowassume that the predictions Y , . . . , Y L are sorted from thebest to the worst, i.e., Y is more correlated with X than Y L .These predictions are also transformed and quantized resultingto ¯ Y , . . . , ¯ Y L . Note that doing so, encoder and decoder areconsistent, which avoids the propagation of quantization errors[17, Sec. 11.3].Each bitplane ¯ X p is encoded thanks to a binary RA-LDPCencoder, with ¯ Y , . . . , ¯ Y L as SI sequences. It is worth notingthat since the correlations between the source and its SIpredictions are known in advance at the encoder, the symbol-based ( Q -ary) and binary-based RA-LDPC (with bitplaneextraction) perform equally [20], but binary-based RA-LDPCis less complex than its symbol-based version. This generates L bitstreams ( b p , . . . , b pL ) that are stored on the server. For thedecoding of X , if the prediction Y j is available, the bitstreamsthat need to be transmitted are ( b p , . . . , b pj ) . It is noticeable thata bitstream is not only dedicated to one particular prediction,but can serve the decoding of several predictions. This is thereason why the storage cost remains limited. Indeed, the storedbitstream does not depend on the number of possible SI L , butonly on the source X correlation with the worst SI.For the access blocks, an additional variable Y ∅ (a source ofzero entropy) is considered as possible SI. Then, the encodingprocess is identical, which means that on top of ( b p , . . . , b pL ) , abitstream b p ∅ is added to complement the decoding in the casewhere the block has to be decoded independently. Note that thebitstream b p ∅ serves as a complement of the other bitstreams incase of independent decoding, which means that (cid:80) Lj =1 | b pj | + | b p ∅ | = H ( ¯ X p ) . E. Decoding order and navigation
Blocks are decoded one after the other until all blocks in therequested area are processed. The choice of the decoding orderis important since it impacts the quality of the predictions thatare made, and therefore the transmission rate. More formally,let J ( θ ) denote the set of displayed blocks by an HDMpointing in the direction θ , i.e., J ( θ ) = { j | X j ∈ ( m ◦ γ )( θ ) } . Bit planeExtraction Bit planeRe-order
RA-
LDPC
RA-
LDPC
RA-
LDPC LDPCDecodingDecoderLDPCDecodingLDPCDecoding ¯ Y ¯ Y L ¯ Y j Trans. Quant.Trans. Quant. Trans.Quant.Quant.Trans. Y Y L X ¯ X ¯ X p ¯ X P ¯ X ( b p , · · · , b pj , · · · , b pL )( b , · · · , b j , · · · , b L )( b P , · · · , b Pj , · · · , b PL ) ( b , · · · , b j )( b p , · · · , b pj )( b P , · · · , b Pj ) ¯ X ¯ X p ¯ X P ˆ XY j ¯ X inv.Quant. inv.Trans. Fig. 9. Lossy coding scheme of an image block X as a function of its predictions Y , . . . , Y L . We seek for a decoding order, i.e. , a permutation τ of this setsuch that the transmission rate is minimized as mentioned in(9b). τ ( j ) stands for the position of j in the new arrangement.The difficulty of this problem is that the rate to encode a blockfrom another block depends on the number of neighbors andtherefore depends on the decoding order. Therefore, the cost ofthe transition between two blocks is not fixed and the problemcan not be formulated as an optimization over a graph. So, tosolve this difficult problem, we propose a greedy approach,called GreedyRate . At each iteration, given the set of alreadydecoded blocks, the one hop neighborhood around all decodedblocks is computed. This neighborhood corresponds to theblocks that can be decoded next. Among this set, the blockwith minimum rate is chosen.However,
GreedyRate needs to know the encoding ratesof each block for each possible SI. Therefore, this algorithmcan only be implemented at the encoder and the decodingorder needs to be sent to the decoder. To avoid sending thedecoding order and further save bandwidth, we replace therate in (9b) by a heuristic, which can be computed at bothencoder and decoder. Indeed, the quality of the prediction,which impacts the transmission rate, increases with the numberof neighboring blocks that serves as reference (Type 1, 2 or 3in Fig. 8). Therefore, we propose to find a decoding order thatmaximizes the overall number of blocks used for predictions, i.e. , max τ (cid:88) j ∈J |{ j (cid:48) | τ ( j (cid:48) ) < τ ( j ) , X j ∼ X j (cid:48) }| , (12)where ∼ means that two blocks are neighbors.Moreover, the experiments conducted in Sec. V-B1, showthat the correlation between horizontal blocks is higher thanbetween vertical blocks, i.e. , the prediction that is generatedby the left or right block generates less distortion than theprediction generated from the top or bottom block. Therefore,we augment the criterion to favor horizontal predictions max τ (cid:88) j ∈J |{ j (cid:48) | τ ( j (cid:48) ) < τ ( j ) , X j ∼ X j (cid:48) }| + E ( j ) (13) E = { j |∃ j (cid:48) , Xj (cid:48) ∼ X j , X j (cid:48) ∈ same line as X j } where E ( j ) is the characteristic function of the set E . Notethat if the image is rotated, then the highest correlation may not be between horizontal blocks but rather verticalfor instance. This direction of highest correlation can theneasily be signalized with a single additional bit. We propose agreedy solution to (13) which is called GreedyCount . At eachiteration, given the set of already decoded blocks, the one hopneighborhood is computed. This neighborhood corresponds tothe blocks that can be decoded next. Among this set, the blockswith maximum number of neighbors are identified. Amongthis subset, we select a block that has maximum number ofhorizontal transitions.To further reduce the complexity of the
GreedyCount algorithm, we propose the
SnakeLike inspired by the one-scanconnected component labeling technique [21], where we add apreference to horizontal transitions. More precisely, the inputof the algorithm is the set of requested blocks and the outputis the ordered sequence of these blocks. At each iteration,among the neighbors of the last selected block, a horizontalneighbor is chosen. If there is no horizontal neighbor (eitherbecause the horizontal neighbor is not requested or becauseit is already in the ordered list), then a vertical neighbor ischosen. If there is no neighbor, then previous process is donefor the last selected block with a neighbor. The proposedsolution can start at any position of a block (not necessarilyat a corner) and is adapted to any shape (not necessarily arectangle). These latter adaptations are made possible througha recursive implementation of the scanning order. An exampleof the snake scan is shown with black arrows in Fig. 4(b).Navigation within a 360 ◦ image is also handled by ourinteractive coder. Indeed, navigation consists in requesting asequence of directions or equivalently a sequence of viewports.Therefore, for any newly requested viewport, the already de-coded blocks of this viewport and its boundary are determined.One of these blocks is chosen as the starting point. A decodingorder is computed for the remaining non-decoded blocks.In conclusion, our scheme handles efficiently navigation bysending only the unknown blocks and even avoiding to senda block as an access block.To summarize, three algorithms have been proposed to ap-proach the difficult decoding order optimization problem (9b). GreedyRate is based on the rates and is thus content based.The two others
GreedyCount and
SnakeLike algorithms usea heuristic and do not depend on the image.
V. E
XPERIMENTAL RESULTS
For the experiments, we selected 8 large equirectangularimages from the Nantes University dataset [22]. We havenamed images for the sake of clarity. These images and theircorresponding names are shown in Fig. 10. In this dataset,each 360-degree image has been viewed by several observers.The head movements have been recorded as trajectories onthe unit sphere. For each sampled head position, the longitudeand latitude values of the center of the viewport are presented.The collected head positions have been down-sampled oversequential windows of 200 msec. The head movement datacan be also predicted using the salient areas [23]. For ourexperiment, the blocks are of size × . The rate-adaptivechannel code used in our scheme is the RA-LDPC [24].In the next three subsections, we present different exper-iments to validate the choices made in the design of ourcoder. After these ablation studies we evaluate our coder withdifferent baselines. For that, we consider two coding schemesas baselines to evaluate the proposed coder, namely tile-basedscheme and exhaustive storage.Different tile structures are used by partitioning the equirect-angular image into × , × , × , × , × , and × non-overlapping tiles. We denote them by “T. m × n ”where m and n are the numbers of vertical and horizontaltiles, respectively. Inspired by [25], [26] we also consider anirregular tiling structure as another baseline in which the top25% of the image is partitioned into one tile, the bottom 25%to another, and the middle 50% is divided into four equal-areatiles. We denote this baseline by “T. opt structure”. We alsoconsider the case when the whole equirectangular image issent in the first request (T. × ).The exhaustive storage (ES) approach does not really existin practice, but it illustrates the intuitions of some baselinesadopted in other contexts [3], [27]–[30]. As in our codingscheme, the ES scheme considers all the predictions for eachblock, but stores a residue for each of them. It is thus welladapted for transmitting at the oracle transmission rate R ∗ ,but has a great storage expense.The storage cost is computed for each image and eachbaseline. The transmission rate and distortion are computedper user and per request while users are navigating in thescene using the recorded head positions. As explained in (4),the distortion is computed at each viewport of a user’s request[11]. A. Ablation study: access block1) Access block position does not impact transmission rate:
In this experiment, we prove that the transmission rate is notdependent on the location of access blocks. In other words, ifwe change the position of access block inside the requestedarea to start decoding the requested blocks, it doesn’t impactthe transmission rate. For that, we take the first request of everyuser in all images, and we change the position of the accessblocks inside the requested area. Note that in this scenario,we assume that every requested block can be an access block,so this is without optimizing the position of access blocks,which we investigate in the next ablation study (
Access block
TABLE IT
HE IMPACT OF ACCESS BLOCK POSITION ON THE TRANSMISSION RATE .T HE AVERAGE , MAXIMUM , AND MINIMUM VALUES OF (14)
OVER ALLUSERS IS COMPUTED FOR EACH
QP.Average Maximum MinimumMarket 3.7e-04 1.1e-03 1.5e-04Street 4.6e-04 7.3e-04 1.6e-04Mountain 2.1e-04 5.7e-04 1.1e-04Church 3.4e-04 7.8e-04 1.9e-04Seashore 1.3e-04 3.0e-04 5.7e-05Park 3.9e-04 8.5e-04 2.0e-04Jacuzzi 5.5e-04 9.8e-04 2.7e-04Cafe 5.1e-04 8.7e-04 2.5e-04TABLE IIS
TORAGE COST OF C ONTENT - BASED AND F IXED - BASED ACCESS BLOCKPOSITIONING . T
HE RESULTS ARE SHOWN IN K ILO B YTE (KB)
AND FORTWO DIFFERENT QP S .QP=22 QP=42ContentBased Fixed ContentBased FixedMarket 13875.44 Street 13768.33
Mountain 11833.89
Church 13102.20
Seashore 13070.45
Park 8597.98
Jacuzzi 7453.13
Cafe 8243.73 positioning ). We choose K = 100 different positions foreach request, and then start the decoding process from thechosen access block. We then compute the ratio between theMaximum Absolute Deviation of the transmission rates andthe average transmission rate for two different quantizationparameters (QP=22, 42). More precisely, for each QP andeach user, if we denote the set of transmission rates of K experiments with R A = { r , r , . . . , r K } , we compute max( {| r i − ¯ r | : i = 1 , . . . , K } )¯ r , where ¯ r = 1 K K (cid:88) i =1 r i . (14)The results are shown in Table I. It can be seen that inall cases, the maximum absolute deviation of the transmissionrate is less than . with respect to the average. Therefore,we can conclude that the transmission rate is independent ofthe position of access block .
2) Access block positioning:
We compare the storage costof the two algorithms proposed for access block positioning(see Section IV-B). Results are shown in Table II for two QPvalues. The Content-based approach depends on the content,and therefore, there is a need to signal the position of accessblocks. As can be seen from the table, even though the content-based approach provides lower storage, when we add thesignalization cost, the overall storage cost exceeds the Fixed-based approach. Therefore, the Fixed access block positioningis sufficient for our purpose, plus it has the advantage that itis fixed between images with the same resolution. For the restof this paper, we consider the Fixed-based approach for ouraccess block positioning. (a) Market (b) Street (c) Mountain (d) Church(e) Seashore (f) Park (g) Jacuzzi (h) CafeFig. 10. Images used in our experiment and their corresponding names.TABLE IIIP ERCENTAGE OF BLOCKS WHEN THE AVERAGE COMPRESSION RATE OFTHE HORIZONTAL PREDICTIONS ( THE PREDICTIONS COME FROM THELEFT OR RIGHT SIDE ) IS BETTER THAN THE VERTICAL PREDICTIONS ( THEPREDICTIONS COME FROM THE TOP OR BOTTOM SIDES ), ANDVICE - VERSA . T
HE REMAINING PERCENTAGE CORRESPONDS TO THE CASEWHEN BOTH ARE EQUAL . T
HE RESULTS ARE RELATED TO
QP=22.Horizontal is better(in %) Vertical is better(in %)Market
B. Ablation study: decoding order1) Horizontal/vertical prediction:
To confirm the validity of(13), it is sufficient to compare the compression rate of a blockwhen only one side of the block is available as a reference todo prediction, i.e., comparing the compression rate betweenthe predictions coming from Type 1 SI of Fig. 8. For everyblock of all images, we measure the average compressionrate of the block when the prediction comes from the leftor right side (horizontal preference) and compare them withthe average compression rate when the prediction comes fromthe top or bottom side of the block (vertical configuration).Table III shows the percentage of blocks for which the averagecompression rate obtained from the right and left predictionsis lower than the average compression rates obtained fromthe top and bottom predictions when QP=22. Similar behaviorobserved for other QP values. It can be seen that for all imagesmore than 50% of the blocks have lower compression rateswhen the predictions come from the left or right sides. Thisclearly shows that for decoding order scanning, it is preferableto favor horizontal predictions than vertical one when thecamera orientation is horizontal as it is the case in our testexperiments.
2) Performance of different decoding order algorithms:
We conduct an experiment to analyze the performance of the
TABLE IVC
OMPARISON BETWEEN DIFFERENT DECODING ORDERS PRESENTED IN S ECTION
IV-E. T
HE SHOWN RESULTS ARE THE TRANSMISSION RATE IN K ILO B YTE (KB)
AVERAGED OVER ALL USERS ’ REQUESTS FOR
QP=22.GreedyRate SnakeLike GreedyCountMarket 1228.71 1217.31
Street 1453.37 1443.25
Mountain 1192.35 1186.69
Church 1307.32 1299.77
Seashore 1750.49 1745.28
Park 789.16
Cafe 919.89 913.44 three different decoding orders presented in Section IV-E. Forthat, we compute the transmission rate for the first request ofall users. The transmission rates averaged over all users arepresented in Table IV. In the GreedyRate algorithm, we needto signalize the decoding order. Therefore, although it tries tooptimize the transmission rate, the addition of the decodingorder signalization results in a drop in the performance ofthe transmission rate. It is interesting to say that this drop ofperformance is not only because of signalization but also be-cause this greedy algorithm can not always manage to find thelowest transmission rate compared to the other two methods.In fact, even if we do not consider the signalization cost ofgreedy algorithm , only 62% of the cases this GreedyRate has alower transmission rate than our SnakeLike approach, and forthe remaining 38% of cases, SnakeLike algorithm performsbetter. This clearly shows the complexity of this optimizationproblem. However, if we consider the signalization cost, theGreedyRate approach always performs worse. Regarding thecomparison between SnakeLike and GreedyCount approaches,we can clearly see that at the cost of a more complexalgorithm, the GreedyCount performs slightly better than theSnakeLike approach. However, this efficiency is negligible.Therefore, for the rest of this paper, we consider the SnakeLikealgorithm for our evaluations.
C. Achievability of the oracle transmission rate
We recall that the aim of the paper is to propose a coder thatis able to achieve the oracle transmission rate R ∗ , obtained if . . . . . . . . · request (in seconds)a) c u m u l a ti v e t r a n s m i ss i on r a t e . . . . . . . . · request (in secons)b)T. 7x7 T. 7x5 T. 5x5 T. 5x4 T. 4x4 T. 2x2 T. 1x1 T. opt structure ES Ours theoretical/oracle Ours RA-LDPC Fig. 11. Accumulated transmission per request at iso-distortion PSNR dB for image Jacuzzi (The lower, the better). The distortion is computed at theviewport of the users. a) For one user. b) Averaged over all users’ requests. “T. m × n ” stands for tile-based approach where m and n are the numbers ofvertical and horizontal tiles and ES stands for exhaustive storage. the user head movement was known at the encoder. For thatpurpose, we compute the accumulated transmission rate forone user during successive requests, i.e., R ( θ T (cid:48) , u ) = T (cid:48) (cid:88) t =1 r ( θ t , u ) . where r ( θ t , u ) is the transmission rate of user u with request θ t . We plot in Fig. 11 the evolution of the accumulatedtransmission rate at a given PSNR of dB (computed atthe users’ viewports) for image Jacuzzi. Similar behaviorswere observed for other images and other PSNR values.We can see that, in theory, our proposed coder achieves theoracle transmission rate. In practice, the suboptimality of theadopted coders leaves a gap between the practical performanceand the theoretical one. However, we can see that still ourimplementation obtains much better performance than thetile-based ones, especially at the beginning of the request.Moreover, we can observe that our method has a behavior thatis better suited to interactivity than the tile-based approach.Indeed, it has a smoother R ( θ T (cid:48) , u ) evolution during the user’shead navigation, meaning that our scheme transmits only whatis needed upon request. By contrast, the tile-based approacheshave a staircase behavior with big steps, meaning that there isa burst of rate at the moment when a new tile is needed (inwhich some blocks are useless).Another way to analyze the performance of the interactivecoder is by means of the usefulness of the transmitted data. Inthe following, the usefulness is defined as the proportion ofdisplayed pixels among all the decoded ones for a request θ : U = E θ ( U θ ) , where U θ = (cid:93) displayed pixels for a request θ(cid:93) decoded pixels for a request θ . (15)The close this value is to 1, the closer the transmitted data towhat the user has requested and vice versa. The usefulness averaged over all users is shown in Fig. 12for image Jacuzzi. It confirms that our scheme sends moreuseful information at each request compared to its baselines.In all these experiments, the ES scheme performs as goodas our solution in terms of transmission rate. More exactly,it performs as good as our theoretical performance whenusing an optimal rate-adaptive code. In practice, they use anarithmetic coder that does not suffer from some sub-optimalityof rate-adaptive implementation. However, in the ES approach,the focus is on the transmission rate only, and this is why in thefollowing we present an evaluation that includes the storagecost. . . . . . . . . . . . . request (in seconds) u s e f u l n e ss T. 07x07 T. 07x05 T. 05x05T. 05x04 T. 04x04 T. 02x02T. 01x01 T. opt structure Oracle/ES/Ours
Fig. 12. Usefulness of the requested pixels per request average of all users’requests for image Jacuzzi. “T. m × n ” stands for tile-based approach where m and n are the numbers of vertical and horizontal tiles and ES stands forexhaustive storage. D. Analysis of the Rate-storage trade-off
We now discuss the performance of the different codersin terms of rate and storage performance. For that, we use aweighted Bjontegaard metric introduced in [31]. In this metric the weight λ balances the relative importance between the rateand the storage, as a function of the application: E θ ( R θ ) + λS. (16)In the following, we first study two extreme scenarios whereeither the rate or the storage is neglected. Then, we showresults for more realistic scenarios where both matters.
1) RD and SD performances:
In Fig. 13, for queries oflength averaged over all users, we compare the perfor-mance when either the storage ( λ = 0 ) or the transmissionrate ( λ = ∞ ) is neglected. We see that the proposed methodwith RA-LDPC codes performs better than tiling approacheswith only a small extra cost in storage. Compared to the ESapproach, the ES performs better on the transmission rate,which is basically because rate-adaptive channel codes arenot optimal for short-length sequences, but as expected, theES approach performs poorly in terms of storage. We alsocomputed the SSIM value of the user’s viewport luma channel.The same conclusion as storage/transmission rate-distortionholds also for SSIM value. The curves for transmission andstorage versus SSIM values are shown in Fig. 14. Similarbehaviors observed for other images as well.The averaged bitrate differences in transmission rate andstorage are summarized in Table V for different length ofusers’ requests. The bitrate difference show the performanceof each method relative to the no tiling (T. × ) approach,i.e., when the whole image is stored and transmitted atonce. Negative values indicate bitrate saving w.r.t. no tilingapproach. Due to space preservation, intermediary tile sizesare removed from the table and only the important baselinesare shown. It is observed that while the performance of RA-LDPC codes does not exactly reach its theoretical rates, theystill perform better than the tile-based approaches in terms oftransmission rate even at a long duration of 2 secs of user’snavigation in a single image (it is worth noting that usuallyafter almost 4 secs, users are viewing the whole image). Interms of storage performance, we observe that the proposedmethod performs significantly better than ES (the greater thepositive value of BD, the worse the performance). In otherwords, the proposed method loses . of storage w.r.t notiling approach while exhaustive storage loses . . Ifwe look at the theoretical results of the proposed coder, wesee that in theory this coder should only lose . in termsof storage and gain . in transmission rate for the firstrequest of the user, which clearly shows that there is plentyof room for improvement in designing rate-adaptive channelcodes.
2) Weighted BD:
The weighted BD is presented for morerealistic scenarios, where both storage and transmission ratesmatter with a relative importance. Lower values of λ given in(16) means that the importance is given more to transmissionrate and, on the contrary, higher λ values give the importanceto storage. Weighted BD values of different baselines aresummarized in Table VI. In our experiments, for tile-basedand our approach, the storage is − times larger thanthe transmission rate and for the exhaustive storage, the storageis usually − times the transmission rate. Therefore,imposing λ around . gives the same importance between transmission rate and storage. It can be seen from Table VIthat for λ = 0 . the RA-LDPC coder performs better thanother baselines which means that if the storage and transmis-sion rates are of the same importance, the proposed methodperforms better. Decreasing λ to e − puts emphasis on thetransmission rate where exhaustive storage is more suitable,but it can be seen that even in this case the performance of theour RA-LDPC coder is pretty close to ES coder. Therefore, interms of transmission rate both our solution and the exhaustivestorage approach have the same performance, but our coderhas much lower storage overhead. Finally, if the storage ismore important than the transmission rate ( λ ≥ . ) tilingapproaches perform better.VI. D ISCUSSION
In the experimental section above, we have proven thatit is possible to encode blocks once and take into accountthe correlation with their neighbors , and still enable flexibleextraction during user’s navigation within the 360 ◦ image. Inparticular, we have demonstrated that the tile-based approachis not suited for interactive compression because of its coarseaccess granularity. Indeed, only tiles can be extracted. Onthe contrary, our coder enables a finer granularity (at theblock level) while maintaining a similar storage cost. InSection VI-A, we discuss how our solution remains compatiblewith most of the optimization tools embedded in recentstandards. In Section VI-B, we discuss how classical streamingmethods could still work and even benefit from our proposedsolution. Indeed, our solution is complementary rather than acompetitor to these advanced tools. A. Compatibility with video coding standards
As stated in the last description of the new video codingstandard [32], the improvements in VVC brings giganticcompression gains deal mostly with: (i) the partitioning (ii)the intra prediction, (iii) the transform, quantization and (iv)entropy coding.
Block partitioning consists in optimizing the size of theblocks based on a rate-distortion criterion. A smooth regionin the image is typically processed with a large block sizewhile textured and heterogeneous regions are split into smallblocks. Whereas no block optimization has been consideredin this paper (neither for tile-based approaches nor for ourcoder), our solution could benefit from an optimized blockpartitioning exactly as the standard coders. In particular, thevery last partitioning [32] that enables diagonal or non-equalblock division can be implemented similarly in our approach.In particular, all these advanced tools could be adapted to ourmethod by replacing the classical rate with our storage andtransmission rate.
Intra prediction has been intensively optimized in recentyears, either by increasing the precision of the prediction( e.g., increasing the number of modes [32]) or by reducingthe complexity of the mode search ( e.g., most probable mode[33] or rough mode decision [34]). All these approaches canbe implemented directly in our coder.
Transform and quantization has also been optimized in . . . . · E θ ( R θ ) (in bits) PS N R · S (in bits) PS N R T. 7x7 T. 7x5 T. 5x5 T. 5x4 T. 4x4 T. 2x2 T. 1x1 T. opt structure ES Ours theoretical Ours RA-LDPC
Fig. 13. Storage and transmission performance after 1 sec of users’ requests for image Jacuzzi. a) Transmission rate-distortion curve. b) Storage-distortioncurve. “T. m × n ” stands for tile-based approach where m and n are the numbers of vertical and horizontal tiles and ES stands for exhaustive storage. . . . . · . . E θ ( R θ ) (in bits) SS I M · . . S (in bits) SS I M T. 7x7 T. 7x5 T. 5x5 T. 5x4 T. 4x4 T. 2x2 T. 1x1 T. opt structure ES Ours theoretical Ours RA-LDPC
Fig. 14. SSIM of luma channel versus storage and transmission rate averaged after 1 sec of users’ requests for image Jacuzzi. a) Transmission rate versusSSIM curve. b) Storage vs SSIM curve. “T. m × n ” stands for tile-based approach where m and n are the numbers of vertical and horizontal tiles and ESstands for exhaustive storage. TABLE VBD MEASURES ( IN PERCENT ) AVERAGED OVER ALL USERS RELATIVE TO THE NO TILING APPROACH (T. 1 X n IS THE BD METRIC FORTRANSMISSION RATE FOR REQUESTS OF LENGTH n . BD-S IS THE
BD-
STORAGE MEASURE .Market Street Mountain Church Seashore Park Jacuzzi Cafe AverageBD-R1strequest T. 2x2 -50.66 -51.65 -49.74 -50.96 -46.56 -51.19 -53.13 -49.03 -50.36T. 7x7 -79.77 -78.39 -78.25 -79.67 -74.96 -79.34 -80.34 -77.73 -78.56ES -90.39 -89.36 -89.77 -89.59 -87.68 -90.22 -90.80 -88.93 -89.59Ours theoretical -90.52 -89.49 -89.91 -89.71 -87.81 -90.36 -90.94 -89.06 -89.72Ours RA-LDPC -85.80 -84.26 -84.89 -84.59 -81.74 -85.56 -86.43 -83.62 -84.61BD-R1 sec T. 2x2 -44.38 -39.25 -31.14 -48.04 -24.32 -44.52 -47.15 -39.73 -39.82T. 7x7 -73.90 -73.39 -71.29 -73.39 -68.07 -73.07 -68.28 -72.32 -71.71ES -87.29 -86.36 -84.38 -85.91 -82.60 -86.07 -83.47 -85.38 -85.18Ours theoretical -87.46 -86.52 -84.59 -86.08 -82.79 -86.27 -83.71 -85.56 -85.37Ours RA-LDPC -81.23 -79.82 -76.91 -79.15 -74.22 -79.45 -75.61 -78.38 -78.09BD-R2 secs T. 2x2 -28.41 -23.33 -9.40 -31.79 -10.63 -23.43 -28.95 -19.04 -21.87T. 7x7 -69.91 -67.69 -64.30 -63.98 -59.89 -65.50 -57.19 -66.42 -64.36ES -84.58 -82.16 -79.28 -78.50 -77.00 -80.93 -72.97 -81.17 -79.57Ours theoretical -84.78 -82.38 -79.55 -78.76 -77.26 -81.20 -73.36 -81.41 -79.84Ours RA-LDPC -77.21 -73.61 -69.38 -68.19 -65.94 -71.85 -60.12 -72.16 -69.81BD-S T. 2x2 0.01 0.02 -0.00 0.01 0.01 0.02 0.02 0.03 0.01T. 7x7 0.08 0.07 0.02 0.08 0.03 0.06 0.06 0.09 0.06ES 1110.57 1105.01 1096.83 1103.01 1095.92 1106.26 1096.71 1104.18 1102.31Ours theoretical 7.14 6.02 5.36 6.11 4.99 7.06 5.71 6.72 6.14Ours RA-LDPC 57.86 56.35 55.15 56.48 54.75 57.55 55.58 57.14 56.36 TABLE VIW
EIGHTED BD FOR REQUESTS OF LENGTH SEC AVERAGED OVER ALL USERS RELATIVE TO THE NO TILING APPROACH (T. 1 X λ = 0 . T. 2x2 -27.73 -24.53 -19.46 -30.02 -15.20 -27.82 -29.46 -24.82 -24.88T. 7x7 -46.16 -45.84 -44.55 -45.84 -42.52 -45.65 -42.65 -45.16 -44.80
ES 362.05 360.38 358.54 359.84 359.48 361.02 359.02 360.72 360.13RA-LDPC -29.05 -28.76 -27.39 -28.29 -25.82 -28.08 -26.42 -27.55 -27.67 λ = 0 . T. 2x2 -41.87 -37.03 -29.37 -45.32 -22.95 -42.00 -44.48 -37.48 -37.56T. 7x7 -69.71 -69.23 -67.26 -69.23 -64.21 -68.93 -64.41 -68.22 -67.65ES -19.47 -18.93 -17.53 -18.62 -15.85 -18.59 -16.68 -18.04 -17.96RA-LDPC -73.35 -72.11 -69.44 -71.47 -66.90 -71.69 -68.18 -70.70 -70.48 λ = 1 e − T. 2x2 -44.11 -39.02 -30.95 -47.75 -24.18 -44.26 -46.87 -39.50 -39.58T. 7x7 -73.46 -72.95 -70.87 -72.95 -67.67 -72.64 -67.87 -71.89 -71.29ES -80.15 -79.26 -77.34 -78.82 -75.56 -78.96 -76.43 -78.28 -78.10
RA-LDPC -80.40 -79.00 -76.13 -78.34 -73.45 -78.63 -74.83 -77.57 -77.29 recent video standards ( e.g., including the choice of newtransforms, dependent quantization parameters (QPs) [32]).Even though most of the new tools can be introduced in ourcoder, it is important to note an important difference betweenour solution and classical coders. While in classical coders,the signal that is transformed and quantized is the residue( i.e. , difference between the prediction and the true block),the transformation and quantization operations of our coderare applied on the original block and the prediction themselvesto avoid propagation of quantization errors (see Section IV-D).
Entropy coding is definitely the major difference between ourcoder and the standard. Indeed, the traditional arithmetic coderimplemented in video standards is replaced by a rate-adaptivechannel coder. Therefore, all the improvements brought to thearithmetic coder are not directly transferable to our solution.However, a solution similar context-based adaptive binaryarithmetic coding (CABAC) [35] that takes into account thecontext in the the compression procedure exists in channelcodes [36], and can be considered in future work.
B. Streaming
As stated in [37], solutions for 360 ◦ image streaming aredivided into two categories. The first one consists of storingdifferent versions of the 360 ◦ content, each of them coveringthe entire sphere, but with heterogeneous resolutions over thedirections (using for example the pyramidal representation[3]). In other words, the server stores different orientationsrequested by a user and transmits the one that is closest tothe requested view. The low-resolution portion of the sphereis used in case the head movement is too fast. The secondapproach consists in partitioning the 360 ◦ picture in differenttiles, each of them being coded multiple times at differentbitrates, as it was done for classical 2D videos [38]. Forexample, a quality emphasized region approach is used tostream the proper tile at a proper quality given a user’s headmovement and the available bandwidth [14], [39].Even though the streaming of our coded bit-stream hasnot been tackled in this paper, the proposed coder is fullycompatible with both approaches. Indeed, the blocks can bestored at different resolutions or different QPs. Our codercould even improve the aforementioned methods since, insteadof storing different versions independently, one could takeinto account their correlation by simply adding the change of resolution/QP as new side information. For that, we canbenefit from saliency-based prediction methods such as [40]to determine important viewports and store the blocks in theseregions at higher bitrates, and the blocks of less importantregions at lower bitrates.VII. C ONCLUSION
We proposed a new coding scheme for the interactivecompression of 360-degree images with incremental rate-adaptive channel codes. Experimental results show that ourscheme balances the trade-off between transmission rate andstorage. In addition, the usefulness of the transmitted blocksshows that the proposed scheme better adapts to the queriesof the users and the transmission rate increases gradually withthe duration of the request. This avoids the staircase effect inthe transmission rate that occurs in tile-based approaches andmakes the encoder perfectly suited for interactive transmission.VIII. A
CKNOWLEDGMENT
The authors would like to thank Laurent Guillo and OlivierLe Meur for their helpful advices on various technical issuesrelated to this paper, and the reviewers for all of their careful,constructive and insightful comments in relation to this work.R
EFERENCES[1] J. Snyder,
Flattening the Earth: Two Thousand Years of MapProjections . University of Chicago Press, 1993. [Online]. Available:https://books.google.fr/books?id=0UzjTJ4w9yEC[2] K.-T. Ng, S.-C. Chan, and H.-Y. Shum, “Data compression and trans-mission aspects of panoramic videos,”
IEEE Transactions on Circuitsand Systems for Video Technology , vol. 15, no. 1, pp. 82–95, Jan 2005.[3] E. Kuzyakov and D. Pio, “Next-generation videoencoding techniques for 360 video and vr,”2016. [Online]. Available: https://code.fb.com/virtual-reality/next-generation-video-encoding-techniques-for-360-video-and-vr/[4] C. Fu, L. Wan, T. Wong, and C. Leung, “The rhombic dodecahedronmap: An efficient scheme for encoding panoramic video,”
IEEE Trans-actions on Multimedia , vol. 11, no. 4, pp. 634–644, June 2009.[5] A. Zare, A. Aminlou, M. M. Hannuksela, and M. Gabbouj, “Hevc-compliant tile-based streaming of panoramic video for virtual realityapplications,” in
Proceedings of the 24th ACM International Conferenceon Multimedia , 2016, pp. 601–605.[6] F. Qian, L. Ji, B. Han, and V. Gopalakrishnan, “Optimizing 360 videodelivery over cellular networks,” in
Proceedings of the 5th Workshopon All Things Cellular: Operations, Applications and Challenges , 2016,pp. 1–6. [7] A. Roumy and T. Maugey, “Universal lossless coding with random useraccess: The cost of interactivity,” in , Sep. 2015, pp. 1870–1874.[8] E. Dupraz, A. Roumy, T. Maugey, and M. Kieffer, “Rate-storageregions for extractable source coding with side information,” PhysicalCommunication , vol. 37, p. 100845, 2019.[9] T. Maugey, A. Roumy, E. Dupraz, and M. Kieffer, “Incremental codingfor extractable compression in the context of massive random access,”
IEEE transactions on Signal and Information Processing over Networks ,vol. 6, pp. 251–260, 2020.[10] R. G. d. A. Azevedo, N. Birkbeck, F. De Simone, I. Janatra,B. Adsumilli, and P. Frossard, “Visual distortions in 360-degree videos,”
IEEE Transactions on Circuits and Systems for Video Technology , pp.1–1, 2019.[11] M. Yu, H. Lakshman, and B. Girod, “A framework to evaluate omnidi-rectional video coding schemes,” in , Sep. 2015, pp. 31–36.[12] M. Hosseini and V. Swaminathan, “Adaptive 360 vr video streaming:Divide and conquer,” in , Dec 2016, pp. 107–110.[13] S. Rossi and L. Toni, “Navigation-aware adaptive streaming strategiesfor omnidirectional video,” in , Oct 2017, pp. 1–6.[14] X. Corbillon, G. Simon, A. Devlic, and J. Chakareski, “Viewport-adaptive navigable 360-degree video delivery,” in , May 2017, pp. 1–7.[15] M. Wien,
High Efficiency Video Coding: Coding Tools and Specifica-tion , ser. Signals and Communication Technology. Springer BerlinHeidelberg, 2014.[16] T. Cover and J. Thomas,
Elements of information theory, second Edition .Wiley, 2006.[17] K. Sayood,
Introduction to Data Compression , Elsevier, Ed. MorganKaufmann Publishers Inc., 2017.[18] N. Cheung, A. Ortega, and G. Cheung, “Distributed source codingtechniques for interactive multiview video streaming,” in , May 2009, pp. 1–4.[19] N.-M. Cheung and A. Ortega, “Compression algorithms for flexiblevideo decoding,” in
IS&T/SPIE Visual Communications and ImageProcessing (VCIP08) , Jan. 2008.[20] R. P. Westerlaken, S. Borchert, R. K. Gunnewiek, and R. L. Lagendijk,“Analyzing symbol and bit plane-based ldpc in distributed video coding,”in , vol. 2,Sep. 2007, pp. II – 17–II – 20.[21] A. AbuBaker, R. Qahwaji, S. Ipson, and M. Saleh, “One scan connectedcomponent labeling technique,” in , Nov 2007, pp. 1283–1286.[22] Y. Rai, J. Guti´errez, and P. Le Callet, “A dataset of head and eyemovements for 360 degree images,” in
Proceedings of the 8th ACM onMultimedia Systems Conference , ser. MMSys17. New York, NY, USA:Association for Computing Machinery, 2017, p. 205210. [Online].Available: https://doi.org/10.1145/3083187.3083218[23] Y. Zhu, G. Zhai, and X. Min, “The prediction of head andeye movement for 360 degree images,”
Signal Processing: ImageCommunication
IEEE Transactions on Communications , pp. 1–1, 2019.[25] M. Hosseini, “View-aware tile-based adaptations in 360 virtual realityvideo streaming,” in , March 2017, pp.423–424.[26] C. Ozcinar, J. Cabrera, and A. Smolic, “Visual attention-aware omnidi-rectional video streaming using optimal tiles for virtual reality,”
IEEEJournal on Emerging and Selected Topics in Circuits and Systems , vol. 9,no. 1, pp. 217–230, March 2019.[27] B. Motz, G. Cheung, and A. Ortega, “Redundant frame structureusing m-frame for interactive light field streaming,” in , Sep. 2016, pp.1369–1373.[28] W. Dai, G. Cheung, N. Cheung, A. Ortega, and O. C. Au, “Rate-distortion optimized merge frame using piecewise constant functions,”in , Sep. 2013,pp. 1787–1791.[29] G. Cheung, A. Ortega, and N. Cheung, “Interactive streaming of storedmultiview video using redundant frame structures,”
IEEE Transactionson Image Processing , vol. 20, no. 3, pp. 744–761, March 2011. [30] M. Karczewicz and R. Kurceren, “The SP- and SI-frames design forH.264/AVC,”
IEEE Transactions on Circuits and Systems for VideoTechnology , vol. 13, no. 7, pp. 637–644, July 2003.[31] N. M. Bidgoli, T. Maugey, and A. Roumy, “Evaluation framework for360-degree visual content compression with user-dependent transmis-sion,” in , Sep. 2019.[32] J. Chen, Y. Ye, and S. Kim, “Algorithm description for versatile videocoding and test model 5 (vtm 5),”
JVET-N1002, Joint Video ExplorationTeam (JVET) , 2019.[33] J. Tariq and S. Kwong, “Efficient intra and most probable mode(mpm) selection based on statistical texture features,” in . IEEE,2015, pp. 1776–1781.[34] K. Saurty, P. C. Catherine, and K. Soyjaudah, “A modified rough modedecision process for fast high efficiency video coding (hevc) intra pre-diction,” in . IEEE,2015, pp. 143–148.[35] D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binaryarithmetic coding in the H.264/AVC video compression standard,”
IEEETransactions on Circuits and Systems for Video Technology , vol. 13,no. 7, pp. 620–636, Jul. 2003.[36] V. Toto-Zarasoa, A. Roumy, and C. Guillemot, “Source modelingfor Distributed Video Coding,”
IEEE Trans. on Circuits and Systemsfor Video Technology , vol. 22, no. 2, Feb. 2012. [Online]. Available:http://hal.inria.fr/inria-00632708/en/[37] M. M. Hannuksela, Y.-K. Wang, and A. Hourunranta, “An overview ofthe omaf standard for 360 video,” in . IEEE, 2019, pp. 418–427.[38] J. Le Feuvre and C. Concolato, “Tiled-based adaptive streaming usingmpeg-dash,” in
Proceedings of the 7th International Conference onMultimedia Systems , 2016, pp. 1–3.[39] S.-C. Yen, C.-L. Fan, and C.-H. Hsu, “Streaming 360 videos to head-mounted virtual reality using dash over quic transport protocol,” in
Proceedings of the 24th ACM Workshop on Packet Video , 2019, pp.7–12.[40] Y. Zhu, G. Zhai, X. Min, and J. Zhou, “The prediction of saliency mapfor head and eye movements in 360 degree images,”