Traffic and Statistical Multiplexing Characterization of 3D Video Representation Formats (Extended Version)
Akshay Pulipaka, Patrick Seeling, Martin Reisslein, Lina J. Karam
aa r X i v : . [ c s . MM ] N ov Traffic and Statistical Multiplexing Characterizationof 3D Video Representation Formats(Extended Version)
Akshay Pulipaka, Patrick Seeling, Martin Reisslein, and Lina J. Karam
Abstract —The network transport of 3D video, which containstwo views of a video scene, poses significant challenges due to theincreased video data compared to conventional single-view video.Addressing these challenges requires a thorough understandingof the traffic and multiplexing characteristics of the differentrepresentation formats of 3D video. We examine the averagebitrate-distortion (RD) and bitrate variability-distortion (VD)characteristics of three main representation formats. Specifically,we compare multiview video (MV) representation and encoding,frame sequential (FS) representation, and side-by-side (SBS)representation, whereby conventional single-view encoding isemployed for the FS and SBS representations. Our resultsfor long 3D videos in full HD format indicate that the MVrepresentation and encoding achieves the highest RD efficiency,while exhibiting the highest bitrate variabilities. We examinethe impact of these bitrate variabilities on network transportthrough extensive statistical multiplexing simulations. We findthat when multiplexing a small number of streams, the MV andFS representations require the same bandwidth. However, whenmultiplexing a large number of streams or smoothing traffic,the MV representation and encoding reduces the bandwidthrequirement relative to the FS representation.
I. I
NTRODUCTION
Multiview video provides several views taken from differentperspectives, whereby each view consists of a sequence ofvideo frames (pictures). Multiview video with two marginallydifferent views of a given scene can be displayed to give view-ers the perception of depth and is therefore commonly referredto as three-dimensional (3D) video or stereoscopic video [2]–[6]; for brevity we use the term “3D video” throughout.Providing 3D video services over transport networks requiresefficient video compression (coding) techniques and transportmechanisms to accommodate the large volume of video datafrom the two views on bandwidth limited transmission links.While efficient coding techniques for multiview video havebeen researched extensively in recent years [7], [8], the net-work transport of encoded 3D video is largely an open researchproblem.
Technical Report, School of Electrical, Computer, and Energy Eng., ArizonaState Univ., November 2012. This extended technical report accompanies [1].Supported in part by the National Science Foundation through grant No.CRI-0750927.Please direct correspondence to M. Reisslein.A. Pulipaka, M. Reisslein, and L.J. Karam are with the Schoolof Electrical, Computer, and Energy Engineering Arizona State Univer-sity, Tempe, AZ 85287-5706, http://trace.eas.asu.edu , Email: { akshay.pulipaka, reisslein, karam } @asu.edu P. Seeling is with Central Michigan University, Mount Pleasant, MI 48859,Email: [email protected]
Previous studies on 3D video transport have primarilyfocused on the network and transport layer protocols andfile formats [9]–[12]. For instance, [10], [13] examine theextension of common transport protocols, such as the data-gram congestion control protocol (DCCP), the stream controltransmission protocol (SCTP), and the user datagram protocol(UDP) to 3D streaming, while the use of two separate InternetProtocol (IP) channels for the delivery of multiview video isstudied in [14]. Another existing line of research has studiedprioritization and selective transport mechanisms for multiviewvideo [15], [16].In this study, we examine the fundamental traffic andstatistical multiplexing characteristics of the main existingapproaches for representing and encoding 3D video for long(54,000 frames) full HD ( × ) 3D videos. Morespecifically, we consider ( i ) multiview video (MV) represen-tation and encoding, which exploits the redundancies betweenthe two views, ( ii ) frame sequential (FS) representation, whichmerges the two views to form a single sequence with twicethe frame rate and applies conventional single-view encoding,and ( iii ) side-by-side (SBS) representation, which halves thehorizontal resolution of the views and combines them to forma single frame sequence for single-view encoding.We find that the MV representation achieves the mostefficient encoding, but generates high traffic variability, whichmakes statistical multiplexing more challenging. Indeed, forsmall numbers of multiplexed streams, the FS representationwith conventional single-view coding has the same transmis-sion bandwidth requirements as the MV representation withmultiview coding. Only when smoothing the MV traffic ormultiplexing many streams can transport systems benefit fromthe more efficient MV encoding.In order to support further research on the network trans-port of 3D video, we make all video traces [17] usedin this study publicly available in the video trace library http://trace.eas.asu.edu . In particular, video trafficmodeling [18]–[22] requires video traces for model devel-opment and validation. Thus, the traffic characteristics of3D video covered in this study will support the nascentresearch area of 3D video traffic modeling [23]. Similarly,video traffic management mechanisms for a wide range ofnetworks, including wireless and optical networks, are builton the fundamental traffic and multiplexing characteristics ofthe encoded video traffic [24]–[27]. Thus, the broad trafficand statistical multiplexing evaluations in this study providea basis for the emerging research area on 3D video traffic management in transport networks [28], [29].II. M ULTIVIEW V IDEO R EPRESENTATION , E
NCODING , AND S TREAMING
In this section, we provide a brief overview of the mainrepresentation formats for multiview video [4] as well as theapplicable encoding and streaming approaches.
A. Multiview video representation formats
With the full resolution multiview format, which we re-fer to as multiview (MV) format for brevity, each view v, v = 1 , . . . , V , is represented with the full resolution of theunderlying spatial video format. For instance, the MV formatfor the full HD resolution of × pixels consists of asequence of × pixel frames for each view v . Eachview has the same frame rate as the underlying temporal videoformat. For example, for a video with a frame rate of f = 24 frames/s, each view has a frame rate of f = 24 frames/s.With the frame sequential (FS) representation, the videoframes of the V views (at the full spatial resolution) aretemporally multiplexed to form a single sequence of videoframes with frame rate V f . For instance, for V = 2 views,the video frames from the left and right views are interleavedin alternating fashion to form a single stream with frame rate f .Frame-compatible representation formats have been intro-duced to utilize the existing infrastructure and equipment forthe transmission of stereoscopic two-view video [4]. The V =2 views are spatially sub-sampled and multiplexed to form asingle sequence of video frames with the same temporal andspatial resolution as the underlying video format [30]. In theside-by-side (SBS) format, the left and right views are spatiallysub-sampled in the horizontal direction and are then combinedside-by-side. For instance, for the full HD format, the leftand right views are sub-sampled to × pixels. Thus,when they are combined in the side-by-side format, they stilloccupy the full HD resolution for every frame. However, eachframe contains the left and right views at only half the originalhorizontal resolution. In the top-and-bottom format, the leftand right views are sub-sampled in the vertical direction andcombined in top-and-bottom (above-below) fashion. For otherformats, we refer to [4], [30]–[32]. We consider the side-by-side (SBS) representation format in our study, since it is oneof the most widely used frame-compatible formats, e.g., it iscurrently being deployed in Japan to transmit 3D content forTV broadcasting over the BS11 satellite channel [30]. Themajor drawback of these frame-compatible formats is that thespatial sub-sampling requires interpolation (and concomitantquality degradation) to extract the left and right views at theiroriginal resolution. B. Multiview video compression
We now proceed to briefly introduce the compression ap-proaches that can be applied to the representation formats out-lined in the preceding subsection. Building on the concept ofinter-view prediction [33], multiview video coding [8] exploits the redundancies across different views of the same scene (inaddition to the temporal and intra-view spatial redundanciesexploited in single-view encoding). Multiview video coding isapplicable only to the multiview (MV) representation formatsince this is the only format to retain distinct sequences ofvideo frames for the views. For the case of 3D video, therecent official ITU multiview video coding reference software,referred to as JMVC, first encodes the left view, and thenpredictively encodes the right view with respect to the encodedleft view.The frame sequential (FS) and side-by-side (SBS) repre-sentation formats present a single sequence of video framesto the encoder. Thus, conventional single-view video encoderscan be applied to the FS and SBS representations. We employthe state-of-the-art JSVM reference implementation [34] of thescalable video coding (SVC) extension of the advanced videocoding (AVC) encoder in single-layer encoding mode.For completeness, we briefly note that each view couldalso be encoded independently with a single-view encoder,which is referred to as simulcasting. While simulcasting hasthe advantage of low complexity, it does not exploit theredundancies between the views, resulting in low encodingefficiency [4]. A currently active research direction in multi-view video encoding is asymmetric coding [10], [35], whichencodes the left and right views with different properties,e.g., different quantization scales. For other ongoing researchdirections in encoding, we refer to the overviews in [4], [8],[10].
C. Multiview video streaminga) SBS representation:
The V = 2 views are integratedinto one frame sequence with the spatial resolution and framerate f of the underlying video. For frame-by-frame transmis-sion of a sequence with M frames, frame m, m = 1 , . . . , M ,of size X m [bytes] is transmitted during one frame period ofduration /f at a bit rate of R m = 8 f X m [bit/s]. b) MV representation: There are a number of streamingoptions for the MV representation with V views. First, the V streams resulting from the multiview video encoding can bestreamed individually. We let X m ( v ) , m = 1 , . . . , M, v =1 , . . . , V , denote the size [bytes] of the encoded video frame m of view v and note that R m ( v ) = 8 f X m ( v ) [bit/s] is thecorresponding bitrate. The mean frame size of the encodedview v is ¯ X ( v ) = 1 M M X m =1 X m ( v ) (1)and the corresponding average bit rate is ¯ R ( v ) = 8 f ¯ X ( v ) .The variance of these frame sizes is S X ( v ) = 1 M − M X m =1 (cid:2) X m ( v ) − ¯ X ( v ) (cid:3) . (2)The coefficient of variation of the frame sizes of view v [unit free] is the standard deviation of the frame sizes S X ( v ) normalized by the mean frame size CoV X ( v ) = S X ( v )¯ X ( v ) (3) and is widely employed as a measure of the variability ofthe frame sizes, i.e., the traffic bitrate variability. Plotting the CoV as a function of the quantization scale (or equivalently,the average PSNR video quality) gives the bitrate variability-distortion (VD) curve [36], [37].Alternatively, the V streams can be merged into one mul-tiview stream. We consider two elementary merging options,namely sequential (S) merging and aggregation (combining).With sequential merging, the M frames of the V views aretemporally multiplexed in round-robin fashion, i.e., first view1 of frame 1, followed by view 2 of frame 1, . . . , followedby view V of frame 1, followed by view 1 of frame 2, and soon. From the perspective of the video transport system, eachof these V M video frames (pictures) can be interpreted as avideo frame to be transmitted. In this perspective, the averageframe size of the resulting multiview stream is ¯ X = 1 V V X v =1 ¯ X ( v ) . (4)Noting that this multiview stream has V frames to be playedback in each frame period of duration /f , the average bitrate of the multiview stream is ¯ R = 8 V f ¯ X. (5)The variance of the frame sizes of the sequentially (S) mergedmultiview stream is S S = 1( M − V − M,V X m,v =1 (cid:2) X m ( v ) − ¯ X (cid:3) (6)with the corresponding CoV S = S S / ¯ X .With combining (C), the V encoded views corresponding toa given frame index m are aggregated to form one multiviewframe of size X m = P Vv =1 X m ( v ) . For 3D video with V = 2 , the pair of frames for a given frame index m (whichcorresponds to a given capture instant of the frame pair)constitutes the multiview frame m . Note that the average sizeof the multiview frames is V ¯ X with ¯ X given in (4). Further,note that these multiview frames have a rate of f multiviewframes/s; thus, the average bit rate of the multiview streamresulting from aggregation is the same ¯ R as given in (5).However, the variance of the sizes of the (combined) multiviewframes is different from (6); specifically, S C = 1( M − M X m =1 (cid:2) X m − V ¯ X (cid:3) (7)and CoV C = S C / ( V ¯ X ) . c) FS representation: Similar to the MV representation,the FS representation can be streamed sequentially (S) withthe traffic characterizations given by (4)–(6). Or, the V en-coded frames for a given frame index m can be combined(C), analogous to the aggregation of the MV representation,resulting in the frame size variance given by (7). d) Frame size smoothing: The aggregate streaming ap-proach combines all encoded video data for one frame periodof playback duration /f [s] and transmits this data at a constant bitrate over the /f period. Compared to the se-quential streaming approach, the aggregate streaming approachthus performs smoothing across the V views, i.e., effectivelysmoothes the encoded video data over the duration of oneframe period /f . This smoothing concept can be extended tomultiple frame periods, such as a Group of Pictures (GoP) ofthe encoder [38]. For GoP smoothing with a GoP length of G frames, the encoded views from G frames are aggregated andstreamed at a constant bitrate over the period G/f [s].III. E
VALUATION S ET - UP In this section, we describe our evaluation set-up, includingthe employed 3D video sequences, the encoding set-up, andthe video traffic and quality metrics.
A. Video sequences
For a thorough evaluation of the traffic characteristics, es-pecially the traffic variability, the publicly available, relativelyshort 3D video research sequences [39] are not well suited.Therefore, we employ long 3D ( V = 2 ) video sequences of M = 52 , frames each. That is, we employ 51,200 left-view frames (pictures) and 51,200 right-view frames (pictures)for each test video. We have conducted evaluations with Mon-sters vs Aliens and
Clash of the Titans , which are computer-animated fiction movies,
Alice in Wonderland , which is afantasy movie consisting of a mix of animated and real-lifecontent, and
IMAX Space Station , a documentary. All videosare in the full HD × pixels format and have a framerate of f = 24 frames/s for each view. B. Encoding set-up
We encoded the multiview representation with the referencesoftware JMVC (version 8.3.1). We encoded the FS and SBSrepresentations with the broadly used H.264 SVC video codingstandard using the H.264 reference software JSVM (version9.19.10) [34], [40] in single-layer encoding mode. We set theGOP length to G = 16 frames for the MV and SBS encodings;for the FS encodings we set G = 32 so that all encodings havethe same playback duration between intracoded (I) frames, i.e.,support the same random access granularity. We employ twodifferent GoP patterns: (B1) with one bi-directionally predicted(B) frame between successive intracoded (I) and predictiveencoded (P) frames, and (B7) with seven B frames betweensuccessive I and P frames. We conducted encodings for thequantization parameter settings 24, 28, and 34. C. Traffic and quality metrics
We employ the peak signal to noise ratio (PSNR) [41], [42]of the video luminance signal of a video frame m, m =1 , . . . , M , of a view v, v = 1 , . . . , V , as the objective qualitymetric of video frame m of view v . We average these videoframe PSNR values over the V M video frames of a givenvideo sequence to obtain the average PSNR video quality.For the MV and FS representations, the PSNR evaluation isconducted over the full HD spatial resolution of each view ofa given frame. We note that in the context of asymmetric 3D video coding [35], the PSNR values of the two views may beweighed unequally, depending on their relative scaling (bit ratereduction) [43]. We do not consider asymmetric video codingin this study and weigh the PSNR values of both views equally.For the SBS representation, we report for some encodingsthe PSNR values from the comparison of the uncompressedSBS representation with the encoded (compressed) and subse-quently decoded SBS representation as SBS without interpola-tion (SBS-NI). We also report for all encodings the comparisonof the original full HD left and right views with the videosignal obtained after SBS representation, encoding, decoding,and subsequent interpolation back to full HD format as SBSwith interpolation (SBS-I). Unless otherwise noted, the SBSresults are for the SBS representation with interpolation. Weemployed the JSVM reference down-sampling with a Sine-windowed Sinc-function and the corresponding normative up-sampling using a set of 4-tap filters [34]. We plot the averagePSNR video quality [dB] as a function of the average stream-ing bitrate ¯ R [bit/s] to obtain the RD curve and the coefficientof variation of the frame sizes CoV as a function of the averagePSNR video quality to obtain the VD curve.IV. T
RAFFIC AND Q UALITY R ESULTS
In this section we present the RD and VD characteristicsfor the examined 3D video representation formats. We brieflynote that generally the encodings with one B frame betweensuccessive I and P frames follow the same trends as observedfor the encodings with seven B frames; the main differenceis that the encodings with one B frame have slightly higherbitrates and slightly lower CoV values, which are effects ofthe lower level of predictive encoding.
A. Bitrate-distortion (RD) characteristics
In Fig. 1, we plot the RD curves of the multiview represen-tation encoded with the multiview video codec for streamingthe left view only (MV-L), the right view only (MV-R),and the merged multiview stream (MV). Similarly, we plotthe RD curves for the frame sequential (FS) representationand the side-by-side (SBS) representation encoded with theconventional single-view codec.From the MV curves in Fig. 1, we observe that the rightview has a significant RD improvement compared to the leftview. This is because of the inter-view prediction of the multi-view encoding, which exploits the inter-view redundancies byencoding the right view with prediction from the left view.Next, turning to the side-by-side (SBS) representation, weobserve that SBS with interpolation can achieve similar oreven slightly better RD efficiency than FS for the low tomedium quality range of videos with real-life content (
Alicein Wonderland and
IMAX Space Station ). However, SBS hasconsistently lower RD efficiency than the MV representation.In additional evaluations for the B7 GoP pattern, we comparedthe uncompressed SBS representation with the encoded (com-pressed) and subsequently decoded SBS representation andfound that the RD curve for this SBS representation withoutinterpolation (SBS-NI) lies between the MV-L and MV RDcurves. We observed from these additional evaluations that the interpolation to the full HD format (SBS-I) significantlyreduces the average PSNR video quality, especially for encod-ings in the higher quality range.Finally, we observe from Fig. 1 that the MV representa-tion in conjunction with multiview encoding has consistentlyhigher RD efficiency than the FS representation with conven-tional single-view encoding. The FS representation essentiallytranslates the multi-view encoding problem into a temporallypredictive coding problem. That is, the FS representation tem-porally interleaves the left and right views and then employsstate-of-the-art temporal predictive encoding. The results inFig. 1 indicate that this temporal predictive encoding can notexploit the inter-view redundancies as well as the state-of-the-art multiview encoder.
B. Bitrate variability-distortion (VD) characteristics
In Fig. 2, we plot the VD curves for the examined multiview(MV), frame sequential (FS), and side-by-side (SBS) repre-sentation formats; whereby, for MV and FS, we plot both VDcurves for sequential (S) merging and aggregation (C). We firstobserve from Fig. 2 that the MV representation with sequentialstreaming (MV-S) has the highest traffic variability. This hightraffic variability is primarily due to the size differencesbetween successive encoded left and right views. In particular,the left view is encoded independently and is thus typicallylarge. The right view is encoded predictively from the leftview and thus typically small. This succession of large andsmall views (frames), whereby each view is treated as anindependent video frame by the transmission system, i.e., istransmitted within half a frame period / (2 f ) in the sequen-tial streaming approach, leads to the high traffic variability.Smoothing over one frame period /f by combing the twoviews of each frame from the MV encoding significantlyreduces the traffic variability. In particular, the MV encodingwith aggregation (MV-C) has generally lower traffic variabilitythan the SBS streams.We further observe from the MV results in Fig. 2 that the leftview (MV-L) has significantly higher traffic variability thanthe right view (MV-R). The large MV-L traffic variabilitiesare primarily due to the typically large temporal variationsin the scene content of the videos, which result in largesize variations of the MV-L frames which are encoded withtemporal prediction across the frames of the left view. Incontrast, the right view is predictively encoded from theleft view. Due to the marginal difference between the twoperspectives of the scene employed for the two views of 3Dvideo, the content variations between the two views (for agiven fixed frame index m ) are small relative to the scenecontent variations occurring over time.Turning to the FS representation, we observe that FS withsequential streaming has CoV values near or below the MVrepresentation with aggregation. Similarly to the MV represen-tation, aggregation significantly reduces the traffic variabilityof the FS representation. In fact, we observe from Fig. 2 thatthe FS representation with aggregation has consistently thelowest CoV values. The lower traffic variability of the FSrepresentation is consistent with its relatively less RD-efficient
39 40 41 42 43 44 45 46 47 0 1000 2000 3000 4000 5000 6000 A v e r age PS NR ( d B ) Average Bit Rate (kbps) MV-RMV-LMVFSSBS 40 41 42 43 44 45 46 47 0 1000 2000 3000 4000 5000 6000 A v e r age PS NR ( d B ) Average Bit Rate (kbps) MV-RMV-LSBS-NIMVFSSBS-I (a)
Monsters vs Aliens , B1 (b)
Monsters vs Aliens , B7
39 40 41 42 43 44 45 0 1200 2400 3600 4800 6000 7200 A v e r age PS NR ( d B ) Average Bit Rate (kbps) MV-RMV-LMVFSSBS 39 40 41 42 43 44 45 0 1200 2400 3600 4800 6000 7200 A v e r age PS NR ( d B ) Average Bit Rate (kbps) MV-RMV-LSBS-NIMVFSSBS-I (c)
Alice in Wonderland , B1 (d)
Alice in Wonderland , B7
36 37 38 39 40 41 42 43 44 45 0 3000 6000 9000 12000 15000 18000 A v e r age PS NR ( d B ) Average Bit Rate (kbps) MV-RMV-LMVFSSBS 37 38 39 40 41 42 43 44 45 0 3000 6000 9000 12000 15000 18000 A v e r age PS NR ( d B ) Average Bit Rate (kbps) MV-RMV-LSBS-NIMVFSSBS-I (e)
IMAX Space Station , B1 (f)
IMAX Space Station , B7Fig. 1. RD curves for multiview (MV) representation, frame sequential (FS) representation, and side-by-side (SBS) representation for GoP patterns with oneB frame between successive I and P frames (B1) as well as seven B frames between successive I and P frames (B7). encoding. The MV representation and encoding exploits theinter-view redundancies and thus encodes the two viewsof each frame more efficiently, leading to relatively largervariations in the encoded frame sizes as the video contentand scenes change and present varying levels of inter-viewredundancy. The FS representation with single-view encoding,on the other hand, is not able to exploit these varying degrees of inter-view redundancy as well, resulting in less variabilityin the view and frame sizes, but also larger average framesizes.In additional evaluations that are not included here in detaildue to space constraints, we found that frame size smoothingover one GoP reduces the traffic variability significantly,especially for the burstier MV representation. For instance, C o V Average PSNR (dB)MV-SMV-LSBSMV-CFS-SMV-RFS-C 0.7 0.8 0.9 1 1.1 1.2 1.3 40 41 42 43 44 45 46 47 C o V Average PSNR (dB)MV-SMV-LSBSMV-CFS-SMV-RFS-C (a)
Monsters vs Aliens , B1 (b)
Monsters vs Aliens , B7 C o V Average PSNR (dB)MV-SMV-LSBSMV-CFS-SMV-RFS-C 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 39 40 41 42 43 44 45 C o V Average PSNR (dB)MV-SMV-LSBSMV-CFS-SMV-RFS-C (c)
Alice in Wonderland , B1 (d)
Alice in Wonderland , B7 C o V Average PSNR (dB)MV-SMV-LSBSMV-CFS-SMV-RFS-C C o V Average PSNR (dB)MV-SMV-LSBSMV-CFS-SMV-RFS-C (e)
IMAX Space Station , B1 (f)
IMAX Space Station , B7Fig. 2. VD curves for different representation formats and streaming approaches for B1 and B7 GoP patterns. for
Monsters vs Aliens , the CoV value of 1.05 for MV-C forthe middle point in Fig. 2(a) is reduced to 0.65 with GoPsmoothing. Similarly, the corresponding CoV value of 1.51 for
IMAX Space Station (Fig. 2(c)) is reduced to 0.77. The CoVreductions are less pronounced for the FS representation: themiddle CoV value of 0.81 for FS-C in Fig. 2(a) is reduced to0.58, while the corresponding CoV value of 0.82 in Fig. 2(c) is reduced to 0.70.V. S
TATISTICAL M ULTIPLEXING E VALUATIONS
In this section, we conduct statistical multiplexing simula-tions to examine the impact of the 3D video representationson the bandwidth requirements for streaming with minusculeloss probabilities [44]. For the MV and FS representations, we C m i n ( M bp s ) Average PSNR (dB)SBS, J = 256FS-C, J = 256MV-C, J = 256SBS, J = 64FS-C, J = 64MV-C, J = 64SBS, J = 16FS-C, J = 16MV-C, J = 16SBS, J = 4FS-C, J = 4MV-C, J = 4 0 5 10 15 20 25 30 35 40 39 40 41 42 43 44 45 46 47 C m i n ( M bp s ) Average PSNR (dB)SBS, J = 64FS-C, J = 64MV-C, J = 64SBS, J = 16FS-C, J = 16MV-C, J = 16SBS, J = 4FS-C, J = 4MV-C, J = 4 (a)
Monsters vs Aliens , frame-by-frame (b)
Monsters vs Aliens , GoP smoothing C m i n ( M bp s ) Average PSNR (dB)SBS, J = 256FS-C, J = 256MV-C, J = 256SBS, J = 64FS-C, J = 64MV-C, J = 64SBS, J = 16FS-C, J = 16MV-C, J = 16SBS, J = 4FS-C, J = 4MV-C, J = 4 0 20 40 60 80 100 120 36 37 38 39 40 41 42 43 44 45 C m i n ( M bp s ) Average PSNR (dB)SBS, J = 64FS-C, J = 64MV-C, J = 64SBS, J = 16FS-C, J = 16MV-C, J = 16SBS, J = 4FS-C, J = 4MV-C, J = 4 (c)
IMAX Space Station , frame-by-frame (d)
IMAX Space Station , GoP smoothingFig. 3. Required minimum link transmission bit rate C min to transmit J streams with an information loss probability P infoloss ≤ ǫ = 10 − . GoP structureB1 with one B frame between I and P frames. consider the combined (C) streaming approach where the pairof frames for each frame index m is aggregated and the GoPsmoothing approach.
1) Simulation Setup:
We consider a single “bufferless”statistical multiplexer [36], [44], [45] which reveals the funda-mental statistical multiplexing behaviors without introducingarbitrary parameters, such as buffer sizes, cross traffic, ormulti-hop routing paths. Specifically, we consider a link oftransmission bitrate C [bit/s], preceded by a buffer of size C/f [bit], i.e., the buffer holds as many bits as can be transmittedin one frame period of duration /f . We let J denote thenumber of 3D video streams fed into the buffer. Each of the J streams in a given simulation is derived from the sameencoded 3D video sequence; whereby, each stream has itsown starting frame that is drawn independently, uniformly,and randomly from the set [1 , , . . . , M ] . Starting from theselected starting frame, each of the J videos places oneencoded frame of the SBS representation (multiview frame ofthe MV-C or FS-C representation) into the multiplexer bufferin each frame period. If the number of bits placed in the bufferin a frame period exceeds C/f , then there is loss. We count thenumber of lost bits to evaluate the information loss probability P infoloss [44] as the proportion of the number of lost bits to the number of bits placed in the multiplexer buffer. We conductmany independent replications of the stream multiplexing,each replication simulates the transmission of M frames (with“wrap-around” to the first frame when the end of the videois reached) for each stream, and each replication has a newindependent set of random starting frames for the J streams.
2) Evaluation Results:
We conducted two types of evalua-tions. First, we determined the maximum number of streams J max that can be transmitted over the link with prescribedtransmission bit rate C = 10 , , and 40 Mb/s such that P infoloss is less than a prescribed small ǫ = 10 − . We terminateda simulation when the confidence interval for P infoloss was lessthan 10 % of the corresponding sample mean.Second, we estimated the minimum link capacity C min thataccommodates a prescribed number of streams J while keep-ing P infoloss ≤ ǫ = 10 − . For each C min estimate, we performed500 runs, each consisting of 1000 independent video streamingsimulations. We discuss in detail the representative resultsfrom this evaluation of C min for a given number of streams J . The results for the evaluation of J max given a fixed linkcapacity C indicate the same tendencies.We observe from Figs. 4(a), (c), and (e) that for smallnumbers of multiplexed streams, namely J = 4 and 16 streams C m i n ( M bp s ) Average PSNR (dB)SBS, J = 256FS-C, J = 256MV-C, J = 256SBS, J = 64FS-C, J = 64MV-C, J = 64SBS, J = 16FS-C, J = 16MV-C, J = 16SBS, J = 4FS-C, J = 4MV-C, J = 4 0 5 10 15 20 25 30 35 40 40 41 42 43 44 45 46 47 C m i n ( M bp s ) Average PSNR (dB)SBS, J = 64FS-C, J = 64MV-C, J = 64SBS, J = 16FS-C, J = 16MV-C, J = 16SBS, J = 4FS-C, J = 4MV-C, J = 4 (a)
Monsters vs Aliens , frame-by-frame (b)
Monsters vs Aliens , GoP smoothing C m i n ( M bp s ) Average PSNR (dB)SBS, J = 256FS-C, J = 256MV-C, J = 256SBS, J = 64FS-C, J = 64MV-C, J = 64SBS, J = 16FS-C, J = 16MV-C, J = 16SBS, J = 4FS-C, J = 4MV-C, J = 4 0 5 10 15 20 25 30 35 40 45 40 41 42 43 44 45 C m i n ( M bp s ) Average PSNR (dB)SBS, J = 64FS-C, J = 64MV-C, J = 64SBS, J = 16FS-C, J = 16MV-C, J = 16SBS, J = 4FS-C, J = 4MV-C, J = 4 (c)
Alice in Wonderland , frame-by-frame (d)
Alice in Wonderland , GoP smoothing C m i n ( M bp s ) Average PSNR (dB)SBS, J = 256FS-C, J = 256MV-C, J = 256SBS, J = 64FS-C, J = 64MV-C, J = 64SBS, J = 16FS-C, J = 16MV-C, J = 16SBS, J = 4FS-C, J = 4MV-C, J = 4 0 20 40 60 80 100 120 37 38 39 40 41 42 43 44 45 C m i n ( M bp s ) Average PSNR (dB)SBS, J = 64FS-C, J = 64MV-C, J = 64SBS, J = 16FS-C, J = 16MV-C, J = 16SBS, J = 4FS-C, J = 4MV-C, J = 4 (e)
IMAX Space Station , frame-by-frame (f)
IMAX Space Station , GoP smoothingFig. 4. Required minimum link transmission bit rate C min to transmit J streams with an information loss probability P infoloss ≤ ǫ = 10 − . GoP structureB7 with seven B frames between I and P frames. for Monsters vs Aliens and
Alice in Wonderland , as wellas J = 4 streams for IMAX Space Station , the MV andFS representations require essentially the same transmissionbitrate. Even though the MV representation and encodinghas higher RD efficiency, i.e., lower average bit rate for agiven average PSNR video quality, the higher MV trafficvariability makes statistical multiplexing more challenging, requiring the same transmission bit rate as the less RD efficientFS representation (which has lower traffic variability). Wefurther observe from Figs. 4(a), (c) and (e) that increasing thestatistical multiplexing effect by multiplexing more streams,reduces the effect of the traffic variability, and, as a result,reduces the required transmission bit rate C min for MV-Crelative to FS-C. J m a x Average PSNR (dB)MV-C, C=20 MbpsFS-C, C=20 MbpsSBS, C=20 MbpsMV-C, C=10 MbpsFS-C, C=10 MbpsSBS, C=10 Mbps 0 20 40 60 80 100 120 140 39 40 41 42 43 44 45 46 47 J m a x Average PSNR (dB)MV-C, C=20 MbpsFS-C, C=20 MbpsSBS, C=20 MbpsMV-C, C=10 MbpsFS-C, C=10 MbpsSBS, C=10 Mbps (a)
Monsters vs Aliens , frame-by-frame (b)
Monsters vs Aliens , GoP smoothing J m a x Average PSNR (dB)MV-C, C=40 MbpsFS-C, C=40 MbpsSBS, C=40 MbpsMV-C, C=20 MbpsFS-C, C=20 MbpsSBS, C=20 Mbps 0 20 40 60 80 100 120 140 160 36 37 38 39 40 41 42 43 44 45 J m a x Average PSNR (dB)MV-C, C=40 MbpsFS-C, C=40 MbpsSBS, C=40 MbpsMV-C, C=20 MbpsFS-C, C=20 MbpsSBS, C=20 Mbps (c)
IMAX Space Station , frame-by-frame (d)
IMAX Space Station , GoP smoothingFig. 5. Maximum number of supported streams with an information loss probability P infoloss ≤ ǫ = 10 − for given link transmission bit rate C . GoP structureB1 with one B frame between I and P frames. We observe from Figs. 4(b), (d), and (f) that GoP smoothingeffectively reduces the MV traffic variability such that alreadyfor small numbers of multiplexed streams, i.e., a weak statis-tical multiplexing effect, the required transmission bitrate forMV is less than that for FS.VI. C
ONCLUSION AND F UTURE W ORK
We have compared the traffic characteristics and fundamen-tal statistical multiplexing behaviors of state-of-the-art multi-view (MV) 3D video representation and encoding with theframe sequential (FS) and side-by-side (SBS) representationsencoded with state-of-the-art single-view video encoding. Wefound that the SBS representation, which permits transmissionof two-view video with the existing single-view infrastructure,incurs significant PSNR quality degradations compared to theMV and FS representations due to the sub-sampling and inter-polation involved with the SBS representation. We found thatwhen transmitting small numbers of streams without trafficsmoothing, the higher traffic variability of the MV encodingleads to the same transmission bitrate requirements as theless RD efficient FS representation with single-view coding.We further found that to reap the benefit of the more RDefficient MV representation and coding for network transport, traffic smoothing or the multiplexing of many streams in largetransmission systems is required.There are many important directions for future researchon the traffic characterization and efficient network transportof encoded 3D video, and generally multiview video. Onedirection is to develop and evaluate smoothing and schedul-ing mechanisms that consider a wider set of network andreceiver constraints, such as limited receiver buffer or varyingwireless link bitrates, or collaborate across several ongoingstreams [46], [47]. Another avenue is to exploit network andclient resources, such as caches or cooperating peer clientsfor efficient delivery of multiview video services. Broadlyspeaking, these effective transmission strategies are especiallycritical when relatively few video streams are multiplexed as,for instance, in access networks, e.g., [48]–[50], and metro net-works, e.g., [51]–[54]. Moreover, the challenges are especiallypronounced in networking scenarios in support of applicationswith tight real-time constraints, such as gaming [55]–[57] andreal-time conferencing and tele-immersion [58]–[60].R
EFERENCES[1] A. Pulipaka, P. Seeling, M. Reisslein, and L. Karam, “Traffic andstatistical multiplexing characterization of 3-D video representationformats,”
IEEE Trans. Broadcasting , vol. 59, no. 2, pp. 382–389, 2013. J m a x Average PSNR (dB)MV-C, C=20 MbpsFS-C, C=20 MbpsSBS, C=20 MbpsMV-C, C=10 MbpsFS-C, C=10 MbpsSBS, C=10 Mbps 0 20 40 60 80 100 120 140 40 41 42 43 44 45 46 47 J m a x Average PSNR (dB)MV-C, C=20 MbpsFS-C, C=20 MbpsSBS, C=20 MbpsMV-C, C=10 MbpsFS-C, C=10 MbpsSBS, C=10 Mbps (a)
Monsters vs Aliens , frame-by-frame (b)
Monsters vs Aliens , GoP smoothing J m a x Average PSNR (dB)MV-C, C=40 MbpsFS-C, C=40 MbpsSBS, C=40 MbpsMV-C, C=20 MbpsFS-C, C=20 MbpsSBS, C=20 Mbps 0 20 40 60 80 100 120 140 160 180 37 38 39 40 41 42 43 44 45 J m a x Average PSNR (dB)MV-C, C=40 MbpsFS-C, C=40 MbpsSBS, C=40 MbpsMV-C, C=20 MbpsFS-C, C=20 MbpsSBS, C=20 Mbps (c)
IMAX Space Station , frame-by-frame (d)
IMAX Space Station , GoP smoothingFig. 6. Maximum number of supported streams with an information loss probability P infoloss ≤ ǫ = 10 − for given link transmission bit rate C . GoP structureB7 with seven B frames between I and P frames.[2] P. Merkle, K. Muller, and T. Wiegand, “3D video: acquisition, coding,and display,” IEEE Trans. Consum. Electron. , vol. 56, no. 2, pp. 946–950, May 2010.[3] J. Morgade, A. Usandizaga, P. Angueira, D. de la Vega, A. Arrinda,M. Velez, and J. Ordiales, “3DTV roll-out scenarios: A DVB-T2approach,”
IEEE Trans. Broadcast. , vol. 57, no. 2, pp. 582–592, Jun.2011.[4] A. Vetro, A. M. Tourapis, K. Muller, and T. Chen, “3D-TV contentstorage and transmission,”
IEEE Transactions on Broadcasting , vol. 57,no. 2, Part 2, pp. 384–394, 2011.[5] A. Vetro, W. Matusik, H. Pfister, and J. Xin, “Coding approaches forend-to-end 3D TV systems,” in
Proc. Picture Coding Symp. , 2004.[6] K. Willner, K. Ugur, M. Salmimaa, A. Hallapuro, and J. Lainema,“Mobile 3D video using MVC and N800 internet tablet,” in
Proc. of3DTV Conference , May 2008, pp. 69–72.[7] Y. Chen, Y.-K. Wang, K. Ugur, M. M. Hannuksela, J. Lainema, andM. Gabbouj, “The emerging MVC standard for 3D video services,”
EURASIP Journal on Advances in Signal Processing , 2009.[8] A. Vetro, T. Wiegand, and G. J. Sullivan, “Overview of the stereoand multiview video coding extensions of the H.264/MPEG-4 AVCstandard,”
Proc. of the IEEE , vol. 99, no. 4, pp. 626–642, Apr. 2011.[9] G. Akar, A. Tekalp, C. Fehn, and M. Civanlar, “Transport methodsin 3DTV—a survey,”
IEEE Trans. Circuits and Sys. for Video Techn. ,vol. 17, no. 11, pp. 1622–1630, Nov. 2007.[10] C. G. Gurler, B. Gorkemli, G. Saygili, and A. M. Tekalp, “Flexibletransport of 3-D video over networks,”
Proceedings of the IEEE
Proceedingsof the IEEE , vol. 99, no. 4, pp. 671–683, Apr. 2011.[13] Y. Zikria, S. Malik, H. Ahmed, S. Nosheen, N. Azeemi, and S. Khan,“Video Transport over Heterogeneous Networks Using SCTP andDCCP,” in
Wireless Networks, Information Processing and Systems , ser.Communications in Computer and Information Science. Springer BerlinHeidelberg, 2009, vol. 20, pp. 180–190.[14] Y. Zhou, C. Hou, Z. Jin, L. Yang, J. Yang, and J. Guo, “Real-time transmission of high-resolution multi-view stereo video over IPnetworks,” in
Proc. of 3DTV Conference , May 2009, pp. 1–4.[15] E. Kurutepe, M. Civanlar, and A. Tekalp, “Client-driven selective stream-ing of multiview video for interactive 3DTV,”
IEEE Trans. Circuits Sys.Video Techn. , vol. 17, no. 11, pp. 1558–1565, Nov. 2007.[16] D. Wu, L. Sun, and S. Yang, “A selective transport framework fordelivery MVC video over MPEG-2 TS,” in
Proc. of IEEE Int. Symp. onBroadband Multimedia Systems and Broadcasting , Jun. 2011, pp. 1–6.[17] P. Seeling and M. Reisslein, “Video transport evaluation with H.264video traces,”
IEEE Communications Surveys and Tutorials , vol. 14,no. 4, pp. 1142–1165, Fourth Quarter 2012.[18] A. Alheraish, S. Alshebeili, and T. Alamri, “A GACS modeling ap-proach for MPEG broadcast video,”
IEEE Transactions on Broadcasting ,vol. 50, no. 2, pp. 132–141, Jun. 2004.[19] N. Ansari, H. Liu, Y. Q. Shi, and H. Zhao, “On modeling MPEG videotraffics,”
IEEE Trans. Broadcast. , vol. 48, no. 4, pp. 337–347, Dec. 2002.[20] D. Fiems, B. Steyaert, and H. Bruneel, “A genetic approach to Marko-vian characterisation of H.264/SVC scalable video,”
Multimedia Toolsand Applications , vol. 58, no. 1, pp. 125–146, May 2012.[21] A. Lazaris and P. Koutsakis, “Modeling multiplexed traffic fromH.264/AVC videoconference streams,”
Computer Communications ,vol. 33, no. 10, pp. 1235–1242, Jun. 2010. [22] K. Shuaib, F. Sallabi, and L. Zhang, “Smoothing and modeling ofvideo transmission rates over a QoS network with limited bandwidthconnections,” Int. Journal of Computer Networks and Communications ,vol. 3, no. 3, pp. 148–162, May 2011.[23] L. Rossi, J. Chakareski, P. Frossard, and S. Colonnese, “A non-stationaryHidden Markov Model of multiview video traffic,” in
Proceedings ofIEEE Int. Conf. on Image Processing (ICIP) , 2010, pp. 2921–2924.[24] H. Alshaer, R. Shubair, and M. Alyafei, “A framework for resourcedimensioning in GPON access networks,”
Int. Journal of NetworkManagement , vol. 22, no. 3, pp. 199–215, May/June 2012.[25] J.-W. Ding, C.-T. Lin, and S.-Y. Lan, “A unified approach to hetero-geneous video-on-demand broadcasting,”
IEEE Transactions on Broad-casting , vol. 54, no. 1, pp. 14–23, Mar. 2008.[26] L. Qiao and P. Koutsakis, “Adaptive bandwidth reservation and schedul-ing for efficient wireless telemedicine traffic transmission,”
IEEE Trans.Vehicular Technology , vol. 60, no. 2, pp. 632–643, Feb. 2011.[27] T. H. Szymanski and D. Gilbert, “Internet multicasting of IPTVwith essentially-zero delay jitter,”
IEEE Transactions on Broadcasting ,vol. 55, no. 1, pp. 20–30, Mar. 2009.[28] J. Cosmas, J. Loo, A. Aggoun, and E. Tsekleves, “Matlab trafficand network flow model for planning impact of 3D applications onnetworks,” in
Proceedings of IEEE Int. Symp. on Broadband MultimediaSystems and Broadcasting (BMSB) , Mar. 2010.[29] N. Manap, G. DiCaterina, and J. Soraghan, “Low cost multi-view videosystem for wireless channel,” in
Proceedings of IEEE 3DTV Conference
Proceedings of the IEEE , vol. 99, no. 4,pp. 684–693, Apr. 2011.[32] E.-K. Lee, Y.-K. Jung, and Y.-S. Ho, “3D video generation usingforeground separation and disocclusion detection,” in
Proceedings ofIEEE 3DTV Conference , Tampere, Finnland, Jun. 2010.[33] M. Lukacs, “Predictive coding of multi-viewpoint image sets,” in
Pro-ceedings of IEEE International Conference on Acoustics, Speech, andSignal Processing , Tokyo, Japan, 1986, pp. 521–524.[34] “JSVM encoder.” http://ip.hhi.de/imagecom G1/savce/downloads/SVC-Reference-Software.htm.[35] G. Saygili, C. G. Gurler, and A. Tekalp, “Evaluation of asymmetricstereo video coding and rate scaling for adaptive 3D video streaming,”
IEEE Trans. Broadcasting , vol. 57, no. 2, pp. 593–601, Jun. 2011.[36] P. Seeling and M. Reisslein, “The rate variability-distortion (vd) curveof encoded video and its impact on statistical multiplexing,”
IEEETransactions on Broadcasting , vol. 51, no. 4, pp. 473–492, Dec. 2005.[37] M. Reisslein, J. Lassetter, S. Ratnam, O. Lotfallah, F. Fitzek, and S. Pan-chanathan, “Traffic and quality characterization of scalable encodedvideo: a large-scale trace-based study, Part 1: overview and definitions,”Arizona State Univ., Tech. Rep., 2002.[38] G. Van der Auwera and M. Reisslein, “Implications of smoothing onstatistical multiplexing of H. 264/AVC and SVC video streams,”
IEEETrans. on Broadcasting , vol. 55, no. 3, pp. 541–558, Sep. 2009.[39] “Microsoft Research 3D Test sequences.”http://research.microsoft.com/en-us/um/people/sbkang/3dvideodownload/.[40] G. Van der Auwera, P. David, and M. Reisslein, “Traffic and qual-ity characterization of single-layer video streams encoded with theH.264/MPEG-4 advanced video coding standard and scalable videocoding extension,”
IEEE Trans. Broadcast. , vol. 54, no. 3, pp. 698–718,Sep. 2008.[41] S. Chikkerur, V. Sundaram, M. Reisslein, and L. Karam, “Objectivevideo quality assessment methods: A classification, review, and perfor-mance comparison,”
IEEE Trans. Broadcast. , vol. 57, no. 2, pp. 165–182,Jun. 2011.[42] K. Seshadrinathan, R. Soundararajan, A. Bovik, and L. Cormack, “Studyof subjective and objective quality assessment of video,”
IEEE Trans.Image Processing , vol. 19, no. 6, pp. 1427–1441, Jun. 2010.[43] N. Ozbek, A. Tekalp, and E. Tunali, “Rate allocation between viewsin scalable stereo video coding using an objective stereo video qualitymeasure,” in
Proc. of IEEE ICASSP , 2007, pp. I–1045–I–1048.[44] J. Roberts and U. Mocci and J. Virtamo,
Broadband Network Traffic:Performance Evaluation and Design of Broadband Multiservice Net-works, Final Report of Action COST 242, Lecture Notes in ComputerScience Vol. 1155 . Springer Verlag, 1996.[45] S. Racz, T. Jakabfy, J. Farkas, and C. Antal, “Connection admissioncontrol for flow level QoS in bufferless models,” in
Proceedings of IEEEINFOCOM , Mar. 2005, pp. 1273–1282.[46] S. Bakiras and V. Li, “Maximizing the number of users in an interactivevideo-on-demand system,”
IEEE Trans. on Broadcasting , vol. 48, no. 4,pp. 281–292, Dec. 2002. [47] M. Reisslein and K. Ross, “High-performance prefetching protocols forVBR prerecorded video,”
IEEE Network , vol. 12, no. 6, pp. 46–55,Nov./Dec. 1998.[48] F. Aurzada, M. Scheutzow, M. Reisslein, N. Ghazisaidi, and M. Maier,“Capacity and delay analysis of next-generation passive optical networks(NG-PONs),”
IEEE Trans. Commun. , vol. 59, no. 5, pp. 1378–1388, May2011.[49] H. Song, B. W. Kim, and B. Mukherjee, “Long-reach optical access net-works: A survey of research challenges, demonstrations, and bandwidthassignment mechanisms,”
IEEE Communications Surveys and Tutorials ,vol. 12, no. 1, pp. 112–123, 1st Quarter 2010.[50] J. Zheng and H. Mouftah, “A survey of dynamic bandwidth allocationalgorithms for Ethernet Passive Optical Networks,”
Optical Switchingand Networking , vol. 6, no. 3, pp. 151–162, Jul. 2009.[51] A. Bianco, T. Bonald, D. Cuda, and R.-M. Indre, “Cost, power con-sumption and performance evaluation of metro networks,”
IEEE/OSA J.Opt. Comm. Netw. , vol. 5, no. 1, pp. 81–91, Jan. 2013.[52] M. Maier and M. Reisslein, “AWG-based metro WDM networking,”
IEEE Commun. Mag. , vol. 42, no. 11, pp. S19–S26, Nov. 2004.[53] M. Scheutzow, M. Maier, M. Reisslein, and A. Wolisz, “Wavelengthreuse for efficient packet-switched transport in an AWG-based metroWDM network,”
IEEE/OSA Journal of Lightwave Technology , vol. 21,no. 6, pp. 1435–1455, Jun. 2003.[54] M. Yuang, I.-F. Chao, and B. Lo, “HOPSMAN: An experimental op-tical packet-switched metro WDM ring network with high-performancemedium access control,”
IEEE/OSA Journal of Optical Communicationsand Networking , vol. 2, no. 2, pp. 91–101, Feb. 2010.[55] M. Bredel and M. Fidler, “A measurement study regarding quality ofservice and its impact on multiplayer online games,” in
Proc. NetGames ,2010.[56] F. Fitzek, G. Schulte, and M. Reisslein, “System architecture for billingof multi-player games in a wireless environment using GSM/UMTS andWLAN services,” in
Proc. ACM NetGames , 2002, pp. 58–64.[57] C. Schaefer, T. Enderes, H. Ritter, and M. Zitterbart, “Subjective qualityassessment for multiplayer real-time games,” in
Proc. ACM NetGames ,2002, pp. 74–78.[58] G. Kurillo and R. Bajcsy, “3D teleimmersion for collaboration andinteraction of geographically distributed users,”
Virtual Reality , vol. 17,no. 1, pp. 29–43, 2013.[59] M. Pallot, P. Daras, S. Richir, and E. Loup-Escande, “3D-live: liveinteractions through 3D visual environments,” in
Proc. Virtual RealityInt. Conf. , 2012.[60] R. Vasudevan, Z. Zhou, G. Kurillo, E. Lobaton, R. Bajcsy, andK. Nahrstedt, “Real-time stereo-vision system for 3D teleimmersivecollaboration,” in