[PDF] Towards a Perceived Audiovisual Quality Model for Immersive Content

Abstract

This paper studies the quality of multimedia content focusing on 360 video and ambisonic spatial audio reproduced using a head-mounted display and a multichannel loudspeaker setup. Encoding parameters following basic video quality test conditions for 360 videos were selected and a low-bitrate codec was used for the audio encoder. Three subjective experiments were performed for the audio, video, and audiovisual respectively. Peak signal-to-noise ratio (PSNR) and its variants for 360 videos were computed to obtain objective quality metrics and subsequently correlated with the subjective video scores. This study shows that a Cross-Format SPSNR-NN has a slightly higher linear and monotonic correlation over all video sequences. Based on the audiovisual model, a power model shows a highest correlation between test data and predicted scores. We concluded that to enable the development of superior predictive model, a high quality, critical, synchronized audiovisual database is required. Furthermore, comprehensive assessor training may be beneficial prior to the testing to improve the assessors' discrimination ability particularly with respect to multichannel audio reproduction. In order to further improve the performance of audiovisual quality models for immersive content, in addition to developing broader and critical audiovisual databases, the subjective testing methodology needs to be evolved to provide greater resolution and robustness.

Full PDF

TTowards a Perceived Audiovisual Quality Model forImmersive Content

Randy Frans Fela

SenseLabFORCE Technology

Hrsholm, [email protected]

Nick Zacharov

SenseLabFORCE Technology

Hrsholm, [email protected]

Sren Forchhammer

Dept. Photonics EngineeringTechnical University of Denmark

Kgs. Lyngby, [email protected]

Abstract —This paper studies the quality of multimedia contentfocusing on 360 video and ambisonic spatial audio reproducedusing a head-mounted display and a multichannel loudspeakersetup. Encoding parameters following basic video quality testconditions for 360 videos were selected and a low-bitrate codecwas used for the audio encoder. Three subjective experimentswere performed for the audio, video, and audiovisual respectively.Peak signal-to-noise ratio (PSNR) and its variants for 360videos were computed to obtain objective quality metrics andsubsequently correlated with the subjective video scores. Thisstudy shows that a Cross-Format SPSNR-NN has a slightlyhigher linear and monotonic correlation over all video sequences.Based on the audiovisual model, a power model shows a highestcorrelation between test data and predicted scores. We concludedthat to enable the development of superior predictive model,a high quality, critical, synchronized audiovisual database isrequired. Furthermore, comprehensive assessor training maybe beneﬁcial prior to the testing to improve the assessors’discrimination ability particularly with respect to multichannelaudio reproduction.In order to further improve the performance of audiovisualquality models for immersive content, in addition to developingbroader and critical audiovisual databases, the subjective testingmethodology needs to be evolved to provide greater resolutionand robustness.

Index Terms —360 video, ambisonics, audiovisual quality,PSNR, design of experiment, perceptual evaluation.

NTRODUCTION

Since the introduction of virtual reality, 360 video has re-cently become a popular way of presenting immersive contentand due to its potential applications, numerous efforts can befound for improving and assessing its quality using objectiveand subjective measures [1]–[4]. Objective measures of 360video aims to quantify the quality the reconstructed videobased on its distortion compared to the original video in theform of a peak signal-to-noise ratio (PSNR) metric. Due tothe various projection formats of 360 video from sphericalto 2-dimensional plane (e.g. equirectangular (ERP), cubemap(CMP), etc.) and vice versa, this opens the possibility tocalculate PSNR-variant metrics as proposed in [5]–[8]. Whenconsidering the user experience, users often only view a pro-portion of spherical projection of 360 video, where a dynamicviewport-based PSNR metric has been proposed to represent the user behavior throughout the video. 360 video processingworkﬂow and a study of its PSNR related metrics have beenshown in earlier studies and JVET-J1012 for common test con-dition and evaluation procedures [1], [9], [10]. The processingworkﬂow enables PSNR metrics in 360 video to be measuredin three phases, namely codec, cross-format spherical and end-to-end spherical metrics. Similarly, subjective evaluations havebeen conducted earlier which generated the distorted videosusing identical coding parameters of quantization parameters(QP) and resolution [9], [11]. These ﬁndings are in agreementwith our own that the perceived quality is proportional toresolution and inversely proportional to QP.In recent years, the trend of 360 video applications inmultimedia platforms (i.e. Youtube, Facebook, and Google)has been paired with spatial audio formats such as am-bisonics which allows 3D auditory sensation while watchingomnidirectional video. A low-bitrate audio codec plays animportant role in streaming applications and therefore withthis has come a growing interest in evaluating the quality ofcompressed ambisonic scenes. The ﬁndings from [12] statedthat lower bitrate ambisonic has a lower quality score andhigher localization error.The use of 360 videos with ambisonics has been demon-strated in order to assess speciﬁc perceptual attributes [13],[14]. However, to the best of our knowledge, a study onperceived audiovisual quality and its interaction is still rel-atively unexplored. This paper describes a preliminary studyinvestigating the perceptual quality of audio, video, and au-diovisual quality of 360 videos and ambisonic reproduction.The aim of this study is to highlight the potential applicationsand limitations in this area of interest including an initialunderstanding of audiovisual interaction towards the futuremultimodal audiovisual quality models. This work addressesthe following questions: • What are the effects of the video encoding parameters onvideo quality score? • What are the effects of number of audio channels andbitrates on audio quality score? • How do perceived audio and video quality relate andcombine to perceived audiovisual quality? • What is an appropriate testing methodology to study thesecharacteristics? a r X i v : . [ c s . MM ] M a y ABLE I: Characteristics of the audio video contents

Video Video Characteristics Audio Characteristics

Lighting condition Motion activity Spatial complexity Type Character Source

1. Duomo Low light Low Simple Orchestra, solo Reverb, clarity Static2. MotoCani Bright, daylight Fast Medium Natural, mechanical, human Low-freq dominant Dynamic3. Autodafe Low light Low Simple Orchestra, group singer Reverb Static, dynamic14. ParcoDucale Bright, daylight Low Medium Natural, human Ambient dominant Static, dynamic5. Fisarmonica Bright, daylight Fast Complex Music, mechanical Reverb-ambient Dynamic (a) Video 1 (b) Video 2 (c) Video 3 (d) Video 4 (e) Video 5

Fig. 1: Equirectangular view of the testing videos.Based on these questions, we established experiments basedon common approaches used in audio and video quality eval-uation. Due to the large number of experimental parametersfor full factorial design, we introduced optimal custom designfor audiovisual quality assessment in a manageable trial size.II. S

TUDY D ESCRIPTION AND E XPERIMENTS

A. Stimuli

Audiovisual stimuli were provided by the Jump VideoDataset from the University of Parma [15]. This source con-tains 360 video with ambisonic spatial audio. Five 360 videos( ∼ ∼

30 Mbps bitrates in YUV 4:2:0 color space and chromasubsampling. The raw 32-channels A-format audio were post-processed into Ambix FOA (4 channels) B-format, PCMsampling format and temporally aligned with the video. Theaudio sampling rate was 48 kHz with the 16 bits and the totalbitrates of 3.072 Mbps (768 kbps/channel).

B. Encoding and Decoding

The video material was encoded using FFmpeg (withlibx264) with frame rate 29.97 fps in a GOP structure ofIBBP with a GOP size of 16. Twenty encoding settingswere applied to create quality degradation for each videocorresponding to combination of the original quality and fourquantization parameter (QP) values of 22, 27, 32 and 37 andfour resolutions of 3840x1920 pixels (4K), 2560x1280 pixels(2.5K), 1920x1080 pixels (fHD) and 1280x720 pixels (HD).In total, 100 videos were used for this study.Lossy audio coding was performed in FFmpeg with lowcomplexity Advanced Audio Coding (AAC-LC) format togenerate three low-bitrate ambisonics ﬁles in 64 kbps, 128 kbps and 256 kbps. Note that the bitrate is the total for 4channels (4 ambisonic channels). Uncompressed clips werealso included into the test. Adobe Audition CC 2019 withSpatial Audio Real-Time Application (SPARTA) VST plug-insinstalled [16] was employed to decode each audio clips intorespective channels of 5.0, 11.0 and 22.0 loudspeaker setup,according to ITU-R BS.2051, generating 60 audio streams.

C. Measures

Objective video quality metrics including PSNR and itsvariants were calculated using HM16.16 reference softwareand 360Lib software package [10]. The quality measure of avideo stream is the average of its frame quality values (I-P-B frames and YUV 4:2:0 color format). Based on the PSNRprocessing chain for 360 video quality measures, these PSNRscover three different phases namely codec PSNRs, Cross-Format (CF) PSNRs and End-to-End (EE) PSNRs [9].Subjective experiments were performed in the FORCETechnology SenseLab’s listening room which fulﬁlls the re-quirements of EBU 3276 and ITU-R BS.1116-3 [17]. The testsequences were evaluated separately in terms of audio quality,video quality, and audiovisual quality test in this order. Priorto the test, the sound level (Leq) of all clips were calibratedand set to 65-70 dB (depend on the clips) at the sweet spot(listeners’ head position) 1.2 meters above the ﬂoor and ata normal angle to the center-ceiling loudspeaker. We usedSamsung Odyssey+ head mounted display (HMD) which hasa 1440x1600 display resolution per eye, 110 ◦ horizontal ﬁeldof view and 90Hz refresh rate. The HMD was operated withinWindows Mixed Reality front-end platform connected to ourSenseLabOnline system providing an interactive user interfacewithin the VR video [18]. A single stimulus Absolute CategoryRating (ACR) scale was used and customized with a continu-ous quality scale (CQS) without anchor and reference. CQS-ACR is introduced here, motivated from single stimulus ratingfound in SAMVIQ (Subjective Assessment Methodology forVideo Quality) and the CQS found in MUSHRA and ITU-TP.800 [19]–[21]. PS N R ( d B ) Bitrate (kbps) (a) Video 1 PS N R ( d B ) Bitrate (kbps) (b) Video 3 PS N R ( d B ) Bitrate (kbps)

QP 22

QP 27QP 32QP 37 (c) Video 5

Fig. 2: Codec Peak signal-to-noise ratio (PSNR) vs video bitrate. O S ( M ea n O p i n i on S c o r e ) PSNR (dB)

SPSNR-NNSPSNR-ICPP-PSNR

Video 1Video 2Video 3Video 4Video 5 (a) Cross-Format (CF) PSNR M O S ( M ea n O p i n i on S c o r e ) PSNR (dB)

SPSNR-NNSPSNR-ICPP-PSNRWS-PSNR

Video 1Video 2Video 3Video 4Video 5 (b) End-to-End (EE) PSNR M O S ( M ea n O p i n i on S c o r e ) PSNR (dB)

PSNRWS-PSNR(CF) SPSNR-NN (CF) SPSNR-I(CF) CPP-PSNR(EE) SPSNR-NN (EE) SPSNR-I(EE) CPP-PSNR (EE) WS-PSNR (c) Average Codec, CF and EE PSNR

Fig. 3: MOS vs PSNR related metrics of 360 video used in the experiment.A full factorial design was applied for the audio and videotests only. Due to the large number of audiovisual conditions,a full factorial audiovisual test was not feasible. Instead anoptimal custom design of experiment (DoE) was employed toyield manageable trial size to run audiovisual quality test. Acoordinate-exchange and D-optimal algorithm [22], [23] wassimulated by using DesignExpert 12 [24] due to the goal is toﬁnd factors important to the process.

D. Procedures

Twenty pre-screened consumers (15 males, 5 females, meanage: 35.4, SD: 7.8) participated in this paid 4-hour studywith sessions split across two days. Auditory and visualscreening tests were performed prior to the experiment asclose as possible according to ITU-T P.910 recommendation[25]. Snellen chart and Ishihara plates were used to conﬁrmvisual acuity and normal color vision. Sixteen assessors wereaudiometrically normal and 4 self-reported normal hearing.During the test, the assessors were required to provideratings using customized modular buttons which controlledTABLE II: Pearson’s and Spearman’s correlation betweenobjective video metrics and

M OS V . P hase M etrics P CC S ROCC R MSECodec PSNR 0.6592 0.6557 0.3774WS-PSNR 0.6793 0.6725 0.3670Cross-Format SPSNR-NN

SPSNR-I 0.8218 0.8105 0.2847CPP-PSNR 0.8215 0.8091 0.2840End-to-End SPSNR-NN 0.8181 0.8076 0.2861SPSNR-I 0.8220 0.8113 0.2847CPP-PSNR 0.8219 0.8104 0.2839WS-PSNR 0.8185 0.8073 0.2850 the user interface and rating scale. The rating interface wasprojected onto a screen for audio quality test and directly inthe virtual video for video and audiovisual quality test. Thetests were presented in double blind random order. The systemautomatically encouraged the assessors to take a short breakevery twenty minutes. III. R

ESULTS

A. Objective Quality of Testing Video

Fig. 2 illustrates the video codec PSNR values and videobitrates of the QP setting encoding parameter. All test videosexcept video 3 have PSNR values ranging from 30dB to 45dB.It can be seen that PSNR of video 3 starts higher than 35dBand could reach nearly 50dB in QP 22 whereas in Video 1and Video 5 started below 35dB. The plots clearly show thatvideo resolution and QP have signiﬁcant contribution to videobitrate. Although the resolutions differ, PSNR is still nearlyidentical at the same QP values. All of the PSNR values havea similar trend over the bitrates except for 1280x720 resolutionwith a slightly different line. Moreover, video in 3840x1920resolution has higher bitrates relatively to the lower resolution.In total, there are 9 different PSNR metrics correspondingwith different computation in 360 video processing chain[10]. There are ﬁve basic types of objective quality metricsincluding PSNR, Weighted to spherically uniform PSNR (WS-PSNR) [5], [26], spherical PSNR based on nearest neighborposition (S-PSNR-NN) [6], spherical PSNR with interpolation(S-PSNR-I) [7] and PSNR in Crasters parabolic projection(CPP-PSNR) [8]. As depicted in Fig. 3 (a-b), almost allPSNR variants in Cross-Format and End-to-End process haveidentical values in each video. The differences between PSNRmetrics for a video coded with the same encoding parameters M O S ( M ea n O p i n i on S c o r e ) Video resolution Ref M O S ( M ea n O p i n i on S c o r e ) QP Video 1Video 2Video 3

Video 4Video 5

Fig. 4: Perceptual video quality (

M OS V ) vs (a) video resolution and (b) QP.

64 128 256 3072 M O S ( M ea n O p i n i on S c o r e ) B itrates (kbps ) M O S ( M ea n O p i n i on S c o r e ) Loudspeaker Channel Video 1Video 2Video 3Video 4Video 5

Fig. 5: Perceptual audio quality (

M OS A ) vs (a) bitrates and (b) loudspeaker channel.are relatively small. By averaging the values across all videos,Fig. 3 (c) shows that both Cross-Format and End-to-Endmanage to capture the inﬂuence of resolution, which the codecPSNR clearly does not as expected.The correlation, including Pearson Correlation Coefﬁcientand Spearman Rank-order Correlation Coefﬁcient and its RootMean Square Error (RMSE), between PSNR related metricsand mean opinion subjective scores are given in Table II. Ascan be seen in Table II, M OS V shows a rather low correlationand high RMSE with PSNR measured in coding distortion(PSNR and WS-PSNR) with the PCC and SROCC valueless than 0.7 and RMSE > > B. Perceptual Video Quality

Fig. 4 shows the results for perceived video quality

M OS V over the resolution of video and quantization parameter (QP).On the left side of Fig. 4, M OS V increases as the videoresolution is increased but decreases when QP is increased.This is because QP regulates how much spatial detail is pre-served and the QP value represents a step size on the DiscreteCosine Transform (DCT) in frequency domain. Therefore,small values of QP more accurately approximate the block’sspatial frequency spectrum.For these videos, M OS V lies within a narrow range witha maximum score of only 3 (Fair) for the best quality video presented. It is also found that the conﬁdence interval (CI) isrelatively small either in the lowest resolution or QP whichmeans there is common agreement that the quality is verypoor and highly noticeable. Furthermore, although the impactof video encoding parameters on video quality score can beconcluded, based on the conﬁdence interval (CI), only smalldifferences are noticeable. We argue this result is due to theabsence of reference content with excellent quality, which, hadit been available, allowed for the quality to span the complete M OS V score range and improve the results. C. Perceptual Audio Quality

The mean opinion score (

M OS A ) of perceived audioquality is presented in Fig. 5 for (a) audio bitrate and (b)loudspeaker channel. It points out that audio bitrate has apositive correlation with M OS A . The highest score is obtainedon the original clips (3,072 kbps). However, even the originalstimuli did not yield a maximum score. There is no differencebetween audio clip for each bitrate and there is no dominantstimuli with the highest score in all bitrates. Statisticallysigniﬁcant differences can be found between the audio in 64kbps and 128 kbps, and slightly difference to 256 kbps. Onlyfew samples have signiﬁcant difference between 256 kbps andoriginal audio.Fig. 5 (b) shows that there is no relationship betweenloudspeaker channel to subjective score as seen with theoverlapping conﬁdence intervals. Whilst the quality and natureof the audiovisual content are well suited to this study, itwould be desireable to have a larger and more critical and highquality database for future studies. A non critical sample leadsto low sensitivity of spatial changes and therefore assessorsare not able to discriminate different number of loudspeakerchannels. Similar ﬁnds have been found in other studies [27]ith broadcast quality programme material. Furthermore, thisﬁnding is also inline with a study from [12] which describedthat a lossy compression for ambisonics has a negative effectto timbral distortion thus reduce the localization accuracy.The study revealed that localization error occurs not only in1 st order but also in 3 rd and 5 th order ambisonics. More-over, it is found that there is no signiﬁcant median scoredifference for timbral distortion between ambisonic ordersand bitrates. It would appear that in complex tasks wheremultiple characteristics are to be considered simultaneously byassessors, that certain characteristics dominate. In this study,audio bitrate has a clearly signiﬁcant impact on the perceivedsound quality, which may also be masking the small differencebetween number of loudspeaker channels. Therefore, in orderto enhance assessors’ sensitivity in doing such complex tasksthus improve the obtained results, a comprehensive trainingwith multiple parameters could be considered prior to the test.Thereafter, further investigation might study the performanceof test methods with and without a reference signal. D. Audiovisual Quality Model

Here we presented initial investigation towards audiovisualquality model of 360 video with ambisonic audio from sub-jective data. The correlation of subjective data between audio-visual quality (

M OS AV ) and audio quality ( M OS A ), videoquality ( M OS V ) and the multiplication ( M OS A .M OS V )were evaluated as shown in Table III. The correlation coefﬁ-cients vary for each interaction and each video. Video 3 gen-erally has the highest correlation between M OS A , M OS V , M OS A .M OS V to M OS AV . Overall the M OS A .M OS V shows the highest correlation with M OS AV in all videos.The audiovisual quality can be modelled based on linearcombination of audio and video quality and the interaction.Here we evaluated six models as proposed in previous studiesand has been summarized in [28]. The model consist of twoor three predictors as shown in (1-4) and function parameter(5-6). The last two models (5-6) were proposed by [29] calledweighted Minkowski and power model. MOS AV = α + α MOS A + α MOS V + α MOS A MOS V (1) MOS AV = α + α MOS A MOS V (2) MOS AV = α + α MOS V + α MOS A MOS V (3) MOS AV = α + α MOS A + α MOS V (4) MOS AV = ( α MOS PA + α MOS PV ) /P (5) MOS AV = α + α MOS P A MOS P V (6) where α , α , α and α are weighting parameters and de-pending on the application, they may vary between studies ( α only improves the ﬁt of residuals and irrelevant to correlation).The subjective models were computed with 80:20 dataratio between training and test data. Pearson and Spearmancorrelations and RMSE were calculated as shown in Table IV.From Table IV, all models show a good ﬁt to the data withthe correlation > MOS AV AV contentAll 1 2 3 4 5

MOS A MOS V MOS A .MOS V TABLE IV: Accuracy between predicted and test data ofaudiovisual models. M odel equation PCC SROCC RMSE

MOS AV MOS AV MOS AV MOS AV MOS AV MOS AV the highest correlation (PCC 0.930, SROCC 0.935) and lowestRMSE (0.198), the difference is relatively small. However, thisresult is consistent with earlier studies [29] in which it wasshown that the power model had the best ﬁt across the testedmodels. It should be noted that instead of using the originalvalue of weighted parameters from referred models as in [29],we concern to employ the model form only and also computethe weighting parameters for each model for this immersiveapplication. IV. C ONCLUSION

We carried out a subjective experiment on audio, video,and audiovisual quality with 360 video displayed on an HMDwith low-bitrate ambisonic based loudspeaker reproduction,evaluated using a CQS-ACR methodology. The main ﬁndingscan be summarized as follows: • Besides the common relationship between video PSNR,encoding parameters and subjective scores, we show thatthe Cross-Format and End-to-End PSNR could predictthe performance across the resolutions and show a linearrelationship. In perceived quality, signiﬁcant differencesare noticeable. However,

M OS V considerably has anarrow range and a maximum M OS V is less than 3 asan impact of quality limitation as objectively shown inFig. 2. This limited quality suggests the urgent needs ofsynchronized 360 video with ambisonic for audiovisualresearch. • There is a signiﬁcant difference of

M OS A across au-dio bitrates. Different number of loudspeaker channelsshow insigniﬁcant perceptual effect. Auditory stimuli(e.g ambisonic order), rating methods, task complexityand assessors’ sensitivity could be the reason that thisoccuring. • The correlation between subjective scores shows thatmultiplicative

M OS

A.V perform a very high correlationamong the others. According to the model, although apower model has the best accuracy, a difference betweenAV models are indistinguishable. However, the resultsimply that video quality is dominant over the others. Thiss consistent in Internet Protocol Television [30] and highmotion video [31] application. A proposed DoE showsgood performance indicating its potential use for furtherinvestigation. V. F

UTURE W ORKS

In order to improve the performance of audiovisual qualitymodel for immersive content, the development of transparent,broader, high quality, and critical audiovisual database is im-portant. Regarding the test methodology, no standard approachexists to assess audiovisual quality for immersive contentcurrently. The tested methodology employed in this study wasable to resolve many important perceptual characteristics, butalso highlighted limitations to be overcome.The use of reference and anchor methods such as MUltipleStimuli with Hidden Reference and Anchor (MUSHRA) inaudio and Subjective Assessment Methodology for VideoQuality (SAMVIQ) in video might provide new directions forour research. Furthermore, a study of objective and subjectivemeasures in immersive audiovisual content could reﬂect therelationship between those metrics and help in ﬁnding themodel with accurate prediction. However, this study mainlylies on the subjective models. The future works will considerthe objective models using the objective spatial audio metricAMBIQUAL [32]. A number of machine learning approachesof multimodal fusion offer the opportunity to learn and pro-pose more accurate and fast prediction.A

CKNOWLEDGMENT

This research was supported by the European Unions Hori-zon 2020 research and innovation programme under the MarieSkodowska-Curie grant agreement No.765911 RealVision. Wethank Prof. Angelo Farina for allowing to use his datasets.The authors also thank colleagues from FORCE TechnologySenseLab who provided insight and expertise that greatlyassisted the research. R

EFERENCES[1] C. Li, M. Xu, S. Zhang, and P. L. Callet, “State-of-the-art in 360 ◦ video/image processing: Perception, assessment and compression,” arXivpreprint arXiv:1905.00161 , 2019.[2] P. P´erez and J. Escobar, “MIRO360: A Tool for Subjective Assessment of360 Degree Video for ITU-T P. 360-VR,” in . IEEE,2019, pp. 1–3.[3] R. Schatz, A. Sackl, C. Timmerer, and B. Gardlo, “Towards subjectivequality of experience assessment for omnidirectional video streaming,”in . IEEE, 2017, pp. 1–6.[4] M. Xu, C. Li, Z. Chen, Z. Wang, and Z. Guan, “Assessing visual qualityof omnidirectional videos,” IEEE Transactions on Circuits and Systemsfor Video Technology , 2018.[5] Y. Sun, A. Lu, and L. Yu, “AHG8: WS-PSNR for 360 video objectivequality evaluation,” in document JVET-D0040 , 2016.[6] Y. He, B. Vishwanath, X. Xiu, and Y. Ye, “AHG8: InterDigital’sprojection format conversion tool,” in

Document JVET-D0021 , 2016.[7] M. Yu, H. Lakshman, and B. Girod, “A framework to evaluate omnidi-rectional video coding schemes,” in . IEEE, 2015, pp. 31–36.[8] V. Zakharchenko, E. Alshina, A. Singh, and A. Dsouza, “AHG8: Sug-gested testing procedure for 360-degree video,”

Joint Video ExplorationTeam of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JVET-D0027, Chengdu , 2016. [9] P. Hanhart, Y. He, Y. Ye, J. Boyce, Z. Deng, and L. Xu, “360-degreevideo quality evaluation,” in .IEEE, 2018, pp. 328–332.[10] J. Boyce, E. Alshina, A. Abbas, and Y. Ye, “JVET common test condi-tions and evaluation procedures for 360 video,”

Joint Video ExplorationTeam of ITU-T SG , vol. 16, 2017.[11] H. T. Tran, N. P. Ngoc, C. T. Pham, Y. J. Jung, and T. C. Thang, “Asubjective study on QoE of 360 video for VR communication,” in . IEEE, 2017, pp. 1–6.[12] T. Rudzki, I. Gomez-Lanzaco, P. Hening, J. Skoglund, T. McKenzie,J. Stubbs, D. Murphy, and G. Kearney, “Perceptual evaluation of bitratecompressed ambisonic scenes in loudspeaker based reproduction,” in .Audio Engineering Society, 2019.[13] M. Kentgens, S. K¨uhl, C. Antweiler, and P. Jax, “From Spatial Recordingto Immersive ReproductionDesign & Implementation of a 3DOF Audio-Visual VR System,” in

Audio Engineering Society Convention 145 .Audio Engineering Society, 2018.[14] M. Olko, D. Dembeck, Y.-H. Wu, A. Genovese, and A. Roginska,“Identiﬁcation of Perceived Sound Quality Attributes of 360º Audio-visual Recordings in VR Using a Free Verbalization Method,” in

AudioEngineering Society Convention 143 . Audio Engineering Society, 2017.[15] “Index of /Public/Jump-Videos,” http://pcfarina.eng.unipr.it/Public/Jump-Videos/, 2019.[16] L. McCormack and A. Politis, “SPARTA & COMPASS: Real-timeimplementations of linear and parametric spatial audio reproductionand processing methods,” in . Audio Engineering Society, 2019.[17] ITU-R Rec. BS.1116-3,

Methods for the Subjective Assessment of SmallImpairments in Audio Systems including Multichannel Sound Systems ,International Telecommunication Union Std., 2015.[18] “SenseLabOnline,” https://senselabonline.com/SLO/4.0.3/.[19] ITU-R Rec. BS.1534-3,

Method for the Subjective Assessment of Inter-mediate Quality Levels of Coding Systems , International Telecommuni-cation Union Std., 2015.[20] ITU-R Rec. BT.1788,

Methodology for the subjective assessment ofvideo quality in multimedia applications , International Telecommuni-cation Union Std., 2007.[21] ITU-T P.800,

Methods for subjective determination of transmissionquality , International Telecommunication Union Std., 1996.[22] R. S. John and N. R. Draper, “D-optimality for regression designs: areview,”

Technometrics , vol. 17, no. 1, pp. 15–23, 1975.[23] R. K. Meyer and C. J. Nachtsheim, “The coordinate-exchange algorithmfor constructing exact optimal experimental designs,”

Technometrics ,vol. 37, no. 1, pp. 60–69, 1995.[24] “Design-Expert | Subjective Video Quality Assessment Methods forMultimedia Applications , International Telecommunication Union Std.,2008.[26] Y. Sun, A. Lu, and L. Yu, “Weighted-to-spherically-uniform qualityevaluation for omnidirectional video,”

IEEE signal processing letters ,vol. 24, no. 9, pp. 1408–1412, 2017.[27] N. Zacharov, C. Pike, F. Melchior, and T. Worch, “Next generation audiosystem assessment using the multiple stimulus ideal proﬁle method,”in . IEEE, 2016, pp. 1–6.[28] Z. Akhtar and T. H. Falk, “Audio-visual multimedia quality assessment:A comprehensive survey,”

IEEE Access , vol. 5, pp. 21 090–21 117, 2017.[29] H. B. Martinez and M. C. Farias, “Full-reference audio-visual videoquality metric,”

Journal of Electronic Imaging , vol. 23, no. 6, p. 061108,2014.[30] M. Garcia and A. Raake, “Impairment-factor-based audio-visual qualitymodel for IPTV,” in . IEEE, 2009, pp. 1–6.[31] D. S. Hands, “A basic multimedia quality model,”

IEEE Transactionson multimedia , vol. 6, no. 6, pp. 806–816, 2004.[32] M. Narbutt, A. Allen, J. Skoglund, M. Chinen, and A. Hines, “Ambiqual-a full reference objective quality metric for ambisonic spatial audio,”in2018 Tenth International Conference on Quality of MultimediaExperience (QoMEX)