Kalman Filter-based Head Motion Prediction for Cloud-based Mixed Reality
Serhan Gül, Sebastian Bosse, Dimitri Podborski, Thomas Schierl, Cornelius Hellge
KKalman Filter-based Head Motion Prediction for Cloud-basedMixed Reality
Serhan Gül
Fraunhofer HHIBerlin, [email protected]
Sebastian Bosse
Fraunhofer HHIBerlin, [email protected]
Dimitri Podborski
Fraunhofer HHIBerlin, [email protected]
Thomas Schierl
Fraunhofer HHIBerlin, [email protected]
Cornelius Hellge
Fraunhofer HHIBerlin, [email protected]
ABSTRACT
Volumetric video allows viewers to experience highly-realistic 3Dcontent with six degrees of freedom in mixed reality (MR) environ-ments. Rendering complex volumetric videos can require a prohibi-tively high amount of computational power for mobile devices. Apromising technique to reduce the computational burden on mobiledevices is to perform the rendering at a cloud server. However,cloud-based rendering systems suffer from an increased interaction(motion-to-photon) latency that may cause registration errors inMR environments. One way of reducing the effective latency is topredict the viewer’s head pose and render the corresponding viewfrom the volumetric video in advance.In this paper, we design a Kalman filter for head motion predic-tion in our cloud-based volumetric video streaming system. Weanalyze the performance of our approach using recorded headmotion traces and compare its performance to an autoregressionmodel for different prediction intervals (look-ahead times). Ourresults show that the Kalman filter can predict head orientations0 . KEYWORDS volumetric video, augmented reality, mixed reality, cloud-based ren-dering, head motion prediction, Kalman filter, time series analysis
With the advances in volumetric capture technologies, volumetricvideo has been gaining importance for the immersive representa-tion of 3D scenes and objects for virtual reality (VR) and augmentedreality (AR) applications [42]. Combined with highly accurate po-sitional tracking technologies, volumetric video allows users tofreely explore six degrees of freedom (6DoF) content and enablesnovel mixed reality (MR) applications where highly realistic virtualobjects can be placed inside real environments and animated basedon user interaction [14].Geometry of volumetric objects is usually represented usingmeshes or point clouds. High-quality volumetric meshes typicallycontain thousands of polygons, and high-quality point clouds maycontain millions to billions of points [41, 43]. Therefore, renderingcomplex volumetric content is still a very demanding task despite the remarkable computing power available in today’s mobile de-vices [10]. Moreover, no efficient hardware implementations ofmesh/point cloud decoders are available yet. Software-based decod-ing can be prohibitively expensive in terms of battery usage andmay not be able to meet the real-time rendering requirements [37].One way to avoid the complex rendering on mobile devices isto offload the processing to a powerful remote server which dy-namically renders a 2D view from the volumetric video based onthe user’s actual head pose [46]. The server then compresses therendered texture into a 2D video stream and transmits it over a net-work to the client. The client can then efficiently decode the videostream using its hardware decoder and display the dynamically up-dated content to the viewer. Moreover, the cloud-based renderingapproach allows utilizing highly efficient 2D video coding tech-niques and thus can reduce the network bandwidth requirementsby avoiding the transmission of the volumetric content [37].However, one drawback of cloud-based rendering is the increasedinteraction latency, also known as the motion-to-photon (M2P) la-tency [47]. Due to the network round-trip time and the addedprocessing delays, the M2P latency is higher than in a system thatperforms the rendering locally. Several studies show that an in-creased interaction latency may lead to a degraded user experienceand motion sickness [8, 27, 30].One way to reduce the latency is to predict the user’s futurehead pose at the cloud server and render the corresponding view ofthe volumetric content in advance. Thereby, it is possible to signifi-cantly reduce or even eliminate the M2P latency, if the user poseis successfully predicted for a look-ahead time (LAT) equal to orlarger than the M2P latency of the system. However, mispredictionsof head motion may increase registration errors and degrade theuser experience in AR environments [27]. Therefore, designing ac-curate head motion prediction algorithms is crucial for high-qualityvolumetric video streaming.In this paper, we consider the problem of head motion predictionfor cloud-based AR/MR applications. Our main contributions areas follows: • We develop a Kalman filter-based predictor for head mo-tion prediction in 6DoF space and analyze its performancecompared to an autoregression model and a baseline (noprediction) model using recorded head motion traces. a r X i v : . [ c s . MM ] J u l ül and Bosse, et al. • We present an architecture for integration of the Kalmanfilter-based predictor into our existing cloud-based volumet-ric streaming framework.The paper is structured as follows. Sec. 2 gives an overview ofthe literature on volumetric video streaming, remote rendering andhead motion prediction. Sec. 3 gives an overview of our cloud-basedvolumetric video streaming system. Sec. 4 describes the developedKalman filter-based predictor and presents a framework for itsintegration into our volumetric streaming system. Sec. 5 presentsour experimental setup and the evaluation results. Sec. 6 concludesthis paper.
A few recent works deal with efficient streaming of volumetricvideos in different content representations. Hosseini and Timmerer[17] extended the concepts of Dynamic Adaptive Streaming overHTTP (DASH) for point cloud streaming. They proposed differentapproaches for spatial subsampling of dynamic point clouds to de-crease the density of points in the 3D space and thus reduce thebandwidth requirements. Park et al. [35] proposed using 3D tilesfor streaming of voxelized point clouds. Their system selects 3Dtiles and adjusts the corresponding level-of-details (LODs) using arate-utility model that considers the user’s viewpoint and distanceto the object. Qian et al. [37] developed a point cloud streamingsystem that uses an edge proxy to convert point cloud streams into2D video streams based on the user’s viewpoint in order to enableefficient decoding on mobile devices. They also proposed variousoptimizations to reduce the M2P latency between the client and theedge proxy. Van der Hooft et al. [50] proposed an adaptive stream-ing framework compliant to the recent point cloud compressionstandard MPEG V-PCC [43]. Their framework PCC-DASH enablesadaptive streaming of scenes with multiple dynamic point cloudobjects. They also presented rate adaptation techniques that rely onthe user’s position and focus as well as the available bandwidth andthe client’s buffer status to select the optimal quality representationfor each object. Petrangeli et al. [36] proposed a streaming frame-work for AR applications that dynamically decides which virtualobjects should be fetched from the server as well as their LODs,depending on the proximity of the user and likelihood of the userto view the object.
The idea of offloading the rendering process to a powerful remoteserver was first considered in 1990s when PCs did not have sufficientcomputational power for intensive graphics tasks [46]. A remoterendering system renders complex graphics on a powerful serverand delivers the result over a network to a less-powerful clientdevice.With the advent of cloud gaming and Mobile Edge Computing(MEC), interactive remote rendering applications have started toemerge which allow the client device to control the rendering appli-cation based on user interaction [28, 37, 45]. Mangiante et al. [28]presented a MEC system for field-of-view (FoV) rendering of 360°videos to optimize the required bandwidth and reduce the process-ing requirements and battery utilization. Shi et al. [45] developed a MEC system to stream AR scenes containing only the user’s FoVplus a latency-adaptive margin around it. They deployed the proto-type on a MEC node connected to a LTE network and evaluatedits performance. A detailed survey of interactive remote renderingsystem is given in [46].
Previous techniques for head motion prediction were mainly devel-oped for dealing with the rendering and display delays of the earlyAR systems. In his dissertation, Azuma [4] developed an AR systemthat relies on head motion prediction to reduce the dynamic regis-tration errors of virtual objects. His results indicate that predictionis most effective for short prediction intervals that are less than80 ms. In a follow-up work, Azuma and Bishop [5] presented a fre-quency domain analysis of head motion prediction and concludedthat the error in predicted position grows rapidly with increasingprediction intervals and head motion signal frequencies. Van Rhijnet al. [51] proposed a framework for Bayesian predictive filteringalgorithms and studied the effect of the filter parameters on the pre-diction performance. They compared the performances of differentprediction methods using both synthetic and experimental data. LaViola [25] presented a comparison of the unscented Kalman filter(UKF) and extended Kalman filter (EKF) for prediction of head andhand orientation represented with quaternions. Kraft [21] proposeda quaternion-based UKF that extends the original UKF formulationto address the inherent properties of unit quaternions. The devel-oped filter was applied for prediction of head orientation; however,the evaluation is limited to simulated motion. Himberg and Mo-tai [15] proposed an EKF that operates on the change of quaternionsbetween consecutive time points (delta quaternion) and showedthat their approach provides similar prediction performance to aquaternion-based EKF with less computational burden.In recent years, with the resurgence of interest in VR, head mo-tion prediction regained importance for prediction of the futureuser viewport in 360° videos. Bao et al. [6] developed regressionmodels to predict the user’s viewport. They also used the samemodels to predict the accuracy of prediction to determine the sizeof margins around the viewport for efficient transmission of 360°videos. Sanchez et al. [12] proposed angular velocity and angularacceleration based predictors to tackle the delay issue in tile-basedviewport-dependent streaming. Qian et al. [38] analyzed the perfor-mance of several machine learning algorithms on head movementtraces collected from 130 diverse users and employed differentprediction algorithms depending on the prediction interval. Thedeveloped prediction framework was integrated into a streamingsystem for 360° videos to reduce the bandwidth usage or boost thevideo quality given the same bandwidth.The principles of head motion prediction for 360° videos andvolumetric videos in mixed reality are similar. However, volumetricvideos are more complex and allow movement in a higher degree offreedom making prediction a more difficult task. In our volumetricstreaming system, we employ a Kalman filter-based predictionframework to jointly predict both translational and rotational headmovements in 6DoF space. alman Filter-based Head Motion Prediction for Cloud-based Mixed Reality
NetworkTrackingHeadmovement VideodecoderDisplay RendererVideoencoder x k z k z k AR Client Cloud server
Figure 1: High level operation of our cloud-based volumetricstreaming system.
Fig. 1 shows an overview of our cloud-based volumetric streamingsystem. We abstract the software components as functional blocksto focus on the prediction aspects. A detailed software architectureof our system is described in [1].At the cloud server, we store our compressed volumetric videoas a single MP4 file containing video and mesh tracks. Particu-larly, we encode the texture atlas using H.264/AVC and the meshgeometry using Google Draco that implements the Edgebreakeralgorithm [39]. The compressed mesh and texture data are multi-plexed into different tracks of an MP4 file ready to be processedby the game engine (Unity) running at our server by means of anative plug-in that demultiplexes and decodes the respective datastreams.Head movement of the user is described by the state vector x k . The tracking system of the AR headset measures the headpose with a sampling interval of t s . The measurement z k is thensent over a network to the cloud server, which then renders thecorresponding view from the volumetric content (textured meshes)according to z k . Next, the rendered view is encoded as a videostream using the NVIDIA hardware encoder (NVENC) [33] andsent to the client using WebRTC [11]. We selected WebRTC as thedelivery protocol since it provides low-latency (real-time) streamingcapabilities and is already widely adopted by different web browsersallowing our system to support several different platforms. Afterthe transmission, the client decodes the received video stream anddisplays it to the viewer.The time period between the head movement and display of thedecoded video frame to the viewer is the M2P latency of the systemwhich we aim to compensate by applying prediction. We propose to use a Kalman filter for 6DoF head motion predictionin our cloud-based volumetric streaming system. As a benchmark,we investigate the performance of a Baseline and an autoregressionmodel.
The Baseline model represents the operation of the system withoutprediction. We assume that the prediction time is set equal to theM2P latency such that the prediction completely eliminates thelatency. For a prediction time of N samples, the measurement z k issimply propagated N samples ahead in our simulations and set asthe user pose at time k + N , i.e. ˆ x k + N = z k . Reference is hidden for peer review
Autoregressive (AutoReg) models use a linear combination of thepast values of a variable to forecast its future values [18].An AutoReg model of lag order ρ can be written as y t = c + ϕ y t − + ϕ y t − + · · · + ϕ ρ y ρ − + ϵ t (1)where y t is the true value of the time series y at time t , ϵ t is thewhite noise, ϕ i are the coefficients of the model. Such a modelwith ρ lagged values is referred to as an AR ( ρ ) model. The optimallag order for the model can be automatically determined usingstatistical tests such as the Akaike Information Criterion (AIC) [2].AutoReg models must first be trained to learn the model coefficients,and the learned model coefficients ϕ i as well as the lag order ρ mayvary depending on the training data.Typically, we need to predict not only the next sample but multi-ple samples ahead in the future to achieve a given look-ahead time(LAT). Therefore, we repeat the prediction step in a sliding windowfashion by using the just-predicted sample for the prediction of thenext sample and iterate Eq. (1) until we achieve the desired LAT.Fig. 2 shows an example that demonstrates the iterations of theAutoReg model for a history window of 3 samples and LAT=2. x , x , · · · , x k − , x k − , x k ˆ x k +1 x , x , · · · · · · · · · , x k − , x k x k +1 ˆ x k +2 Sliding window . Figure 2: An example showing multi-step ahead predictionusing autoregression model for a history window of 3 and alook-ahead time (LAT) of 2 samples.
Basics.
The Kalman filter estimates the state x ∈ R n of a discrete-time process expressed by the linear difference equation x k = Fx k − + w k − , (2)with a measurement (observation) z ∈ R m expressed by z k = Hx k + v k , (3)where the random variables w k and v k represent the process andmeasurement noise, respectively. They are assumed to be indepen-dent from each other and have the Gaussian distributions p ( w ) ∼ N ( , Q ) and p ( v ) ∼ N ( , R ) where Q and R are the process andmeasurement noise covariance matrices, respectively. The generalformulation of the Kalman filter allows Q and R to be changed ateach time step; however, we assume that they remain constant. Thematrix F ∈ R n × n represents the state transition (process) modelthat relates the state at the previous time step x k − to the currentstate x k . The matrix H ∈ R m × n represents the observation modelthat relates the state to the measurement. Since no external controlis involved in a subject’s head movements, we do not include acontrol input in our framework [53].Given the knowledge of the state before step k , we define ˆ x - k asour a priori state estimate, and given the measurement z k at step k , ül and Bosse, et al. we define ˆ x k as our a posteriori state estimate. We can then define a priori estimate error and a posteriori estimate error as P - k = cov ( x k − ˆ x - k ) , (4) P k = cov ( x k − ˆ x k ) , (5)respectively.A complete Kalman filter cycle consists of the time update ("pre-diction") and measurement update ("correction") steps. The timeupdate projects the current state estimate ahead in time using theprocess model F . Given the previous state estimate ˆ x k − , the a priori state and error covariance estimates are obtained byˆ x - k = F ˆ x k − , (6) P - k = FP k − F T + Q , (7)respectively.The measurement update applies a correction to the projectedestimate using the actual measurement. First, the Kalman gain K iscomputed which is a ratio expressing how much the filter "trusts"the prediction vs. the measurement K k = P - k H T ( HP - k H T + R ) − . (8)Then, the actual measurement z k is incorporated to obtain the aposteriori state estimate by ˆx k = ˆx - k + K k ( z k − H ˆ x - k ) , (9)and the a posteriori error covariance is obtained by P k = ( I − K k H ) P - k . (10)A detailed derivation of the Kalman filter equations can be foundin [24]. Filter design.
We use a 14D state vector that consists of the positionand orientation components as well as their first time-derivatives x = (cid:2) x , (cid:219) x , y , (cid:219) y , z , (cid:219) z , q w , (cid:219) q w , q x , (cid:219) q x , q y , (cid:219) q y , q z , (cid:219) q z (cid:3) T . (11)We initialize the state as a zero-vector, x = and the error co-variance as an identity matrix, P = I . Our state transition model F ∈ R × is a block diagonal matrix with the block (cid:20) ∆ t (cid:21) repeated seven times in the diagonal such that the same constant-velocity motion model is applied to all the state variables. ∆ t is thetime step of the filter set equal to the sampling time t s (5 ms) inour simulations. Our observation model H ∈ R × is also a blockdiagonal matrix with (cid:2) (cid:3) repeated seven times in the diagonal. Measurement noise. R ∈ R × models the noise in the sensors asa covariance matrix. The AR headset we use for data collection,Microsoft HoloLens, uses advanced visual-inertial SimultaneousLocalization and Mapping (SLAM) algorithms that track the userpose with high accuracy [26]. Therefore, we experimented withsmall noise variances and empirically constructed R as a diagonalmatrix with the diagonal values set to 10 − . In practice, there mightbe correlation between different sensors and usually their noise isnot a pure Gaussian [24]. However, for the lack of a better sensormodel, we employ a simplified measurement matrix. NetworkTrackingHeadmovement VideodecoderDisplay RendererVideoencoder x k z k MeasurementupdateMeasurementmodel H Stateestimation F Multi-stepprediction F N t p Delay ˆ z k ˆ x - k z k ˆ x k ˆ x k − ˆ x k + N AR Client Cloud server
Figure 3: Integration of the Kalman filter-based predictorinto our cloud-based volumetric streaming system.
Process noise.
Since we have a discrete-time process in which wesample the system at regular intervals, we need a discrete represen-tation of the noise term w given in Eq. 2. Therefore, we considera discretized white noise model which assumes that the velocityremains constant during each ∆ t but differs for each time period [7].The process noise covariance matrix Q ∈ R × is then a blockdiagonal matrix with the block ∆ t ∆ t ∆ t ∆ t σ ν repeated seven times in the diagonal, where the noise variance isempirically set to σ ν = for the first three blocks (position) and σ ν = × for the remaining blocks (orientation). A derivationof Q for the discretized white noise model is given in [7]. Multi-step ahead prediction.
Each iteration of the Kalman filterresults in an a posteriori state estimate ˆx k . To obtain an N -stepprediction ˆx k + N , after each iteration we need to propagate ˆx k ahead by applying the process model F multiple times on ˆx k , i.e. byiterating the Eq. (6) N times [20, 51]. We perform the prediction of orientations in the quaternion do-main [13]. We readily obtain quaternions from the HoloLens andthus can avoid conversion from another representation such asEuler angles or rotation matrices. Quaternions allow smooth in-terpolation of orientations using techniques like Spherical LinearInterpolation of Rotations (SLERP) [48]. Moreover, quaternions donot suffer from gimbal lock as opposed to Euler angles and offer asingularity-free description of orientation. They are more compactcompared to rotation matrices and thus computationally more ef-ficient [19]. The set of unit quaternions, i.e. quaternions of normone, constitutes a unit sphere in 4-D space. The three remainingdegrees of freedom after applying the unity constraint are sufficientto represent any rotation in 3-D space [13]. Fig. 3 shows the different steps of the Kalman filter-based predic-tor and presents interfaces for its integration into our volumetric Gimbal lock is the loss of one degree of freedom while using Euler angles, when thepitch angle approaches ± °. alman Filter-based Head Motion Prediction for Cloud-based Mixed Reality Figure 4: The virtual object used during trace collectionshown from different viewpoints. streaming system. After initialization, the state estimate from theprevious step ˆ x k − is propagated in time using the process model F and the time step of the filter ∆ t = / f s . Then, the obtained initialstate estimate ˆ x - k is converted to a measurement estimate ˆ z k usingthe measurement model H . Finally, the actual measurement z k (sentover the network) is combined with ˆ z k to obtain the corrected stateestimate ˆ x k . This completes one cycle of the Kalman filter oper-ation. At the end of each cycle, the process model F is re-used topropagate the state estimate ˆ x k in time by the LAT = t p and obtainthe predicted state ˆ x k + N at time t k + t p . Finally, a correspondingview from the volumetric video is rendered based on the predicteduser pose and transmitted to the client. To evaluate the proposed predictors for different head movementsand LATs, we recorded user traces via Microsoft HoloLens andcreated a Python-based simulation framework that enables offlineprocessing of the traces. Below, we first discuss the experimentalsetup and the evaluation metrics, before presenting the obtainedresults and discussing the limitations of our approach.
Dataset.
We created a HoloLens application that overlays a vir-tual object (shown in Fig. 4) on the real world and collected 14head movement traces. For each session, the user was asked tofreely move around and explore the object for a duration of 60 s.We recorded the position samples ( x , y , z ) and rotation samplesin the form of quaternions ( q x , q y , q z , q w ) together with the cor-responding timestamps. Since the raw sensor data we obtainedfrom the HoloLens was unevenly sampled at 60 Hz (i.e. differenttemporal distances between consecutive samples), we interpolatedthe data to obtain temporally equidistant samples. We upsampledthe position data using linear interpolation and quaternions usingSLERP [48]. Thus, we obtained an evenly-sampled dataset with a sampling rate of 200 Hz. This resampling significantly simplifies theoffline analysis of our traces and implementation of the predictors. Autoregression model settings.
We used one of our collected headmotion traces (see Sec. 5.1) as training data and estimated the Au-toReg model parameters from each time series ( x , y , z , q w , q x , q y , q z ) separately using the Python library statsmodels [44]. To selectthe best AutoReg model, we trained different models using threedifferent training traces and selected the best-performing model.Our models have an automatically determined lag order of 40 sam-ples, i.e. they consider the past 40 ∗ =
200 ms and predict the nextsample using Eq. (1).
Fig. 5 shows one of the traces in our dataset. For visualizationpurposes, orientations are given as Euler angles (yaw, pitch, roll),although we perform the prediction in the quaternion domain (seeSec. 4.4). Note that our framework uses a left-handed coordinatesystem where +y axis points up and +z axis lies in the viewingdirection.We can make two important observations based on the sampletrace: firstly, the viewer rarely moves along the y-axis (except forthe time period between 38-42 s during which the viewer probablysat down and stood back up), which is understandable since itrequires more effort to crouch down and stand up. Secondly, theorientation changes are typically due to yaw movements, whereasthe magnitude of changes due to roll and pitch movements are muchsmaller. Our observations are also confirmed by visual inspectionof the other recorded traces.
Head movement velocity.
We analyzed the peak and mean headmovement velocities by computing the first-order difference forall degrees of freedom ( x , y , z , yaw, roll, pitch). Since numericaldifferentiation using finite-differences is a noisy operation thatamplifies any noise present in the data [9], we used a Savitzky-Golay filter [40] to smooth the computed velocities.Fig. 6 shows the cumulative distribution functions (CDFs) of thecomputed linear and angular velocities, respectively, for trace 1(shown in Fig. 5). We observe that the 95th percentile in y dimen-sion is located at 0 . / s, whereas the 95th percentiles in x and z dimension are at around 0 . / s. We also observe that the pitchand roll velocities are mostly smaller than the yaw velocity, whichcan reach peak values around 200 deg / s.Fig. 7 shows the mean linear and angular velocities as well asthe 95th percentile ranges for five different traces. In all traces,we observe that the linear velocities in x and z dimensions aregreater than those in y . Similarly, the angular velocities in yawdimension are greater than those in pitch and roll. Deviations fromthe mean values can be significant in all dimensions, as observedby inspecting the 95th percentiles (lightly shaded in the figure). For evaluation of the prediction methods, we employed two objec-tive error metrics: position error and angular error. Position error In an online setting, where an interpolation may not be feasible due to sequentialarrival of the data points, a Kalman filter can be designed to handle varying timeintervals between incoming samples [24]. ül and Bosse, et al. − . . . . . . m e t e r s xyz seconds − − − d e g r ee s yawpitchroll Figure 5: One sample trace from our dataset collected using Microsoft HoloLens. Top: position, bottom: orientation (shown asEuler angles). . . . . . . m/s . . . . . . C D F xyz deg/s yawpitchroll . Figure 6: CDF of linear velocity (left) and angular velocity(right) for trace 1.
Trace . . . . . . m / s xyz 1 2 3 4 5 Trace d e g/ s yawpitchroll . Figure 7: Mean linear velocity (left) and mean angular ve-locity (right) for five traces. Lighter shades show the 95thpercentile. is the Euclidean distance (in meters) between the actual and the predicted position. It is defined as d = (cid:113) ( ˆ x − x ) + ( ˆ y − y ) + ( ˆ z − z ) . (12)Angular error is the spherical distance (in degrees) between theactual and the predicted orientation. Let q be a measured and ˆq be apredicted unit quaternion. Then, the spherical distance ϕ betweenthe two orientations can be computed as follows [49] r = q ∗ ˆq (13) ϕ = π arccos ( r w ) (14)where q ∗ is the conjugate of the quaternion q and r w is the scalar(real) part of the quaternion r .After computing the position and angular error for each timepoint, we compute the mean absolute error (MAE) over a trace asMAE ( d ) = N N (cid:213) i = d i , MAE ( ϕ ) = N N (cid:213) i = ϕ i (15)where d i and ϕ i are the position and angular errors at time i , re-spectively, and N is the number of predicted samples in a trace. We evaluated the performance of our Kalman filter-based predictorfor different LATs {20,40,60,80,100} ms. This range was chosen con-sidering the measured M2P latency of our cloud-based volumetricstreaming system. Running our server on an Amazon Web Services(AWS) instance, we measured an average M2P latency around 60ms with a network latency of 13.3 ms [1].First, we evaluate the performance of the Kalman filter for afixed LAT, t p =
60 ms. Fig. 8 shows the distributions of the posi-tion and angular errors for each trace. In the top plot, we observethat most of the position errors lie within the range 0 . alman Filter-based Head Motion Prediction for Cloud-based Mixed Reality − − − − − m e t e r s Trace − − d e g r ee s Figure 8: Per-trace distribution of the position errors (top) and angular errors (bottom) of the Kalman filter for a givenLAT=60 ms. Whiskers represent the 1-99% range; outliers are shown by circles. Each trace contains 12k data points. Notethe logarithmic scale of the y-axes. few outliers exceeding 10 cm. Thus, the worst outliers are approx-imately one order of magnitude greater than the median of theposition errors. The bottom plot shows that the most angular er-rors have a magnitude of about 0 . ( d ) and MAE ( ϕ ) of the AutoReg, Kalman filter and the Baseline modelfor each LAT. The results are averaged over all traces. We observethat both position error and angular error increase linearly withincreasing LAT. Both AutoReg and Kalman filter perform betterthan the Baseline in terms of position error showing that predictingtranslational movement is better than doing no prediction. This ismore evident for larger M2P latencies (assumed equal to LATs in oursimulations) for which the performance of the Baseline deterioratesmore quickly than both predictors. However, the results for angularerror show that AutoReg does not have a clear advantage over theBaseline. On the other hand, the Kalman filter decreases the angularerrors on average by 0 . . Statistical significance.
To verify that the gain obtained by theKalman filter over Baseline is statistically significant, we applied atwo-sample T-test on the angular errors obtained by the Baselineand Kalman filter, for each trace and LAT combination. Conse-quently, we reject the null hypothesis H of identical means of theangular error of the Baseline model and the Kalman filter-basedprediction with p < . During the trace recordings, we observed that the HoloLens flipsthe sign of a quaternion, when its real (scalar) component q w ap-proaches ± .
5. Since a quaternion q and its negative − q correspondto the same orientation on the unit sphere [49], this behavior doesnot lead to a discontinuity in the rotation space. However, theKalman filter assumes that all latent and observed variables have aGaussian distribution which cannot model the true topology of thespherical quantities [22]. Therefore, these "jumps" in our data aredetrimental to the performance of our predictor and cause largepeak errors.As a workaround, we compare a new measurement z k with theprevious one z k − at each iteration of the filter and reset the filter byre-initializing the state x and error covariance P , whenever a signflip is detected. This workaround partially alleviates the peak errorsobserved after quaternion sign flips. However, the filter requires afew iterations to output good predictions again which causes a fewlarge outliers after a re-initialization. In Sec. 6, we identify some A two-sample T-test checks the null hypothesis that two independent samples haveidentical average (expected) values [16]. ül and Bosse, et al.
20 40 60 80 100
Look-ahead time [ms] . . . . . . . . M A E [ m e t e r ] AutoRegKalmanBaseline
20 40 60 80 100
Look-ahead time [ms] . . . . . M A E [ d e g r ee ] AutoRegKalmanBaseline
Figure 9: Position error (left) and angular error (right) in terms of MAE. Averages over 14 traces are shown. techniques that correctly handle spherical distributions and thusmay reduce the observed peak errors.
This paper presented a Kalman filter-based framework for predic-tion of head motion in order to reduce the interaction (motion-to-photon) latency of a cloud-based volumetric streaming system. Weevaluated our approach using real head motion traces for differentlook-ahead times. Our results show that the proposed approachcan predict head orientation 0 . . REFERENCES [1] [n.d.]. Anoynmous for peer review.[2] Htrotugu Akaike. 1973. Maximum likelihood identification of Gaussian au-toregressive moving average models.
Biometrika
60, 2 (1973), 255–265. https://doi.org/10.1093/biomet/60.2.255[3] A Deniz Aladagli, Erhan Ekmekcioglu, Dmitri Jarnikov, and Ahmet Kondoz. 2017.Predicting head trajectories in 360 virtual reality videos. In . IEEE, 1–6. https://doi.org/10.1109/IC3D.2017.8251913[4] Ronald Azuma. 1995.
Predictive tracking for augmented reality . Dissertation.University of North Carolina at Chapel Hill.[5] Ronald Azuma and Gary Bishop. 1995. A frequency-domain analysis of head-motion prediction. In
Proceedings of the 22nd annual conference on Computergraphics and interactive techniques - SIGGRAPH ’95 . ACM Press, 401–408. https://doi.org/10.1145/218380.218496[6] Yanan Bao, Huasen Wu, Tianxiao Zhang, Albara Ah Ramli, and Xin Liu. 2016.Shooting a moving target: Motion-prediction-based transmission for 360-degreevideos. In . IEEE, 1161–1170. https://doi.org/10.1109/bigdata.2016.7840720[7] Yaakov Bar-Shalom, X Rong Li, and Thiagalingam Kirubarajan. 2004.
Estimationwith applications to tracking and navigation: theory algorithms and software . JohnWiley & Sons.[8] Tom Beigbeder, Rory Coughlan, Corey Lusher, John Plunkett, Emmanuel Agu,and Mark Claypool. 2004. The Effects of Loss and Latency on User Performancein Unreal Tournament 2003. In
Proceedings of 3rd ACM SIGCOMM Workshop onNetwork and System Support for Games (NetGames ’04) . ACM, 144–151. https://doi.org/10.1145/1016540.1016556[9] Rick Chartrand. 2011. Numerical differentiation of noisy, nonsmooth data.
ISRNApplied Mathematics
IEEE Communications Magazine
IEEE Journal on Emerging and Selected Topics in Circuits and Systems
9, 1 (2019),18–28. https://doi.org/10.1109/jetcas.2019.2899516 alman Filter-based Head Motion Prediction for Cloud-based Mixed Reality [13] James Diebel. 2006. Representing attitude: Euler angles, unit quaternions, androtation vectors.
Matrix
58, 15-16 (2006), 1–35.[14] Peter Eisert and Anna Hilsmann. 2020. Hybrid Human Modeling: Making Volu-metric Video Animatable. In
Real VR - Immersive Digital Reality: How to Import theReal World into Head-Mounted Immersive Displays , Marcus Magnor and Alexan-der Sorkine-Hornung (Eds.). Springer International Publishing, Cham, 167–187.https://doi.org/10.1007/978-3-030-41816-8_7[15] Henry Himberg and Yuichi Motai. 2009. Head Orientation Prediction: DeltaQuaternions Versus Quaternions.
IEEE Transactions on Systems, Man, and Cy-bernetics, Part B (Cybernetics)
39, 6 (2009), 1382–1392. https://doi.org/10.1109/TSMCB.2009.2016571[16] Robert V Hogg, Joseph McKean, and Allen T Craig. 2005.
Introduction to mathe-matical statistics . Pearson Education.[17] Mohammad Hosseini and Christian Timmerer. 2018. Dynamic Adaptive PointCloud Streaming. In
Proceedings of the 23rd Packet Video Workshop (PV ’18) . ACM,New York, NY, USA, 25–30. https://doi.org/10.1145/3210424.3210429[18] Rob J Hyndman and George Athanasopoulos. 2018.
Forecasting: principles andpractice . OTexts.[19] Yan-Bin Jia. 2013. Quaternions and Rotations. http://graphics.stanford.edu/courses/cs348a-17-winter/Papers/quaternion.pdf. Online; accessed: 2020-05-23.[20] Andrew Kiruluta, Moshe Eizenman, and Subbarayan Pasupathy. 1997. Predictivehead movement tracking using a Kalman filter.
IEEE Transactions on Systems,Man, and Cybernetics, Part B (Cybernetics)
27, 2 (1997), 326–331. https://doi.org/10.1109/3477.558841[21] Edgar Kraft. 2003. A quaternion-based unscented Kalman filter for orientationtracking. In
Sixth International Conference of Information Fusion, 2003. Proceedingsof the , Vol. 1. IEEE, 47–54. https://doi.org/10.1109/ICIF.2003.177425[22] Gerhard Kurz, Igor Gilitschenski, and Uwe D. Hanebeck. 2016. Recursive Bayesianfiltering in circular state spaces.
IEEE Aerospace and Electronic Systems Magazine
31, 3 (2016), 70–87. https://doi.org/10.1109/MAES.2016.150083[23] Gerhard Kurz, Igor Gilitschenski, and Uwe D Hanebeck. 2016. Unscented vonmises–fisher filtering.
IEEE Signal Processing Letters
23, 4 (2016), 463–467.[24] Roger Labbe. 2015. Kalman and Bayesian filters in Python. https://github.com/rlabbe/Kalman-and-Bayesian-Filters-in-Python. Online; accessed: 2020-05-14.[25] Joseph J. LaViola. 2003. A comparison of unscented and extended Kalman filteringfor estimating quaternion motion. In
Proceedings of the 2003 American ControlConference, 2003. , Vol. 3. 2435–2440 vol.3. https://doi.org/10.1109/ACC.2003.1243440[26] Yang Liu, Haiwei Dong, Longyu Zhang, and Abdulmotaleb El Saddik. 2018. Tech-nical Evaluation of HoloLens for Multimedia: A First Look.
IEEE MultiMedia . IEEE, 77–86. https://doi.org/10.1109/ISMAR.2008.4637329[28] Simone Mangiante, Guenter Klas, Amit Navon, Zhuang GuanHua, Ju Ran, andMarco Dias Silva. 2017. VR is on the edge: How to deliver 360 videos in mobilenetworks. In
Proceedings of the Workshop on Virtual Reality and Augmented RealityNetwork . ACM, 30–35. https://doi.org/10.1145/3097895.3097901[29] William R Mark, Leonard McMillan, and Gary Bishop. 1997. Post-rendering 3Dwarping. In
Proceedings of the 1997 symposium on Interactive 3D graphics . 7–ff.https://doi.org/10.1145/253284.253292[30] Jeffrey W. McCandless, Stephen R. Ellis, and Bernard D. Adelstein. 2000. Local-ization of a Time-Delayed, Monocular Virtual Object Superimposed on a RealEnvironment.
Presence: Teleoperators and Virtual Environments
9, 1 (2000), 15–24.https://doi.org/10.1162/105474600566583[31] Microsoft. 2016. Hologram stabilization. https://microsoft.github.io/MixedRealityToolkit-Unity/Documentation/hologram-stabilization.html. Online;accessed: 2020-05-24.[32] Anh Nguyen, Zhisheng Yan, and Klara Nahrstedt. 2018. Your attention is unique:Detecting 360-degree video saliency in head-mounted display for head movementprediction. In
Proceedings of the 26th ACM international conference on Multimedia .1190–1198. https://doi.org/10.1145/3240508.3240669[33] NVIDIA. 2019. NVIDIA CloudXR Delivers Low-Latency AR/VR Streaming Over5G Networks to Any Device. https://blogs.nvidia.com/blog/2019/10/22/nvidia-cloudxr. Online; accessed: 2020-03-26.[34] Cagri Ozcinar, Julian Cabrera, and Aljosa Smolic. 2019. Visual Attention-AwareOmnidirectional Video Streaming Using Optimal Tiles for Virtual Reality.
IEEEJournal on Emerging and Selected Topics in Circuits and Systems
9, 1 (2019), 217–230. https://doi.org/10.1109/jetcas.2019.2895096[35] Jounsup Park, Philip A Chou, and Jenq-Neng Hwang. 2019. Rate-utility optimizedstreaming of volumetric media for augmented reality.
IEEE Journal on Emergingand Selected Topics in Circuits and Systems
9, 1 (2019), 149–162. https://doi.org/10.1109/jetcas.2019.2898622[36] Stefano Petrangeli, Gwendal Simon, Haoliang Wang, and Vishy Swaminathan.2019. Dynamic Adaptive Streaming for Augmented Reality Applications. In . IEEE, 56–567. https://doi.org/10.1109/ism46123.2019.00017 [37] Feng Qian, Bo Han, Jarrell Pair, and Vijay Gopalakrishnan. 2019. Toward practicalvolumetric video streaming on commodity smartphones. In
Proceedings of the20th International Workshop on Mobile Computing Systems and Applications . ACM,135–140. https://doi.org/10.1145/3301293.3302358[38] Feng Qian, Bo Han, Qingyang Xiao, and Vijay Gopalakrishnan. 2018. Flare:Practical Viewport-Adaptive 360-Degree Video Streaming for Mobile Devices.In
Proceedings of the 24th Annual International Conference on Mobile Computingand Networking (MobiCom ’18) . ACM, New York, NY, USA, 99–114. https://doi.org/10.1145/3241539.3241565[39] Jarek Rossignac. 1999. Edgebreaker: Connectivity compression for trianglemeshes.
IEEE transactions on visualization and computer graphics
5, 1 (1999),47–61. https://doi.org/10.1109/2945.764870[40] Abraham Savitzky and Marcel JE Golay. 1964. Smoothing and differentiation ofdata by simplified least squares procedures.
Analytical chemistry
36, 8 (1964),1627–1639. https://doi.org/10.1021/ac60214a047[41] O Schreer, I Feldmann, P Kauff, P Eisert, D Tatzelt, C Hellge, K Müller, T Ebner,and S Bliedung. 2019. Lessons learnt during one year of commercial volumetricvideo production. In . IBC.[42] Oliver Schreer, Ingo Feldmann, Sylvain Renault, Marcus Zepp, Markus Worchel,Peter Eisert, and Peter Kauff. 2019. Capture and 3D video processing of volumetricvideo. In . IEEE, 4310–4314. https://doi.org/10.1109/icip.2019.8803576[43] Sebastian Schwarz, Marius Preda, Vittorio Baroncini, Madhukar Budagavi, PabloCesar, Philip A. Chou, Robert A. Cohen, Maja Krivokuća, Sebastien Lasserre,Zhu Li, Joan Llach, Khaled Mammou, Rufael Mekuria, Ohji Nakagami, Ernes-tasia Siahaan, Ali Tabatabai, Alexis M. Tourapis, and Vladyslav Zakharchenko.2019. Emerging MPEG Standards for Point Cloud Compression.
IEEE Journalon Emerging and Selected Topics in Circuits and Systems
9, 1 (2019), 133–148.https://doi.org/10.1109/JETCAS.2018.2885981[44] Skipper Seabold and Josef Perktold. 2010. statsmodels: Econometric and statisticalmodeling with python. In . inproceedings.[45] Shu Shi, Varun Gupta, Michael Hwang, and Rittwik Jana. 2019. Mobile VR onedge cloud: a latency-driven design. In
Proceedings of the 10th ACM MultimediaSystems Conference . ACM, 222–231. https://doi.org/10.1145/3304109.3306217[46] Shu Shi and Cheng-Hsin Hsu. 2015. A survey of interactive remote renderingsystems.
Comput. Surveys
47, 4 (May 2015), 1–29. https://doi.org/10.1145/2719921[47] Shu Shi, Klara Nahrstedt, and Roy Campbell. 2012. A real-time remote ren-dering system for interactive mobile graphics.
ACM Transactions on Multime-dia Computing, Communications, and Applications (TOMM)
8, 3s (2012), 1–20.https://doi.org/10.1145/2348816.2348825[48] Ken Shoemake. 1985. Animating rotation with quaternion curves. In
ACMSIGGRAPH computer graphics , Vol. 19. ACM, 245–254. https://doi.org/10.1145/325165.325242[49] Joan Solà. 2017. Quaternion kinematics for the error-state Kalman filter.
CoRR abs/1711.02508 (2017). arXiv:1711.02508 http://arxiv.org/abs/1711.02508[50] Jeroen van der Hooft, Tim Wauters, Filip De Turck, Christian Timmerer, andHermann Hellwagner. 2019. Towards 6DoF HTTP adaptive streaming throughpoint cloud compression. In
Proceedings of the 27th ACM International Conferenceon Multimedia . 2405–2413. https://doi.org/10.1145/3343031.3350917[51] A. van Rhijn, R. van Liere, and J.D. Mulder. 2005. An analysis of orientationprediction and filtering methods for VR/AR. In
IEEE Proceedings. VR 2005. VirtualReality, 2005.
IEEE, 67–74. https://doi.org/10.1109/VR.2005.1492755[52] JMP Van Waveren. 2016. The asynchronous time warp for virtual reality onconsumer hardware. In
Proceedings of the 22nd ACM Conference on Virtual RealitySoftware and Technology . ACM, 37–46. https://doi.org/10.1145/2993369.2993375[53] Greg Welch and Gary Bishop. 1995.