[PDF] IDOL: Inertial Deep Orientation-Estimation and Localization

Abstract

Many smartphone applications use inertial measurement units (IMUs) to sense movement, but the use of these sensors for pedestrian localization can be challenging due to their noise characteristics. Recent data-driven inertial odometry approaches have demonstrated the increasing feasibility of inertial navigation. However, they still rely upon conventional smartphone orientation estimates that they assume to be accurate, while in fact these orientation estimates can be a significant source of error. To address the problem of inaccurate orientation estimates, we present a two-stage, data-driven pipeline using a commodity smartphone that first estimates device orientations and then estimates device position. The orientation module relies on a recurrent neural network and Extended Kalman Filter to obtain orientation estimates that are used to then rotate raw IMU measurements into the appropriate reference frame. The position module then passes those measurements through another recurrent network architecture to perform localization. Our proposed method outperforms state-of-the-art methods in both orientation and position error on a large dataset we constructed that contains 20 hours of pedestrian motion across 3 buildings and 15 subjects.

Full PDF

IIDOL: Inertial Deep Orientation-Estimation and Localization

Scott Sun, Dennis Melamed, Kris Kitani

Carnegie Mellon University

Abstract

Many smartphone applications use inertial measurementunits (IMUs) to sense movement, but the use of these sen-sors for pedestrian localization can be challenging due totheir noise characteristics. Recent data-driven inertial odom-etry approaches have demonstrated the increasing feasibilityof inertial navigation. However, they still rely upon conven-tional smartphone orientation estimates that they assume tobe accurate, while in fact these orientation estimates can be asigniﬁcant source of error. To address the problem of inaccu-rate orientation estimates, we present a two-stage, data-drivenpipeline using a commodity smartphone that ﬁrst estimatesdevice orientations and then estimates device position. Theorientation module relies on a recurrent neural network andExtended Kalman Filter to obtain orientation estimates thatare used to then rotate raw IMU measurements into the appro-priate reference frame. The position module then passes thosemeasurements through another recurrent network architectureto perform localization. Our proposed method outperformsstate-of-the-art methods in both orientation and position er-ror on a large dataset we constructed that contains 20 hoursof pedestrian motion across 3 buildings and 15 subjects. Codeand data are available at https://github.com/KlabCMU/IDOL.

Inertial localization techniques typically estimate 3D mo-tion from inertial measurement unit (IMU) samples of linearacceleration (accelerometer), angular velocity (gyroscope),and magnetic ﬂux density (magnetometer). A well-knownweakness that has plagued inertial localization is the depen-dence on accurate 3D orientation estimates (e.g., roll, pitch,yaw; quaternion; rotation matrix) to properly convert sensor-frame measurements to a global reference frame. Small er-rors in this component can result in substantial localiza-tion errors that have limited the feasibility of inertial pedes-trian localization (Shen, Gowda, and Roy Choudhury 2018).Since orientation estimation plays a central role in inertialodometry, we hypothesize that improvements in 3D orien-tation estimation will result in improvements to localizationperformance.Existing localization methods typically rely on WiFi,Bluetooth, LiDAR, or camera sensors because of their ef-

IMU Signal Sequence Orientation Module LocationIMU Orientations Position Module

Figure 1: Overview of our proposed method.fectiveness. However, WiFi and Bluetooth beacon-based so-lutions are costly due to requiring heavy instrumentationof the environment for accurate localization (Ahmetovicet al. 2016). While LiDAR-based localization is highly ac-curate, it is expensive and power-hungry (Zhang and Singh2014). Image-based localization is effective with ample lightand texture, but is also power-intensive and a privacy con-cern. IMUs would address many of these problems becausethey are infrastructure-free, highly energy-efﬁcient, do notrequire line-of-sight with the environment, and are highlyubiquitous by function of their cost and size.Recent deep-learning approaches, like IONet (Chen et al.2018a) and RoNIN (Herath, Yan, and Furukawa 2020), havedemonstrated the possibility of estimating 3D device (oruser) motion using an IMU, but they do not directly addressdevice orientation estimation. These new approaches havebeen able to address the problem of drift suffered by tra-ditional inertial localization techniques through the use ofsupervised learning to directly estimate the spatial displace-ment of the device. However, most existing works use the3D orientation estimates generated by the device (typicallywith conventional ﬁltering-based approaches), which can beinaccurate ( > ◦ of error with the iPhone 8). This orienta-tion is typically used as an initial step to rotate the local IMUmeasurements to a common reference frame before apply-ing a deep network (Chen et al. 2018a). This is ﬂawed as thedeep network output can be corrupted by these orientationestimate errors, leading to signiﬁcant error growth.Our approach to this problem involves designing a su-pervised deep network architecture with an explicit orienta-tion estimation module to complement a position estimationmodule, shown in Figure 1. In addition to the gyroscope, the3D orientation module makes use of information encoded in a r X i v : . [ c s . R O ] F e b he accelerometer and magnetometer of the IMU by proxyof their measurements of gravitational acceleration and theEarth’s magnetic ﬁeld. By looking at a small temporal win-dow of IMU measurements, this module learns to estimatea more accurate device orientation, and by extension, resultsin a more accurate device location. We train a two-stage deepnetwork to estimate the 5D device pose – 3D orientation and2D position (3D position is also possible).The contributions of our work are: (i) a state-of-the-artdeep network architecture to perform inertial orientation es-timation, which we show leads to improved position esti-mates; (ii) an end-to-end model that produces more accu-rate position estimation than previous classical and learning-based techniques; and (iii) a building-scale inertial sensordataset of annotated 6D pose (3D position + orientation) dur-ing human motion. We place inertial systems into two broad categories based ontheir approaches to localization and orientation: traditionalmethods and data-driven methods.

Dead reckoning with an IMU using the analytical solutionconsists of integrating gyroscopic readings to determine sen-sor orientation (e.g., Rodrigues’ rotation formula), usingthose orientations to rotate accelerometer readings into theglobal frame, removing gravitational acceleration, and thendouble-integrating the corrected accelerometer readings todetermine position (Chen et al. 2018a). The multiple inte-grations lead to errors being magniﬁed over time, resultingin an unusable estimate within seconds. However, additionalsystem constraints on sensor placement and movement canbe used to reduce the amount of drift, e.g., foot mountedZUPT inertial navigation systems that rely on the foot beingstationary when in contact with the ground to reset errors(Jimenez et al. 2009). Extended Kalman Filters (EKFs) areoften used to combine IMU readings accurate in the near-term with other localization methods that are more accurateover the long term, like GPS (Caron et al. 2006), cameras(Qin, Li, and Shen 2018), and heuristic constraints (Solinet al. 2018).

Recent years have seen the development of new data-drivenmethods for inertial odometry. Earlier work vary from ap-proaches like RIDI (Yan, Shan, and Furukawa 2018), whichrelies on SVMs, to deep network approaches like Cortes,Solin, and Kannala (2018), IONet (Chen et al. 2018a),RoNIN (Herath, Yan, and Furukawa 2020), and TLIO (Liuet al. 2020).RIDI and Cortes, Solin, and Kannala (2018) fundamen-tally rely on dead-reckoning, i.e., double-integrating accel-eration, but differ in how they counteract the resulting ex-treme drift. RIDI uses a two-stage system to regress low-frequency corrections to the acceleration. A Support VectorMachine (SVM) classiﬁes the motion as one of four hold-ing modalities (e.g. in the hand, the purse) from the ac-celerometer/gyroscope measurements. These measurements are then fed to a modality-speciﬁc Support Vector Regres-sor (SVR) to determine a correction applied before the ac-celeration is double-integrated to determine user position.Cortes, Solin, and Kannala (2018) use a convolutional neu-ral network (CNN) to regress momentary user speed as aconstraint. The speed acts as a pseudo-measurement updateto an EKF that performs accelerometer double-integration asits process update.IONet and the remaining work forgo dead-reckoning andinstead rely on deep networks to bypass one set of integra-tion, thereby limiting error growth. In IONet, a bi-directionallong-short-term memory (BiLSTM) network is given a win-dow of world-frame accelerometer and gyroscope measure-ments (with the reference frame conversion done by thephone API), from which it sequentially regresses a polardisplacement vector describing the device’s motion in theground plane. This single integration helps minimize the er-ror magniﬁcation. The bulk of their work is evaluated ina small Vicon motion capture studio with different motionmodalities, e.g., in the pocket.RoNIN builds on the IONet approach and presents threedifferent neural network architectures to tackle inertial lo-calization: an LSTM network, a temporal convolutional net-work (TCN), and a residual network (ResNet). These modelsregress user velocity/displacement estimates in the ground(x-y) plane.TLIO is a recent work that uses a ResNet-style architec-ture from RoNIN to estimate positional displacements. Theyfuse these deep estimates with the raw IMU measurementsusing a stochastic-cloning EKF, which estimates position,orientation, and the sensor biases.The present work suggests current data-driven inertial lo-calization approaches lack a robust device orientation esti-mator. Previous networks rely heavily on direct gyroscopeintegration or the device’s estimate, which fuses accelerom-eter, gyro, and magnetometer readings using classical meth-ods. While these estimates may be accurate over the shortterm, they are prone to discontinuities, unstable readings,and drift over time. The success of data-driven approachesin localization suggests similar possibilities for orientationestimation.

Prior work for device orientation estimation are primarilybased on traditional ﬁltering techniques. The Madgwick ﬁl-ter (Madgwick, Harrison, and Vaidyanathan 2011) is usedwidely in robotics. In the Madgwick ﬁlter, gyroscope read-ings are integrated over time to determine an orientation es-timate. This is accurate in the short term but drifts due togyroscope bias. To correct the bias, a minimization is per-formed between two vectors: (1) the static world gravityvector, rotated into the device frame using the current es-timated orientation, and (2) the acceleration vector. The ma-jor component of the acceleration vector is assumed to begravity, so it calculates a gradient to bring the gravity vectorcloser to the acceleration vector in the current frame. Theorientation estimate consists of a weighted combination ofthis gradient and the gyroscope integration. This assumes theon-gravitational acceleration components are small, whichis impractical for pedestrian motion.Complementary ﬁlters are also used in state-of-the-artorientation estimation systems like MUSE (Shen, Gowda,and Roy Choudhury 2018). MUSE behaves similarly to theMadgwick ﬁlter, but uses the acceleration vector as the tar-get of the orientation update only when the device is static.Instead, they mainly use the magnetic north vector as thebasis of the gradient calculation. This has the advantage ofremoving the issue of large non-gravitational accelerationscausing erroneous updates since when the device is static,the acceleration vector consists mostly of gravity. However,a static device is rare during pedestrian motion and magneticﬁelds can vary signiﬁcantly from within the same buildingdue to local disturbances which are difﬁcult to characterize.Extended Kalman Filter (EKF) approaches (Bar-Itzhackand Oshman 1985; Marins et al. 2002; Sabatini 2006) fol-low a similar approach to the previously mentioned ﬁlters,but use a more statistically rigorous method of combininggyroscope integration with accelerometer/magnetometer ob-servations. An estimate of the orientation error can also beextracted from this type of ﬁlter. We take advantage of sucha ﬁlter in our work, but replace the gravity vector or mag-netic north measurement update with the output of a learnedmodel to provide a less noisy estimate of the true orientationand simplify the Kalman update equations.

Recent literature like OriNet (Esfahani et al. 2020)and Brossard, Bonnabel, and Barrau (2020) (abbreviatedBrossard et. al. (2020)) has begun utilizing deep networksto regress orientation from IMU measurements. OriNet usesa recurrent neural architecture based on LSTMs to propagatestate. It corrects for gyroscopic bias via a genetic algorithmand sensor noise via additive Gaussian noise during training.Brossard et. al. (2020) estimates orientation via gyroscopicintegration, but uses a CNN to perform a correction to theangular velocity to ﬁlter out unwanted noise and bias priorto integration. These methods have primarily focused on ﬁl-tering gyroscopic data using deep networks and estimatingcorrection factors to reduce bias and noise. Our method di-rectly estimates an orientation from all IMU channels usinga deep network to capture all error sources for long-term ac-curacy while fusing gyro data in the short term via an EKF.The prior data-driven approaches thus far have yet to includemagnetic observations, which leaves performance on the ta-ble given the success of incorporating magnetic observationsin classical approaches.

We aim to develop a method for 3D orientation and 2D po-sition estimation of a smartphone IMU held by a pedestrianthrough the use of supervised learning. Our model is de-signed based on the knowledge that the accelerometer con-tains information about the gravitational acceleration of theEarth and that the magnetometer contains information aboutthe Earth’s magnetic ﬁeld. Thus, it should be possible to in-fer the absolute 3D orientation of the device using a deep network with higher accuracy than that achievable with theheuristic-based traditional ﬁltering methods.

We propose a network architecture for estimating device ori-entation. The network consists of two components: (1) anorientation network that estimates a device orientation fromthe provided acceleration, angular velocity, and magnetome-ter readings and (2) an Extended Kalman ﬁlter to further sta-bilize the network output with the gyroscope readings. Theresulting 3D orientation is used to rotate the accelerometerand gyroscope channels from the phone’s coordinate systemto a world coordinate system. The corrected measurementsare then passed as inputs to the position network.We use a neural network, referred to as the OrientationNetwork (OrientNet), to convert IMU measurements to a 3Dorientation and a corresponding covariance estimate. Insteadof directly converting the magnetic ﬁeld or acceleration vec-tor to orientation, as is done in traditional ﬁltering methods,we use a neural network to learn a data-driven mapping ofother sensor measurements to orientation. We ﬁnd that themagnetic ﬁeld measurements contribute most reliably to thisestimate (much more than gravity), in agreement with theclaims by Shen, Gowda, and Roy Choudhury (2018).Formally, we estimated the instantaneous 3D orientation ˆ θ , ˆ Σ = g ( a t , ω t , B t , h (cid:48) t − ) , (1)where the function g consists of a 2-layer LSTM with 100hidden units and h (cid:48) t − is a hidden state produced by theLSTM at the last time step. At each time step, the accelerom-eter, gyroscope, and magnetometer readings ( a , ω , B ) aretaken as input. The hidden state is then fed through 2 fully-connected layers to produce an absolute orientation ˆ θ in theglobal reference frame, and through two other fully con-nected layers to produce an orientation covariance ˆΣ . Thiscovariance represents the auto-covariance of the 3-dim ori-entation error, determined using a boxminus operation be-tween the true and estimated orientations (as deﬁned inequation 3), and is thus a 3 × ˆx k | k − = ˆx k − | k − + B k ω k , ˆP k | k − = ˆP k − | k − + Q k , (2)where the quaternion state ˆx k − | k − is the a posteriori esti-mate of of the state at k − given observations up to k − , iLSTMLinear ⊕ BiLSTMLinear ⊕ ⊕

BiLSTMLinear a ′ , ω ′ a ′ , ω ′ a ′ t , ω ′ t ᠁᠁ ̂ x ̂ x ̂ x t FC FC FCInitial Offset BiLSTMLinear ⊕ BiLSTMLinear ⊕ ⊕

BiLSTMLinear a ′ t +1 , ω ′ t +1 a ′ t +2 , ω ′ t +2 a ′ t , ω ′ t ᠁᠁ ̂ x t +1 ̂ x t +2 ̂ x t FC FC FCInitial Offset q , ̂ Σ a , ω , B a , ω , B ᠁ OrientNet

Extended Kalman Filter Extended Kalman Filter ω ω ̂ q ̂ q ᠁ LSTMFCLSTMFC h ′ Acc.GyroMag. NLL Loss q Ground Truth x Orientation Position

Orientation Module Position Module

Reference Frame TransformReference Frame Transform

MSE Loss a , ω a ′ , ω ′ ̂ x t − ̂ x x t − x a , ω a ′ , ω ′ ᠁᠁ ̂ q , ̂ Σ Figure 2: Detailed system diagram. IMU readings are ﬁrst passed to the orientation module, which is trained to estimate theorientation quaternion q . This orientation is used to convert accelerometer/gyroscope readings from device to world frame.These readings are passed to the position module, which is trained to minimize displacement error per window, for localization.and ˆx k | k − is the propagated orientation estimate at timestep k given observations up to k − . The motion model is pa-rameterized by B k , which converts the current gyroscopemeasurement ω k into a quaternion representing the rotationachieved by ω k . The process update is applied via simpleaddition, which approximates quaternion rotations at highsample rates. ˆP k | k − is the estimate of the covariance of thepropagated state vector at time k given observations up to k − , with ˆP k − | k − again the a posteriori estimate of co-variance at time k − . Q k is a static diagonal propagationnoise matrix for the gyro, which we set to . I based onexperimentation with our training data.The EKF’s measurement updates correct the propagatedstate with the network-predicted orientation. Using normaladdition and subtraction as orientation operators becomesinaccurate since there is no guarantee the predicted andpropagated quaternions lie close together. Thus, here wetreat the difference between orientations as a distance on thequaternion manifold instead of as a vector-space distance us-ing the methods presented by Hertzberg et al. (2013). Box-plus ( (cid:1) ) and boxminus ( (cid:12) ), which respect the manifold, re-place addition and subtraction of quaternions: q (cid:12) q = 2 ¯log( q − ⊗ q ) = δ , q (cid:1) δ = q ⊗ exp( δ /

2) = q , exp( δ ) = (cid:20) cos( (cid:107) δ (cid:107) ) sinc ( (cid:107) δ (cid:107) ) δ (cid:21) , ¯log( (cid:20) w v (cid:21) ) = (cid:40) , v = 0 atan ( (cid:107) v (cid:107) /w ) (cid:107) v (cid:107) v , v (cid:54) = 0 . (3) q and q are unit quaternions; δ is the three-dimensionalmanifold difference between them. w and v are the real andimaginary parts of a quaternion, respectively. These oper-ators maintain the unit norm and validity of the resultingquaternions. The norm of δ between two quaternions de-scribes the distance along a unit sphere between the orien-tations. Adding δ to a quaternion using (cid:1) results in anothervalid quaternion displaced from the initial quaternion. This displacement is encoded in δ . With these operators, the mea-surement update for our method becomes K k = P k | k − ( P k | k − + R k ) − , ˆx k | k = ˆx k | k − (cid:1) K k ( q k (cid:12) ˆx k | k − ) , P k | k = ( I − K k ) P k | k − , (4)where q k and R k are the network-predicted orientation andcovariance for timestep k . The result of the measurementupdate is the ﬁnal orientation estimate for timestep k ( ˆx k | k )and the estimated state covariance P k | k . To train the orientation module, we ﬁrst perform a tradi-tional ellipsoid ﬁt calibration on the raw magnetometer val-ues, which also serves to scale the network inputs to a con-ﬁned range (Kok and Sch¨on 2016). From here on, we willrefer to these coarsely calibrated magnetometer readings aspart of the raw IMU measurements. To obtain the mean andcovariance needed to parameterize a Gaussian estimate, anegative log likelihood (NLL) loss is used to train the covari-ance estimator for each orientation output. This loss seeks tomaximize the probability of the given ground truth, assum-ing a Gaussian distribution parameterized by the estimatedorientation and covariance: L orient = 12 ( q (cid:12) ˆ q ) T ˆ Σ − ( q (cid:12) ˆ q ) + 12 ln( | ˆ Σ | ) . (5)The output of the covariance estimation head of the networkis a six dimensional vector, following a standard parame-terization of a covariance matrix (Russell and Reale 2019).The ﬁrst three elements are the log of the standard devia-tion, which are then exponentiated and squared to form thediagonal elements of the covariance matrix. The remaining3 are correlation coefﬁcients between variables, which aremultiplied by the relevant exponentiated standard deviationsto form the covariance elements of the matrix. Analytically, to convert from the world frame accelerometervalues to position, one would need to perform two integrals.owever, any offsets in the acceleration values would resultin quadratic error growth. Therefore, we again adopt the useof a neural network to learn an appropriate approximationthat is less susceptible to error accumulation, as has beendemonstrated successfully by Herath, Yan, and Furukawa(2020) and Chen et al. (2018a). The position estimation net-work takes world frame gyroscope and accelerometer chan-nels as inputs and outputs the ﬁnal user position in the sameglobal reference frame as the orientation module. We use aCartesian parameterization of the user position to match thatof the rotated accelerometer.The position module’s architecture is depicted in Figure 2.We primarily rely on a 2-layer BiLSTM with a hidden sizeof 100. The input is a sequence of 6-DOF IMU measure-ments in the world frame. At each timestep, the hidden stateis passed to 2 fully-connected layers with tanh activationbetween them and hidden state sizes of 50 and 20, respec-tively. The resulting vector is then passed to a linear layerthat converts it to a two-dimensional Cartesian displacementrelative to the start of each window. These are then summedover time to form Cartesian positions relative to the start ofthe sequence, with each window’s ﬁnal position serving asthe initial offset for the next window. During test time, theLSTM hidden states are not propagated between each se-quence window, as this periodic resetting helps to limit theaccumulation of drift.

Over the course of training, progressively longer batch se-quences are provided. We start with sequences of length 100and progressively increase this over training to length 2000.We ﬁnd this type of curriculum learning greatly reduces driftaccumulation as the overall error must be kept low over alonger time period. After this routine, the sequence lengthmay then be dropped back down to a shorter length to re-duce latency. We use MSE loss over the displacement ofeach LSTM window. In other words, L position = L MSE ( x t − x , ˆ x t − ˆ x ) . (6) To collect trajectories through the narrow hallways of a typ-ical building, we rely on a SLAM rig (Kaarta Stencil) asground truth. To obtain the phone’s ground truth orientationand position, we rigidly mount it to the rig, which uses a Li-DAR sensor, video camera, and Xsens IMU to estimate itspose at 200Hz. From testing in a Vicon motion capture stu-dio, we measured < ◦ RMS orientation error and < Experiments

To demonstrate the effectiveness and utility of our inertialodometry system, we set three main goals for evaluation: (i)verify that our model produces better orientation estimatesthan the baselines, (ii) show that our model is able to achievehigher position localization accuracy than previous methods,and (iii) demonstrate that orientation error is a major sourceof ﬁnal position error by showing that other inertial odome-try methods beneﬁt from our orientation module.The main metric used to evaluate the orientation mod-ule is root mean squared (RMSE) orientation error, mea-sured as the direct angular distance between the estimatedand ground truth orientations. We evaluate the accuracy ofour position estimate using metrics deﬁned by Sturm et al.(2011) and used by RoNIN:•

Absolute Trajectory Error (ATE) : the RMSE betweencorresponding points in the estimated and ground truthtrajectories. The error is deﬁned as E i = x i − ˆ x i where i corresponds to the timestep. This is a measure of globalconsistency and usually increases with trajectory length.• Time-Normalized Relative Traj. Error (T-RTE) : theRMSE between the displacements over all corresponding1-minute windows in the estimated and ground truth tra-jectories. The error is deﬁned as E i = ( x i + t − x i ) − (ˆ x i + t − ˆ x i ) where i is the timestep and t is the interval.This measures local consistency between trajectories.• Distance-Normalized Relative Traj. Error (D-RTE) :the RMSE between the displacements over all corre-sponding windows in the estimated and ground truth tra-jectories where the ground truth trajectory has traveled1 meter. The error is deﬁned as E i = ( x i + t d − x i ) − (ˆ x i + t d − ˆ x i ) where i corresponds to the timestep and t d is the interval length required to traverse a distance of 1m.The RMSE for these metrics is calculated using the follow-ing equation where E i is the i -th error term out of m total: RM SE = (cid:118)(cid:117)(cid:117)(cid:116) m m (cid:88) i =1 (cid:107) E i (cid:107) . (7) We implemented our model in Pytorch 1.15 (Paszke et al.2019) and train it using the Adam optimizer (Kingma andBa 2015) on an Nvidia RTX 2080Ti GPU. The orienta-tion network is ﬁrst individually trained using a ﬁxed seedand a learning rate of 0.0005. Then, using these initializedweights, the position network is attached and then trainedusing a learning rate of 0.001. We use a batch size of 64,with the network reaching convergence within 20 epochs.Each epoch involves a full pass through all training data.At test time, an initial orientation can be provided or, as-suming a calibrated magnetometer, the initial orientation canbe estimated by the network directly with high accuracy rel-ative to a predeﬁned global frame. This cannot be said forsystems that rely solely on gyroscope integration, which pro-duces an orientation relative to the initialization. As this sys-tem is meant to aid pedestrian navigation using a smartphone

System Bldg 1 Bldg 2 Bldg 3 iOS CoreMotion 0.39 0.37 0.40MUSE 0.21 0.25 0.45Brossard et. al. (2020) 0.23 0.30 0.47OrientNet only (ours) 0.21 0.44 0.49OrientNet+EKF (ours)

Table 1: Orientation RMSE comparison (in radians). Eachbuilding is separately trained and tested; building test setsare of similar length ( ∼ To evaluate our orientation module, we compare it againstthe iOS CoreMotion API, Brossard et. al. (2020), and MUSE(Shen, Gowda, and Roy Choudhury 2018). The CoreMotionestimate is selected for its ubiquity; Brossard et. al. (2020)is the most competitive deep learning estimator since theyoutperform OriNet; MUSE is a high-performance traditionalapproach. As a reminder, CoreMotion and MUSE fuse mag-netic readings.To show the performance of our inertial odometrypipeline, we compare it against several different baselineinertial odometry methods. Pedestrian Dead Reckoning ischosen as the representative of traditional odometry meth-ods. We use a similar PDR baseline to (Herath, Yan, andFurukawa 2020) that involves regressing a heading and dis-tance every physical step. We assume a stride length of0.67m/step and use the heading from iOS CoreMotion.The main data-driven inertial localization methods ex-plored in prior work are IONet, RoNIN, and TLIO, all ofwhich take orientation estimates directly from the phoneAPI. For IONet, we use our own implementation as the orig-inal code is not publicly available. IONet was primarily eval-uated in a small Vicon motion capture room. We have found,however, that IONet does not perform very well in largeindoor environments, which is consistent with experimentsrun by Herath, Yan, and Furukawa (2020). We evaluate all 3RoNIN variants–LSTM, TCN, and ResNet–using their exactopen source implementation. In our evaluations using theircode, we noticed a bug in their evaluation metric, where theyomitted the L2-norm in their calculation of RMSE when de-riving ATE and RTE (see Equation 7). Because of this error,their metrics consistently under-report the true error; how-ever, the relative comparisons between their models and theconclusions are still valid because this is applied consis-tently. We use the correct method for these metrics, whichexplains the discrepancies between the relative sizes of ourerrors (in addition to trajectories being from different build-ings and of different lengths). TLIO uses RoNIN-ResNetwith stochastic-cloning EKF to reﬁne the orientation esti-mates; we use their released code for evaluation.igure 4: Comparison of model CDFs for orientation error (a) Model error between trueand predicted orientations. (b) Correlation between orien-tation error & covariance esti-mate of predicted error dist.

Figure 5: Orientation module performance

We now seek to answer the question of whether our orien-tation pipeline is worth using, i.e., does it outperform thesystems that others use? We perform evaluations on ourbuilding-scale dataset, where each trajectory occurs over along time period of 10 minutes that allows one to more eas-ily discern accumulated drift. Table 1 demonstrates that ourmodel outperforms competing approaches by a considerablemargin when trained separately on trajectories from eachbuilding. Averaging across all three buildings, our estimateis 0.28 radians (16.04 ◦ ) more accurate than CoreMotion’sestimate, 0.22 radians (12.83 ◦ ) more accurate than Brossardet. al. (2020), and 0.20 radians (11.50 ◦ ) more accurate thanMUSE’s. While MUSE and Brossard et. al. (2020) outper-form the base OrientNet slightly, the OrientNet+EKF main-tains a signiﬁcant lead. In fact, at 0.08 radians (4.6 ◦ ) in Bldg1, our method nearly reaches the ground truth rig’s accuracy.Figure 4’s comparison of the error CDFs is particularlyuseful in understanding relative performance. We can seethat the EKF addresses one of the main limitations of us-ing only the OrientNet–namely that the outliers with higherrors are eliminated. While the base OrientNet performsbetter than the other approaches most of the time, it has alarger proportion of large error terms due to jitter and oc-casional discontinuities that appear in the output, which re-duces RMSE down to the level of the others. Our full orien-tation module signiﬁcantly outperforms other methods in allmetrics with better performance and fewer outliers.The error growth over time is evident in Figure 5a, whereall other methods exhibit a steeper error growth than ours–which stays relatively ﬂat. While MUSE and the iOS esti- Model

Bldg 1, Known Subjects Bldg 1, Unknown Subjects

ATE T-RTE D-RTE ATE T-RTE D-RTE

PDR 26.98 16.49 2.26 24.29 12.65 2.77IONet 33.42 22.97 2.47 31.28 24.04 2.29RoNIN-LSTM 18.62 7.02 0.53 18.17 6.51 0.51RoNIN-TCN 12.00 6.41 0.48 13.41 5.82 0.48RoNIN-ResNet 9.03 6.43 0.56 12.07 5.95 0.49TLIO 4.62 2.52 0.31 6.34 4.22 0.46Ours

Table 2: Model position generalization across subjects.Known subjects (2.4hr) present in train split; unknown(2.2hr) were not.

Model

Building 1 Building 2 Building 3

ATE T-RTE D-RTE ATE T-RTE D-RTE ATE T-RTE D-RTE

PDR 25.70 14.66 2.50 21.86 19.48 1.66 12.66 12.74 1.09R-LSTM 18.41 6.78 0.52 29.81 18.67 0.75 33.69 13.14 0.62R-TCN 12.67 6.13 0.48 22.52 13.69 0.73 24.79 12.48 0.59R-ResNet 10.48 6.20 0.53 35.44 15.71 0.49 14.11 11.78 0.60TLIO 5.44 3.33 0.38 8.69 8.86

Table 3: Comparison across buildings using separately-trained models. RoNIN models abbreviated with ”R-”.mate can sometimes recover from such drastic error growtheventually via magnetic observations, our pipeline quicklyand frequently adapts to keep the orientation estimate accu-rate in the face of constant device motion. Figure 5b showsthe predicted standard deviation correlates well with the ac-tual error. The square root of the trace of the predicted orien-tation covariance matrix is, due to the manifold structure ofour loss, the standard deviation of the absolute angular error.Overall, 60% our estimates lie within one predicted standarddeviation (a new covariance is predicted for each timestep)of the true orientation, 90% lie within 2 standard deviations,and 97% lie within 3. This approximately matches with theexpected probabilities of a Gaussian distribution, which sug-gests our network is producing reasonable covariance esti-mates.

Tables 2 and 3 show the comparison between between ourend-to-end model and a mix of traditional and deep-learningbaselines. Table 2 demonstrates that a single model trainedon Building 1 generalizes well to both Known Subject andUnknown Subjects test sets; furthermore, it outperforms allother methods on both sets. TLIO is closest in performancebecause their EKF helps reduce drift by estimating sensorbiases. One point to note is the consistent performance ofPDR in the ATE metric. It is capable of achieving (rela-tively) low errors for this metric because of the fewer up-dates which take place as a result of step counting, so theoverall trajectory tends to stay in the same general region.Some of the other models tend to drift slowly over time un-til the trajectory is no longer centered in the same original ethod Metric R-LSTM R-TCN R-ResNet TLIO

APIOrientation ATE 18.41 12.67 10.48 5.44T-RTE 6.78 6.13 6.20 3.33D-RTE 0.52 0.48 0.53 0.38OurOrientation ATE 7.03 6.04 5.66 4.67T-RTE 2.71 2.56 2.63 2.39D-RTE 0.35 0.30 0.39 0.29TrueOrientation ATE 6.53 5.69 4.49 4.53T-RTE 2.33 2.17 2.17 2.30D-RTE 0.28 0.26 0.38 0.27

Table 4: Localization using different orientation estimateson Building 1. RoNIN models abbreviated with ”R-”.location despite almost always producing more accurate tra-jectory shapes, as reﬂected by the lower RTE metrics. IONetdoes not perform well on these large buildings, so will beomitted for the remaining results.Table 3 presents the results of separately training modelsfor evaluation per building. Here, our position estimate out-performs all other methods, especially in RTE. Lower RTEmeans the trajectory shape is more similar to ground truthwhile lower ATE means the position has generally deviatedless. Note that Bldg 2 and 3 result in larger errors due to theirsize and the increased presence of magnetic distortions thatdegrade orientation estimates reliant on magnetic readings.Figure 6 compares some example trajectories in all threebuildings among our method, the best performing variant ofRoNIN, and TLIO. This succinctly illustrates the importanceof a good orientation estimator, as TLIO and RoNIN’s useof the phone estimate results in rotational drift that compro-mises the resulting position estimate (to the extent that thetrajectories can leave the ﬂoorplan entirely).Table 4 examines the impact on position error from usingthe phone orientation, our orientation, and the ground truthorientation on TLIO and RoNIN. We can see that not onlydoes our orientation module improve the performance of allother position models (quite signiﬁcantly for RoNIN), butit also nearly reaches the theoretical maximum performancewhere ground truth orientations are directly provided.

In our experiments, we notice two distinct failure modes formodel generalization to new environments. The ﬁrst is ori-entation failure due to reliance on magnetic readings, whichleads to degraded performance in environments with wildlydifferent magnetic ﬁelds from training, e.g., in new build-ings. The second is position failure due to variations in build-ing shape/size. Buildings in our dataset vary in dimensionfrom tens to hundreds of meters and in composition of sharpvs rounded turns. While the ﬁrst mode affects our model dueto reliance on magnetic data, we discovered that all data-driven methods suffer from the second failure mode, regress-ing position inaccurately in cross-building evaluations (trainon one location, test on held-out location). This is perhapsunsurprising, as data-driven methods are known to fail onout-of-distribution examples. Regarding magnetic ﬁeld vari- Figure 6: Trajectory comparison among our’s, TLIO, andRoNIN. Same initial pose and truncated slightly for clarity.ations, as long as our model is trained on a building withmagnetic distortions, it can produce accurate orientation es-timates that far exceed existing methods. This is due to itsability to capture magnetic ﬁeld variation, but comes at thecost of degraded generalization in novel environments.

In this work, we present a novel data-driven model thatis one of the ﬁrst to integrate deep orientation estimationwith deep inertial odometry. Our orientation module uses anLSTM and EKF to form a stable, accurate orientation esti-mate that outperforms traditional and data-driven techniqueslike CoreMotion, MUSE, and Brossard et. al. (2020). Com-bined with a position module, this end-to-end-system local-izes better than previous methods across multiple buildingsand users. In addition, our orientation module is a swap-incomponent capable of empowering existing systems withorientation performance comparable to visual-inertial sys-tems in known environments. Lastly, we build a large datasetof device pose, spanning 20 hours of pedestrian motionacross 3 buildings and 15 people. Existing traditional inertialodometry methods either use assumptions or constraints onthe user’s motion, while previous data-driven techniques useclassical orientation estimates. A pertinent issue future workshould address is generalization across buildings throughfurther data collection in unique environments, data aug-mentation, or architectural modiﬁcations. eferences

Ahmetovic, D.; Gleason, C.; Ruan, C.; Kitani, K.; Takagi,H.; and Asakawa, C. 2016. NavCog: A Navigational Cogni-tive Assistant for the Blind.

Proceedings of the 18th Inter-national Conference on Human-Computer Interaction withMobile Devices and Services

Aerospace and Electronic Systems, IEEE Transactions on

Vol. AES-21: 128 – 136. doi:10.1109/TAES.1985.310546.Brossard, M.; Bonnabel, S.; and Barrau, A. 2020. Denois-ing IMU Gyroscopes With Deep Learning for Open-LoopAttitude Estimation.

IEEE Robotics and Automation Letters

The International Journalof Robotics Research

Inf. Fusion

Thirty-Second AAAI Conference on Artiﬁcial Intelli-gence .Chen, C.; Zhao, P.; Lu, C. X.; Wang, W.; Markham, A.; andTrigoni, N. 2018b. OxIOD: The Dataset for Deep InertialOdometry.

CoRR abs/1809.07491. URL http://arxiv.org/abs/1809.07491.Cortes, S.; Solin, A.; and Kannala, J. 2018. Deep LearningBased Speed Estimation for Constraining Strapdown Iner-tial Navigation on Smartphones.

Pro-ceedings of the European Conference on Computer Vision(ECCV)

IEEE Robotics and Automation Letters

Inf. Fusion

URL http://arxiv.org/abs/1412.6980.Kok, M.; and Sch¨on, T. B. 2016. Magnetometer calibrationusing inertial sensors.

CoRR abs/1601.05257. URL http://arxiv.org/abs/1601.05257.Liu, W.; Caruso, D.; Ilg, E.; Dong, J.; Mourikis, A.; Dani-ilidis, K.; Kumar, V.; Engel, J.; Valada, A.; and Asfour,T. 2020. TLIO: Tight Learned Inertial Odometry.

IEEERobotics and Automation Letters

IEEE International Conference on Re-habilitation Robotics

IEEE Inter-national Conference on Intelligent Robots and Systems

Advances in NeuralInformation Processing Systems 32

IEEE Transactions on Robotics arXiv e-prints arXiv:1910.14215.Sabatini, A. M. 2006. Quaternion-based extended Kalmanﬁlter for determining orientation by inertial and magneticsensing.

IEEE Trans. Biomed. Engineering

IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS)

Proceedings of the24th Annual International Conference on Mobile Computingand Networking

Proc.of the RGB-D Workshop on Advanced Reasoning with DepthCameras at Robotics: Science and Systems Conf. (RSS) .Yan, H.; Shan, Q.; and Furukawa, Y. 2018. RIDI: RobustIMU Double Integration.

Proceedings of the European Con-ference on Computer Vision (ECCV) .Zhang, J.; and Singh, S. 2014. LOAM: Lidar Odometry andMapping in Real-time.