[PDF] Optimization-Based Visual-Inertial SLAM Tightly Coupled with Raw GNSS Measurements

Abstract

Unlike loose coupling approaches and the EKF-based approaches in the literature, we propose an optimization-based visual-inertial SLAM tightly coupled with raw Global Navigation Satellite System (GNSS) measurements, a first attempt of this kind in the literature to our knowledge. More specifically, reprojection error, IMU pre-integration error and raw GNSS measurement error are jointly minimized within a sliding window, in which the asynchronism between images and raw GNSS measurements is accounted for. In addition, issues such as marginalization, noisy measurements removal, as well as tackling vulnerable situations are also addressed. Experimental results on public dataset in complex urban scenes show that our proposed approach outperforms state-of-the-art visual-inertial SLAM, GNSS single point positioning, as well as a loose coupling approach, including scenes mainly containing low-rise buildings and those containing urban canyons.

Full PDF

OOptimization-Based Visual-Inertial SLAMTightly Coupled with Raw GNSS Measurements

Jinxu Liu, Wei Gao* and Zhanyi Hu

Abstract — Fusing vision, Inertial Measurement Unit (IMU)and Global Navigation Satellite System (GNSS) informationis a promising solution for accurate global positioning incomplex urban scenes, because of the complementarity ofthe different sensors. Unlike the loose coupling approachesand the EKF-based approaches in the literature, we proposean optimization-based visual-inertial SLAM tightly coupledwith raw GNSS measurements, including pseudoranges andDoppler shift, which is the ﬁrst of such approaches to ourknowledge. Reprojection error, IMU pre-integration error andraw GNSS measurement error are jointly optimized usingbundle adjustment in a sliding window, and the asynchronismbetween images and raw GNSS measurements is considered.Marginalization is performed in the sliding window, and somemethods dealing with noisy measurements and vulnerablesituations are employed. Experimental results on public datasetin complex urban scenes prove that our proposed approachoutperforms state-of-the-art visual-inertial SLAM, GNSS singlepoint positioning, as well as a loose coupling approach, both inthe scenes that mainly contain low-rise buildings and the scenesthat contain urban canyons.

I. INTRODUCTIONPositioning in outdoor scenes has long been a concernedtask by the researchers in the realms of autonomous driving,unmanned aerial vehicles and large-scale augmented real-ity. Global Navigation Satellite System (GNSS) has beenwidely exploited for decades, due to its ability to acquiregeodetic coordinates and its stable performance in openarea. However, in urban canyons where there are lots ofskyscrapers in the surroundings, GNSS positioning resultsdramatically deteriorate because most of the GNSS signalsare blocked or reﬂected. On the other hand, camera andInertial Measurement Unit (IMU) do not suffer from urbancanyons, and visual SLAM (V-SLAM) and visual inertialSLAM (VI-SLAM) perform well if the scene has abundanttexture and not too many moving objects. What’s more, thelocal accuracy of V-SLAM or VI-SLAM approaches is muchhigher than that of single point GNSS positioning, due to thefact that the current pose is always computed depending onprevious poses in SLAM approaches.Nevertheless, V-SLAM and VI-SLAM approaches alsohave their inherent shortcomings. Besides the deteriorationof accuracy in low-textured scenes and the scenes with manymoving objects, the SLAM approaches suffer from drift and

This work was supported in part by the National Key R&D Pro-gram of China (2016YFB0502002), and in part by the Natural Sci-ence Foundation of China (61872361). *Corresponding Author: Wei Gao [email protected] .All authors are with National Laboratory of Pattern Recognition, Instituteof Automation, Chinese Academy of Sciences, and with School of ArtiﬁcialIntelligence, University of Chinese Academy of Sciences, China. thus have lower global accuracy, as some directions of theparameters are unobservable. And more importantly, sinceglobal position is unobservable for SLAM approaches, whatcan be obtained from SLAM approaches are the poses rela-tive to the starting point, which need to be transformed intogeodetic coordinate frame with the aid of additional globalinformation provided by other sensors, such as GNSS. HenceGNSS positioning and V-SLAM or VI-SLAM approaches arecomplementary, because of their different global and localaccuracy, as well as their different vulnerable scenes.II. RELATED WORK

A. Loose Coupling Approaches

Loose coupling approaches ﬁrstly derive poses from fea-ture points, or the positions and velocities from raw GNSSmeasurements, or both of the above two quantities. Thensome the derived quantities rather than raw measurements areinvolved in the fusion stage, which generally adopts eitherExtended Kalman Filter (EKF) or optimization framework.Some approaches [1], [2] derive poses from feature points,and fuse them with raw GNSS measurements such as pseudo-range and Doppler shift, where loose coupling is applied onthe vision side. Conversely, some approaches [3], [4] derivepositions from raw GNSS measurements, and fuse them withfeature points from the images, where loose coupling isapplied on the GNSS side. Other approaches [5], [6], [7],[8] apply loose coupling on both the vision side and theGNSS side, i.e. the derived visual quantities and the derivedGNSS quantities are involved in the fusion stage. In addition,there are approaches [2], [9] where visual information isexploited to aid GNSS positioning by judging which satel-lites are visible. The above loose coupling approaches arenot optimal. For example, when the GNSS measurementsare in good quality but the GNSS satellites are poorlydistributed, the loose coupling approaches will deterioratebecause the derived GNSS quantities are inaccurate, butthe tight coupling approaches are not affected under suchconditions.

B. Tight Coupling Approaches

This category of approaches employs tight coupling onboth the vision side and the GNSS side. That is to say, boththe feature points from the image and the raw GNSS mea-surements such as pseudorange and Doppler shift serve asmeasurements either in an EKF-based or in an optimization-based framework. The EKF-based approaches [10], [11]tightly couple vision, IMU and single point GNSS. Theyperform state prediction using the IMU measurements, and a r X i v : . [ c s . R O ] O c t pdate the states using the feature points as well as the pseu-dorange and Doppler shift measurements. Similar EKF-basedframework is adopted in [12], while [12] utilizes double-differenced GNSS instead of single point GNSS, and hencedouble-differenced pseudorange and double-differenced car-rier phase are utilized as measurements for state update,instead of the directly measured pseudorange and Dopplershift. Double-differenced GNSS provides measurements formore accurate positioning, but requires an additional basestation in comparison with single point GNSS.Besides EKF-based approaches, optimization-based sensorfusion approaches are emerging as well. Compared withEKF, batch optimization allows for the reduction of errorthrough relinearization in visual-inertial navigation systems[13] and proves to have better performance in GNSS-IMUfusion task [14]. Therefore, optimization is a promisingsolution to fuse vision, IMU and GNSS measurements.The optimization-based approach [15] tightly couples featurepoints with pseudorange from single point GNSS in bundleadjustment. However, Doppler shift is not uitilized, and IMUmeasurements are not tightly coupled in [15], and only themagnetometer in IMU is exploited, to determine the directionof the local world frame. Nevertheless, tightly coupling IMUmeasurements can eliminate the effect of the asynchronismbetween vision and GNSS measurements, thus the low-speedmotion assumption applied in [15] is not needed in this case.In this paper we propose an approach that tightly couplesvision, IMU and single point raw GNSS measurementsincluding pseudoranges and Doppler shift in an optimization-based framework, which is the ﬁrst of such approaches to ourknowledge. III. PRELIMINARIES A. Frames and Notations

In this paper, the world frames include the earth-centered,earth-ﬁxed (ECEF) frame, the ground east-north-up (ENU)frame, and the local world frame for SLAM. The sensorframes include the camera frame, the IMU frame and theGNSS receiver frame. The above world frames and sensorframes are illustrated in Fig.1. We use ( · ) WE , ( · ) WG and ( · ) WL to denote the ECEF frame, the ground ENU frameand the local world frame for SLAM respectively, and use ( · ) c , ( · ) b and ( · ) g to denote the camera frame, IMU frameand GNSS receiver frame respectively. Specially, we use ( · ) b k and ( · ) g k to denote the IMU frame and GNSS receiverframe corresponding to the k th image, and slightly abuse thesymbols using ( · ) b t and ( · ) g t to denote the IMU frame andGNSS receiver frame at some moment t .Let R BA denote the rotation matrix that takes a vector inframe { A } to frame { B } , and q BA is its quaternion form. p BA is the coordinate of the origin point of frame { A } inframe { B } , and v BA is the velocity of the origin point offrame { A } measured in frame { B } . Note that for GNSSreceiver frame, only the coordinate of its origin point inframe { W L } matters in our approach, and the orientations ofits axes can be regarded as arbitrary. Let b a k and b ω k denotethe accelerometer bias and gyroscope bias corresponding to { WE } YZ 在此处键入公式。 { WG } XY Z { WL }{ WG } X Y Z { b } 在此处键入公式。 IMU

GNSS receivercamera z xy { b } { c } x zy { g } trajectory X Fig. 1. Illustration of world frames and sensor frames. The globe in theupper left of the ﬁgure is the earth globe. The origin of the ECEF frame { WE } is on the center of the earth. The X axis of the ECEF frame pointsto the point of intersection between the prime meridian and equator, andthe Z axis points to the North Pole. The ground ENU frame { WG } and thelocal world frame for SLAM { WL } both locate on the ground, with theirZ axes pointing upward. image k respectively. Moreover, we use [ · ] × to denote theskew symmetric matrix corresponding to a vector. B. GNSS Single Point Positioning and Velocity Measurement

GNSS single point positioning and velocity measurementare used in GNSS-SLAM initialization in Sect.IV-A andﬁltering out the noisy GNSS measurements in Sect.IV-D. Forsingle point positioning, the states to be estimated include X P = (cid:2) p WEg , δ t r ... δ t rJ − (cid:3) , (1) where J is the number of different GNSS constellations, suchas GPS, GLONASS, etc. δ t rj for j = . . . J − j . The cost function c P ( X P ) that is to be minimized can be written as c P ( X P ) = N − ∑ i = (cid:107) e Pi (cid:107) , (2) where N is the number of observed satellites and e Pi = (cid:107) p WEs i − p WEg (cid:107) + c δ t rj ( i ) − c δ t si − ρ i , (3) where c is the light speed, p WEs i and δ t si are the positionin ECEF frame and clock bias of satellite i respectively,which are computed directly from the broadcast emphemeris.The subscript j ( i ) means the constellation corresponding tosatellite i , and ρ i is the pseudorange measurement after iono-spheric delay correction and tropospheric delay correction.For velocity measurement, the states to be estimatedinclude X D = (cid:104) v WEg , ˙ δ t r ... ˙ δ t rJ − (cid:105) , (4) where J is the number of different GNSS constellations, and˙ δ t rj for j = . . . J − j . The cost function c D ( X D ) can be written as c D ( X D ) = N − ∑ i = (cid:107) e Di (cid:107) , (5) here N is the number of observed satellites and e Di = l Ti ( v WEs i − v WEg ) + c ˙ δ t rj ( i ) − c ˙ δ t si − ( − λ i D i ) , (6) where v WEs i and ˙ δ t si are the velocity in ECEF frame and clockdrift of satellite i respectively, which are computed directlyfrom the broadcast emphemeris. λ i and D i are the wavelengthof the carrier and the Doppler shift measurement respectively. l Ti is computed as l Ti = p WEsi − p WEg (cid:107) p WEsi − p WEg (cid:107) leveraging p WEg computedthrough single point positioning. The meanings of c andsubscript j ( i ) are the same as those in ( ) .IV. METHOD A. GNSS-SLAM Initialization

After our system starts, the initialization for VI-SLAMis performed, and the nonlinear optimization for VI-SLAMstarts performing temporarily following the routine of [16].Meanwhile, the GNSS single point positioning is performedas described in Sect. III-B. Hence we have obtained theVI-SLAM trajectory and the GNSS single point positioningtrajectory respectively. At GNSS-SLAM initialization stagewe align the above two trajectories, in order to compute thevalue of R WEWG and p WEWG , as well as the initial value of R WGWL and p WGWL .Firstly, we select one position calculated by GNSS singlepoint positioning as the reference point p WEre f , which servesas the origin point of the ground ENU frame. In our im-plementation, the third GNSS single point positioning resultsince the system starts is selected as the reference point.Given the reference point, R WEWG and p WEWG can be computedstraightforwardly. We have p WEWG = p WEre f , and R WEWG =  − sin λ cos λ − sin φ cos λ − sin φ sin λ cos φ cos φ cos λ cos φ sin λ sin φ  T , (7) where φ and λ are the latitude and longitude of the referencepoint respectively, which can be easily converted from theECEF position p WEre f [17]. Then the GNSS single pointpositioning trajectory is converted from the ECEF frame tothe ground ENU frame as p WGg = R WEWGT ( p WEg − p WEWG ) . (8) Hence we have two sets of positions { p WGg l | l = . . . L − } and { p WLg l | l = . . . L − } , of which the former includesthe positions from GNSS single point positioning in frame { W G } , and the latter includes the positions estimated fromVI-SLAM in frame { W L } . Note that the positions from VI-SLAM are in fact p WLb l , but at the initialization stage weneglect the translation from IMU frame to GNSS receiverframe, thus we regard p WLb l as p WLg l . At this stage interpolationis performed to align the timestamps of the above two sets ofpositions. At last a 5-Degrees of Freedom(DoF) alignmentbetween the above two sets is performed to compute theinitial value of R WGWL and p WGWL , which writes as min s , R WGWL , p WGWL L − ∑ l = (cid:107) p WGg l − s R WGWL p WLg l − p WGWL (cid:107) . (9) The scale parameter s is estimated because the VI-SLAMis prone to suffer from scale drift in degenerate cases[18].Note that R WGWL only contains one DoF, which is the rotationaround Z axis, because the Z axes of both frame { W G } and frame { W L } point upward. The initial value of p WGWL issolved by centering the two sets of positions, and the initialvalue of R WGWL is obtained by searching every 1 degree aftercentering, followed by solving the minimization problem (9)iteratively. After minimization, p WGWL is adjusted according tothe estimated s because the scaling factor s is not applied inthe following stages. B. Tightly Coupled Optimization

After GNSS-SLAM initialization, raw GNSS measure-ments, i.e. pseudoranges and Doppler shift, are integratedinto optimization. The states to be estimated include X = [ x ... x K − , f ... f M − , T bc , T WGWL , δδδ ttt r ... δδδ ttt rE − , ˙ δδδ ttt r ... ˙ δδδ ttt rE − ] , (10) where K is the number of images in the sliding window, M is the number of landmarks, and E is the number of GNSSepochs in the sliding window. f is the inverse depth of onelandmark in camera frame, and x k = (cid:104) p WLb k , v WLb k , q WLb k , b a k , b ω k (cid:105) , k = ... K − , T bc = (cid:2) R bc , p bc (cid:3) , T WGWL = (cid:2) R WGWL , p WGWL (cid:3) , δδδ ttt re = (cid:2) { δ t rj , e |∀ j ∈ O ( e ) } (cid:3) , e = ... E − , ˙ δδδ ttt re = (cid:104) { ˙ δ t rj , e |∀ j ∈ O ( e ) } (cid:105) , e = ... E − , (11) where O ( e ) is the set of observable GNSS constellations atepoch e , δ t rj , e and ˙ δ t rj , e are the receiver clock bias and clockdrift w.r.t GNSS constellation j at epoch e respectively, and δδδ ttt re and ˙ δδδ ttt re are the vectors containing the receiver clockbiases and clock drifts for all the observable constellationsat epoch e respectively. Compared with VINS-mono[16], theintroduced additional parameters include the transformationfrom the local world frame to ground ENU frame, as wellas clock biases and clock drifts. The transformation fromground ENU frame to ECEF frame ( R WEWG , p WEWG ) is treated asa known quantity rather than a parameter, because the groundENU frame can be located anywhere on the ground that isnot too far away from the current position. Although GNSSmeasurements and images are collected at different moments,we do not introduce additional parameters concerning poseand velocity at GNSS measurement moments. In otherwords, we incorporate intermediate GNSS measurements,which is inspired by [19]. Note that just like in the GNSS-SLAM initialization stage, R WGWL only has 1 DoF, so it isparameterized only by the rotation angle around the Z axis,which writes as κ WGWL .The cost function c ( X ) that is to be minimized writes as c ( X ) = c repro j ( X ) + c IMU ( X ) + E − ∑ e = ∑ i ∈ S ( e ) e Pi , eT W P e Pi , e + E − ∑ e = ∑ i ∈ S ( e ) e Di , eT W D e Di , e + e margT e marg , (12) here E is the number of GNSS epochs in the slidingwindow and S ( e ) is the set of observable GNSS satellitesat epoch e . e Pi , e and e Di , e are the pseudorange residual andthe Doppler shift residual for satellite i observed at epoch e respectively. e marg denotes the marginalization residual. Thereprojection factors c repro j ( X ) and IMU factors c IMU ( X ) are identical to those in VINS-mono[16], and the readersmay refer to (14), (16) and (17) in VINS-mono[16] fordetails. The IMU pre-integration performed in our proposedapproach is also identical to that in VINS-mono[16].As for the GNSS factors, the pseudorange residual e Pi , e andthe Doppler shift residual e Di , e write as e Pi , e = (cid:107) p WEs i , e − ( R WEWG R WGWL p WLg m ( e ) + R WEWG p WGWL + p WEWG ) (cid:107) + c δ t rj ( i ) , e − c δ t si , e − ρ i , e , e Di , e = l Ti , e ( v WEs i , e − R WEWG R WGWL v WLg m ( e ) ) + c ˙ δ t rj ( i ) , e − c ˙ δ t si , e − ( − λ i , e D i , e ) , (13) where l i , e = p WEs i , e − ( R WEWG R WGWL p WLg m ( e ) + R WEWG p WGWL + p WEWG ) (cid:107) p WEs i , e − ( R WEWG R WGWL p WLg m ( e ) + R WEWG p WGWL + p WEWG ) (cid:107) , p WLg m ( e ) = p WLb k + v WLb k ∆ t k , m ( e ) + R WLb k ( ααα b k b m ( e ) + R ( γγγ b k b m ( e ) ) p bg ) − g WL ∆ t k , m ( e ) , v WLg m ( e ) = v WLb k + R WLb k ( βββ b k b m ( e ) − R ( γγγ b k b m ( e ) )[ p bg ] × ( ˆ ωωω m ( e ) − b ω k )) − g WL ∆ t k , m ( e ) . (14) In (13) and (14), p WEs i , e and δ t si , e are the position and clockbias of satellite i at GNSS epoch e respectively, and v WEs i , e and˙ δ t si , e are the velocity and clock drift of satellite i at epoch e respectively. The above four quantities are computed directlyfrom the broadcast emphemeris. The subscript m ( e ) denotesthe measurement moment of GNSS epoch e . The subscript j ( i ) is the GNSS constellation corresponding to satellite i ,hence the parameters δ t rj ( i ) , e and ˙ δ t rj ( i ) , e are the receiverclock bias and clock drift w.r.t GNSS constellation j ( i ) atepoch e respectively. ρ i , e , λ i , e and D i , e are the pseudorangemeasurement, wavelength of the carrier, and Doppler shiftmeasurement of satellite i at epoch e respectively. c isthe light speed, and ∆ t k , m ( e ) is the time interval from theexposure moment of image k to the moment m ( e ) . p bg isthe translational component of IMU-GNSS receiver extrinsicparameters, which is assumed to have been accurately cali-brated beforehand in this paper. ααα b k b m ( e ) , βββ b k b m ( e ) and γγγ b k b m ( e ) arethe nominal states coming from IMU pre-integration fromthe exposure moment of image k to the moment m ( e ) , andthey correspond to displacement, change in velocity, andrelative rotation respectively. R ( γγγ b k b m ( e ) ) is the rotation matrixconverted from the rotation quaternion γγγ b k b m ( e ) . For the detailsof IMU-preintegration the readers may refer to (3) and (5) inVINS-mono[16]. ˆ ωωω m ( e ) is the angular velocity measurementfrom IMU at moment m ( e ) , and g WL is the gravity vector inframe { W L } , which writes as (cid:2) g (cid:3) T with g being themagnitude of gravity. In (14), an approximation is appliedthat the gyroscope bias at image exposure moment b w k substitutes for the gyroscope bias at GNSS measurementmoment m ( e ) , because gyroscope bias is a slow time-varyingquantity.In fact, W P and W D in (12) are both 1 × W P = W D =

4, supposing the medianerrors of pseudorange measurements and pseudorange ratesconverted from Doppler shift measurements to be 1 m and0 . m / s respectively. The optimization is performed in asliding window minimizing the cost function c ( X ) usingCeres Solver[20]. C. Sliding Window and Marginalization

In the sliding window in our approach illustrated in Fig.2,the parameters related to GNSS, i.e. GNSS receiver clockbias and clock drift, are attached to the image frame justbefore it. There may exist more than one GNSS parameternode attached to a certain image frame. old keyframes latest frames reprojection factorsIMU factors GNSS factors marginalized many landmarks 𝐱 𝐱 𝐱 𝐱 𝐾−3 𝐱 𝐾−2 𝐱 𝐾−1 keyframemany landmarks 𝐱 𝐱 𝐱 𝐱 𝐾−3 𝐱 𝐾−2 𝐱 𝐾−1 non-keyframe ×××× remove  add tGNSS receiver clock bias and clock drift 𝐑 𝑊𝐿𝑊𝐺 , 𝐩

𝑊𝐿𝑊𝐺

Fig. 2. Marginalization in sliding window. The hollow circles represent theparameters (nodes), and the solid rectangles represent the factors (edges).The camera-IMU extrinsic parameters and the marginalization factor are notshown here for clarity of explanation.

If the second latest frame is a keyframe, we keep it in thesliding window, and marginalize the oldest keyframe and theGNSS parameters and feature points attached to it, as wellas the reprojection factors, IMU factors and GNSS factorsrelated to them. If the second latest frame is not a keyframe,we simply remove the frame and all its corresponding visualmeasurements. However, the GNSS parameters attached tothe second latest frame will not be removed, and they willbe attached to the third latest frame instead, i.e. removedand added in Fig.2. The GNSS factors that are originallyattached to the second latest frame will be changed, forhe reason that the IMU pre-integration in the above GNSSfactors need to be recomputed because their starting momentswill switch from the second latest frame to the third latestframe. The GNSS measurements in the above GNSS factorsremain unchanged.The marginalization in our approach is carried out usingSchur-Complement. The readers may refer to [21] and [16]for details of applying marginalization.

D. Methods Dealing with Noisy Measurements and Vulner-able Situations • In urban canyons, raw GNSS measurement can be prettynoisy. Therefore, before the GNSS measurements enterinto the sliding window for optimization, we performsingle point positioning and velocity measurement asdescribed in Sect.III-B, and compute the residuals ofpseudoranges and Doppler shift measurements. Thepseudoranges and Doppler shifts are ﬁltered out if theircorresponding residuals exceed the threshold T P and T D respectively. If too few GNSS measurements are leftafter ﬁltering out the noisy measurements at a certainGNSS epoch, we will ﬁlter out all the GNSS measure-ments at the epoch altogether. In our implementation, T P = m and T D = m / s . Fig.3 shows the GNSS singlepoint positioning results and the results of our ﬁlteringoperation. From Fig.3 we can see that the GNSS epochswhen measurements are all ﬁltered out generally expe-rience less accurate single point positioning results thanthe epochs when measurements are partially ﬁltered outor not ﬁltered out.(a) east and north (b) height Fig. 3. GNSS single point positioning results w.r.t east and north directions(a), and w.r.t height direction (b). The red, green and blue markers areepochs when GNSS measurements are all ﬁltered out, partially ﬁltered outand not ﬁltered out respectively. The single point positioning is performedas described in Sect. III-B. • From the above single point positioning, the GNSSreceiver clock bias can be estimated. The estimatedclock bias serves as initial value for the clock biasparameters in (10). • At the initial stage just after GNSS-SLAM initialization,a prior factor constraining the parameters ( R WGWL , p WGWL ) is added to the sliding window, because at the initialstage the marginalization factor does not contain enoughinformation from GNSS measurements in the past, andthus the estimation of ( R WGWL , p WGWL ) is susceptible tonoises in GNSS measurements. After the GNSS factors in more than T N GNSS epochs have been marginalized,the prior factor is permanently removed from the slidingwindow. Note that the prior factor is not marginalized,hence it will not take any effect since it is removed.The prior residual writes as e prior = (cid:20) w ( p WGWL − ˆ p WGWL ) w ( κ WGWL − ˆ κ WGWL ) (cid:21) , where p WGWL is parameter in (10), and κ WGWL is the rotationaround Z axis used to parameterize R WGWL in (10). ˆ p WGWL and ˆ κ WGWL are estimated from GNSS-SLAM initializationin Sect.IV-A. In our implementation, T N =

30 and theweight w = • Since the accuracy of IMU-preintegration decreasesas the time interval for pre-integration increases, anyGNSS factor that is more than T t seconds later thanthe image frame it attaches to is simply removed fromthe sliding window. In this way we can avoid theinaccuracy of the IMU-preintegration results in GNSSfactors (see (14)). In our implementation the raw GNSSmeasurements come at a frequency of about 4Hz, hencewe choose T t = • According to [18] and [22], VI-SLAM system is notwell-constrained before the platform has made a turnfor the ﬁrst time. Therefore, we employ the strategyproposed in [22] but in a much simpler way that we startto adjust the camera-IMU extrinsic parameters ( R bc , p bc ) in optimization after the platform has made a turn. • In our experiments, we ﬁnd that it is a vulnerable scenefor our approach when the car stops at a crossroad,which may be due to the fact that the accuracy of IMUpre-integration between two keyframes decreases as thetime interval between two keyframes increases. In ourimplementation when the estimated speed of the secondlatest frame is under 0 . m / s , the GNSS factors in thesliding window are temporarily removed , but they stillparticipate in marginalization as if they were in thesliding window.V. EXPERIMENTSThe experiments are conducted on the public datasetUrbanLoco[23], which is collected in highly urbanized areasin San Francisco and Hong Kong. The images in Hong Kongdata are collected using a ﬁsheye sky camera, and the imagesin San Francisco data are collected using six 360-degree viewcameras. Our approach is evaluated on 4 sequences collectedin San Francisco, because VINS-mono[16] evaluation resultson only the above 4 sequences are reported in the paper[23]. The 4 sequences contain both urban canyons and sceneswith low-rise buildings, which are shown in Fig.4. All theexperiments presented are performed on a PC with Intel Corei7 3.6GHz × ABLE IC

OMPARISON OF MEAN ABSOLUTE ERROR (MAE)

OF RESULTING TRAJECTORIES FROM DIFFERENT APPROACHES

Data Sequence Length (km) Method MAE in Translation (m) MAE in Rotation (deg)X Y Z Z Y XCA 20190828155828 5.9 VINS-mono[16] 42.68 42.68 23.24 6.04 0.69

GNSS SPP

CA 20190828184706 1.8 VINS-mono[16] 32.52 31.47 26.89

GNSS SPP 2.270 2.190 5.514 * * *VINS-Fusion (with GNSS)[8] 2.120 2.729 4.519 6.130 15.166 9.207Proposed

Here

GNSS SPP means GNSS single point positioning.

Proposed means the proposed approach in this paper. The symbol * means theresult is not available. GNSS single point positioning only provides positioning results, hence its rotational accuracy is not available. (a) urban canyon (b) low-rise buildings

Fig. 4. Urban canyon (a) and low-rise buildings (b) in the sequences onwhich the experiments are conducted. are evaluated by ourselves.

GNSS SPP is the single pointpositioning result produced using RTKLIB[24]. For VINS-Fusion (with GNSS)[8], the inputted GNSS positions arethe single point positioning results from RTKLIB[24], i.e.the results of

GNSS SPP . For fairness of comparison, weemploy the frontal camera and the IMU in both VINS-Fusion (with GNSS)[8] and our proposed approach. Alsofor fairness of comparison, because the resulting trajectoryof VINS-mono[16] is aligned to the ground truth by a rigidbody transformation[25] before evaluation in paper [23],the other three approaches are also aligned to the groundtruth by a rigid body transformation before evaluation. Foreach approach the loop closure option is turned off. For

GNSS SPP and our proposed approach, the same raw GNSSmeasurements are utilized, including the measurements fromboth GPS and GLONASS. Same as the evaluation of VINS-mono[16] in paper [23], our proposed approach is also eval-uated at about half-real time playback rate. The comparisonsof trajectories among different approaches w.r.t the east andnorth directions, as well as w.r.t. the height direction arealso shown in Fig.5 and Fig.6 respectively. From Table I,Fig.5 and Fig.6 we can see that our proposed approach (a) CA 20190828155828 (b) CA 20190828173350(c) CA 20190828184706 (d) CA 20190828190411

Fig. 5. Comparison of trajectories in east and north directions. In ﬁgure (b)a few very noisy GNSS SPP results that lie outside the ﬁgure are omitted.All trajectories are aligned with the ground truth. outperforms the other three approaches w.r.t the accuracyof both translation and rotation.Tightly coupling raw GNSS measurements, our approachachieves global positioning results. For our proposed ap-proach, the resulting trajectories of two sequences are pro-jected onto the Google Map, which are shown in Fig.7.VI. CONCLUSIONSIn this paper, we propose an optimziation-based visual-inertial SLAM tightly coupled with raw GNSS measure-ments. Feature points, IMU measurements as well as pseudo-ranges and Doppler shift measurements from GNSS, whicha) CA 20190828155828 (b) CA 20190828173350(c) CA 20190828184706 (d) CA 20190828190411

Fig. 6. Comparison of trajectories in height direction. In ﬁgure (a) and (b)a few very noisy GNSS SPP results that lie outside the ﬁgure are omitted.All trajectories are aligned with the ground truth. (a) CA 20190828184706 (b) CA 20190828190411

Fig. 7. Our resulting trajectories of (a) CA 20190828184706 and (b)CA 20190828190411 projected onto Google Map. are captured at different moments, are integrated by means ofoptimization in a sliding window. Experimental results provethat our proposed approach outperforms standard visual-inertial SLAM, GNSS single point positioning, as well as theloose coupling approach [8] on public dataset. In the future,approaches that aid the mapping process in the backend withraw GNSS measurements may be explored.R

EFERENCES[1] M. Schreiber, H. K¨onigshof, A. Hellmund, and C. Stiller. Vehiclelocalization with tightly coupled gnss and visual odometry. In , pages 858–863, 2016.[2] Paul Verlaine Gakne and Kyle O’Keefe. Tightly-coupled gnss/visionusing a sky-pointing camera for vehicle navigation in urban areas.

Sensors , 18(4):1244, 2018.[3] Y. Yu, W. Gao, C. Liu, S. Shen, and M. Liu. A gps-aided omnidi-rectional visual-inertial state estimator in ubiquitous environments. In , pages 7750–7755, 2019.[4] W. Lee, K. Eckenhoff, P. Geneva, and G. Huang. Intermittentgps-aided vio: Online initialization and calibration. In , pages5724–5731, 2020. [5] J. Rehder, K. Gupta, S. Nuske, and S. Singh. Global pose estimationwith limited gps and long range visual odometry. In , pages 627–633, 2012.[6] R. Mascaro, L. Teixeira, T. Hinzmann, R. Siegwart, and M. Chli.Gomsf: Graph-optimization based multi-sensor fusion for robust uavpose estimation. In , pages 1421–1428, 2018.[7] Wenlin Yan, Lu´ısa Bastos, Jos´e A Gonc¸alves, Am´erico Magalh˜aes,and Tianhe Xu. Image-aided platform orientation determination witha gnss/low-cost imu system using robust-adaptive kalman ﬁlter.

GPSSolutions , 22(1):12, 2018.[8] Tong Qin, Shaozu Cao, Jie Pan, and Shaojie Shen. A generaloptimization-based framework for global pose estimation with multiplesensors, 2019.[9] W. Wen, X. Bai, Y. C. Kan, and L. Hsu. Tightly coupled gnss/insintegration via factor graph and aided by ﬁsh-eye camera.

IEEETransactions on Vehicular Technology , 68(11):10651–10662, 2019.[10] Dae Hee Won, Eunsung Lee, Moonbeom Heo, Sangkyung Sung, JiyunLee, and Young Jae Lee. Gnss integration with vision-based navigationfor low gnss visibility conditions.

GPS solutions , 18(2):177–187, 2014.[11] Dae Hee Won, Eunsung Lee, Moonbeom Heo, Seung-Woo Lee, JiyunLee, Jeongrae Kim, Sangkyung Sung, and Young Jae Lee. Selectiveintegration of gnss, vision sensor, and ins using weighted dop undergnss-challenged environments.

IEEE Transactions on Instrumentationand Measurement , 63(9):2288–2298, 2014.[12] Tuan Li, Hongping Zhang, Zhouzheng Gao, Xiaoji Niu, and Naser El-Sheimy. Tight fusion of a monocular camera, mems-imu, and single-frequency multi-gnss rtk for precise navigation in gnss-challengedenvironments.

Remote Sensing , 11(6):610, 2019.[13] G. Huang. Visual-inertial navigation: A concise review. In , pages9572–9582, 2019.[14] Weisong Wen, Tim Pfeifer, Xiwei Bai, and Li-Ta Hsu. It is timefor factor graph optimization for gnss/ins integration: Comparisonbetween FGO and EKF.

CoRR , 2020.[15] Zheng Gong, Rendong Ying, Fei Wen, Jiuchao Qian, and Peilin Liu.Tightly coupled integration of gnss and vision slam using 10-dofoptimization on manifold.

IEEE Sensors Journal , 19(24):12105–12117, 2019.[16] T. Qin, P. Li, and S. Shen. Vins-mono: A robust and versatile monoc-ular visual-inertial state estimator.

IEEE Transactions on Robotics ,34(4):1004–1020, 2018.[17] Gang Xie et al. Principles of gps and receiver design.

PublishingHouse of Electronics Industry, Beijing , 7:61–63, 2009.[18] K. J. Wu, C. X. Guo, G. Georgiou, and S. I. Roumeliotis. Vinson wheels. In , pages 5155–5162, 2017.[19] S. Ch’ng, A. Khosravian, A. Doan, and T. Chin. Outlier-robust mani-fold pre-integration for ins/gps fusion. In , pages 7489–7496, 2019.[20] Sameer Agarwal, Keir Mierle, and Others. Ceres solver. http://ceres-solver.org .[21] Stefan Leutenegger, Simon Lynen, Michael Bosse, Roland Siegwart,and Paul Furgale. Keyframe-based visual–inertial odometry using non-linear optimization.

The International Journal of Robotics Research ,34(3):314–334, 2015.[22] J. Liu, W. Gao, and Z. Hu. Visual-inertial odometry tightly coupledwith wheel encoder adopting robust initialization and online extrinsiccalibration. In , pages 5391–5397, 2019.[23] W. Wen, Y. Zhou, G. Zhang, S. Fahandezh-Saadi, X. Bai, W. Zhan,M. Tomizuka, and L. T. Hsu. Urbanloco: A full sensor suite dataset formapping and localization in urban scenes. In , pages 2310–2316,2020.[24] T Takasu. Rtklib: An open source program package for gnss position-ing.

Tech. Rep., 2013. Software and documentation , 2011.[25] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. Abenchmark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJInternational Conference on Intelligent Robots and Systems