[PDF] A Hybrid Learner for Simultaneous Localization and Mapping

Abstract

Simultaneous localization and mapping (SLAM) is used to predict the dynamic motion path of a moving platform based on the location coordinates and the precise mapping of the physical environment. SLAM has great potential in augmented reality (AR), autonomous vehicles, viz. self-driving cars, drones, Autonomous navigation robots (ANR). This work introduces a hybrid learning model that explores beyond feature fusion and conducts a multimodal weight sewing strategy towards improving the performance of a baseline SLAM algorithm. It carries out weight enhancement of the front end feature extractor of the SLAM via mutation of different deep networks' top layers. At the same time, the trajectory predictions from independently trained models are amalgamated to refine the location detail. Thus, the integration of the aforesaid early and late fusion techniques under a hybrid learning framework minimizes the translation and rotation errors of the SLAM model. This study exploits some well-known deep learning (DL) architectures, including ResNet18, ResNet34, ResNet50, ResNet101, VGG16, VGG19, and AlexNet for experimental analysis. An extensive experimental analysis proves that hybrid learner (HL) achieves significantly better results than the unimodal approaches and multimodal approaches with early or late fusion strategies. Hence, it is found that the Apolloscape dataset taken in this work has never been used in the literature under SLAM with fusion techniques, which makes this work unique and insightful.

Full PDF

AA Hybrid Learner for Simultaneous Localizationand Mapping

Thangarajah Akilan,

Member, IEEE , Edna Johnson, Japneet Sandhu, Ritika Chadha, Gaurav Taluja

Abstract —Simultaneous localization and mapping (SLAM) isused to predict the dynamic motion path of a moving platformbased on the location coordinates and the precise mapping of thephysical environment. SLAM has great potential in augmentedreality (AR), autonomous vehicles, viz. self-driving cars, drones,Autonomous navigation robots (ANR). This work introducesa hybrid learning model that explores beyond feature fusionand conducts a multimodal weight sewing strategy towardsimproving the performance of a baseline SLAM algorithm.It carries out weight enhancement of the front end featureextractor of the SLAM via mutation of different deep networks’top layers. At the same time, the trajectory predictions fromindependently trained models are amalgamated to reﬁne thelocation detail. Thus, the integration of the aforesaid earlyand late fusion techniques under a hybrid learning frameworkminimizes the translation and rotation errors of the SLAMmodel. This study exploits some well-known deep learning(DL) architectures, including ResNet18, ResNet34, ResNet50,ResNet101, VGG16, VGG19, and AlexNet for experimentalanalysis. An extensive experimental analysis proves that hybridlearner (HL) achieves signiﬁcantly better results than theunimodal approaches and multimodal approaches with early orlate fusion strategies. Hence, it is found that the Apolloscapedataset taken in this work has never been used in the literatureunder SLAM with fusion techniques, which makes this workunique and insightful.

Index Terms —SLAM, deep learning, hybrid learning

I. I

NTRODUCTION

SLAM is a technological process that enables a deviceto build a map of the environment, at the same time, helpscompute the relative location on predeﬁned map. It can be usedfor range of applications from self-driving vehicles (SDV) tospace and maritime exploration, and from indoor positioningto search and rescue operations. The primary responsibilityof a SLAM algorithm is to produce an understanding of amoving platform’s environment and the location of the vehicleby providing the value of its coordinates; thus, improving theformation of a trajectory to determine the view at a particularinstance. As a SLAM is one of the emerging technologies,numerous implementations have been introduced but the DL-based approaches surmount others by their efﬁciency in ex-tracting the ﬁnest features and giving better results even in afeature-scarce environment.This study aims to improve the performance of a self-localization module based on PoseNet [1] architecture throughthe concept of hybrid learning that does a multimodal weightmutation for enhancing the weights of a feature extractor layerand reﬁnes the trajectory predictions using amalgamation ofmultimodal scores. The ablation study is carried out on the Apolloscape [2], [3], as per our knowledge, there has been noresearch work performed on the self-localization repository ofthe Apolloscape dataset, in which the proposed HL has beenevaluated extensively. The experimental analysis presented inthis work consists of three parts, in which initial two parts formthe base for the third. The ﬁrst part concentrates on an exten-sive evaluation of several DL models, as feature extractors.The second part analyzes of two proposed multimodal fusionapproaches: (i). an early fusion via layer weight enhancementof feature extractor, and (ii). a late fusion via score reﬁnementof the trajectory (pose) regressor. Finally, the third part aimsat the combination of early and late fusion models forming ahybrid learner with addition or multiplication operation. Here,the late fusion model harnesses ﬁve pretrained deep convo-lutional neural networks (DCNNs), viz. ResNet18, ResNet34,ResNet101. VGG16, and VGG19 as the feature extractor forpose regressor module. While, the early fusion model andthe HL focuses on exploiting the best DCNNs, ResNet101and VGG19 based on their individual performance on theApolloscape self-localization dataset.When analyzing the results of the early and late fusionmodels, it is observed that the early fusion encompasses . m of translation error and . ◦ of rotation error. Onthe other hand, the late fusion achieves . m of translationerror and . ◦ of rotation error. On analyzing the hybridlearners, the additive hybrid learner (AHL) gets . m oftranslation error and . ◦ of rotation error, whereas the mul-tiplicative hybrid learner (MHL) records . m and . ◦ of translation and rotation errors, respectively. By fusing thepredictions of AHL and MHL called hybrid learner full-fusion(HLFF) produces better results than all other models with . m and . ◦ of translation rotation errors, respectively.The rest of the paper is organized as follows. Section IIreviews relevant SLAM literature and provides basic detailof the PoseNet, unimodality, and multimodality. Section IIIelaborates the proposed hybrid learner including required pre-processing operations. Section IV describes the experimentalsetup and analyzes the obtained results from various models.Section V concludes the research work.II. B ACKGROUND

A. SLAM

Simultaneous localization and mapping is an active researchdomain in robotics and artiﬁcial intelligence (AI). It enablesa remotely automated moving vehicle to be placed in anunknown environment and location. According to Whyte etal. [4] and Montemerlo et al. [5], SLAM should build a a r X i v : . [ c s . R O ] J a n onsistent map of this unknown environment and determinethe location relative to the map. Through SLAM, robots andvehicles can be truly and completely automated without anyor minimal human intervention. But the estimation of mapsconsists of various other entities, such as large storage issues,precise location coordinates, which makes SLAM a ratherintriguing task, especially in the real-time domain.Many researches have been done worldwide to determinethe efﬁcient method to perform SLAM. In [6], Montemerlo et al. propose a model named FastSLAM, as an efﬁcientsolution to the problem. FastSLAM is a recursive algorithmthat calculates the posterior distribution spanning over au-tonomous vehicle’s pose and landmark locations, yet, it scaleslogarithmically with the total number of landmarks. Thisalgorithm relies on an exact factorization of the posterior intoa product of landmark distributions and a distribution overthe paths of the robot. The research on SLAM originateson the work of Smith and Cheeseman [7] that propose theuse of the extended Kalman ﬁlter (EKF). It is based on thenotion that pose errors and errors in the map are correlated,and the covariance matrix obtained by the EKF represents thiscovariance. There are two main approaches for the localizationof an autonomous vehicle: metric SLAM and appearance-based SLAM [1]. However, this research focuses on theappearance-based SLAM that is trained by giving a set ofvisual samples collected at multiple discrete locations. B. PoseNet

The neural network (NN) comprises of several intercon-nected nodes and associated parameters, like weights andbiases. The weights are adjusted through a series of trials andexperiments in the training phase so that the network can learnand can be used to predict the outcomes at a later stage. Thereare various kinds of NN’s available, for instance, Feed-forwardneural network (FFNN), Radial basis neural network (RBNN),DCNN, Recurrent neural network (RNN), etc. Among them,the DCNN’s have been highly regarded for the adaptability andﬁner interpretability with accurate and justiﬁable predictionsin applications range from ﬁnance to medical analysis andfrom science to engineering. Thus, the PoseNet model forSLAM shown in Fig. 1, harness the DCNN to be ﬁrm againstdifﬁcult lighting, blurredness, and varying camera instincts [1].Figure 1 depicts the underlying architecture of the PoseNet. Itsubsumes a front-end with a feature extractor and a back-endwith a regression subnetwork. The feature extractor can be apretrained DCNN, like ResNet , VGG , or AlexNet. Theregression subnetwork consists of three stages: a dropout, anaverage pooling, a dense layer interconnected, sequentially.It receives the high dimensional vector from the featureextractor. Through the average pooling and dropout layers,it is then reduced to a lower dimension for generalizationand faster computation [8]. The predicted poses are in Six-degree of freedom (6-DoF), which deﬁne the six parametersin translation and rotation [1]. The translation consists offorward-backward, left-right, and up-down parameters formingthe axis of 3D space as x − axis , y − axis , and z − axis , Front-end Back-end

Feature Extractor:

Pose Regressor Output: Poses

A pretrained CNN Dropout → Pooling → Dense Translation & Rotation

Fig. 1: PoseNet Architecture Subsuming a Feature Extractorand a Pose Regressor Subnetwork.respectively. Likewise, the rotation includes yaw, pitch, androll parameters of the same 3D space noted as normal − axis , transverse − axis , and longitudinal − axis , respectively.Then, these six core parameters are converted to sevencoordinates: x , x , and x of translation coordinates, and y , y , y , y of rotation coordinates. It is because the actualrotation poses are in Euler angles. Thus, a pre-processingoperation converts the Euler angles into quaternions. Thequaternions are the set of four values ( x o , y , y and y ),where x o represents a scalar rotation of the vector - y , y and y . This conversion is governed by the expressions givenin Eq. (1) - (4). x = ( √ c c + c c − s s s + c c ) / , (1) y = ( c s + c s + s s c ) / x , (2) y = ( s c + s c + c s s ) / x , (3) y = ( − s s + c s c + s ) / x , (4)where c = cos( roll/ , c = cos( yaw/ , c = cos( pitch/ , s = sin( roll/ , s = sin( yaw/ , and s = sin( pitch/ .The pose regressor subnetwork is to be trained to minimizethe translation and rotation errors. These errors are combinedinto a single objective function, L β as deﬁned in Eq. (5) [9]. L β ( I ) = L x ( I ) + βL q ( I ) , (5)where L x , L q are the losses of translation and rotationrespectively, and I is the input vector representing the discretelocation in the map. β is a scaling factor that is used to balanceboth the losses and calculated using homoscedastic uncertaintythat combines the losses as deﬁned in (6). L σ ( I ) = L x ( I )ˆ σ x + log ˆ σ x + L q ( I )ˆ σ q + log ˆ σ q , (6)where ˆ σ x and ˆ σ q are the uncertainties for translation androtation respectively. Here, the regularizers log ˆ σ x and log ˆ σ q prevent the values from becoming too big [9]. It can becalculated using a more stable form as in Eq. (7), which isvery handy for training the PoseNet. L σ ( I ) = L x ( I ) − ˆ s x + ˆ s x + L q ( I ) − ˆ s q + ˆ s q , (7)where the learning parameter s = log ˆ σ . Following [9], inthis work, ˆ s x and ˆ s q are set to and − . , respectively. . The Front-end Feature Extractor As discussed earlier the PoseNet take advantage of transferlearning (TL), whereby it uses pretrained DCNN as featureextractor. TL differs from traditional learning, as, in latter, themodels or tasks are isolated and function separately. They donot retain any knowledge, whereas TL learns from the olderproblem and leverages the new set of problems [10]. Thus, inthis work, versions of ResNet, versions of VGG, and AlexNetare investigated. Some basic information of these DCNN’s aregiven in the following subsections.

1) AlexNet:

It was the winner in 2012 ImageNet LargeScale Visual Recognition Competition (ILSVRC’12) with abreakthrough performance [11]. It consists of ﬁve convolution(Conv) layers taking up to 60 million trainable parametersand 650,000 neurons making it one of the huge models inDCNN’s. The ﬁrst and second Conv layers are followed by amax pooling operation. But the third, fourth, and ﬁfth Convlayers are connected directly, and the ﬁnal stage is a denselayer and a thousand-way Softmax layer. It was the ﬁrst timefor a DCNN to adopt rectiﬁed linear units (ReLU) instead ofthe tanh activation function and to use of dropout layer toeradicate overﬁtting issues of DL.

2) VGG (16, 19):

Simonayan and Zisserman [12] proposedthe ﬁrst version of VGG network named VGG16 for theILSVRC’14. It stood 2nd in the image classiﬁcation challengewith the error of top-5 as . . VGG16 and 19 consist of16 and 19 Conv layers, respectively with max pooling layerafter set of two or three Conv layers. It comprises of twofully connected layers and a thousand-way Softmax top layer,similar to AlexNet. The main drawbacks of VGG models arehigh training time and high network weights.

3) ResNet (18, 34, 50, 101):

ResNet18 [13] was intro-duced to compete in ILSVRC’15, where it outperformedother models, like VGG, GoogLeNet, and Inception. All theResNet models used in this work are trained on the ImageNetdatabase that consists more than million images. Experimentshave depicted that even though ResNet18 is a subspace ofResNet34, yet its performance is more or less equivalentto ResNet34. ResNet18, 34, 50, and 101 consist of 18, 34,50, and 101 layers, respectively. This paper, ﬁrstly, evaluatesthe performance of the PoseNet individually using the abovementioned ResNet models besides other feature extractors.Consequently, it chooses the best ones to be used in the fusionmodalities and in the hybrid learner, thereby, establishing agood trade-off between depth and performance. The ResNetmodels constitute of residual blocks, whereas ResNet18 and34 have two stack of deep residual blocks, while ResNet50and 101 have three deep residual blocks. A residual blocksubsumes ﬁve convolutional stages, which is followed byaverage pooling layer. Hence, each ResNet model has a fullyconnected layer followed by a thousand-way Softmax layer togenerate a thousand-class labels.

D. Multimodal Feature Fusion

There are many existing researches that have taken theadvantage of various strategies for feature extraction and fusion. For an instance, Xu et al. [14] modify the Inception-ResNet-v1 model to have four layers followed by a fullyconnected layer in order to reduce effect of overﬁtting, astheir problem domain has less number of samples and ﬁfteenclasses. On the other hand, Akilan et al. [15] continue witha TL technique in the feature fusion, whereby they extractfeatures using multiple DCNN’s, namely AlexNet, VGG16and Inception-v3. As these extractors will result into a variedfeature dimensions and sub-spaces, feature space transforma-tion and energy-level normalisation are performed to embedthe features into a common sub-space using dimensionalityreduction techniques like PCA. Finally, the features are fusedtogether using fusion rules, such as concatenation, featureproduct, summation, mean value pooling, and maximum valuepooling.Fu et al. [16] also consider the dimension normalizationtechniques to produce a consistently uniform dimensionalfeature space. It presents supervised and unsupervised learn-ing sub-space learning method for dimensionality reductionand multimodal feature fusion. The work also introduces anew technique called, Tensor-Based discriminative sub-spacelearning. This technique gives better results, as it producesthe ﬁnal fused feature vector of adequate length, i.e., the longvector if the number of features are too large and the shortervector if number of features are small. Hence, Bahrampour etal. [17] introduce a multimodal task-driven dictionary learningalgorithm for information that is obtained either homogenouslyor heterogeneously. These multimodal task-driven dictionariesproduce the features from the input data for classiﬁcationproblems.

E. Hybrid Learning

Sun et al. [18] proposes a hybrid convolutional neuralnetwork for face veriﬁcation in wild conditions. Instead ofextracting the features separately from the images, the featuresfrom the two images are jointly extracted by ﬁlter pairs. Theextracted features are then processed through multiple layersof the DCNN to extract high-level and global features. Thehigher layers in the DCNN discussed in their work locallyshare the weights, which is quite contrary to conventionalCNNs. In this way, feature extraction and recognition arecombined under the hybrid model.Similarly, Pawar et al. [19] develope an efﬁcient hybridapproach involving invariant scale features for object recogni-tion. In the feature extraction phase, the invariant features, likecolor, shape, and texture are extracted and subsequently fusedtogether to improve the recognition performance. The fusedfeature set is then fed to the pattern recognition algorithms,such as support vector machine (SVM), discriminant canonicalcorrelation, and locality preserving projections, which likelyproduces either three distinct or identical numbered falsepositives. To hybridize the process entirely, a decision moduleis developed using NN’s that takes in the match values fromthe chosen pattern recognition algorithm as input, and thenreturns the result based on those match values.ig. 2: Operational Flow of the Proposed Hybrid Learner with a Weight Sewing Strategy and a Late Fusion Phase of thePredicted Poses Towards Improving the Localization Capability of a SLAM Model - PoseNet.However, the hybrid learner (Fig. 2) introduced in this workis more unique and insightful than the existing hybrid fusionapproaches. It focuses on enhancing and updating the weightsof the pretrained unimodals before using them as front-endfeature extractors of the PoseNet. Besides that it not only doesa mutation of multimodal weights of the feature extractiondense layer, but also fuses the predicted scores of the poseregressor. III. PROPOSED METHOD

A. Hybrid Weight Swing and Score Fusion Model

The Fig. 2 shows a detailed ﬂow diagram of the hybridlearner. It consists of two parts, wherein the ﬁrst part (Step1 - weight enhancement) carries out an early fusion by layerweight enhancement of the feature extractor and the secondpart (Step 3) does a late fusion via score reﬁnement of themodels involved in the early fusion. The two best featureextractors chosen for early fusion based on their individual per-formances are used for forming the hybrid learner. The early fusion models obtained through fusing the dense layer weightsof ResNet101 and VGG19 by addition or multiplication. In latefusion, the predicted scores of multiple pose regressors withthe weight enhanced above feature extractors are amalgamatedusing average ﬁltering to achieve better results.

B. Preprocessing

Before passing the images and poses to the PoseNet model,it is required to preprocess the data adequately. The prepro-cessing involves checking the consistency of the images, resiz-ing and center cropping of the images, extraction of mean andstandard deviation, and normalization of the poses. The imagesare resized to × , and center cropped to × . Thetranslation is used to get the minimum, maximum, mean andstandard deviation. The rotation values are read as Euler angleswhich suffer from wrap around inﬁnities, gimbal lock andinterpolation problems. To overcome these challenges, Eulerrotations is converted to quaternions [20]. . Multimodal Weight Sewing via Early Fusion (EF) The preprocessed data is fed to the feature extractors:ResNet101 and VGG19. These two models are selected basedon their individual performance on the Apolloscape testdataset. Using these two feature extractors the PoseNet hasproduced minimum translation and rotation errors, as recordedin Table I. The weights of the top feature extracting denselayers of the two feature extractors are fused via addition ormultiplication operation. The fused values are used to updatethe weights of the respective dense layer of the ResNet101and VGG19 feature extractors. The updated models are thenused as new feature extractors for the regressor subnetwork.Then, the regressor is trained on the training dataset.

D. Pose Reﬁnement via Late Fusion (LF)

The trained models with the updated ResNet101 andVGG19 using early fusion of multiplication and additionoperations are moved onto the late fusion phase as shownin Step 3 in Fig. 2. Where, the loaded weight enhancedearly fusion models, simultaneously predict the poses foreach input visual. The predicted scores from these models(in this case, ResNet101 and VGG19) are amalgamated withaverage ﬁltering. This way of fusion is denoted as AHL.Similarly, the predicted scores of the early fusion models canbe reﬁned using multiplication and it is denoted as MHL.Finally, the predicted scores of the four early fusion modelsusing addition and multiplication are fused together usingaverage mathematical operation to achieve the predicted scoresfor the full hybrid fusion model stated earlier (Section I) inthe paper as HLFF. These predicted poses are then comparedwith the ground truth poses to calculate the mean and medianof translation and rotational errors.IV. E

XPERIMENTAL S ETUP AND R ESULTS

A. Dataset

Apolloscape dataset is used for many computer visiontasks related to autonomous driving. The Apolloscape datasetconsists of modules including instance segmentation, sceneparsing, lanemark parsing and self-localization that are usedto understand and make a vehicle to act and reason accord-ingly [2], [3]. The Apolloscape dataset for self-localization ismade up of images and poses on different roads at discretelocations. The images and the poses are created from therecordings of the videos. Each record of the dataset hasmultiple images and poses corresponding to every image. Theroad chosen for the ablation study of this research is zpark . Itconsists of a total of 3000 stereo vision road scenes. For eachimage, there is a ground truth of poses with 6-DoF. From thisentire dataset, a mutually exclusive training and test sets arecreated with the ratio of . B. Evaluation Metric

Measuring the performance of the machine learning modelis pivotal to comparing the various CNN models. Since everyCNN model is trained and tested on different datasets withvaried hyperparameters, it is necessary to choose the right

Model Name Median e t ( m ) Mean e t ( m ) Median e r ( ◦ ) Mean e r ( ◦ ) MAPST ( s ) M - ResNet

18 21 .

194 24 .

029 0 .

778 0 .

900 0 . M - ResNet

34 20 .

990 23 .

597 0 .

673 0 .

824 0 . M - ResNet

50 18 .

583 20 .

803 0 .

903 1 .

434 0 . M - ResNet

101 16 .

227 19 .

427 0 .

966 1 .

230 0 . M - VGG

16 17 .

150 21 .

571 1 .

079 1 .

758 0 . M - VGG

19 16 .

820 19 .

935 0 .

899 1 .

378 0 . M - AlexNet .

992 53 .

004 4 .

282 7 .

177 0 . M - LF .

763 10 .

561 0 .

945 4 .

645 0 . M - AEF ResNet .

870 18 .

256 0 .

673 0 .

784 0 . M - MEF ResNet .

842 18 .

013 0 .

779 0 .

977 0 . M - AEF V GG .

047 13 .

840 0 .

742 1 .

024 0 . M - MEF V GG .

730 14 .

181 0 .

756 1 .

141 0 . M - AHL .

400 12 .

193 0 .

828 5 .

155 0 . M - MHL .

307 11 .

420 1 .

206 5 .

455 0 . M - HLFF .

762 8 .

829 1 .

008 4 .

618 0 . TABLE I: Performance Analysis of Various Models: e t -translation error, e r - rotation error, MAPST - mean averageper sample processing time.evaluation metric. As the domain of this work is a regressionproblem, the mean absolute error (MAE) is used to measurethe performance of the set of models ranging from unimodalsto the proposed hybrid learner. MAE is a linear score, which iscalculated as an average of the absolute difference between thetarget variables and the predicted variables using the formulagiven in Eq. (8). M AE = 1 n n (cid:88) i =1 | x i − x | , (8)where, n is the total number of samples in the validationdataset, x i and x are the predicted and ground truth poses,respectively. Since, it is an average measure, it implies thatall the distinct values are weighted equally ignoring any biasinvolvement. C. Performance Analysis

This Section elaborates the results obtained from each ofthe model introduced earlier in this paper.

1) Translation and Rotation Errors:

Table I tabulates theperformance of the PoseNet with various front-end uni-model and multimodal feature extractors, along with theproposed hybrid learners. The results in this Table can bedescribed in three subdivisions. The primary section fromM to M is the outcomes of the unimodal PoseNet withunimodality-based feature extractors. The subsequent sectionextending from M to M depicts the performances ofﬁve multimodality-based learners. M represents late fusion(LF), M (AEF ResNet ) and M (MEF ResNet ) representearly fusion on ResNet101 as feature extractor with additionand multiplication, respectively. M (AEF V GG ) and M (MEF V GG ) are the results for early fusion on VGG withaddition and multiplication, respectively. The third sectionconsists of proposed hybrid learners, where M , M , andM stand for AHL that combines the early fusion models,M and M , MHL, which combines the early fusion models, a) Error in Terms of Translation.(b) Error in Terms of Rotation.(c) Average Processing Time per Sample. Fig. 3: Performance Analysis of All Different Models.M and M , and the HLFF obtained after averaging thepredicted scores of the four models, M , M , M , and M .The results are computed as the mean and median valuesof translation and rotation errors. The translation errors aremeasured in terms of meters ( m ), while the rotation erroris measured in degrees ( ◦ ) . Considering the unimodal-basedPoseNet implementation, it is very apparent that ResNet Model Name Improvement in TimingOverhead ( ms ) e t (%) e r (%)LF

37 3 48 EF ResNet EF V GG

31 24 39

HLFF − TABLE II: Performance Improvement of Proposed HybridLearner When Compared to the Baseline PoseNet withResNet101 as Front-end.and VGG give the best two outcomes among others. Interms of translation error, the late fusion model shows betterperformance than unimodality-based learners, but not good ascompared to the early fusion model. Let’s pick the ResNet -based PoseNet as baseline model for rest of the comparativeanalysis because it has the best performance amongst the allthe unimodals. Here, the late fusion shows a % decreasein the translation’s median errors and an % decrease in thetranslation’s mean errors when compared to ResNet (M ).The median of rotation errors in the late fusion model showsa decrease of % but the mean of rotation errors increase by %.On comparison of the baseline model (M ) with the earlyfusion model using addition having ResNet101 as a featureextractor (M ), it is seen that there is a % decrease intranslation’s median error and a % decrease in translation’smean error. On the other hand, the median of rotation errorshows a % decrease and mean of rotation error shows a %decrease. The comparison with the early fusion model (M )using multiplication on VGG exhibits a % decrease intranslation’s median error and a % decrease in translation’smean error. While the rotation errors drop in median value by % and in the mean value by %.It is evident from the Table I that hybrid learners showmuch better performance than the unimodal and early fusionmodels. The hybrid learner using average ﬁltering shows a % decrease in the translation’s median errors and a %decrease in the translation’s mean error when compared to theResNet -based PoseNet. While for rotation errors, there isa % increase in the median and a % increase in the mean.In holistic analysis, it is observed that the late fusion showsan improvement of % in translation and % in rotation.On considering the early fusion model using ResNet as afeature extractor, the translation shows improvement whilerotation shows %. The improvement of the early fusionusing VGG as a feature extractor in terms of translationand rotation is % and %, respectively. It is quite evidentfrom the Table II that the proposed HL has a negligible lowresults for rotation with a decrease of % nevertheless, it isthe best model considering a huge improvement in translationby % across all the modals. D. Timing Analysis

The timing analysis is conducted on a machine that usesan Intel Core i5 processor that uses the Google Colaboratoryaving a GPU chip Tesla K80 with 2496 CUDA cores, ahard disk space of 319GB and 12.6GB RAM. Table I showsthe mean average processing time calculated for processing abatch of ten samples.As seen from the Table I and Figure 3, the fusion modelswhich involve early, late, and hybrid learner take slightly extratime compared to the unimodality-based baseline PoseNet.The late fusion model (M8) takes more processing time incomparison to all the other models, as it uses ﬁve pretrainedmodalities, which are trained and tested individually, thereby,increasing the time overhead. The early fusion models alsoshow an increase in the processing time in comparison to theunimodels but lesser than the late fusion model, as training thepretrained model after weight enhancement takes more time.The hybrid learners also show the same trend because of theunderlying fact that it is a combination of the early and latefusion methods. These models employ weight enhanced earlyfusion models adding to the time overhead, besides fusing thescores from different models after validation.Note that the hyper-parameters have been ﬁxed throughoutthe experimental analysis on various models to avoid uncer-tainties in the comparative study. The learning rate used inall the models is . , dropout rate for the dropout layer ofthe PoseNet is set to . , and the batch size during trainingis ﬁxed to . Hence, every model is trained for epochswith Adam optimizer.V. C ONCLUSION

This work introduces a the hybrid learner to improve thelocalization accuracy of a pose regressor model for SLAM.The hybrid learner is a combination of multimodal early andlate fusion algorithms to harness the best properties of theboth. The extensive experiments on the Apolloscape self-localization dataset show that the proposed hybrid leaner iscapable of reducing the translation error nearly by a %decrease, although the rotation error gets worse by a negligible % when compared to unimodal PoseNet with ResNet asa feature extractor.Thus, the future work aims at minimizing the rotation errorsand overcoming the little overhead in the processing time.A CKNOWLEDGMENT

This work acknowledges the Google for generosity ofproviding the HPC on the Colab machine learning platformand the organizer of Apollo Scape dataset.R

EFERENCES[1] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutionalnetwork for real-time 6-dof camera relocalization,” in

Proceedings ofthe IEEE international conference on computer vision , pp. 2938–2946,2015.[2] P. Wang, R. Yang, B. Cao, W. Xu, and Y. Lin, “Dels-3d: Deeplocalization and segmentation with a 3d semantic map,” in

CVPR ,pp. 5860–5869, 2018.[3] P. Wang, X. Huang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “Theapolloscape open dataset for autonomous driving and its application,”

IEEE transactions on pattern analysis and machine intelligence , 2019.[4] T. Bailey and H. Durrant-Whyte, “Simultaneous localization and map-ping (slam): Part ii,”

IEEE robotics & automation magazine , vol. 13,no. 3, pp. 108–117, 2006. [5] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al. , “Fastslam 2.0:An improved particle ﬁltering algorithm for simultaneous localizationand mapping that provably converges,” in

IJCAI , pp. 1151–1156, 2003.[6] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al. , “Fastslam: Afactored solution to the simultaneous localization and mapping problem,”

Aaai/iaai , vol. 593598, 2002.[7] R. C. Smith and P. Cheeseman, “On the representation and estimationof spatial uncertainty,”

The international journal of Robotics Research ,vol. 5, no. 4, pp. 56–68, 1986.[8] F. Walch, C. Hazirbas, L. Leal-Taix´e, T. Sattler, S. Hilsenbeck, andD. Cremers, “Image-based localization with spatial lstms,”

CoRR ,vol. abs/1611.07890, 2016.[9] A. Kendall and R. Cipolla, “Geometric loss functions for camera poseregression with deep learning,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 5974–5983, 2017.[10] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learningdomains: A survey,”

Journal of Machine Learning Research , vol. 10,no. Jul, pp. 1633–1685, 2009.[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neural infor-mation processing systems , pp. 1097–1105, 2012.[12] K. Simonyan and A. Zisserman, “Very deep convolutional neural net-works for large-scale image recognition,” in

The International Confer-ence on Learning Representations, 2015 , pp. 1–5, 2015.[13] R. S. Paolo Napoletano, Flavio Piccoli, “Anomaly detection in nanoﬁ-brous materials by cnn-based similarity,”

MDPI , 2018.[14] J. Xu, Y. Zhao, J. Jiang, Y. Dou, Z. Liu, and K. Chen, “Fusion modelbased on convolutional neural networks with two features for acousticscene classiﬁcation,” in

Proc. of the Detection and Classiﬁcation ofAcoustic Scenes and Events 2017 Workshop (DCASE2017), Munich,Germany , 2017.[15] T. Akilan, Q. M. J. Wu, and H. Zhang, “Effect of fusing features frommultiple dcnn architectures in image classiﬁcation,”

The Institution ofEngineering and Technology , Feb 2018.[16] Y. Fu, L. Cao, G. Guo, and T. S. Huang, “Multiple feature fusion bysubspace learning,” in

Proceedings of the 2008 international conferenceon Content-based image and video retrieval , pp. 127–134, 2008.[17] S. Bahrampour, N. Nasrabadi, A. Ray, and W. Jenkins, “Multimodaltask-driven dictionary learning for image classiﬁcation,”

The IEEETransactions on Image Processing , 2015.[18] Y. Sun, X. Wang, and X. Tang, “Hybrid deep learning for face veriﬁca-tion,” in

Proceedings of the IEEE international conference on computervision , pp. 1489–1496, 2013.[19] V. Pawar and S. Talbar, “Hybrid machine learning approach for objectrecognition: Fusion of features and decisions,”

Machine Graphics andVision , vol. 19, no. 4, pp. 411–428, 2010.[20] P. Bouthellier, “Rotations and orientations in r3,”27th InternationalConference on Technology in Collegiate Mathematics