A Hybrid Learner for Simultaneous Localization and Mapping
Thangarajah Akilan, Edna Johnson, Japneet Sandhu, Ritika Chadha, Gaurav Taluja
AA Hybrid Learner for Simultaneous Localizationand Mapping
Thangarajah Akilan,
Member, IEEE , Edna Johnson, Japneet Sandhu, Ritika Chadha, Gaurav Taluja
Abstract —Simultaneous localization and mapping (SLAM) isused to predict the dynamic motion path of a moving platformbased on the location coordinates and the precise mapping of thephysical environment. SLAM has great potential in augmentedreality (AR), autonomous vehicles, viz. self-driving cars, drones,Autonomous navigation robots (ANR). This work introducesa hybrid learning model that explores beyond feature fusionand conducts a multimodal weight sewing strategy towardsimproving the performance of a baseline SLAM algorithm.It carries out weight enhancement of the front end featureextractor of the SLAM via mutation of different deep networks’top layers. At the same time, the trajectory predictions fromindependently trained models are amalgamated to refine thelocation detail. Thus, the integration of the aforesaid earlyand late fusion techniques under a hybrid learning frameworkminimizes the translation and rotation errors of the SLAMmodel. This study exploits some well-known deep learning(DL) architectures, including ResNet18, ResNet34, ResNet50,ResNet101, VGG16, VGG19, and AlexNet for experimentalanalysis. An extensive experimental analysis proves that hybridlearner (HL) achieves significantly better results than theunimodal approaches and multimodal approaches with early orlate fusion strategies. Hence, it is found that the Apolloscapedataset taken in this work has never been used in the literatureunder SLAM with fusion techniques, which makes this workunique and insightful.
Index Terms —SLAM, deep learning, hybrid learning
I. I
NTRODUCTION
SLAM is a technological process that enables a deviceto build a map of the environment, at the same time, helpscompute the relative location on predefined map. It can be usedfor range of applications from self-driving vehicles (SDV) tospace and maritime exploration, and from indoor positioningto search and rescue operations. The primary responsibilityof a SLAM algorithm is to produce an understanding of amoving platform’s environment and the location of the vehicleby providing the value of its coordinates; thus, improving theformation of a trajectory to determine the view at a particularinstance. As a SLAM is one of the emerging technologies,numerous implementations have been introduced but the DL-based approaches surmount others by their efficiency in ex-tracting the finest features and giving better results even in afeature-scarce environment.This study aims to improve the performance of a self-localization module based on PoseNet [1] architecture throughthe concept of hybrid learning that does a multimodal weightmutation for enhancing the weights of a feature extractor layerand refines the trajectory predictions using amalgamation ofmultimodal scores. The ablation study is carried out on the Apolloscape [2], [3], as per our knowledge, there has been noresearch work performed on the self-localization repository ofthe Apolloscape dataset, in which the proposed HL has beenevaluated extensively. The experimental analysis presented inthis work consists of three parts, in which initial two parts formthe base for the third. The first part concentrates on an exten-sive evaluation of several DL models, as feature extractors.The second part analyzes of two proposed multimodal fusionapproaches: (i). an early fusion via layer weight enhancementof feature extractor, and (ii). a late fusion via score refinementof the trajectory (pose) regressor. Finally, the third part aimsat the combination of early and late fusion models forming ahybrid learner with addition or multiplication operation. Here,the late fusion model harnesses five pretrained deep convo-lutional neural networks (DCNNs), viz. ResNet18, ResNet34,ResNet101. VGG16, and VGG19 as the feature extractor forpose regressor module. While, the early fusion model andthe HL focuses on exploiting the best DCNNs, ResNet101and VGG19 based on their individual performance on theApolloscape self-localization dataset.When analyzing the results of the early and late fusionmodels, it is observed that the early fusion encompasses . m of translation error and . ◦ of rotation error. Onthe other hand, the late fusion achieves . m of translationerror and . ◦ of rotation error. On analyzing the hybridlearners, the additive hybrid learner (AHL) gets . m oftranslation error and . ◦ of rotation error, whereas the mul-tiplicative hybrid learner (MHL) records . m and . ◦ of translation and rotation errors, respectively. By fusing thepredictions of AHL and MHL called hybrid learner full-fusion(HLFF) produces better results than all other models with . m and . ◦ of translation rotation errors, respectively.The rest of the paper is organized as follows. Section IIreviews relevant SLAM literature and provides basic detailof the PoseNet, unimodality, and multimodality. Section IIIelaborates the proposed hybrid learner including required pre-processing operations. Section IV describes the experimentalsetup and analyzes the obtained results from various models.Section V concludes the research work.II. B ACKGROUND
A. SLAM
Simultaneous localization and mapping is an active researchdomain in robotics and artificial intelligence (AI). It enablesa remotely automated moving vehicle to be placed in anunknown environment and location. According to Whyte etal. [4] and Montemerlo et al. [5], SLAM should build a a r X i v : . [ c s . R O ] J a n onsistent map of this unknown environment and determinethe location relative to the map. Through SLAM, robots andvehicles can be truly and completely automated without anyor minimal human intervention. But the estimation of mapsconsists of various other entities, such as large storage issues,precise location coordinates, which makes SLAM a ratherintriguing task, especially in the real-time domain.Many researches have been done worldwide to determinethe efficient method to perform SLAM. In [6], Montemerlo et al. propose a model named FastSLAM, as an efficientsolution to the problem. FastSLAM is a recursive algorithmthat calculates the posterior distribution spanning over au-tonomous vehicle’s pose and landmark locations, yet, it scaleslogarithmically with the total number of landmarks. Thisalgorithm relies on an exact factorization of the posterior intoa product of landmark distributions and a distribution overthe paths of the robot. The research on SLAM originateson the work of Smith and Cheeseman [7] that propose theuse of the extended Kalman filter (EKF). It is based on thenotion that pose errors and errors in the map are correlated,and the covariance matrix obtained by the EKF represents thiscovariance. There are two main approaches for the localizationof an autonomous vehicle: metric SLAM and appearance-based SLAM [1]. However, this research focuses on theappearance-based SLAM that is trained by giving a set ofvisual samples collected at multiple discrete locations. B. PoseNet
The neural network (NN) comprises of several intercon-nected nodes and associated parameters, like weights andbiases. The weights are adjusted through a series of trials andexperiments in the training phase so that the network can learnand can be used to predict the outcomes at a later stage. Thereare various kinds of NN’s available, for instance, Feed-forwardneural network (FFNN), Radial basis neural network (RBNN),DCNN, Recurrent neural network (RNN), etc. Among them,the DCNN’s have been highly regarded for the adaptability andfiner interpretability with accurate and justifiable predictionsin applications range from finance to medical analysis andfrom science to engineering. Thus, the PoseNet model forSLAM shown in Fig. 1, harness the DCNN to be firm againstdifficult lighting, blurredness, and varying camera instincts [1].Figure 1 depicts the underlying architecture of the PoseNet. Itsubsumes a front-end with a feature extractor and a back-endwith a regression subnetwork. The feature extractor can be apretrained DCNN, like ResNet , VGG , or AlexNet. Theregression subnetwork consists of three stages: a dropout, anaverage pooling, a dense layer interconnected, sequentially.It receives the high dimensional vector from the featureextractor. Through the average pooling and dropout layers,it is then reduced to a lower dimension for generalizationand faster computation [8]. The predicted poses are in Six-degree of freedom (6-DoF), which define the six parametersin translation and rotation [1]. The translation consists offorward-backward, left-right, and up-down parameters formingthe axis of 3D space as x − axis , y − axis , and z − axis , Front-end Back-end
Feature Extractor:
Pose Regressor Output: Poses
A pretrained CNN Dropout → Pooling → Dense Translation & Rotation
Fig. 1: PoseNet Architecture Subsuming a Feature Extractorand a Pose Regressor Subnetwork.respectively. Likewise, the rotation includes yaw, pitch, androll parameters of the same 3D space noted as normal − axis , transverse − axis , and longitudinal − axis , respectively.Then, these six core parameters are converted to sevencoordinates: x , x , and x of translation coordinates, and y , y , y , y of rotation coordinates. It is because the actualrotation poses are in Euler angles. Thus, a pre-processingoperation converts the Euler angles into quaternions. Thequaternions are the set of four values ( x o , y , y and y ),where x o represents a scalar rotation of the vector - y , y and y . This conversion is governed by the expressions givenin Eq. (1) - (4). x = ( √ c c + c c − s s s + c c ) / , (1) y = ( c s + c s + s s c ) / x , (2) y = ( s c + s c + c s s ) / x , (3) y = ( − s s + c s c + s ) / x , (4)where c = cos( roll/ , c = cos( yaw/ , c = cos( pitch/ , s = sin( roll/ , s = sin( yaw/ , and s = sin( pitch/ .The pose regressor subnetwork is to be trained to minimizethe translation and rotation errors. These errors are combinedinto a single objective function, L β as defined in Eq. (5) [9]. L β ( I ) = L x ( I ) + βL q ( I ) , (5)where L x , L q are the losses of translation and rotationrespectively, and I is the input vector representing the discretelocation in the map. β is a scaling factor that is used to balanceboth the losses and calculated using homoscedastic uncertaintythat combines the losses as defined in (6). L σ ( I ) = L x ( I )ˆ σ x + log ˆ σ x + L q ( I )ˆ σ q + log ˆ σ q , (6)where ˆ σ x and ˆ σ q are the uncertainties for translation androtation respectively. Here, the regularizers log ˆ σ x and log ˆ σ q prevent the values from becoming too big [9]. It can becalculated using a more stable form as in Eq. (7), which isvery handy for training the PoseNet. L σ ( I ) = L x ( I ) − ˆ s x + ˆ s x + L q ( I ) − ˆ s q + ˆ s q , (7)where the learning parameter s = log ˆ σ . Following [9], inthis work, ˆ s x and ˆ s q are set to and − . , respectively. . The Front-end Feature Extractor As discussed earlier the PoseNet take advantage of transferlearning (TL), whereby it uses pretrained DCNN as featureextractor. TL differs from traditional learning, as, in latter, themodels or tasks are isolated and function separately. They donot retain any knowledge, whereas TL learns from the olderproblem and leverages the new set of problems [10]. Thus, inthis work, versions of ResNet, versions of VGG, and AlexNetare investigated. Some basic information of these DCNN’s aregiven in the following subsections.
1) AlexNet:
It was the winner in 2012 ImageNet LargeScale Visual Recognition Competition (ILSVRC’12) with abreakthrough performance [11]. It consists of five convolution(Conv) layers taking up to 60 million trainable parametersand 650,000 neurons making it one of the huge models inDCNN’s. The first and second Conv layers are followed by amax pooling operation. But the third, fourth, and fifth Convlayers are connected directly, and the final stage is a denselayer and a thousand-way Softmax layer. It was the first timefor a DCNN to adopt rectified linear units (ReLU) instead ofthe tanh activation function and to use of dropout layer toeradicate overfitting issues of DL.
2) VGG (16, 19):
Simonayan and Zisserman [12] proposedthe first version of VGG network named VGG16 for theILSVRC’14. It stood 2nd in the image classification challengewith the error of top-5 as . . VGG16 and 19 consist of16 and 19 Conv layers, respectively with max pooling layerafter set of two or three Conv layers. It comprises of twofully connected layers and a thousand-way Softmax top layer,similar to AlexNet. The main drawbacks of VGG models arehigh training time and high network weights.
3) ResNet (18, 34, 50, 101):
ResNet18 [13] was intro-duced to compete in ILSVRC’15, where it outperformedother models, like VGG, GoogLeNet, and Inception. All theResNet models used in this work are trained on the ImageNetdatabase that consists more than million images. Experimentshave depicted that even though ResNet18 is a subspace ofResNet34, yet its performance is more or less equivalentto ResNet34. ResNet18, 34, 50, and 101 consist of 18, 34,50, and 101 layers, respectively. This paper, firstly, evaluatesthe performance of the PoseNet individually using the abovementioned ResNet models besides other feature extractors.Consequently, it chooses the best ones to be used in the fusionmodalities and in the hybrid learner, thereby, establishing agood trade-off between depth and performance. The ResNetmodels constitute of residual blocks, whereas ResNet18 and34 have two stack of deep residual blocks, while ResNet50and 101 have three deep residual blocks. A residual blocksubsumes five convolutional stages, which is followed byaverage pooling layer. Hence, each ResNet model has a fullyconnected layer followed by a thousand-way Softmax layer togenerate a thousand-class labels.
D. Multimodal Feature Fusion
There are many existing researches that have taken theadvantage of various strategies for feature extraction and fusion. For an instance, Xu et al. [14] modify the Inception-ResNet-v1 model to have four layers followed by a fullyconnected layer in order to reduce effect of overfitting, astheir problem domain has less number of samples and fifteenclasses. On the other hand, Akilan et al. [15] continue witha TL technique in the feature fusion, whereby they extractfeatures using multiple DCNN’s, namely AlexNet, VGG16and Inception-v3. As these extractors will result into a variedfeature dimensions and sub-spaces, feature space transforma-tion and energy-level normalisation are performed to embedthe features into a common sub-space using dimensionalityreduction techniques like PCA. Finally, the features are fusedtogether using fusion rules, such as concatenation, featureproduct, summation, mean value pooling, and maximum valuepooling.Fu et al. [16] also consider the dimension normalizationtechniques to produce a consistently uniform dimensionalfeature space. It presents supervised and unsupervised learn-ing sub-space learning method for dimensionality reductionand multimodal feature fusion. The work also introduces anew technique called, Tensor-Based discriminative sub-spacelearning. This technique gives better results, as it producesthe final fused feature vector of adequate length, i.e., the longvector if the number of features are too large and the shortervector if number of features are small. Hence, Bahrampour etal. [17] introduce a multimodal task-driven dictionary learningalgorithm for information that is obtained either homogenouslyor heterogeneously. These multimodal task-driven dictionariesproduce the features from the input data for classificationproblems.
E. Hybrid Learning
Sun et al. [18] proposes a hybrid convolutional neuralnetwork for face verification in wild conditions. Instead ofextracting the features separately from the images, the featuresfrom the two images are jointly extracted by filter pairs. Theextracted features are then processed through multiple layersof the DCNN to extract high-level and global features. Thehigher layers in the DCNN discussed in their work locallyshare the weights, which is quite contrary to conventionalCNNs. In this way, feature extraction and recognition arecombined under the hybrid model.Similarly, Pawar et al. [19] develope an efficient hybridapproach involving invariant scale features for object recogni-tion. In the feature extraction phase, the invariant features, likecolor, shape, and texture are extracted and subsequently fusedtogether to improve the recognition performance. The fusedfeature set is then fed to the pattern recognition algorithms,such as support vector machine (SVM), discriminant canonicalcorrelation, and locality preserving projections, which likelyproduces either three distinct or identical numbered falsepositives. To hybridize the process entirely, a decision moduleis developed using NN’s that takes in the match values fromthe chosen pattern recognition algorithm as input, and thenreturns the result based on those match values.ig. 2: Operational Flow of the Proposed Hybrid Learner with a Weight Sewing Strategy and a Late Fusion Phase of thePredicted Poses Towards Improving the Localization Capability of a SLAM Model - PoseNet.However, the hybrid learner (Fig. 2) introduced in this workis more unique and insightful than the existing hybrid fusionapproaches. It focuses on enhancing and updating the weightsof the pretrained unimodals before using them as front-endfeature extractors of the PoseNet. Besides that it not only doesa mutation of multimodal weights of the feature extractiondense layer, but also fuses the predicted scores of the poseregressor. III. PROPOSED METHOD
A. Hybrid Weight Swing and Score Fusion Model
The Fig. 2 shows a detailed flow diagram of the hybridlearner. It consists of two parts, wherein the first part (Step1 - weight enhancement) carries out an early fusion by layerweight enhancement of the feature extractor and the secondpart (Step 3) does a late fusion via score refinement of themodels involved in the early fusion. The two best featureextractors chosen for early fusion based on their individual per-formances are used for forming the hybrid learner. The early fusion models obtained through fusing the dense layer weightsof ResNet101 and VGG19 by addition or multiplication. In latefusion, the predicted scores of multiple pose regressors withthe weight enhanced above feature extractors are amalgamatedusing average filtering to achieve better results.
B. Preprocessing
Before passing the images and poses to the PoseNet model,it is required to preprocess the data adequately. The prepro-cessing involves checking the consistency of the images, resiz-ing and center cropping of the images, extraction of mean andstandard deviation, and normalization of the poses. The imagesare resized to × , and center cropped to × . Thetranslation is used to get the minimum, maximum, mean andstandard deviation. The rotation values are read as Euler angleswhich suffer from wrap around infinities, gimbal lock andinterpolation problems. To overcome these challenges, Eulerrotations is converted to quaternions [20]. . Multimodal Weight Sewing via Early Fusion (EF) The preprocessed data is fed to the feature extractors:ResNet101 and VGG19. These two models are selected basedon their individual performance on the Apolloscape testdataset. Using these two feature extractors the PoseNet hasproduced minimum translation and rotation errors, as recordedin Table I. The weights of the top feature extracting denselayers of the two feature extractors are fused via addition ormultiplication operation. The fused values are used to updatethe weights of the respective dense layer of the ResNet101and VGG19 feature extractors. The updated models are thenused as new feature extractors for the regressor subnetwork.Then, the regressor is trained on the training dataset.
D. Pose Refinement via Late Fusion (LF)
The trained models with the updated ResNet101 andVGG19 using early fusion of multiplication and additionoperations are moved onto the late fusion phase as shownin Step 3 in Fig. 2. Where, the loaded weight enhancedearly fusion models, simultaneously predict the poses foreach input visual. The predicted scores from these models(in this case, ResNet101 and VGG19) are amalgamated withaverage filtering. This way of fusion is denoted as AHL.Similarly, the predicted scores of the early fusion models canbe refined using multiplication and it is denoted as MHL.Finally, the predicted scores of the four early fusion modelsusing addition and multiplication are fused together usingaverage mathematical operation to achieve the predicted scoresfor the full hybrid fusion model stated earlier (Section I) inthe paper as HLFF. These predicted poses are then comparedwith the ground truth poses to calculate the mean and medianof translation and rotational errors.IV. E
XPERIMENTAL S ETUP AND R ESULTS
A. Dataset
Apolloscape dataset is used for many computer visiontasks related to autonomous driving. The Apolloscape datasetconsists of modules including instance segmentation, sceneparsing, lanemark parsing and self-localization that are usedto understand and make a vehicle to act and reason accord-ingly [2], [3]. The Apolloscape dataset for self-localization ismade up of images and poses on different roads at discretelocations. The images and the poses are created from therecordings of the videos. Each record of the dataset hasmultiple images and poses corresponding to every image. Theroad chosen for the ablation study of this research is zpark . Itconsists of a total of 3000 stereo vision road scenes. For eachimage, there is a ground truth of poses with 6-DoF. From thisentire dataset, a mutually exclusive training and test sets arecreated with the ratio of . B. Evaluation Metric
Measuring the performance of the machine learning modelis pivotal to comparing the various CNN models. Since everyCNN model is trained and tested on different datasets withvaried hyperparameters, it is necessary to choose the right
Model Name Median e t ( m ) Mean e t ( m ) Median e r ( ◦ ) Mean e r ( ◦ ) MAPST ( s ) M - ResNet
18 21 .
194 24 .
029 0 .
778 0 .
900 0 . M - ResNet
34 20 .
990 23 .
597 0 .
673 0 .
824 0 . M - ResNet
50 18 .
583 20 .
803 0 .
903 1 .
434 0 . M - ResNet
101 16 .
227 19 .
427 0 .
966 1 .
230 0 . M - VGG
16 17 .
150 21 .
571 1 .
079 1 .
758 0 . M - VGG
19 16 .
820 19 .
935 0 .
899 1 .
378 0 . M - AlexNet .
992 53 .
004 4 .
282 7 .
177 0 . M - LF .
763 10 .
561 0 .
945 4 .
645 0 . M - AEF ResNet .
870 18 .
256 0 .
673 0 .
784 0 . M - MEF ResNet .
842 18 .
013 0 .
779 0 .
977 0 . M - AEF V GG .
047 13 .
840 0 .
742 1 .
024 0 . M - MEF V GG .
730 14 .
181 0 .
756 1 .
141 0 . M - AHL .
400 12 .
193 0 .
828 5 .
155 0 . M - MHL .
307 11 .
420 1 .
206 5 .
455 0 . M - HLFF .
762 8 .
829 1 .
008 4 .
618 0 . TABLE I: Performance Analysis of Various Models: e t -translation error, e r - rotation error, MAPST - mean averageper sample processing time.evaluation metric. As the domain of this work is a regressionproblem, the mean absolute error (MAE) is used to measurethe performance of the set of models ranging from unimodalsto the proposed hybrid learner. MAE is a linear score, which iscalculated as an average of the absolute difference between thetarget variables and the predicted variables using the formulagiven in Eq. (8). M AE = 1 n n (cid:88) i =1 | x i − x | , (8)where, n is the total number of samples in the validationdataset, x i and x are the predicted and ground truth poses,respectively. Since, it is an average measure, it implies thatall the distinct values are weighted equally ignoring any biasinvolvement. C. Performance Analysis
This Section elaborates the results obtained from each ofthe model introduced earlier in this paper.
1) Translation and Rotation Errors:
Table I tabulates theperformance of the PoseNet with various front-end uni-model and multimodal feature extractors, along with theproposed hybrid learners. The results in this Table can bedescribed in three subdivisions. The primary section fromM to M is the outcomes of the unimodal PoseNet withunimodality-based feature extractors. The subsequent sectionextending from M to M depicts the performances offive multimodality-based learners. M represents late fusion(LF), M (AEF ResNet ) and M (MEF ResNet ) representearly fusion on ResNet101 as feature extractor with additionand multiplication, respectively. M (AEF V GG ) and M (MEF V GG ) are the results for early fusion on VGG withaddition and multiplication, respectively. The third sectionconsists of proposed hybrid learners, where M , M , andM stand for AHL that combines the early fusion models,M and M , MHL, which combines the early fusion models, a) Error in Terms of Translation.(b) Error in Terms of Rotation.(c) Average Processing Time per Sample. Fig. 3: Performance Analysis of All Different Models.M and M , and the HLFF obtained after averaging thepredicted scores of the four models, M , M , M , and M .The results are computed as the mean and median valuesof translation and rotation errors. The translation errors aremeasured in terms of meters ( m ), while the rotation erroris measured in degrees ( ◦ ) . Considering the unimodal-basedPoseNet implementation, it is very apparent that ResNet Model Name Improvement in TimingOverhead ( ms ) e t (%) e r (%)LF
37 3 48 EF ResNet EF V GG
31 24 39
HLFF − TABLE II: Performance Improvement of Proposed HybridLearner When Compared to the Baseline PoseNet withResNet101 as Front-end.and VGG give the best two outcomes among others. Interms of translation error, the late fusion model shows betterperformance than unimodality-based learners, but not good ascompared to the early fusion model. Let’s pick the ResNet -based PoseNet as baseline model for rest of the comparativeanalysis because it has the best performance amongst the allthe unimodals. Here, the late fusion shows a % decreasein the translation’s median errors and an % decrease in thetranslation’s mean errors when compared to ResNet (M ).The median of rotation errors in the late fusion model showsa decrease of % but the mean of rotation errors increase by %.On comparison of the baseline model (M ) with the earlyfusion model using addition having ResNet101 as a featureextractor (M ), it is seen that there is a % decrease intranslation’s median error and a % decrease in translation’smean error. On the other hand, the median of rotation errorshows a % decrease and mean of rotation error shows a %decrease. The comparison with the early fusion model (M )using multiplication on VGG exhibits a % decrease intranslation’s median error and a % decrease in translation’smean error. While the rotation errors drop in median value by % and in the mean value by %.It is evident from the Table I that hybrid learners showmuch better performance than the unimodal and early fusionmodels. The hybrid learner using average filtering shows a % decrease in the translation’s median errors and a %decrease in the translation’s mean error when compared to theResNet -based PoseNet. While for rotation errors, there isa % increase in the median and a % increase in the mean.In holistic analysis, it is observed that the late fusion showsan improvement of % in translation and % in rotation.On considering the early fusion model using ResNet as afeature extractor, the translation shows improvement whilerotation shows %. The improvement of the early fusionusing VGG as a feature extractor in terms of translationand rotation is % and %, respectively. It is quite evidentfrom the Table II that the proposed HL has a negligible lowresults for rotation with a decrease of % nevertheless, it isthe best model considering a huge improvement in translationby % across all the modals. D. Timing Analysis
The timing analysis is conducted on a machine that usesan Intel Core i5 processor that uses the Google Colaboratoryaving a GPU chip Tesla K80 with 2496 CUDA cores, ahard disk space of 319GB and 12.6GB RAM. Table I showsthe mean average processing time calculated for processing abatch of ten samples.As seen from the Table I and Figure 3, the fusion modelswhich involve early, late, and hybrid learner take slightly extratime compared to the unimodality-based baseline PoseNet.The late fusion model (M8) takes more processing time incomparison to all the other models, as it uses five pretrainedmodalities, which are trained and tested individually, thereby,increasing the time overhead. The early fusion models alsoshow an increase in the processing time in comparison to theunimodels but lesser than the late fusion model, as training thepretrained model after weight enhancement takes more time.The hybrid learners also show the same trend because of theunderlying fact that it is a combination of the early and latefusion methods. These models employ weight enhanced earlyfusion models adding to the time overhead, besides fusing thescores from different models after validation.Note that the hyper-parameters have been fixed throughoutthe experimental analysis on various models to avoid uncer-tainties in the comparative study. The learning rate used inall the models is . , dropout rate for the dropout layer ofthe PoseNet is set to . , and the batch size during trainingis fixed to . Hence, every model is trained for epochswith Adam optimizer.V. C ONCLUSION
This work introduces a the hybrid learner to improve thelocalization accuracy of a pose regressor model for SLAM.The hybrid learner is a combination of multimodal early andlate fusion algorithms to harness the best properties of theboth. The extensive experiments on the Apolloscape self-localization dataset show that the proposed hybrid leaner iscapable of reducing the translation error nearly by a %decrease, although the rotation error gets worse by a negligible % when compared to unimodal PoseNet with ResNet asa feature extractor.Thus, the future work aims at minimizing the rotation errorsand overcoming the little overhead in the processing time.A CKNOWLEDGMENT
This work acknowledges the Google for generosity ofproviding the HPC on the Colab machine learning platformand the organizer of Apollo Scape dataset.R
EFERENCES[1] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutionalnetwork for real-time 6-dof camera relocalization,” in
Proceedings ofthe IEEE international conference on computer vision , pp. 2938–2946,2015.[2] P. Wang, R. Yang, B. Cao, W. Xu, and Y. Lin, “Dels-3d: Deeplocalization and segmentation with a 3d semantic map,” in
CVPR ,pp. 5860–5869, 2018.[3] P. Wang, X. Huang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “Theapolloscape open dataset for autonomous driving and its application,”
IEEE transactions on pattern analysis and machine intelligence , 2019.[4] T. Bailey and H. Durrant-Whyte, “Simultaneous localization and map-ping (slam): Part ii,”
IEEE robotics & automation magazine , vol. 13,no. 3, pp. 108–117, 2006. [5] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al. , “Fastslam 2.0:An improved particle filtering algorithm for simultaneous localizationand mapping that provably converges,” in
IJCAI , pp. 1151–1156, 2003.[6] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al. , “Fastslam: Afactored solution to the simultaneous localization and mapping problem,”
Aaai/iaai , vol. 593598, 2002.[7] R. C. Smith and P. Cheeseman, “On the representation and estimationof spatial uncertainty,”
The international journal of Robotics Research ,vol. 5, no. 4, pp. 56–68, 1986.[8] F. Walch, C. Hazirbas, L. Leal-Taix´e, T. Sattler, S. Hilsenbeck, andD. Cremers, “Image-based localization with spatial lstms,”
CoRR ,vol. abs/1611.07890, 2016.[9] A. Kendall and R. Cipolla, “Geometric loss functions for camera poseregression with deep learning,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 5974–5983, 2017.[10] M. E. Taylor and P. Stone, “Transfer learning for reinforcement learningdomains: A survey,”
Journal of Machine Learning Research , vol. 10,no. Jul, pp. 1633–1685, 2009.[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Advances in neural infor-mation processing systems , pp. 1097–1105, 2012.[12] K. Simonyan and A. Zisserman, “Very deep convolutional neural net-works for large-scale image recognition,” in
The International Confer-ence on Learning Representations, 2015 , pp. 1–5, 2015.[13] R. S. Paolo Napoletano, Flavio Piccoli, “Anomaly detection in nanofi-brous materials by cnn-based similarity,”
MDPI , 2018.[14] J. Xu, Y. Zhao, J. Jiang, Y. Dou, Z. Liu, and K. Chen, “Fusion modelbased on convolutional neural networks with two features for acousticscene classification,” in
Proc. of the Detection and Classification ofAcoustic Scenes and Events 2017 Workshop (DCASE2017), Munich,Germany , 2017.[15] T. Akilan, Q. M. J. Wu, and H. Zhang, “Effect of fusing features frommultiple dcnn architectures in image classification,”
The Institution ofEngineering and Technology , Feb 2018.[16] Y. Fu, L. Cao, G. Guo, and T. S. Huang, “Multiple feature fusion bysubspace learning,” in
Proceedings of the 2008 international conferenceon Content-based image and video retrieval , pp. 127–134, 2008.[17] S. Bahrampour, N. Nasrabadi, A. Ray, and W. Jenkins, “Multimodaltask-driven dictionary learning for image classification,”
The IEEETransactions on Image Processing , 2015.[18] Y. Sun, X. Wang, and X. Tang, “Hybrid deep learning for face verifica-tion,” in
Proceedings of the IEEE international conference on computervision , pp. 1489–1496, 2013.[19] V. Pawar and S. Talbar, “Hybrid machine learning approach for objectrecognition: Fusion of features and decisions,”
Machine Graphics andVision , vol. 19, no. 4, pp. 411–428, 2010.[20] P. Bouthellier, “Rotations and orientations in r3,”27th InternationalConference on Technology in Collegiate Mathematics