[PDF] Autonomous Navigation in Dynamic Environments: Deep Learning-Based Approach

Abstract

Mobile robotics is a research area that has witnessed incredible advances for the last decades. Robot navigation is an essential task for mobile robots. Many methods are proposed for allowing robots to navigate within different environments. This thesis studies different deep learning-based approaches, highlighting the advantages and disadvantages of each scheme. In fact, these approaches are promising that some of them can navigate the robot in unknown and dynamic environments. In this thesis, one of the deep learning methods based on convolutional neural network (CNN) is realized by software implementations. There are different preparation studies to complete this thesis such as introduction to Linux, robot operating system (ROS), C++, python, and GAZEBO simulator. Within this work, we modified the drone network (namely, DroNet) approach to be used in an indoor environment by using a ground robot in different cases. Indeed, the DroNet approach suffers from the absence of goal-oriented motion. Therefore, this thesis mainly focuses on tackling this problem via mapping using simultaneous localization and mapping (SLAM) and path planning techniques using Dijkstra. Afterward, the combination between the DroNet ground robot-based, mapping, and path planning leads to a goal-oriented motion, following the shortest path while avoiding the dynamic obstacle. Finally, we propose a low-cost approach, for indoor applications such as restaurants, museums, etc, on the base of using a monocular camera instead of a laser scanner.

Full PDF

IInstitute of Aviation Engineeringand Technology

Department of Electronics andCommunications Engineering

Autonomous Navigation in DynamicEnvironments: Deep Learning-Based Approach

Authors : Omar Mohamed Zeyad MohsenMohamed Wageeh Mohamed Hegazy

Internal supervisor : External supervisor : Dr. Mohamed S. Elbakry Prof. Moustafa Elshafei

Institute of Aviation Engineering & Technology Zewail City of Science & Technology

Co-supervisor : Mr. Ihab S. Mohamed

INRIA Sophia Antipolis - Méditerranée, France

Giza, Egypt , a r X i v : . [ c s . R O ] F e b cknowledgements In the Name of ALLAH, the Most Merciful, the Most Compassionate all praise be to ALLAHand prayers and peace be upon the Prophet Mohamed. First and foremost, we are totally surethat this work would have never become truth, without help of ALLAH.Secondly, we would like to express our deepest appreciation to our supervisors professorMostafa Elshafei and Dr. Mohamed Sobhy for their continuous stimulating suggestions andencouragement. We would like to express our special thanks of gratitude to professor MostafaElshafei, who granted us the opportunity for achievement our graduation project at Zewail City.We would like to express our deepest appreciation to Dr. Mohamed Sobhy for his patient aca-demic guidance through the research and preparation of this thesis. Because of their invaluableadvices and constructive direction, I have been able to ﬁnish this dissertation for my graduationproject.Third, we would like to acknowledge with much appreciation the crucial role of Eng. Ihab S.Mohamed. This work wouldn’t have been ﬁnalized at this level without his continuous supportin every aspect of the project as well as his fruitful discussions and bright suggestions from thestart. He has taught us so much about robotics and have helped develop a stronger desire tocontinue to pursue research in the ﬁeld. Thanks a lot for his tediously reviewing our thesis.Finally, we must express our very profound gratitude to our parents, sisters, brothers, andfamilies for providing us with unfailing support and continuous encouragement throughout ouryears of study and through the process of working on this project. This accomplishment wouldnot have been possible without them. Thank you. bstract

Mobile robotics is a research area that has witnessed incredible advances for the last decades.Robot navigation is an essential task for mobile robots. Many methods are proposed for allowingrobots to navigate within diﬀerent environments. This thesis studies diﬀerent deep learning-based approaches, highlighting the advantages and disadvantages of each scheme. In fact, theseapproaches are promising that some of them can navigate the robot in unknown and dynamicenvironments. In this thesis, one of the deep learning methods based on convolutional neuralnetwork (CNN) is realized by software implementations. There are diﬀerent preparation studiesto complete this thesis such as introduction to Linux, robot operating system (ROS), C++,python, and GAZEBO simulator. Within this work, we modiﬁed the drone network (namely,DroNet) approach to be used in an indoor environment by using a ground robot in diﬀerentcases. Indeed, the DroNet approach suﬀers from the absence of goal-oriented motion. Therefore,this thesis mainly focuses on tackling this problem via mapping using simultaneous localizationand mapping (SLAM) and path planning techniques using Dijkstra. Afterward, the combinationbetween the DroNet ground robot-based, mapping, and path planning leads to a goal-orientedmotion, following the shortest path while avoiding the dynamic obstacle. Finally, we proposea low-cost approach, for indoor applications such as restaurants, museums, etc, on the base ofusing a monocular camera instead of a laser scanner. ontents

ONTENTS

Bibliography 60 iii ist of Figures × ﬁlter and a stride of 2 [30]. . . . . . . . . . . 173.4 The ReLU activation function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1 Mobile robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Diﬀerential robot model [40]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3 ROS with libraries [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 An image processing example of ROS architecture . . . . . . . . . . . . . . . . . 264.5 Depth camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.6 Principle of operation of a time-of-ﬂight camera . . . . . . . . . . . . . . . . . . 284.7 Laser scanner diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.8 Turtlebot burger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.9 The navigation stack diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.10 Move base node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.11 Monte Carlo localization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.12 Standard behavior and simple potential calculation paths . . . . . . . . . . . . . 385.1 Diﬀerence between containers and virtual machines. . . . . . . . . . . . . . . . . 405.2 ROS ecosystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 Sample view of rviz conﬁgured to generate a chase perspective [45]. . . . . . . . 42iv IST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.13 Mapping for the environment used. . . . . . . . . . . . . . . . . . . . . . . . . . 555.14 Path planning for the environment used. . . . . . . . . . . . . . . . . . . . . . . 565.15 Moving the turtleBot on path without laser sensor. . . . . . . . . . . . . . . . . 57v hapter 1Introduction Recently mobile robots have started to work in the real world scenarios. Applications ofmobile robots are immense and acquiring importance. These applications include agriculturalrobotics such as fertilizing and planting , support to medical services such as transportation ofmedication , client support such as museum tour, exhibition guides and military missions suchas surveillance and monitoring. A group of mobile robots can do work in parallel and it hasadvantages over single robot systems. Multi mobile robot systems can complete a given taskfaster as compared to a single robot. In such tasks where multi mobile robots are involved, thereis a requirement that all robots navigate and avoid each other to reach their goal positions. Multimobile robot systems can be used for material transportation in factories, defense, agriculturalrobotics and service support.

Mobile robot is an autonomous agent capable of navigating intelligently anywhere usingsensor-actuator control techniques. The applications of the autonomous mobile robot in manyﬁelds such as industry, space, defence and transportation, and other social sectors are growingday by day. Furthermore, navigation from one point to another point is one of the most basictasks almost in every robotic system nowadays. There are many methods have been proposedthroughout the last century to achieve this fundamental operation [1]. Also, there are severalchallenges that are faced during navigation. These challenges include ﬂuctuations in navigation1 . Introduction accuracy depending on the complexity of the environment as well as problems in mappingprecision, localization accuracy, actuators eﬃciency and etc. Till now, the navigation system indynamic environments is the main important challenges in mobile robot systems. Recently, thistopic is one of the hot research areas. So, there are many approaches to achieve this task withthe highest possible accuracy. Thus, this thesis will focus on the deep learning approaches asthey have showed the most auspicious results of all the various investigated methods.

Autonomous navigation means that a robot is able to plan its path and execute its planwithout human intervention. In some cases remote navigation aids are used in the planningprocess, while at other times the only information available to compute a path is based on inputfrom sensors aboard the robot itself. An autonomous robot is one which not only can maintainits own stability as it moves but also can plan its movements. Autonomous robots use navigationaids when possible but can also rely on visual, auditory, and olfactory cues. Once basic positioninformation is gathered in the form of triangulated signals or environmental perception, machineintelligence must be applied to translate some basic motivation (reason for leaving the presentposition) into a route and motion plan. This plan may have to accommodate the estimatedor communicated intentions of other autonomous robots in order to prevent collisions, whileconsidering the dynamics of the robot’s own movement envelope.

The main core of this project is studying and evaluation the state-of-art the deep learn-ing approaches for robotic navigation which are recently proposed in both static and dynamicenvironments. After studying the implementation of each approach and the advantages anddisadvantages of these algorithms we were able to set our minds on studying and modifyingthe DroNet approach [2]. The DroNet approach was proposed to accomplish the requirementsfor civilian drones that are soon expected to be used in a wide variety of tasks such as aerialsurveillance, delivery, or monitoring of existing architectures. The DroNet approach is a con-2 . Introduction volutional neural network (CNN) that can safely drive a drone through the streets of a city. This approach mainly works in outdoor and indoor environments and it also suﬀers from ab-sences of goal oriented motion. This thesis aims autonomous mobile robot navigation in dynamicenvironments.

The main objective of this thesis is autonomous mobile robot navigation in dynamic envi-ronments. This is achieved by modifying the DroNet approach proposed in [2] to navigate in anindoor environment using ground robot, then retraining the CNN to enhance the performanceof DroNet in this environment. In addition to this, we generate path to target so we need amap and a path planning technique. For the map we use simultaneous localization and mapping(SLAM) (gmapping) and for path planning we used Dijkstra. Finally, the combination betweenthe modiﬁed DroNet and the paths generated to get a goal oriented motion with shortest pathand dynamic obstacle avoidance with low cost. We have implemented and tested their methodon both ROS simulation environment and GAZEBO simulation for robotic system.

This thesis is organized as follows: Chapter 1 presents the thesis motivation, deﬁnition ofautonomous navigation, the problem statement, the thesis objectives and the thesis structure. InChapter 2, a literature review of deep learning-based schemes are studied. Chapter 3 introducesthe convolutional neural network (CNN). After that, Chapter 4 studies the simulation tools thatwe used in our graduation project such as Linux, robot operating system (ROS), C++, pythonand GAZEBO simulator. Chapter 5 presents our proposed approach and its simulation results.Finally, Chapter 6 includes the conclusions and future work direction. For supplementary video see: https://youtu.be/ow7aw9H4BcA hapter 2Literature Review The approach [3], presents a learning-based mapless motion planner by taking the sparse 10-dimensional range ﬁndings and the target position with respect to the mobile robot coordinateframe as input and the continuous steering commands as output. Traditional motion plannersfor mobile ground robots with a laser range sensor mostly depend on the obstacle map of thenavigation environment where both the highly precise laser sensor and the obstacle map build-ing work of the environment are indispensable. We show that, through an asynchronous deepreinforcement learning method, a mapless motion planner can be trained end-to-end withoutany manually designed features and prior demonstrations. The trained planner can be directlyapplied in unseen virtual and real environments. The experiments show that the proposed map-less motion planner can navigate the nonholonomic mobile robot to the desired targets withoutcolliding with any obstacles. 4 . Literature Review

In this approach, a mapless motion planner was trained end-to-end through continuouscontrol deep rienforcement learning (RL) from scratch.We revised the state-of-art continuousdeep-RL method so that the training and sample collection can be executed in parallel. Bytaking the 10-dimensional sparse range ﬁndings and the target position relative to the mobilerobot coordinate frame as input, the proposed motion planner can be directly applied in unseenreal environments without ne-tuning,even though it is only trained in a virtual environment.When compared to the low-dimensional map-based motion planner, our approach proved to bemore robust to extremely complicated environments.

Robot navigation using deep neural networks has been drawing a great deal of attention.Although reactive neural networks easily learn expert behaviors and are computationally eﬃ-cient, they suﬀer from generalization of policies learned in speciﬁc environments. As such, re-inforcement learning and value iteration approaches for learning generalized policies have beenproposed.However, these approaches are more costly. In the approach [4], they tackle the problemof learning reactive neural networks that are applicable to general environments. The key con-cept is to crop, rotate,and resize an obstacle map according to the goal location and the agent’scurrent location so that the map representation will be better correlated with self-movement inthe general navigation task, rather than the layout of the environment. Furthermore, in additionto the obstacle map, we input a map of visited locations that contains the movement history ofthe agent, in order to avoid failures that the agent travels back and forth repeatedly over thesame location as shown in Figure 2.1. Experimental results reveal that the proposed networkoutperforms the state-of-the-art value iteration network in the grid-world navigation task. they5 . Literature Review also demonstrate that the proposed model can be well generalized to unseen obstacles and un-known terrain. Finally, they demonstrate that the proposed system enables a mobile robot tosuccessfully navigate in a real dynamic environment.Figure 2.1: Goselo [4].The proposed method is based on a convolutional neural network (CNN) to estimate thenext best step among neighboring pixels in a grid map. We refer to such a CNN as a “reac-tiveCNN” because it reacts to speciﬁc patterns on a map in order to determine the movement of theagent. Navigation based on a reactive CNN has three main advantages, as described below.First,a reactive CNN estimates the next best step in a constant time in any situation. In contrast, thecomputational time of most existing path planning methods, such as the A ∗ search and rapidlyexploring random tree (RRT) depends on the scale and complexity of the map. Furthermore,such classical path planning methods will fail when there is no path to the goal. A CNN-basedmethod can suggest a plausible direction in which to proceed at every moment, regardless ofthe existence of a path, which is important for navigation in cluttered, dynamic environments.Second, a reactive CNN can use graphics processing unit (GPU) acceleration due to its highpotential for parallelization. This is also a major advantage over many classical path planningmethods that cannot be wholly parallelized, because every point on a path is dependent on otherlocations. Finally, a reactive CNN can eﬃciently learn expert behaviors, e.g., human controls,without modeling the rewards and the policy behind the behaviors. They proposed a novel navigation method for an online-editable 2D map via an imageclassiﬁcation technique. The computation time required by the proposed method to estimatethe best direction for the agent remains constant at each step. Another signiﬁcant advantage6 . Literature Review of the proposed method is that the agent preferentially moves to new locations, which helpsthe agent to avoid the local minima trap. Experimental results demonstrated the eﬀectivenessof the proposed goal-directed map representation, i.e., GOSELO, as well as its superiority toexisting neural-network-based methods (such as the VIN method) in terms of both success rateand computational cost. We also demonstrated that the proposed method can be generalized toavoid unseen obstacles and navigate unknown terrain. Experiments using the Peacock mobilerobot demonstrated the robustness of the proposed navigation system with respect to dynamicscenarios involving crowds of people. Pea-cock successfully moved continuously, all day long fortwo days, while avoiding people. Peacock demonstrated the advantage of the proposed methodover classical path planning methods, such as A ∗ search, which fails to predict the next stepwhen there is no path to the goal. Although we used a CPU for the prediction of a single futurestep, there would be more room for predicting dozens of future steps if we use a GPU. We areplanning such an extension to predict a more reliable direction to proceed. Extending GOSELOfrom 2D to 3D is another area for future study. This paper [5] proposes an end-to-end method for train-ing convolutional neural networksfor autonomous navigation of a mobile robot. Traditional approach for robot navigation consistsof three steps. The ﬁrst step is extracting visual features from the scene using the camera input.The second step is to ﬁgure out the current position by using a classiﬁer on the extracted visualfeatures. The last step is making a rule for moving the direction manually or training a model tohandle the direction. In contrast to the traditional multi-step method, the proposed visuo-motornavigation system can directly output the linear and angular velocities of the robot from aninput image in a singlestep. The trained model gives wheel velocities for navigation asoutputs inreal-time making it possible to be implanted on mobilerobots such as robotic vacuum cleaners.The experimental results show an average linear velocity error of 2.2 cm/s and average angular7 . Literature Review velocity error of 3.03 degree/s. The robot deployed with the proposed model can navigate in areal-world environment by only using the camera without relying on any other sensors such asLiDAR, Radar, IR, GPS, IMU.The proposed system architecture is shown in Figure 2.2.Figure 2.2: End-to-end deep architecture [5].The input of the proposed architecture is an red green blue (RGB) image and the outputsare linear and angular velocities. The system does not require detection, localization or planningmodules for navigation separately. The CNN architecture used in this paper is AlexNet . Eventhough other well-known architectures such as VGGNet , GoogleNet or ResNet can be used,these networks are not applicable for real-time robot navigation due to slow inference speed.Our network performs multi-label regression giving outputs as two real-values. The groundtruth velocities are in the range of 0 to 0.5m/s for linear velocity, and in the range of -1.5 to1.5 radians for angular velocity. Two velocities were normalized to the values between0 and 1for CNN input. Since the output values are oscillated,using raw output values makes the robotmovement unstable. For acquiring consistent outputs, post-processing for noise reduction wasconducted to make the movement of the robot stable.

Traditional methods for robot navigation or path planning require multiple and complexalgorithms for localization,navigation and action planning. The proposed approach using end-to-end deep learning could make it possible to control the robot motor directly from the visualinput as the humandid. The human can decide the path seeing only a local scene without anyinformation of the global map. This result veriﬁed the potential of the proposed system as alocal path planner.For future work, the visuo-motor system as a global pathplanner can be8 . Literature Review developed. Moreover, the model can be com-pressed for direct deployment of the visuo-motorsystem on the embedded board without a server.

This paper [6] represent a model that is able to learn the complex mapping from raw 2D-laserrange ﬁnder and a target position to produce the required steering commands for the robot.A data-driven end-to-end motion planner based on CNN model is proposed. The supervisedtraining data is based on expert demonstration generated using an existing motion planner.The system can navigate the robot safely though cluttered environment to reach the goal.Their proposed solution does not require any global map for the robot to navigate. Giventhe sensor data and relative target position, the robot is able to navigateFigure 2.3: DNN architecture.to the desired location while avoiding the surrounding obstacles. By design, the approachis not limited to any kind of environment. However, in this paper, only navigation in staticenvironments is considered. Their main contribution can be summarized in two points. First, a9 . Literature Review data-driven end-to-end motion planner from laser range ﬁndings to motion commands. Second,deployment and tests on a real robotic platform in unknown environments. The end-to-endrelationship between input data and steering commands can result in an arbitrarily complexmodel. Among diﬀerent machine learning ap- proaches, DNNs/CNNs are well known for theircapabilities as a hyper-parametric function approximation to model complex and highly non-linear dependencies. To avoid the problem of generating data for training, simulation data iscollected where a global motion planner is used as an expert. Since no pre-processing of thelaser data is required, the computational complexity and therefore also the query time for asteering command only depends on the complexity of the model, which is constant once it istrained. The paper also consider the case when a robot is faced with a sudden appearing objectblocking the path. Their proposed deep planner reacted clearly to the object by swerving to theright and after removing the obstacle, it corrects its course of motion again to reach the targetas fast as possible.

This work showed some limitation with real world experimentation. They found that therobot has some weaknesses when it comes to wide open spaces with clutter around. However,they suggested that this can potentially results from the fact that the model was trained purelyfrom perfect simulation data and re-training using real sensor data might reduce this undesir-able eﬀect. In addition, once the robot enters a convex dead-end region, it is not capable offreeing itself. Moreover, the motion of the robot sometimes ﬂuctuate apparently because theirarchitecture does not contain any memory of past visited locations. Another drawback of thework is that the motion considered are not fully autonomous and needs sometimes help from ahuman with a joystick when the robot got stuck. 10 hapter 3Convolutional Neural Networks

Even though computers are designed by and for humans, it is clear that the concept of acomputer is very diﬀerent from a human brain. The human brain is a complex and non-linearsystem, and on top of that its way of processing information is highly parallelized. It is basedon structural components known as neurons, which are all designed to perform certain types ofcomputations. It can be applied to a huge amount of recognition tasks, and usually performsthese within 100 to 200 ms. Tasks of this kind are still very diﬃcult to process, and just a fewyears ago, performing these computations on a CPU could take days [7].Inspired by this amazing system, in order to make computers more suitable for these kindsof task a new way of handling these problems arose. It is called an Artiﬁcial neural network(ANN). An ANN is a model based on a potentially massive interconnected network of processingunits, suitably called neurons. In order for the network and its neurons to know how to handleincoming information, the model has to acquire knowledge. This is done through a learningprocess. The connections between the neurons in the network are represented by weights, andthese weights store the knowledge learned by the model. This kind of structure results in highgeneralization, and the fact that the way the neurons handle data can be non-linear is beneﬁcialfor a whole range of diﬀerent applications. This opens up completely new approaches for input-output mapping and enables the creation of highly adaptive models for computation [7]. Thelearning process itself generally becomes a case of what is called supervised learning, which is11 . Convolutional Neural Networks described in the next segment.

Convolutional Neural Network (CNN, or ConvNet) is a type of artiﬁcial neural networkinspired by biological processes [8]. In machine learning, it is a class of deep, feed-forwardartiﬁcial neural networks that has successfully been applied to analyzing visual imagery. It canbe seen as a variant of Multilayer Perceptron (MLP). In computer vision, a traditional MLPconnects each hidden neuron with every pixel in the input image trying to ﬁnd global patterns.However, such a connectivity is not eﬃcient because pixels distant to each other are often lesscorrelated. The found patterns are thus less discriminative to be fed to a classiﬁer. In addition,due to this dense connectivity, the size of parameters grows largely as the size of an input imageincreases, resulting in substantial increases in both computational complexity and memory spaceusage.However, these problems can be alleviated in CNNs. A hidden neuron in CNNs only connectsto a local patch in the input image. This type of sparse connectivity is more eﬀective to discoverlocal patterns and these local patterns learned from one part of an image are also applicable toother parts of the image.CNNs have been widely used for visual based classiﬁcation applications. In recent years, aseries of R-CNN methods are proposed to apply CNNs on object detection tasks [9, 10, 11,12]. In [9], the original version of R-CNN, R-CNN takes full image and object proposals asinput. The regional object proposals could come from a variety of methods and in their workthey use Selective Search [13]. Each proposed region is then cropped from the original imageand wrapped to a uniﬁed × pixel size. A 4096-dimensional feature vector is extractedby forward propagating the subtracted region through ﬁne-tuned CNN with ﬁve convolutionallayers and two fully connected layers . With the feature vectors, a set of class-speciﬁc linearsupport vector machines ( SVMs ) are trained for classiﬁcations.R-CNN achieves excellent object detection accuracy, however, it has notable drawbacks.First, training and testing has multiple stages including ﬁne-tuning CNN with

Softmax loss,training SVMs and learning bounding-box regressors. Secondly, the CNN part is slow because12 . Convolutional Neural Networks it performs forward pass for each object proposal without sharing computation. To addressthe speed problem, Spatial Pyramid Pooling network (SPPnet) [14] and Fast R-CNN [11] areproposed. Both methods compute one single convolutional feature map for the entire input imageand do the cropping on the feature map instead of on the original image and then extract featurevectors for each region. For feature extraction, SPPnet pools the feature maps into multiple sizesand concatenate them as a spatial pyramid [15], while Fast R-CNN only use single scale of thefeature maps. The feature sharing of SPPnet accelerates R-CNN by 10 to 100x in testing and3x in training. However it still has the same multiple-stage pipeline as R-CNN. In Fast R-CNNGirshick propose a new type of layer, region of interest (RoI) pooling layer, to connect the gapbetween feature maps and classiﬁers. With this layer, they build an semi end-to-end trainingframework which only rely on full image input and object proposals.All the mentioned methods rely on external object proposal input. In [10], the authors pro-posed proposal-free framework called Faster R-CNN. In Faster R-CNN, they use a RPN, whichslides over the last convolutional feature maps to generate bounding-box proposals in diﬀerentscales and ratio aspects. These proposals are then fed back to Fast R-CNN as input. Anotherproposal-free work

You Only Look Once is proposed in [16]. This network uses features fromthe entire image to predict object bounding box. Instead of sliding windows on the last con-volutional feature maps, this network connects the feature map output to an 4096-dimensionalfollowed by another full-connected × × tensor. The tensor is a × mapping of the inputimage. Each grid of the tensor is a 24-dimensional vector which encodes bounding boxes andclass probabilities of the object whose center falls into this grid on the origin image. The YOLOnetwork is 100 to 500x faster than Fast R-CNN based methods, though with less than 8% meanAverage Precision (mAP) drop on VOC 2012 test set [17].Some other speciﬁc R-CNN variants are also proposed to solve diﬀerent problems. The paperfrom [18] presents a R-CNN based networks with triple loss functions combined for the task ofkeypoints (as representation pose) prediction and action classiﬁcation of people. It also adaptR-CNN to use more than one region, but also contextual subregions for human detection andaction classiﬁcation called R*CNN [19]. In [20], the authors proposed DeepID-Net with defor-mation constrained pooling layer, which models the deformation of object parts with geometricconstraint and penalty. Furthermore, a broad survey of the recent advances in CNNs and its13 . Convolutional Neural Networks applications in computer vision, speech and natural language processing have been presented in[21].On the other hands, in general, there are some eﬀective laser-based methods for object detec-tion, estimation and tracking using machine learning approaches [22, 23, 24, 25]. A multi-modalsystem for detecting, tracking and classifying objects in outdoor environment was presented in[26]. In this section, some important concepts related to general CNNs including the structure ofnetworks and essential layers will be covered.

In general, the architecture of CNN can be decomposed into two stages, which are hierarchicalfeature extraction stage and classiﬁcation stage. A typical architecture of CNN is shown in Figure3.1. An input image is convolved by a set of trainable ﬁlters (kernels) each with a nonlinearmapping (e.g. ReLU [27]) to produce so-called feature maps . Each feature map containing specialfeatures is then partitioned into equal-sized, non-overlapping regions and the maximum (oraverage) of each region is passed to the next layer (sub-sampling layer), resulting in resolution-reduced feature maps with depth unchanged. This operation allows small translation to theinput image, thus more robust features that are invariant to translations are more likely to befound [28]. These two steps, convolutions and subsampling, are alternated for two iterations inthe CNN in Figure 3.1 and the resulting feature maps are fully connected with a MLP to performclassiﬁcation. In some applications, the ﬁnal fully connected layer that performs classiﬁcationis replaced with other classiﬁers e.g. SVM. For example, the state-of-the-art object detectorR-CNN [9] extracts high-level features from the penult ﬁnal fully layer and feeds them to SVMsfor classiﬁcation [29]. 14 . Convolutional Neural Networks

Figure 3.1: A typical architecture of CNN (Wikipedia).

As mentioned in Section 3.3.1, CNNs are commonly made up of mainly three layer types: convolutional layer, pooling layer (usually subsampling) and fully connected layer . The explana-tions of these layers and the introduction of other auxiliary layers that are not shown in Figure3.1 will be introduced.•

Convolution Layer

The convolutional layer is the core building block of a CNN. The layer’s parameters consistof a set of ﬁlters or kernels, which have a small receptive ﬁeld, but extend through the fulldepth of the input volume or image. The convolution operation replicates a ﬁlter acrossthe entire image ﬁeld to get the response of each location and form a response featuremap. Given multiple ﬁlters, the network will get a stack of features maps to form a new3D volume.Oﬃcially, the convolution layer accepts a volume or image of size W × H × D fromprevious layer as input data, where H , W , D are image height, image width, and anumber of channels (or depth) respectively. The layer deﬁnes K ﬁlters with the shape F × F × D each, where F is the kernel size. The convolution of input volume and ﬁltersproduces the output volume of size W × H × K , where the new volume’s W and H are dependent on the ﬁlter size, stride and pad settings of the convolution operation. Ingeneral, the formula for calculating the output size, W and H , for any given convolutionlayer is deﬁned as: 15 . Convolutional Neural Networks – width: W = ( W − F +2 P ) S + 1 , – height: H = ( H − F +2 P ) S + 1 ,Where: K is the ﬁlter size, P is the padding, and S is the stride.For instance, Figure 3.2 illustrate a 2D version convolution where the × × inputvolume is convolved with one × ﬁlter. With padding and 1 stride settings, it producesa × × output volume.Figure 3.2: An example of convolution operation in 2D [30]. – Stride and Padding

There are two main parameters that must be tuned after choosing the ﬁlter size K inorder to modify the behavior of each layer. These two parameters are the stride andthe padding . Stride , S , controls how the ﬁlter convolves around the input volume.In the previous example, S = 1 . This means that the ﬁlter convolves around theinput volume by shifting one unit at a time. So, the amount by which the ﬁltershifts is the stride. Moreover, as we keep applying convolution layers, the size of thevolume will decrease faster than we would like. Therefore, in order to preserve asmuch information about the original input volume so that we can extract those lowlevel features, the zero-padding must be applied. Let’s say we want to apply the sameconvolution layer but we want the output volume to remain × × , which is equalto the input size. To do this, we can apply a zero-padding of size 1 to that layer.Zero-padding pads the input volume with zeros around the border. The zero-paddingis deﬁned as: P = K − . (3.1)16 . Convolutional Neural Networks • Pooling Layer

Another important concept of CNNs is pooling, which is a form of non-linear down-sampling. It partitions the input image into a set of non-overlapping rectangles and, foreach such sub-region, outputs the maximum (in case of max-pooling ). The function ofpooling layer is to reduce the spatial size of representation and hence reduce the amountof parameters and amount of computations in the network and also control over-ﬁtting.There are several non-linear functions to implement pooling such as max-pooling, average-pooling and stochastic-pooling . The pooling layer operates independently on every depthslice of the input and resizes it spatially. Pooling is an translation-invariance operation.The pooled image keeps the structural layout of the input image.Formally a pooling layer accepts a volume of size W × H × D as input and output avolume of size W × H × D . The output width W and height H are dependent onthe kernel size, stride and pad settings, as shown in Figure 3.3. The produced output hasdimensions: – width: W = ( W − F ) S + 1 , and – height: H = ( H − F ) S + 1 . Average-pooling Max-pooling

Figure 3.3: An example of pooling with a × ﬁlter and a stride of 2 [30].• ReLU Layer . Convolutional Neural Networks Figure 3.4: The ReLU activation function.Rectiﬁed Linear Units (ReLU) is one of the most notable non-saturated activation func-tions, which can be used by neurons just like any other activation function. The ReLUactivation function is deﬁned as (Figure 3.4): f ( x ) = max (0 , x ) . (3.2)ReLU is an element wise operation (applied per pixel) and replaces all negative pixelvalues in the feature map by zero. It increases the nonlinear properties of the decisionfunction and of the overall network without aﬀecting the receptive ﬁelds of the convolutionlayer. For that reason, after each convolution layer, it is convention to apply a ReLU layerimmediately afterwards. The main reason that it is used is because of how eﬃciently it canbe computed compared to more conventional activation functions like the sigmoid function f ( x ) = | tanh ( x ) | and hyperbolic tangent function f ( x ) = tanh ( x ) , without making asigniﬁcant diﬀerence to generalization accuracy. Many works have shown that ReLU worksbetter than other activation functions empirically [31, 32]. Moreover, the recently usedactivation functions in CNNs based on ReLU such as Leaky ReLU [31], Parametric ReLU[32], Randomized ReLU [33], and Exponential Linear Unit (ELU) [34] were introduced in[21].• Fully Connected Layer

Eventually, after several convolutional and max pooling layers, the high-level reasoningin the neural network is done via fully connected layers. Neurons in a fully connectedlayer have full connections to all neurons in the previous layer. It provides a form of denseconnectivity and loses the structural layout of the input image. Fully connected layers18 . Convolutional Neural Networks are usually inserted after the last convolution layer to reduce the amount of features andcreating vector-like representation.•

Loss Layer

It is important to choose an appropriate loss function for a speciﬁc task. The loss layerspeciﬁes the learning process by comparing the output of the network with the true label(or target) and minimizing the cost. Generally, the loss is calculated by forward pass andthe gradient of network parameters with respect to loss is calculated by the backprop-agation. For multi-class classiﬁcation problems, softmax classiﬁer with loss is commonlyused. Firstly, it takes multi-class scores as input, and uses softmax function to normalizethe input and get a distribution-like output. Then, the loss is computed by calculating thecross-entropy of the target class probability distribution and the estimated distribution.The softmax function is deﬁned as: y ( x ) i = exp ( x i ) (cid:80) nj =1 exp ( x j ) , (3.3)Where: – ≤ y ( x ) i ≤ , – (cid:80) nj =1 y ( x ) j = 1 , – i = 1 , . . . , n & n is the number of Classes.The cross-entropy between the target distribution p and the estimation distribution q isgiven by H ( p, q ) = (cid:88) i p i log q i . (3.4)The purpose of the softmax classiﬁcation layer is simply to transform all the networkactivations in your ﬁnal output layer to a series of values that can be interpreted asprobabilities. The softmax function is also known as the normalized exponential function.The recently used loss layers (e.g. Hinge loss [35], L-Softmax loss [36], Contrastive loss[37, 38], and Triplet loss [39]), were presented in [21].19 hapter 4Preparation Studies Mobile robots are vehicles with the ability to change their positions. These robots can moveon grounds, on the surface of water, under water and in the air. Two modes can be usedto operate mobile robots. One is tele-operated mode where movement instructions are givenexternally. Another mode is autonomous where robots operate on the information that theseget from sensors and no external instructions are given. Wheeled mobile robots are one of thetypes of mobile robots extensively used in research and industry, as wheel is the most popularlocomotion mechanism in mobile robotics. One of the advantages of wheeled robots is thatbalancing is not a problem as robots are designed in such a way that all wheels are on theground. The Figure Below show examples of mobile robots.20 . Preparation Studies

Figure 4.1: Mobile robots . The diﬀerential drive robot is, probably, the most common and most used mobile robot inthe current times. A diﬀerential drive robot consists of two independently driven wheels thatrotate about the same axis, as well as one or more caster wheels, ball casters, or low-frictionsliders that keep the robot horizontal. This is the case with the robot in our simulation. Figure4.2 shows a visual representation of the system.Figure 4.2: Diﬀerential robot model [40].For the diﬀerential drive system, you are going to need to know 2 parameters:• L: The distance between the wheels of the robot, also known as Wheel Base.• R: The radius of the wheels of the robot. https://robohub.org/robot-teams-create-supply-chain-to-deliver-energy-to-explorer-robots/ . Preparation Studies These parameters are relatively easy to measure in any system. On a real robot, can justmeasure them with a ruler, On a simulated robot, you can also extract these values from theuniﬁed robot deccription format (URDF) ﬁle of the robot.From the visual representation above, the 2 inputs for the system:• v R : Rate at which the right wheel is turning• v L : Rate at which the left wheel is turningSo, in order to have the kinematic model of our system, a set of equations that connect theinputs of our system with the outputs are required. For the diﬀerential drive robot, these arethe equations: ˙ x = R ( v r + v l ) cos( θ )˙ y = R ( v r + v l ) sin( θ )˙ θ = RL ( v r − v l ) (4.1) Robot Operating System (ROS) allows you to stop reinventing the wheel. Reinventing thewheel is one of the main killers for new innovative applications. The ROS goal is to provide astandard for robotics software development, that you can use on any robot. Whether you areprogramming a mobile robot, a robotic arm, a drone, a boat, a vending machine, You can usethe Robot Operating System. This standard allows you to actually focus on the key features ofyour application, using an existing foundation, instead of trying to do everything yourself. ROSis more of a middle ware, something like a low-level “framework” based on an existing operatingsystem. The main supported operating system for ROS is Ubuntu. You have to install ROSon your operating system in order to use it. Robot Operating System is mainly composed of 2things:• a core (middle ware) with communication tools,• a set of plug & play libraries. 22 . Preparation Studies

Basically, a middle ware is responsible for handling the communication between programs in adistributed system (as shown in ﬁgure 4.3)Figure 4.3: ROS with libraries [41] .ROS comes with 3 main communication tools:• Topics: Those will be used mainly for sending data streams between nodes. Example:you’re monitoring the temperature of a motor on the robot. The node monitoring thismotor will send a data stream with the temperature. Now, any other node can subscribeto this topic and get the data.• Services: They will allow you to create a simple synchronous client/server communicationbetween nodes. Very useful for changing a setting on your robot, or ask for a speciﬁcaction: enable freedrive mode, ask for speciﬁc data, etc.• Actions: A little bit more complex, they are in fact based on topics. They exist to provideyou with an asynchronous client/server architecture, where the client can send a requestthat takes a long time (ex: asking to move the robot to a new location). The client canasynchronously monitor the state of the server, and cancel the request anytime. Robots must sense the world around them in order to react to variations in tasks and envi-ronments. The sensors can range from minimalist setups designed for quick installation to highly https://roboticsbackend.com/what-is-ros/ . Preparation Studies elaborate and tremendously expensive sensor rigs. Many successful industrial deployments usesurprisingly little sensing. A remarkable number of complex and intricate industrial manipula-tion tasks can be performed through a combination of clever mechanical engineering and limitswitches, which close or open an electrical circuit when a mechanical lever or plunger is pressed,in order to start execution of a pre-programmed robotic manipulation sequence. Through care-ful mechanical setup and tuning, these systems can achieve amazing levels of throughput andreliability. It is important, then, to consider these binary sensors when enumerating the world ofrobotic sensing. These sensors are typically either “on” or “oﬀ.” In addition to mechanical limitswitches, other binary sensors include optical limit switches, which use a mechanical “ﬂag” tointerrupt a light beam, and bump sensors, which channel mechanical pressure along a relativelylarge distance to a single mechanical switch. These relatively simple sensors are a key partof modern industrial automation equipment, and their importance can hardly be overstated.Another class of sensors return scalar readings. For example, a pressure sensor can estimate themechanical or barometric pressure and will typically output a scalar value along some range ofsensitivity chosen at time of manufacture. Range sensors can be constructed from many physicalphenomena (sound, light, etc.) and will also typically return a scalar value in some range, whichseldom includes zero or inﬁnity! Each sensor class has its own quirks that distort its view ofreality and must be accommodated by sensor-processing algorithms. These quirks can often besurprisingly severe. For example, a range sensor may have a “minimum distance” restriction: ifan object is closer than that minimum distance, it will not be sensed. As a result of these quirks,it is often advantageous to combine several diﬀerent types of sensors in a robotic system. Higher-order animals tends to rely on visual data to react to the world around them. If onlyrobots were as smart as animals! Unfortunately, using camera data intelligently is surprisinglydiﬃcult, as we will describe in later chapters of this book. However, cameras are cheap and oftenuseful for tele-operation, so it is common to see them on robot sensor heads. Interestingly, it isoften more mathematically robust to describe robot tasks and environments in three dimensions(3D) than it is to work with 2D camera images. This is because the 3D shapes of tasks and24 . Preparation Studies environments are invariant to changes in scene lighting, shadows, occlusions, and so on. In fact,in a surprising number of application domains, the visual data is largely ignored; the algorithmsare interested in 3D data. As a result, intense research eﬀorts have been expended on producing3D data of the scene in front of the robot. When two cameras are rigidly mounted to a commonmechanical structure, they form a stereo camera. Each camera sees a slightly diﬀerent view ofthe world, and these slight diﬀerences can be used to estimate the distances to various featuresin the image. This sounds simple, but as always, the devil is in the details. The performanceof a stereo camera depends on a large number of factors, such as the quality of the camera’smechanical design, its resolution, its lens type and quality, and so on. Equally important arethe qualities of the scene being imaged: a stereo camera can only estimate the distances tomathematically discernible features in the scene, such as sharp, high-contrast Corners. A stereocamera cannot, for example, estimate the distance to a featureless wall, although it can mostlikely estimate the distance to the corners and edges of the wall, if they intersect a ﬂoor, ceiling,or other wall of a diﬀerent color. Many natural outdoor scenes possess suﬃcient texture thatstereo vision can be made to work quite well for depth estimation. Uncluttered indoor scenes,however, can often be quite diﬃcult. Several conventions have emerged in the ROS communityfor handling cameras. The canonical ROS message type for images is sensor_msgs/Image , andit contains little more than the size of the image , its pixel encoding scheme, and the pixelsthemselves. To describe the intrinsic distortion of the camera resulting from its lens and sensoralignment, the sensor_msgs/CameraInfo message is used. Often, these ROS images need to besent to and from OpenCV, a popular computer vision library.25 . Preparation Studies

Figure 4.4: An image processing example of ROS architecture .The Camera Node publishes images in a message named image_data which is subscribedby both Image Display Node and Image Processing Node. The ROS Master tracks publishersand subscribers enabling individual nodes to locate and message each other. As discussed in the previous section, even though visual camera data is intuitively appealing,and seems like it should be useful somehow, many perception algorithms work much better with3D data. Fortunately, the past few years have seen massive progress in low-cost depth cameras.Unlike the passive stereo cameras described in the previous section, depth cameras are activedevices. They illuminate the scene in various ways, which greatly improves the system perfor-mance. For example, a completely featureless indoor wall or surface is essentially impossible todetect using passive stereo vision. However, many depth cameras will shine a texture patternon the surface, which is subsequently imaged by its camera. The texture pattern and cameraare typically set to operate in near-infrared wavelengths to reduce the system’s sensitivity tothe colors of objects, as well as to not be distracting to people nearby. . Preparation Studies Figure 4.5: Depth camera.Some common depth cameras, such as the Microsoft Kinect as shown in Figure 4.5, project astructured light image. The device projects a precisely known pattern into the scene, its cameraobserves how this pattern is deformed as it lands on the various objects and surfaces of the scene,and ﬁnally a reconstruction algorithm estimates the 3D structure of the scene from this data.It’s hard to overstate the impact that the Kinect has had on modern robotics! It was designedfor the gaming market, which is orders of magnitude larger than the robotics sensor market,and could justify massive expenditures for the development and production of the sensor. Thelaunch price of 150$ was incredibly cheap for a sensor capable of outputting so much usefuldata. Many robots were quickly retroﬁtted to hold Kinects, and the sensor continues to beused across research and industry. Although the Kinect is the most famous (and certainly themost widely used) depth camera in robotics, many other depth-sensing schemes are possible.For example, unstructured light depth cameras employ “standard” stereo-vision algorithms withrandom texture injected into the scene by some sort of projector. This scheme has been shown towork far better than passive stereo systems in feature-scarce environments, such as many indoorscenes. A diﬀerent approach is used by time-of-ﬂight depth cameras. These imagers rapidly blinkan infrared light emitting diode (LED) or laser illuminator, while using specially designed pixelstructures in their image sensors to estimate the time required for these light pulses to ﬂy intothe scene and bounce back to the depth camera. Once this “time of ﬂight” is estimated, the(constant) speed of light can be used to convert the estimates into a depth image, as illustratedin Figure 4.6 . 27 . Preparation Studies

Figure 4.6: Principle of operation of a time-of-ﬂight camera .Intense research and development is occurring in this domain, due to the enormous existingand potential markets for depth cameras in video games and other mass-market user-interactionscenarios. It is not yet clear which (if any) of the schemes discussed previously will end up beingbest suited for robotics applications. At the time of writing, cameras using all of the previousmodalities are in common usage in robotics experiments. Just like visual cameras, depth camerasproduce an enormous amount of data. This data is typically in the form of point clouds, which arethe 3D points estimated to lie on the surfaces facing the camera. The fundamental point cloudmessage is sensor_msgs/PointCloud2 (so named purely for historical reasons). This messageallows for unstructured point cloud data, which is often advantageous, since depth camerasoften cannot return valid depth estimates for each pixel in their images. As such, depth imagesoften have substantial “holes,” which processing algorithms must handle gracefully. Although depth cameras have greatly changed the depth-sensing market in the last few yearsdue to their simplicity and low cost, there are still some applications in which laser scanners(Figure 4.7) are widely used due to their superior accuracy and longer sensing range. There aremany types of laser scanners, but one of the most common schemes used in robotics involves https://en.wikipedia.org/wiki/Time-of-flight_camera . Preparation Studies shining a laser beam on a rotating mirror spinning around 10 to 80 times per second (typically600 to 4,800 RPM). As the mirror rotates, the laser light is pulsed rapidly, and the reﬂectedwaveforms are correlated with the outgoing waveform to estimate the time of ﬂight of the laserpulse for a series of angles around the scanner.Figure 4.7: Laser scanner diagram .Laser scanners used for autonomous vehicles are considerably diﬀerent from those used forindoor or slow-moving robots. Vehicle laser scanners made by companies such as Velodyne mustdeal with the signiﬁcant aerodynamic forces, vibrations, and temperature swings common tothe automotive environment. Since vehicles typically move much faster than smaller robots,vehicle sensors must also have considerably longer range so that suﬃcient reaction time ispossible. Additionally, many software tasks for autonomous driving, such as detecting vehiclesand obstacles, work much better when multiple laser scan lines are received each time the devicerotates, rather than just one. These extra scan lines can be extremely useful when distinguishingbetween classes of objects, such as between trees and pedestrians. To produce multiple scanlines,automotive laser scanners often have multiple lasers mounted together in a rotating structure,rather than simply rotating a mirror. All of these additional features naturally add to thecomplexity, weight, size, and thus the cost of the laser scanner. The complex signal processingsteps required to produce range estimates are virtually always handled by the ﬁrmware of thelaser scanner itself. The devices typically output a vector of ranges several dozen times per . Preparation Studies second, along with the starting and stopping angles of each measurement vector. In ROS, laserscans are stored in sensor_msgs/LaserScan messages, which map directly from the output ofthe laser scanner. Each manufacturer, of course, has their own raw message formats, but ROSdrivers exist to translate between the raw output of many popular laser scanner manufacturersand the sensor_msgs/LaserScan message format. TurtleBot is the robot we used in this thesis, The TurtleBot was designed in 2011 as aminimalist platform for ROS-based mobile robotics education and prototyping. It has a smalldiﬀerential-drive mobile base with an internal battery, power regulators, and charging contacts.Atop this base is a stack of laser- cut “shelves” that provide space to hold a netbook computerand depth camera, and lots of open space for prototyping. To control cost, the TurtleBot relieson a depth camera for range sensing; it does not have a laser scanner. Despite this, mappingand navigation can work quite well for indoor spaces. TurtleBot are available from severalmanufacturers for less than 2,000$. More information is available at http://turtlebot.org .Figure 4.8: Turtlebot burger .Because the shelves of the TurtleBot (in Figure 4.8) are covered with mounting holes, manyowners have added additional subsystems to their TurtleBot, such as small manipulator arms,additional sensors, or upgraded computers. However, the “stock” TurtleBot is an excellent start-ing point for indoor mobile robotics. Many similar systems exist from other vendors, such as the . Preparation Studies Pioneer and Erratic robots and thousands of custom- built mobile robots around the world. Theexamples in this book will use the TurtleBot, but any other small diﬀerential-drive platformcould easily be substituted.

ROS has a set of resources that are useful so a robot is able to navigate through a medium, inother words, the robot is capable of planning and following a path while it deviates from obstaclesthat appear on its path throughout the course. These resources are found on the navigation stack.One of the many resources needed for completing this task and that is present on the navigationstack are the localization systems, that allow a robot to locate itself, whether there is a staticmap available or simultaneous localization and mapping is required. Adaptive Monte CarloLocalization (AMCL) is a tool that allows the robot to locate itself in an environment through astatic map, a previously created map. The disadvantage of this resource is that, because of usinga static map, the environment that surrounds the robot can not suﬀer any modiﬁcation, becausea new map would have to be generated for each modiﬁcation and this task would consumecomputational time and eﬀort. Being able to navigate only in modiﬁcation-free environmentsis not enough, since the robots should be able to operate in places like industries and schools,where there is constant movement. To bypass the lack of ﬂexibility of static maps, two otherlocalization systems are oﬀered by the navigation stack: gmapping and hector mapping. Bothgmapping and hector mapping are based on Simultaneous Localization and Mapping (SLAM),a technique that consists of mapping an environment at the same time that the robot is moving,in other words, while the robot navigates through an environment, it gathers information fromthe environment through his sensors and generates a map. This way you have a mobile baseable not only to generate a map of an unknown environment as well as updating the existingmap, thus enabling the use of the device in more generic environments, not immune to changes.The diﬀerence between gmapping and hector mapping is that the ﬁrst one takes in account theodometry information to generate and update the map and the robot’s pose, however, the robot31 . Preparation Studies needs to have encoders, preventing some robots(e.g. ﬂying robots) from using it. The odometryinformation is interesting because they are able to aid in the generation of more precise maps,since understanding the robot dynamics we can estimate its pose. The dynamic behaviour of therobot is also known as kinematics. Kinematics is inﬂuenced, basically, by the way that the devicesthat guarantee the robot’s movement are assembled. Some examples of mechanical features thatinﬂuence the kinematics are: the wheel type, the number of wheels, the wheels positioning andthe angle at which they are disposed. However, as much useful as the odometry information canbe, it isn’t immune to faults. The faults are caused by the lack of precision on the capitation,friction, slip, drift and other factors, and, with time, they may accumulate, making inconsistentdata and prejudicing the maps formation, that tend to be distorted under these circumstances.Other indispensable data to generate a map are the sensors‘ distance readings, for the reasonthat they are responsible in detecting the external world and, this way, serve as reference to therobot. Nonetheless, the data gathered by the sensors must be adjusted before being used by thedevice. These adjustments are needed because the sensors measure the environment in relationto themselves, not in relation to the robot, in other words, a geometric conversion is needed.To make this conversion simpler, ROS oﬀers the TF tool, which makes it possible to adjust thesensor’s positions in relation to the robot and, this way, adequate the measures to the robot‘snavigation.

The Navigation Stack

The ROS Navigation Stack is generic. That means, it can be used with almost any type ofmoving robot, but there are some hardware considerations that will help the whole system toperform better, so they must be considered. These are the requirements:1. The Navigation package will work better in diﬀerential drive and holonomic robots. Also,the mobile robot should be controlled by sending velocity commands in the form x , y (linear velocity), z (angular velocity).2. The robot should mount a planar laser somewhere around the robot. It is used to buildthe map of the environment and perform localization.3. Its performance will be better for square and circular shaped mobile bases.32 . Preparation Studies Figure 4.9: The navigation stack diagram .According to the shown diagram, we must provide some functional blocks in order to work andcommunicate with the Navigation stack. Following are brief explanations of all the blocks whichneed to be provided as inputto the ROS Navigation stack:• Odometry source: Odometry data of a robot gives the robot position with respect to itsstarting position. Main odometry sources are wheel encoders, inertial measurement unit(IMU), and 2D/3D cameras (visual odometry). The odom value should publish to theNavigation stack, which has a message type of nav_msgs/Odometry. The odom messagecan hold the position and the velocity of the robot.• Sensor source: Sensors are used for two tasks in navigation: one for localizing the robot inthe map (using for example the laser) and the other one to detect obstacles in the pathof the robot (using the laser, sonars or point clouds).• sensor transforms/tf: the data captured by the diﬀerent robot sensors must be referencedto a common frame of reference (usually the base_link) in order to be able to comparedata coming from diﬀerent sensors. The robot should publish the relationship between themain robot coordinate frame and the diﬀerent sensors’ frames using ROS transforms. . Preparation Studies • base_controller: The main function of the base controller is to convert the output of theNavigation stack, which is a Twist (geometry_msgs/Twist) message, into correspondingmotor velocities for the robot. The move_base node

This is the most important node of the Navigation Stack. It’s where most of the "magic"happens. The main function of the move_base node is to move a robot from its current positionto a goal position with the help of other Navigation nodes. This node links the global plannerand the local planner for the path planning, connecting to the rotate recovery package if therobot is stuck in some obstacle, and connecting global costmap and local costmap for gettingthe map of obstacles of the environment. The following is the list of all the packages which arelinked by the move_base node:• global-planner.• local-planner.• rotate-recovery.• costmap-2DThe following are the other packages which are interfaced to the move_base node:• map-server.• AMCL.• gmapping. 34 . Preparation Studies

Figure 4.10: Move base node . Robot localization is the process of determining where a mobile robot is located with respectto its environment. Localization is one of the most fundamental competencies required by anautonomous robot as the knowledge of the robot’s own location is an essential precursor tomaking decisions about future actions. In a typical robot localization scenario, a map of theenvironment is available and the robot is equipped with sensors that observe the environmentas well as monitor its own motion. The localization problem then becomes one of estimating therobot position and orientation within the map using information gathered from these sensors.Robot localization techniques need to be able to deal with noisy observations and generate notonly an estimate of the robot location but also a measure of the uncertainty of the locationestimate.Robot localization provides an answer to the question: Where is the robot now? Areliable solution to this question is required for performing useful tasks, as the knowledge ofcurrent location is essential for deciding what to do next.

Monte Carlo Localization

Because the robot may not always move as expected, it generates many random guesses asto where it is going to move next. These guesses are known as particles. Each particle containsa full description of a possible future pose. When the robot observes the environment it’s in (via . Preparation Studies sensor readings), it discards particles that don’t match with these readings, and generates moreparticles close to those that look more probable. This way, in the end, most of the particleswill converge in the most probable pose that the robot is in. So the more you move, the moredata you’ll get from your sensors, hence the localization will be more precise. These particlesare those arrows that are shown in RViz in the next ﬁgure.Figure 4.11: Monte Carlo localization.Monte Carlo localization (MCL) [42], also known as particle ﬁlter localization is an algorithmfor robots to localize using a particle ﬁlter. Given a map of the environment, the algorithmestimates the position and orientation of a robot as it moves and senses the environment. Thealgorithm uses a particle ﬁlter to represent the distribution of likely states, with each particlerepresenting a possible state, i.e., a hypothesis of where the robot is. The algorithm typicallystarts with a uniform random distribution of particles over the conﬁguration space, meaning therobot has no information about where it is and assumes it is equally likely to be at any pointin space. Whenever the robot moves, it shifts the particles to predict its new state after themovement. Whenever the robot senses something, the particles are resampled based on recursiveBayesian estimation, i.e., how well the actual sensed data correlate with the predicted state.Ultimately, the particles should converge towards the actual position of the robot.36 . Preparation Studies The AMCL Package

In order to navigate around a map autonomously, a robot needs to be able to localize itselfinto the map. And this is precisely the functionality that the amcl node (of the amcl package)provides us. In order to achieve this,the amcl node uses the MCL (Monte Carlo Localization)algorithm. The AMCL (Adaptive Monte Carlo Localization) package provides the amcl node,which uses the MCL system in order to track the localization of a robot moving in a 2D space.This node subscribes to the data of the laser, the laser-based map, and the transformations ofthe robot, and publishes its estimated position in the map. On startup, the amcl node initializesits particle ﬁlter according to the parameters provided. Basically, the amcl node takes data fromthe laser and the odometry of the robot, and also from the map of the environment, and outputsan estimated pose of the robot. The more the robot moves around the environment, the moredata the localization system will get, so the more precise the estimated pose it returns will be.

The Navfn planner is probably the most commonly used global planner for ROS Navigation.It uses Dijkstra’s algorithm in order to calculate the shortest path between the initial pose andthe goal pose. navfn provides a fast interpolated navigation function that can be used to createplans for a mobile base. The planner assumes a circular robot and operates on a costmap toﬁnd a minimum cost plan from a start point to an end point in a grid. The navigation functionis computed with Dijkstra’s algorithm, but support for an A ∗ heuristic may also be added inthe near future. Carrot Planner

The carrot planner takes the goal pose and checks if this goal is in an obstacle. Then,if it is in an obstacle, it walks back along the vector between the goal and the robot until agoal point that is not in an obstacle is found. It, then, passes this goal point on as a plan to alocal planner or controller. Therefore, this planner does not do any global path planning. It is For supplementary reading visit http://wiki.ros.org/navfn For supplementary reading visit: http://wiki.ros.org/carrot_planner?distro=noetic . Preparation Studies helpful if you require your robot to move close to the given goal, even if the goal is unreachable.In complicated indoor environments, this planner is not very practical. This algorithm can beuseful if, for instance, you want your robot to move as close as possible to an obstacle. Butinstead we use another global planner. Global Planner

The global planner is a more ﬂexible replacement for the navfn planner. It allows youto change the algorithm used by navfn (Dijkstra’s algorithm) to calculate paths for other al-gorithms. These options include support for A ∗ [43], toggling quadratic approximation, andtoggling grid path. (a) Standard behavior (b) Simple potential calculation Figure 4.12: Standard behavior and simple potential calculation paths . For supplementary reading visit: http://wiki.ros.org/global_planner?distro=noetic http://wiki.ros.org/global_planner hapter 5Proposed Approach and SimulationResults ROS [44] is a ﬂexible platform to build robotics software applications. Its col- lection of tools,libraries and conventions greatly simpliﬁes the task of building complex and robust roboticsbehaviours. In addition, ROS was created to encour- age collaborative robotics software devel-opment across the world. The ecosystem of ROS is illustrated in Fig 5.2. The ﬁle system andnodes representation in ROS are extremely helpful in organizing and building robotics tasks.39 . Proposed Approach and Simulation Results

Figure 5.1: Diﬀerence between containers and virtual machines.ROS oﬀers a message passing interface that provides inter-process communication referredto as middleware. The middleware provides facilities like: publish/subscribe anonymous messagepassing, recording and playback of messages, remote procedure calls, and distributed parametersystem. In addition, ROS provides common robot-speciﬁc features that help in running basicand core robotics functions. The oﬀered features include standard message deﬁnitions for robots,robot geometry library, robot description language (URDF), pose estimation and localizationtools. Perhaps the most well-known tool in ROS is Rviz. Rviz provides general purpose, three-dimensional visualiza- tion of many sensor data types and any URDF-described robot. We caneasily visualize the laser scanned data, robot’s odometry, environment map, and many othertopics that the robot subscribes to. Rviz can be seen as a tool to visualize what your robotcan see. Another useful tool in ROS is rqt. Using the rqt graph plugin we can introspect andvisualize a live ROS system, showing nodes and the connections between them, and being able toeasily debug and understand our running system and how it is structured. For all the mentionedfeatures, we have chosen ROS as a software platform 40 . Proposed Approach and Simulation Results

Figure 5.2: ROS ecosystem.to develop and test our robot’s navigation system on. The ROS version we use is KineticKame which is the 10th oﬃcial ROS release and is supported on our operating system, UbuntuXenial.

Rviz stands for ROS visualization. It is a general-purpose 3D visualization environment forrobots, sensors, and algorithms. Like most ROS tools, it can be used for any robot and rapidlyconﬁgured for a particular application. rviz can plot a variety of data types streaming through atypical ROS system, with heavy emphasis on the three-dimensional nature of the data. In ROS,all forms of data are attached to a frame of reference. For example, the camera on a Turtlebotis attached to a reference frame deﬁned relative to the center of the Turtlebot’s mobile base.The odometry reference frame, often called odom, is taken by convention to have its originat the location where the robot was powered on, or where its odometers were most recentlyreset. Each of these frames can be useful for teleoperation, but it is often desirable to have a“chase” perspective, which is immediately behind the robot and looking over its“shoulders”. Thisis because simply viewing the robot’s camera frame can be deceiving — the ﬁeld of view of acamera is often much narrower than we are used to as humans, and thus it is easy for tele-operators to bonk the robot’s shoulders when turning corners. A sample view of rviz conﬁguredto generate a chase perspective is shown in Figure 5.3. Observing the sensor data in the same3D view as a rendering of the robot’s geometry can make tele-operation more intuitive.41 . Proposed Approach and Simulation Results

Figure 5.3: Sample view of rviz conﬁgured to generate a chase perspective [45].

In general, robot motions can be divided into mobility and manipulation. The mobilityaspects can be handled by two- or three-dimensional simulations in which the environmentaround the robot is static. Simulation manipulation, however, requires a signiﬁcant increasein the complexity of the simulator to handle the dynamics of not just the robot, but also thedynamic models in the scene. For example, at the moment that a simulated household robot ispicking up a handheld object, contact forces must be computed between the robot, the object,and the surface the object was previously resting upon.42 . Proposed Approach and Simulation Results

Figure 5.4: Gazebo [45].Simulators often use rigid-body dynamics, in which all objects are assumed to be in com-pressible, as if the world were a giant pinball machine. This assumption drastically improves thecomputational performance of the simulator, but often requires clever tricks to remain stableand realistic, since many rigid-body interactions become point contacts that do not accuratelymodel the true physical phenomena. The art and science of managing the tension between com-putational performance and physical realism are highly nontrivial. There are many approachesto this trade-oﬀ, with many well suited to some domains but ill suited to others.

Civilian drones are soon expected to be used in a wide variety of tasks, such as aerialsurveillance, delivery, or monitoring of existing architectures. Nevertheless, their deploymentin urban environments has so far been limited. Indeed, in unstructured and highly dynamicscenarios, drones face numerous challenges to navigate autonomously in a feasible and safe way.In contrast to traditional “map-localize-plan” methods.This is achieved by DroNet: a convolutional neural network that can safely drive a dronethrough the streets of a city. Designed as a fast 8-layers residual network, DroNet produces43 . Proposed Approach and Simulation Results two outputs for each single inputimage: a steering angle to keep the drone navigating whileavoiding obstacles, and a collision probability to let the unmanned aerial vehicle (UAV) recognizedangerous situations and promptly react to them. The challenge is however to collect enoughdata in an unstructured outdoor environment such as a city.Figure 5.5: DroNet is a forked Convolutional Neural Network that predicts, from a single × frame in gray-scale, a steering angle and a collision probability. The shared part ofthe architecture consists of a ResNet-8 with 3 residual blocks ,followed by dropout and ReLUnon-linearity. Afterwards, the network branches into 2 separated fully-connected layers, one tocarry out steering prediction, and the other one to infer collision probability. In the notationabove, we indicate for each convolution ﬁrst the kernel’s size, then the number of ﬁlters, andeventually the stride if it is diﬀerent from 1.Figure 5.5: Dronet architecture [2]. The approach aims at reactively predicting a steering angle and a probability of collisionfrom the drone on-board forward-looking camera. These are later converted into control ﬂyingcommands which enable a UAV to safely navigate while avoiding obstacles. Since they aim toreduce the bare image processing time, they advocate a single convolutional neural network(CNN) with a relatively small size. The resulting network, which we call DroNet. The archi-44 . Proposed Approach and Simulation Results tecture is partially shared by the two tasks to reduce the network’s complexity and processingtime, but is then separated into two branches at the very end. Steering prediction is a regressionproblem, while collision prediction is addressed as a binary classiﬁcation problem. Due to theirdiﬀerent nature and output range, we propose to separate the network’s last fully-connectedlayer. During the training procedure, we use only images recorded by manned vehicles. Steeringangles are learned from images captured from a car, while probability of collision, from a bicycle.

The outputs of DroNet are used to command the UAV to move on a plane with forwardvelocity and steering angle θ k . More speciﬁcally, they use the probability of collision p t providedby the network to modulate the forward velocity: the vehicle is commanded to go at maximalspeed v max when the probability of collision is null, and to stop whenever it is close to 1. Theyuse a low-pass ﬁltered version of the modulated forward velocity v k to provide the controllerwith smooth, continuous inputs (0 ≤ α ≤ v k = (1 − α ) v k − + α (1 − p t ) v max (5.1)Where:• v k : The required forward velocity.• v k − : The forward velocity from the previous iteration and zero for the ﬁrst iteration.• p t : Probability of collision provided by the neural network.• v max : Max forward velocity of the robot.similarly, they map the predicted scaled steering s k into a rotation around the body z-axis (yawangle θ ), corresponding to the axis orthogonal to the propellers’ plane. Concretely, they convert s k from a [-1,1] range into a desired yaw angle θ k in the range [ − π/ , π/ and low pass ﬁlter it:45 . Proposed Approach and Simulation Results θ k = (1 − β ) θ k − + β ( π/ s k (5.2)Where:• θ k : The required steering angle.• θ k − : The steering angle from the previous iteration and zero for the ﬁrst iteration.• s k : The steering angle provided by the neural network.In all our experiments we set α = 0 . and β = 0 . , while v max was changed according to thetesting environment. The above constants have been selected empirically trading oﬀ smoothnessfor reactiveness of the drone’s ﬂight. As a result, they obtain a reactive navigation policy thatcan reliably control a drone from a single forward-looking camera. An interesting aspect oftheir approach is that they can produce a collision probability from a single image without anyinformation about the platform’s speed. Indeed, they conjecture the network to make decisionson the base of the distance to observed objects in the ﬁeld of view. Convolutional networks arein fact well known to be successful on the task of monocular depth estimation. Dronet control is implemented as anode that receives (steering angle , probability of collision) from “/cnn_out/predictions” topic which is the output of the neural network. The neuralnetwork is presented in another node “/dronet_perception” which receives images from thecamera and treats it as an input to the neural network and then transmit the output (probabilityof collision , steering angle) to the “cnn_ out/predictions” topic.46 . Proposed Approach and Simulation Results

Figure 5.6: The rqt_graph of dronet.

In this chapter, we will discuss the simulation results acquired throughout the whole project.We will divide the project into six stages and each stage will contain a result sample anddescription of the problem faced. The simulation tool that we are using is GAZEBO. In thefollowing subsections, the diﬀerent scenarios for DroNet ground robot based will be simulated.

Converting the DroNet from a drone based autonomous navigation into a ground basedrobot (turtlebot) based autonomous navigation. Then, creating an environment looks like theenvironment that the DroNet neural network was trained on. The DroNet neural network wastrained on real world data targeted to navigate through streets so the simulated environmentthat we created is a single lane road as shown in Figure 5.7.47 . Proposed Approach and Simulation Results

Figure 5.7: Single lane road environment.From Figure 5.8, the ground robot DroNet based will collide with the wall which solved inthe next subsection. 48 . Proposed Approach and Simulation Results

Figure 5.8: Robot collision.

This is achieved by tuning the LPF parameters which are linear and angularvelocity in Equation 5.1 and Equation 5.2 respectively. They set α = 0.3 and β = 0.5 while V max was changed according to the testing environment. This scenario was concerned with trying the tuned ground based DroNet in an environmentas shown in Figure 5.9 where the neural network wasn’t trained on it before to test how it wouldbehave. 49 . Proposed Approach and Simulation Results

Figure 5.9: Indoor environment.50 . Proposed Approach and Simulation Results (a) (b)

Figure 5.10: (a) Start position of the robot. (b) Robot direction.From Figure 5.10 (a), we can conclude that, the blue beam from laser scanner overhead theturtleBot shows its ﬁeld. The ideal scenario for the turtleBot motion is to go to the path thatperfect ﬁt the robot. But, the robot did not go through it and robot considers the whole areais blocked and turned as shown in Figure 5.10 (b).

After retraining the neural network using a dataset generated from the same that is usedin scenario . Proposed Approach and Simulation Results (a) (b)

Figure 5.11: (a) Start position (b) Heading to the perfect ﬁt path.Figure 5.11 (a) also shows the blue beam from laser scanner shows the robot ﬁeld. On theother hand, Figure 5.11 (b) shows the turtleBot moves to the path that perfectly ﬁt the robot.

In order to perform autonomous navigation, the robot must have a map of the environment.The robot will use this map for many things such as planning trajectories, avoiding obstacles,etc. The mapping and localization is achieved as follows:

Simultaneous Localization and Mapping (SLAM) is the name that deﬁnes the robotic prob-lem of building a map of an unknown environment while simultaneously keeping track of therobot’s location on the map that is being built. 52 . Proposed Approach and Simulation Results

The gmapping ROS package is an implementation of a speciﬁc SLAM algorithm calledgmapping ( ). This means that, somebody ( http://wiki.ros.org/slam_gmapping ) has implemented the gmapping algorithm for us to use insideROS, without having to code it ourself. So if we use the ROS Navigation stack, we only needto know (and have to worry about) how to conﬁgure gmapping for our speciﬁc robot (in ourcase Turtlebot). The gmapping package contains a ROS Node called slam_gmapping, whichallows us to create a 2D map using the laser and pose data that your mobile robot is providingwhile moving around an environment. This node basically reads data from the laser and thetransforms of the robot, and turns it into an occupancy grid map (OGM).

Another of the packages available in the ROS Navigation Stack is the map_server package.This package provides the map_saver node, which allows us to access the map data from a ROSService, and save it into a ﬁle. When you request the map_saver to save the current map, themap data is saved into two ﬁles: one is the YAML ﬁle, which contains the map metadata andthe image name, and second is the image itself, which has the encoded data of the occupancygrid map.

Moving from one place to another is a trivial task, for humans. One decides how to move in asplit second. For a robot, such an elementary and basic task is a major challenge. In autonomousrobotics, path planning is a central problem in robotics. The typical problem is to ﬁnd a pathfor a robot, whether it is a vacuum cleaning robot, a robotic arm, or a magically ﬂying object,from a starting position to a goal position safely. The problem consists in ﬁnding a path from astart position to a target position. This problem was addressed in multiple ways in the literaturedepending on the environment model, the type of robots, the nature of the application, etc.Safeand eﬀective mobile robot navigation needs an eﬃcient path planning algorithm since the quality53 . Proposed Approach and Simulation Results of the generated path aﬀects enormously the robotic application. Typically, the minimizationof the traveled distance is the principal objective of the navigation process as it inﬂuences theother metrics such as the processing time and the energy consumption. Path planning is dividedinto global and local path planning.Figure 5.12: Global and local path planning . In the following subsection, the mapping and localization scenario for the turtlebot will bepresented.

In this scenario, the turtlebot will be moving on a pre-deﬁned path to target without needinga laser range sensor. This is achieved as the follows sequences:1. Generating an obstacle map for the environment by using the Gmapping package as shownin Figure 5.13. . Proposed Approach and Simulation Results Figure 5.13: Mapping for the environment used.2. Generating the shortest path to the target using dijkstra path planner then saving thepath obtained as shown in Figure 5.14. 55 . Proposed Approach and Simulation Results

Figure 5.14: Path planning for the environment used.3. Finally, creating a script that retrieve the path from the ﬁle and move on the path withoutLaser range sensor as shwon in Figure 5.15. 56 . Proposed Approach and Simulation Results

Figure 5.15: Moving the turtleBot on path without laser sensor.

In this scenario, the Combination of DroNet in Scenario hapter 6Conclusions and Future Work

In this project, a deep learning method that is based on a convolutional neural network isconsidered. Software simulation tools have been learned and used to achieve the main objectivesof our project such as Linux, robot operating system (ROS), C++, python, and GAZEBOsimulator. Software simulation is to achieve the output of the method using the laser sensordata that is preprocessed in such a way that enables the network to decide which direction tofollow to move nearer to the target. The goal-oriented motion problem for the DroNet approachhas been solved using mapping and path planning. In addition to this, this thesis proposed amarketing service restaurant application at a low cost. Finally, the simulation results are verypromising and the robot performance is good.

Now, we are going to list some possible directions for future work:• We ﬁrst realize our project by hardware implementation.• This work may be extended for agriculture application by combination internet of thing(IOT) approach. 58 . Conclusions and Future Work • The goal oriented motion problem for the DroNet approach my solved via potential ﬁeldapproach. 59 ibliography [1] Ihab S Mohamed, Guillaume Allibert, and Philippe Martinet. “Model predictive path in-tegral control framework for partially observable navigation: A quadrotor case study”.In: . 2020, pp. 196–203.[2] Antonio Loquercio et al. “Dronet: Learning to ﬂy by driving”. In:

IEEE Robotics andAutomation Letters . IEEE. 2017, pp. 31–36.[4] Asako Kanezaki, Jirou Nitta, and Yoko Sasaki. “Goselo: Goal-directed obstacle and self-location map for robot navigation using reactive neural networks”. In:

IEEE Robotics andAutomation Letters . IEEE. 2018, pp. 1–6.[6] Mark Pfeiﬀer et al. “From perception to decision: A data-driven approach to end-to-endmotion planning for autonomous ground robots”. In: . IEEE. 2017, pp. 1527–1533.[7] Simon Haykin and Neural Network. “A comprehensive foundation”. In:

Neural Networks

IBLIOGRAPHY [8] Yann LeCun and M Ranzato. “Deep learning tutorial”. In:

Tutorials in InternationalConference on Machine Learning (ICML’13) . Citeseer. 2013.[9] Ross Girshick et al. “Rich feature hierarchies for accurate object detection and semanticsegmentation”. In:

Proceedings of the IEEE conference on computer vision and patternrecognition . 2014, pp. 580–587.[10] Shaoqing Ren et al. “Faster R-CNN: Towards real-time object detection with region pro-posal networks”. In:

Advances in neural information processing systems . 2015, pp. 91–99.[11] Ross Girshick. “Fast r-cnn”. In:

Proceedings of the IEEE international conference on com-puter vision . 2015, pp. 1440–1448.[12] Ihab S. Mohamed et al. “Detection, localisation and tracking of pallets using machinelearning techniques and 2D range data”. In:

Neural Computing and Applications (2019),pp. 1–18.[13] Jasper RR Uijlings et al. “Selective search for object recognition”. In:

International journalof computer vision

European Conference on Computer Vision . Springer. 2014, pp. 346–361.[15] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. “Beyond bags of features: Spatialpyramid matching for recognizing natural scene categories”. In:

Computer vision and pat-tern recognition, 2006 IEEE computer society conference on . Vol. 2. IEEE. 2006, pp. 2169–2178.[16] Joseph Redmon et al. “You only look once: Uniﬁed, real-time object detection”. In:

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2016,pp. 779–788.[17] Mark Everingham et al. “The pascal visual object classes challenge: A retrospective”. In:

International journal of computer vision arXivpreprint arXiv:1406.5212 (2014). 61

IBLIOGRAPHY [19] Georgia Gkioxari, Ross Girshick, and Jitendra Malik. “Contextual action recognition withr* cnn”. In:

Proceedings of the IEEE international conference on computer vision . 2015,pp. 1080–1088.[20] Wanli Ouyang et al. “Deepid-net: Deformable deep convolutional neural networks for ob-ject detection”. In:

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition . 2015, pp. 2403–2412.[21] Jiuxiang Gu et al. “Recent advances in convolutional neural networks”. In: arXiv preprintarXiv:1512.07108 (2015).[22] Thibault BARBIÉ et al. “Real Time Object Position Estimation by Convolutional NeuralNetworks”. In: ().[23] Alex Teichman and Sebastian Thrun. “Practical object recognition in autonomous drivingand beyond”. In:

Advanced Robotics and its Social Impacts (ARSO), 2011 IEEE Workshopon . IEEE. 2011, pp. 35–38.[24] Wen Xiaoa et al. “Simultaneous Detection and Tracking of Pedestrian from PanoramicLaser Scanning Data”. In:

ISPRS Annals of Photogrammetry, Remote Sensing and SpatialInformation Sciences (2016), pp. 295–302.[25] Andry Maykol Pinto, Luís F Rocha, and A Paulo Moreira. “Object recognition usinglaser range ﬁnder and machine learning techniques”. In:

Robotics and Computer-IntegratedManufacturing

Intelligent Transportation Systems Conference, 2007. ITSC2007. IEEE . IEEE. 2007, pp. 1044–1049.[27] Vinod Nair and Geoﬀrey E Hinton. “Rectiﬁed linear units improve restricted boltz-mann machines”. In:

Proceedings of the 27th international conference on machine learning(ICML-10) . 2010, pp. 807–814.[28] Ian Goodfellow, A Courville, and Y Bengio. “Deep learning, book in preparation for MITPress (2016)”. In: (2016).62

IBLIOGRAPHY [29] Markus Braun et al. “Pose-RCNN: Joint object detection and pose estimation using3D object proposals”. In:

Intelligent Transportation Systems (ITSC), 2016 IEEE 19thInternational Conference on . IEEE. 2016, pp. 1546–1551.[30] Ihab S. Mohamed. “Detection and tracking of pallets using a laser rangeﬁnder and machinelearning techniques”. PhD thesis. European Master on Advanced Robotics+(EMARO+),University of Genova, Italy, 2017.[31] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. “Rectiﬁer nonlinearities improveneural network acoustic models”. In:

Proc. ICML . Vol. 30. 1. 2013.[32] Kaiming He et al. “Delving deep into rectiﬁers: Surpassing human-level performance onimagenet classiﬁcation”. In:

Proceedings of the IEEE international conference on computervision . 2015, pp. 1026–1034.[33] Bing Xu et al. “Empirical evaluation of rectiﬁed activations in convolutional network”. In: arXiv preprint arXiv:1505.00853 (2015).[34] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. “Fast and accurate deepnetwork learning by exponential linear units (elus)”. In: arXiv preprint arXiv:1511.07289 (2015).[35] Tong Zhang. “Solving large scale linear prediction problems using stochastic gradient de-scent algorithms”. In:

Proceedings of the twenty-ﬁrst international conference on Machinelearning . ACM. 2004, p. 116.[36] Weiyang Liu et al. “Large-Margin Softmax Loss for Convolutional Neural Networks.” In:

ICML . 2016, pp. 507–516.[37] Sumit Chopra, Raia Hadsell, and Yann LeCun. “Learning a similarity metric discrimina-tively, with application to face veriﬁcation”. In:

Computer Vision and Pattern Recognition,2005. CVPR 2005. IEEE Computer Society Conference on . Vol. 1. IEEE. 2005, pp. 539–546.[38] Raia Hadsell, Sumit Chopra, and Yann LeCun. “Dimensionality reduction by learning aninvariant mapping”. In:

Computer vision and pattern recognition, 2006 IEEE computersociety conference on . Vol. 2. IEEE. 2006, pp. 1735–1742.63

IBLIOGRAPHY [39] Florian Schroﬀ, Dmitry Kalenichenko, and James Philbin. “Facenet: A uniﬁed embeddingfor face recognition and clustering”. In:

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition . 2015, pp. 815–823.[40] Mofeed Turky Rashid, Huda Ameer Zaki, and Rana Jassim Mohammed. “Simulation ofautonomous navigation mobile robot system”. In:

Journal of Engineering and SustainableDevelopment

Structured Programming . London,UK, UK: Academic Press Ltd., 1972. isbn : 0-12-200550-3.[42] Dieter Fox et al. “Monte carlo localization: Eﬃcient position estimation for mobile robots”.In:

AAAI/IAAI

ACM SIGART Bulletin

37 (1972),pp. 28–29.[44] Morgan Quigley et al. “ROS: an open-source Robot Operating System”. In:

ICRA work-shop on open source software . Vol. 3. 3.2. Kobe, Japan. 2009, p. 5.[45] Morgan Quigley, Brian Gerkey, and William D Smart.