[PDF] SA-Net: Deep Neural Network for Robot Trajectory Recognition from RGB-D Streams

Abstract

Learning from demonstration (LfD) and imitation learning offer new paradigms for transferring task behavior to robots. A class of methods that enable such online learning require the robot to observe the task being performed and decompose the sensed streaming data into sequences of state-action pairs, which are then input to the methods. Thus, recognizing the state-action pairs correctly and quickly in sensed data is a crucial prerequisite for these methods. We present SA-Net a deep neural network architecture that recognizes state-action pairs from RGB-D data streams. SA-Net performed well in two diverse robotic applications of LfD -- one involving mobile ground robots and another involving a robotic manipulator -- which demonstrates that the architecture generalizes well to differing contexts. Comprehensive evaluations including deployment on a physical robot show that \sanet{} significantly improves on the accuracy of the previous method that utilizes traditional image processing and segmentation.

Full PDF

SSA-Net : Deep Neural Network for RobotTrajectory Recognition from RGB-D Streams

Nihal Soans, Yi Hong, and Prashant DoshiDept. of Computer ScienceUniversity of Georgia, Athens GA 30606 { yihong,pdoshi } @cs.uga.edu Abstract

Learning from demonstration (LfD) and imitation learning offer new paradigmsfor transferring task behavior to robots. A class of methods that enable such on-line learning require the robot to observe the task being performed and decomposethe sensed streaming data into sequences of state-action pairs, which are then in-put to the methods. Thus, recognizing the state-action pairs correctly and quicklyin sensed data is a crucial prerequisite for these methods. We present

SA-Net adeep neural network architecture that recognizes state-action pairs from RGB-Ddata streams.

SA-Net performed well in two diverse robotic applications of LfD– one involving mobile ground robots and another involving a robotic manipulator– which demonstrates that the architecture generalizes well to differing contexts.Comprehensive evaluations including deployment on a physical robot show that

SA-Net signiﬁcantly improves on the accuracy of the previous method that uti-lizes traditional image processing and segmentation.

Recent robot learning methods such as learning from demonstration (LfD) Argall et al. [2009] and imitation learning allow a transfer of preferences and policy from the expertperforming the task to the learner. These methods have allowed the learning robots tosuccessfully perform difﬁcult acrobatic aerial maneuvers Abbeel et al. [2007], carry outnontrivial manipulation tasks Pollard and Hodgins [2004], penetrate patrols Bogert andDoshi [2014], and merge autonomously into a congested freeway Nishi et al. [2017].An important way by which this transfer occurs is the learner simply observing theexpert perform the task.Observations of the expert engaged in the task is expected to yield trajectories ofstate-action pairs, which is then given as input to the algorithms that drive these meth-ods. Consequently, recognizing the expert’s state and action accurately from obser-vations is crucial for the learner. If the learner is a robot, its observations are sensorstreams. Very likely, these will be streams from range and camera sensors yieldingRGB and depth (RGB-D) data. Therefore, the learning robot must recognize sequences1 a r X i v : . [ c s . R O ] M a y A-Net

Object detectionYOLO (x ,y ), (x ,y ) R-G-B streamDepth stream (s ,a ), (s ,a ), ... Algorithms for LfD

Figure 1: An overview of the input and output of

SA-Net for state-action recognition fromKinect 360 streams of a TurtleBot patrolling a corridor. of state-action pairs quickly and accurately from RGB-D streams. This is a criticalcomponent of the LfD and imitation learning pipelines.In this paper, we present

SA-Net , a deep neural network that recognizes state-action pairs from RGB-D data streams with a high accuracy. This supervised learn-ing method offers a general deep learning alternative to the current adhoc techniques,which often rely on problem-speciﬁc implementations using OpenCV. Figure 1 givesan overview of how

SA-Net is deployed.

SA-Net aims to recognize from a sensorstream, the expert’s state and action. The state is the 2D or 3D coordinates in a globalreference frame and the orientation. For example, the state of a ground mobile robotis its 2D coordinates and the angle it is facing as measured counterclockwise from thepositive x-axis. The action is derived from the motion performed by the robot.As the learner’s position may not be ﬁxed,

SA-Net seeks to recognize the coordi-nates and orientation of the observed object(s) relative to the learner’s location. Whilethe RGB frame offers context, the depth data is relative to the observer. Coordinatesare recognized by interleaving convolutional neural nets (CNN) and pooling layersfollowed by fully-connected networks input to a softmax. This allows the use of allfour channels, RGB-D, in recognizing the coordinates. Identifying the expert’s ori-entation and action is more challenging. Both of these rely on temporal data, and

SA-Net utilizes frames from time steps t − , t − , and current time step t . Eachframe is cropped previously by a network such as Faster R-CNN Girshick et al. [2014]or YOLO Redmon and Farhadi [2018] to focus attention on the expert. The networkbacktracks the movement inside the bounding box for time step t − and t − us-ing a layer of time-distributed CNNs followed by two convolutional long-short-termmemory nets (LSTM) Hochreiter and Schmidhuber [1997]. SA-Net continues to uti-lize the depth channel here as well by running an intercept to the previously describedfully-connected nets that provide relative distance.We evaluate

SA-Net in two diverse domains. It is used to identify the state-actionsequences of two TurtleBots that are simultaneously but independently patrolling ahallway Bogert and Doshi [2014]. In another application,

SA-Net is used to identifythe state-action sequences of a PhantomX robotic arm that is performing pick-and-place operations. In both domains,

SA-Net exhibits high accuracy while being able2o run on computing machines on board a robot with limited processing power andmemory. Ablation and robustness studies demonstrate that the architecture is neces-sary and sufﬁcient and that

SA-Net can handle typical adverse conditions as well.Consequently,

SA-Net offers high-accuracy trajectory recognition to facilitate robotsengaged in LfD or imitation learning in various domains.

Traditionally, the state and action of an observed agent is recognized by tracking amarker associated with the agent. For example, Bogert and Doshi [2014] makes use ofa colored box placed on the TurtleBot, which simpliﬁes the detection of the robot andestimation of its state and action. A limitation of such methods is a lack of robustnessto occlusion of the object and to noise in the context.Recently, deep neural networks have demonstrated signiﬁcantly improved perfor-mance on tasks involving image and video analysis Yue-Hei Ng et al. [2015] He et al. [2016]. Related to our method are the neural network architectures utilized for recog-nizing human gestures and activities. For example, Ji et al. [2013] recognizes humanactions in surveillance videos using a 3D-CNN. A recurrent neural network (RNN)combined with 3D CNN is utilized by Montes et al. [2016] to classify and tempo-rally localize activities in untrimmed videos. To leverage depth information in gesturerecognition, two separate CNN streams are used Eitel et al. [2015] with a late fusionnetwork. Recently, the RGB and depth modalities were considered as one entity toextract features for action recognition with CNNs Wang et al. [2017]. In general, theseaction recognition methods treated input videos for learning as either 3D volumes withmultiple adjacent frames Ji et al. [2013], one or multiple compact images Wang et al. [2017], or as a sequence of image frames Montes et al. [2016]. Our proposed methodbelongs to the last category and handles the image sequence with LSTMs, which arecapable of learning temporal dependencies. Furthermore, in contrast to these methodsfor recognizing actions,

SA-Net is tasked with recognizing the state and action pairssimultaneously for use in online learning. SA-Net

Architecture As SA-Net is tasked with recognizing state-action pairs, this motivates a network de-sign that efﬁciently mixes convolutional and recurrent NNs, which we describe below.

We aim to automatically estimate the state and action pairs of an expert from RGBand depth streams using deep neural networks. Given the expert’s three video framescaptured by a learner at time points t − , t − , and t , our network jointly predicts thestate ( X, Y, Z, θ ) and action ( A ) of the expert at the current time point t . Here, the tupleof ( X, Y, Z ) in the state representation describes the location coordinate of the expertin a 3D environment; the Z dimension is ignored for 2D cases. The θ describes theorientation of the expert. In this paper, we consider discrete states and action, which3 on v ( F : x x , S : x ) C on v ( F : x x , S : x ) C on v ( F : x x , S : x ) C on v ( F : x x , S : x ) C on v ( F : x x , S : x ) P oo li ng ( F : x , S : x ) P oo li ng ( F : x , S : x ) P oo li ng ( F : x , S : x ) F C ( ) F C ( ) F C ( ) F C ( ) F C ( ) S o ft m a x S o ft m a x (X, Y, Z)(ΔX, ΔY, ΔZ)Object Detection T D - C on v ( F : x x , S : x ) T D - C on v ( F : x x , S : x ) T D - P oo li ng ( F : x , S : x ) T D - C on v ( F : x x , S : x ) S o ft m a x F l a tt en T-2 T-1 T C on v - L S T M ( F : x x , S : x ) C on v - L S T M ( F : x x , S : x ) ! S o ft m a x ActionF: FilterS: StrideFC: Fully ConnectedTD: Time distributed: Concatenation

Figure 2: An overview of the

SA-Net architecture. This network jointly predicts the state andaction of an expert using the observed RGB-D data streams and corresponding sequential datacropped by an object detection model. The ﬁnal outputs of the network include the coordi-nate

X, Y, Z , the orientation θ , and the action. The additional output of relative coordinate ∆ X, ∆ Y, ∆ Z is used in the training procedure. allows us to formulate our task as a multi-label classiﬁcation problem. Formally, ourproblem can be formulated as: ( X, Y, Z, θ, A ) = f ( I t − , I t − , I t ; Θ ) ,X ∈ { , , ..., N X − } , Y ∈ { , , ..., N Y − } ,Z ∈ { , , ..., N Z − } , θ ∈ { , , ..., N θ − } ,A ∈ { , , ..., N A − } . where f indicates the mapping function learned by our classiﬁcation network; I t − , I t − , and I t are the three frame inputs; Θ represents the parameter set of the net-work for classifying the state and action jointly; N X , N Y , and N Z are the discretizeddimensions in each coordinate; N θ is the number of the expert’s orientations – for in-stance, we have four orientations including north, south, east, and west in our TurtleBotapplication; N A is the number of actions, e.g., four different actions including moveforward, stop, move right, and left. In general, the network includes two coupled com-ponents for the state and action recognition, which are learned simultaneously. Thearchitecture of SA-Net is shown in Fig. 2.

The state recognition aims to determine the expert’s coordinate (

X, Y, Z ) and its ori-entation θ . Typically, the expert’s coordinate can be identiﬁed on the basis of itssurrounding environment. Therefore, we use one image frame without consideringthe temporal information in our coordinate recognition module. Different from staterecognition, orientation recognition requires more than one image frame to recognizehard-to-distinguish orientations. As shown in Fig. 3, the TurtleBot is facing in differentdirections in the two images, but the image difference is too subtle to correctly sepa-rate these two orientations of the TurtleBot. In such situations, image sequence plays4 igure 3: An example of a TurtleBot in two similar images but having different orientations. an important role in recognizing the orientation. Therefore, in the state recognition of SA-Net , we separate the prediction of the coordinate (

X, Y, Z ) from that of the orien-tation θ , as one network stream takes the static image input while the other takes theimage sequence as input. Coordinate recognition

As shown in the top stream of the network in Fig. 2, onlythe image frame at time point t is used to predict the expert’s location coordinate. Weassign a pre-deﬁned coordinate system for each environment; that is, each image framewill be classiﬁed into a unique coordinate, which is represented by an absolute location( X, Y, Z ) with respect to the origin in the coordinate system. The expert’s coordinatesare learned from images captured by the learner; however, the learner’s location maychange in different situations. To improve the generalization of the network, we lever-age the relative distance between the expert and the learner to help in the recognitionof the expert’s coordinate.In the coordinate recognition branch, we have two sets of coordinate-related predic-tion, that is, the relative distance ( ∆ X, ∆ Y, ∆ Z ) and the absolute coordinate ( X, Y, Z ).These two prediction tasks share the same process of image feature extraction, whichincludes ﬁve convolutional layers and three max pooling layers. The convolutional lay-ers use 32 ﬁlters with the same kernel size × and the same × stride. The threepooling layers are located after the ﬁrst, third, ﬁfth convolutional layers, respectively,with ﬁlters of size × , × , and × and strides of size × , × , and × .Following the convolutional and pooling layers, two fully convolutional (FC) layersare used in the classiﬁcation. Because the prediction of relative distance contributes tothe coordinate prediction, we have an additional FC layer in the stream of coordinateclassiﬁcation after concatenating the pre-activation of the softmax function from therelative distance classiﬁcation. Orientation recognition

Different from the coordinate parameter, the orientation ofthe expert guides its movement regardless of the environment – similar to the actionparameter discussed in Section 2.3. Therefore, in both orientation and action recogni-tion we would like the network to have its attention on the expert itself, especially whenthe expert is far away from the learner and relatively small in the whole image frame.To achieve this goal, we adopt object detection to make the expert stand out for per-5eiving its behavior. More details about the object detection are given in Section 2.4.After object detection, we have three new sequential frames, which are cropped fromthe original RGB-D image inputs and re-sized to images of size × to facilitateorientation and action recognition of the expert. The sequential frames are essential inorientation recognition to differentiate hard examples as shown in Fig. 3.To handle the sequential image inputs, we use time-distributed convolutional (TD-Conv) layers in the network stream for the orientation recognition. These layers collectimage features required for orientation recognition from all three time steps. In par-ticular, we have two TD-Conv layers, followed by one time-distributed max poolinglayer and another TD-Conv layer. Each TD-Conv layer has 32 ﬁlters with size of × and stride of × , and the pooling layer uses a ﬁlter with size of × and stride of × . In addition, we observe that the orientation and action recognition are connectedto coordinate recognition, albeit loosely. For instance, the TurtleBot is less likely toturn left or right if it is in the middle of a corridor. Thus, we concatenate the wholeimage features extracted from the coordinate recognition with the spatio-temporal fea-tures extracted from the cropped image sequence to predict the expert’s orientation. Asimilar operation is performed in action recognition, as discussed next. Similar to the orientation recognition, actions are recognized using the same three se-quential cropped images after object detection (Section 2.4). The goal is to determinethe expert’s action – for example, in which cardinal direction is the expert moving.Because the orientation and action recognition are working on the same input, theyshare the ﬁrst three layers for extracting lower-level features from cropped images atall time steps. The action recognition then uses two convolutional LSTM layers tofurther compose higher-level features and capture temporal changes in the image se-quence. These two new layers also use 32 kernels with size of × and stride of × .In this branch we leverage all features extracted from the state (both coordinate andorientation) recognition to support the action recognition. This object detection module provides inputs for the orientation and action recognitionof the expert. We use the RGB data stream at the current time point t to perform theobject detection using an existing model, YOLO Redmon and Farhadi [2018]. Usingthe predicted bounding box for the expert [( x , y ) , ( x , y )] , we crop the images fromthe frames t − , t − , and t . In case of cropping, we keep a small amount of sur-rounding environment background; this buffered cropping is calculated by the linearequation, ∆ a = r × ∆ b + c min . Here, r ≥ is the cropping factor that determineshow aggressively the users want to crop the image; ∆ b is the width | x − x | or theheight | y − y | of the bounding box before the buffered cropping, while ∆ a is thecorresponding value after the cropping; and c min is the minimum amount of cropping,e.g., 10 pixels. In all of our experiments, we set r = 1 . .6 .5 Masking for Multiple Experts In practice, we would have more than one expert in the scene for learning. However, thenetwork is not explicitly designed to recognize the state and action pairs for multipleexperts. To address this issue, we propose to use the masking strategy that ensures onlyone expert existing in the images for recognition. In particular, we leverage the objectdetection described in Section 2.4 to separate the experts and generate a new imagefor each of them. When generating the image for one expert, we remove all unwantedexperts using the detected bounding boxes and replace the the removed regions withthe background image stored in the memory. In this way, we have new images for eachexpert to pass through the network for its state and action recognition.

SA-Net exhibits a general architecture useful in multiple domains. We evaluate iton two domains ofﬂine and online on a physical robot and report on our extensiveexperiments below.

We evaluated

SA-Net on two diverse domains. First, it was deployed on a TurtleBottasked with penetrating cyclic patrols by two other TurtleBots in a hallway as shownin Fig. 4 ( a ) ; this domain has been used previously to evaluate inverse reinforcementlearning methods Bogert and Doshi [2014]. Each patroller can assume one of 4 orien-tations and 4 actions. The other domain involves observing a PhantomX arm mountedon a TurtleBot (Fig. 4 ( b ) ), which is performing a pick-and-place task. The arm is ob-served from a Kinect 360 RGB-D sensor overlooking the arm. This domain adds athird dimension, the height of the end effector, to the state, and the arm has 6 possibleorientations and 6 actions. For both domains, we evaluated

SA-Net using stratiﬁed 5-fold cross validation. 500annotated RGB and depth image pairs were utilized to train a Faster R-CNN Ren et al. [2015], whose output then trained a YOLO network to obtain the bounding boxes forthe cropped images. The whole data sets consist of 60K annotated sets of RGB anddepth image frame pairs for the patrolling domain and 10K such sets for the manipu-lation domain. Each set consists of an uncropped pair and three cropped pairs of timepoints t − , t − , and t .Tables 1 ( a ) and 1 ( b ) show the prediction accuracy on the 2D or 3D coordinatesand orientation that make up the state, and on the action for each domain. We showthe results for each of the 5 runs, mean, and standard deviation across the runs. Noticethat in both domains, SA-Net generates predictions of state and action with very highaccuracy, with those for the manipulation domain being slightly less accurate than thosefor the patrolling domain. This is generally consistent across all folds due to which thestandard deviations are low. 7 a ) ( b ) Figure 4: ( a ) A map of the hallway patrolled by two TurtleBots. The learner, shown in blue,observes the patrols from its vantage point using a Kinect 360 RGB-D sensor. A 2D grid issuperimposed on the hallways. ( b ) SA-Net is deployed on a TurtleBot that observes a PhantomXarm mounted on top of a TurtleBot. A 3D grid is superimposed to represent the coordinates ofits end effector.

We performed an ablation study to understand the sensitivity of

SA-Net ’s performanceon key components of the network. The ablation study removes a part of the networkand conducts experiments on the revised model.

Relative X and Y

In this experiment on the patrolling domain, we eliminate that partof

SA-Net which contributes to establishing the 2D grid coordinate of the observedrobot relative to the observer’s location. This part relies more greatly on the depth data.Consequently, we may expect the network to memorize the location by relying moreon RGB data but unable to detect changes in its own deployed position. Row 1 of Table2 shows a signiﬁcant drop in the prediction accuracy of state and action with a morepronounced drop in the accuracy of predicting the X-coordination and action. Thesetwo rely signiﬁcantly more on the relative distances.

Temporal sequence data

In this experiment, we eliminate the part of

SA-Net respon-sible for processing temporal data from previous time steps t − and t − . This alsoeliminates those two input channels and keeps input from time step t only. We hypoth-esize this removal to signiﬁcantly impact the recognition of orientation θ and action,both of which are thought to rely on sequence data. On the other hand, a single framecould be sufﬁcient to identify the orientation in many cases. Table 2, row 2 presentsprediction accuracies that are signiﬁcantly lower for θ and action, while recognizingthe 2D coordinates is generally not affected. As such, the temporal data is indeedimportant for θ and for SA-Net in general.

Multimodal data

Next, we study if depth data is needed for the predictions and howthe network will behave when its removed. Can we make the network learn the stateand action from RGB data only? Row 3 of Table 2 shows that the predictions of X-coordinate, θ , and action are signiﬁcantly degraded in the absence of the depth channel.The Y-coordinate is least impacted as we may expect. As a patroller approaches the8 A-Net

X Y θ ActionRun 1 98.853 99.96 99.99 99.97Run 2 98.853 99.96 99.99 99.95Run 3 98.853 99.95 100 99.95Run 4 98.885 99.97 99.97 99.94Run 5 98.81 99.99 100 99.97

Mean ± SD 98.85 ± ± ± ± SA-Net

X Y Z θ ActionRun 1 97.63 95.23 96.54 98.19 99.12Run 2 97.65 95.19 96.56 98.12 99.14Run 3 97.62 95.2 96.58 98.23 99.1Run 4 97.63 95.22 95.59 98.17 99.14Run 5 97.66 95.21 95.55 98.15 99.16

Mean ± SD 97.64 ± ± ± ± ± Table 1:

SA-Net results per run of a 5-fold cross validation for the patrolling (top) , and manipu-lation ( bottom ) domains. We show the prediction accuracy of state and action for both domains.Ablation X Y θ Action

SA-Net w/o Relative X & Y 81.378 ± ± ± ± SA-Net w/o data from t-1, t-2 96.56 ± ± ± ± SA-Net w/o depth channel 87.23 ± ± ± ± SA-Net w/o object detect 68.74 ± ± ± ± observer, there are multiple states for which the RGB frames are similar. In the absenceof depth, the network memorizes certain features and overﬁts on those characteristics. Object detection

Finally, we precluded the object recognition performed by YOLO,resulting in no cropped images as input. The drastic drop in prediction quality of allcoordinates, orientation, and action (row 4) gives evidence that object detection is re-quired. Coordinate recognition suffers because object detection is needed for maskingeach expert in the context of multiple experts. In recognizing the orientation and ac-tion, object detection plays a more integral role focusing

SA-Net ’s attention, which isdemonstrated by a larger degradation in their prediction accuracy.

Environment X Y Z θ ActionTurtlebot ± SD ± ± N/A ± ± Baseline: Centroid method 94.15 ± ± ± ± ± SD 87.56 ± ± ± ± ± SA-Net ’s accuracy in physical experiments for patrolling and manipulator arm domainsunder typical conditions. est X Y θ Action

SA-Net w/ Noise ± ± ± ± Centroid method w/ Noise 34.20 ± ± ± ± SA-Net w/ Occlusion ± ± ± ± Centroid method w/ Occlusion 18.23 ± ± ± ± SA-Net compared to the centroid method on the patrolling domainwith background noise and occlusion.

We deployed the trained

SA-Net on a physical TurtleBot that observed two otherTurtleBots patrolling the hallway and on a TurtleBot that is connected to a Kinect360 overlooking a PhantomX arm.

SA-Net can be used in ROS as a service and thecorresponding component architecture is shown in Fig. 5.

Figure 5: ROS nodes architecture for

SA-Net on a robot.

Although, in general, it is challenging to report the prediction accuracy in onlinephysical experiments, we logged the RGB-D stream and

SA-Net ’s predictions for eachframe in the stream. These predictions were later veriﬁed manually. Table 3 reportsthe prediction accuracy of observed state-action pairs for both domains. We compared

SA-Net ’s performance on the patrolling domain with a traditional OpenCV based im-plementation that detects the centroid of the colored box on each robot. The extantmethod is particularly poor in recognizing the patroller’s action, and

SA-Net improveson it drastically.

SA-Net ’s reduced accuracy on the manipulation domain is due to theincreased complexity of a third dimension and more actions of the manipulator. Next,

SA-Net ’s prediction robustness was evaluated in various scenarios.

Noise test

In this experiment, we test if background noise impacts the prediction ac-curacy of the network. The noise is deﬁned as objects that look like or have similarcharacteristics as the target, and dimmed ambient light. Such background objects,shown in Fig. 6 ( a ) , include a human wearing a similar-colored shirt and boxes of samecolors on the ﬂoor. Occlusion test

In this experiment, the target is covered partially to approximate 50%occlusion; we cover the TurtleBot by a cardboard box or a white cloth as shown inFig. 6 ( b ) . These robots then patrol the hallways as before.In Table 4, we show SA-Net ’s prediction accuracy in each context. For the noisetest, the predictions are average of 15 runs split into 5 with a human, 5 with boxes,and 5 with dimmed ambient light. For the occlusion test, again an average of 15 runs is10 a )( b ) Figure 6: ( a ) Robustness tests involving background noise via boxes on the ﬂoor, low ambientlight, and human sharing the space. ( b ) Observed robot partially occluded. shown with the object partially covered to approximate 50% occlusion. Notice that

SA-Net ’s predictions degrade and rather dramatically under occlusion of the target object.The latter drop is because of

SA-Net ’s reliance on RGB data, which get curtailed underocclusion. Nevertheless, it’s predictions remain signiﬁcantly better in both tests thanthe traditional centroid-based blob detection method. In particular, the centroid-basedmethod fails to detect the observed robots under occlusion.

Memory usage 742MB ± → SA-Net ± → SA-Net ± Table 5:

SA-Net resource utilizations and the beneﬁt of YOLO.

How much memory is consumed by the ROS deployment of

SA-Net ? Table 5reports the total amount of RAM held by the ROS service for good performance onstate-action recognition. We also show the maximum time in seconds taken by

SA-Net for prediction when paired with Faster R-CNN and paired with YOLO2 for thepatrolling domain, which has two targets. Notice that pairing with YOLO2 speeds upthe prediction by a factor of more than ﬁve.11

Concluding Remarks

SA-Net brings the recent advances in deep supervised learning to bear on a crucial stepin LfD and imitation learning. It represents a general architecture for recognizing state-action pairs from RGB-D streams, which are then input to underlying methods for LfDsuch as inverse reinforcement learning.

SA-Net demonstrates recognition accuracieson diverse robotics applications that are signiﬁcantly better than previous conventionaltechniques. While minor changes in component layers may be beneﬁcial, an abla-tion study revealed that the major architectural parts of the neural network are indeedneeded. A low resource utilization signature allows

SA-Net to be deployed using therelatively sparse computing resources on board robotic platforms.

SA-Net also brings another beneﬁt to LfD. Recent techniques, such as maximumentropy deep inverse reinforcement learning Wulfmeier et al. [2015], utilize a neuralnetwork to perform inverse reinforcement learning. Consequently, this offers an oppor-tunity to integrate

SA-Net into the neural network for inverse reinforcement learning,optimizing synergies. This offers the potential for an end-to-end deep learning ap-proach for LfD in the future.

References

Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application ofreinforcement learning to aerobatic helicopter ﬂight.

Advances in Neural Informa-tion Processing Systems (NIPS) , pages 1–8, 2007.Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey ofrobot learning from demonstration.

Robotics and Autonomous Systems , 57(5):469–483, 2009.Kenneth Bogert and Prashant Doshi. Multi-robot inverse reinforcement learning underocclusion with interactions. In

International Conference on Autonomous Agents andMulti-Agent Systems , pages 173–180, 2014.Andreas Eitel, Jost Tobias Springenberg, Luciano Spinello, Martin Riedmiller, andWolfram Burgard. Multimodal deep learning for robust RGB-D object recognition.In

Intelligent Robots and Systems (IROS) , pages 681–687. IEEE, 2015.Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar-chies for accurate object detection and semantic segmentation. In

IEEE Conferenceon Computer Vision and Pattern Recognition , pages 580–587, 2014.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. In

IEEE Conference on Computer Vision and Pattern Recogni-tion , pages 770–778, 2016.Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural Compu-tation , 9(8):1735–1780, 1997. 12. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human ac-tion recognition.

IEEE Transactions on Pattern Analysis and Machine Intelligence ,35(1):221–231, Jan 2013.Alberto Montes, Amaia Salvador, Santiago Pascual, and Xavier Giro-i Nieto. Tem-poral activity detection in untrimmed videos with recurrent neural networks. arXivpreprint arXiv:1608.08128 , 2016.Tomoki Nishi, Prashant Doshi, and Danil V. Prokhorov. Freeway merging in con-gested trafﬁc based on multipolicy decision making with passive actor critic.

CoRR ,abs/1707.04489, 2017.Nancy S Pollard and Jessica K Hodgins. Generalizing demonstrated manipulationtasks.

Algorithmic Foundations of Robotics V , pages 523–539, 2004.Joseph Redmon and Ali Farhadi. YOLO v3: An incremental improvement. arXivpreprint arXiv:1804.02767 , 2018.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towardsreal-time object detection with region proposal networks. In

Advances in NeuralInformation Processing Systems (NIPS) , pages 91–99, 2015.Pichao Wang, Wanqing Li, Zhimin Gao, Yuyao Zhang, Chang Tang, and Philip Ogun-bona. Scene ﬂow to action map: A new representation for RGB-D based actionrecognition with convolutional neural networks. In

IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pages 416–425, 2017.Markus Wulfmeier, Peter Ondruska, and Ingmar Posner. Maximum entropy deep in-verse reinforcement learning. arXiv preprint arXiv:1507.04888 , 2015.Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals,Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for videoclassiﬁcation. In