[PDF] Interpretability in Contact-Rich Manipulation via Kinodynamic Images

Abstract

Deep Neural Networks (NNs) have been widely utilized in contact-rich manipulation tasks to model the complicated contact dynamics. However, NN-based models are often difficult to decipher which can lead to seemingly inexplicable behaviors and unidentifiable failure cases. In this work, we address the interpretability of NN-based models by introducing the kinodynamic images. We propose a methodology that creates images from the kinematic and dynamic data of a contact-rich manipulation task. Our formulation visually reflects the task's state by encoding its kinodynamic variations and temporal evolution. By using images as the state representation, we enable the application of interpretability modules that were previously limited to vision-based tasks. We use this representation to train Convolution-based Networks and we extract interpretations of the model's decisions with Grad-CAM, a technique that produces visual explanations. Our method is versatile and can be applied to any classification problem using synchronous features in manipulation to visually interpret which parts of the input drive the model's decisions and distinguish its failure modes. We evaluate this approach on two examples of real-world contact-rich manipulation: pushing and cutting, with known and unknown objects. Finally, we demonstrate that our method enables both detailed visual inspections of sequences in a task, as well as high-level evaluations of a model's behavior and tendencies. Data and code for this work are available at this https URL

Full PDF

IInterpretability in Contact-Rich Manipulation via Kinodynamic Images

Ioanna Mitsioni , Joonatan M¨antt¨ari , Yiannis Karayiannidis , John Folkesson and Danica Kragic Abstract — Deep Neural Networks (NNs) have been widelyutilized in contact-rich manipulation tasks to model the com-plicated contact dynamics. However, NN-based models areoften difﬁcult to decipher which can lead to seemingly in-explicable behaviors and unidentiﬁable failure cases. In thiswork, we address the interpretability of NN-based models byintroducing the kinodynamic images. We propose a method-ology that creates images from the kinematic and dynamicdata of a contact-rich manipulation task. Our formulationvisually reﬂects the task’s state by encoding its kinodynamicvariations and temporal evolution. By using images as the staterepresentation, we enable the application of interpretabilitymodules that were previously limited to vision-based tasks. Weuse this representation to train Convolution-based Networksand we extract interpretations of the model’s decisions withGrad-CAM, a technique that produces visual explanations. Ourmethod is versatile and can be applied to any classiﬁcationproblem using synchronous features in manipulation to visuallyinterpret which parts of the input drive the model’s decisionsand distinguish its failure modes. We evaluate this approach ontwo examples of real-world contact-rich manipulation: pushingand cutting, with known and unknown objects. Finally, wedemonstrate that our method enables both detailed visual in-spections of sequences in a task, as well as high-level evaluationsof a model’s behavior and tendencies. Data and code for thiswork are available at [1].

I. I

NTRODUCTION

Contact-rich tasks emerge naturally in many robotic appli-cations, from pushing and grasping to in-hand manipulation.Modelling and describing the contact dynamics analyticallyis especially challenging due to their frictional nature, dis-continuities and variety of surfaces and objects. Moreover,the dynamics can vary throughout the task execution on thesame object. In cutting for example, changes in the knifecontact surface area, the width of the inserted blade, andtransient physical properties of the cutting object will resultin temporally and spatially varying dynamics.Models based on neural networks can handle the het-erogeneous kinematic and dynamic data of a manipulationtask and discover relevant patterns. They can account forvariations in the objects and modes of interaction, generalizeto unseen objects and address the diverse dynamics [2]–[7].However, small differences in training parameters or networkarchitectures can result in signiﬁcantly different behaviors.Additionally, due to the deep networks’ black-box nature,the increased modelling performance comes at the cost ofinterpretability. For human operators it is mostly unclear why Division of Robotics, Perception and Learning (RPL), CAS,EECS, KTH Royal Institute of Technology, Stockholm, Sweden mitsioni,manttari,johnf,[email protected] Division of Systems and Control, Dept. of Electrical Engi-neering, Chalmers University of Technology, Gothenburg, Sweden [email protected] (a) Pushing task (b) Cutting task(c) System overview

Fig. 1:

To create the interpretable kinodynamic image, we transform stateand input sequences of length M into a pixelated image where the placementof the pixels is determined by the feature and the timestep, while the colorby the magnitude. Every kinodynamic image corresponds to a block ofmeasurements that starts at the current timestep t and ends at t + M . Afterscaling the images for training and visualization accordingly, we use themas inputs to a CNN that predicts the class of a future sequence. Lastly, weextract saliency maps through Grad-CAM that answer the question ”Whichfeatures led the network to this prediction?” . an input leads to a speciﬁc output, which is crucial both forsafety and performance when the models are deployed withina real robotic system.Consider the example of food-cutting tasks with objectsof various shapes and stiffness. During execution, differentmotions might result in successful cuts or failed ones wherethe knife gets stuck in the object and potentially leads to amechanical shutdown due to excessive forces. To describethis interaction, we need to take into account the robot’skinematics and the dynamics depending on the knife’s shape,the materials and the controller parameters. Deriving ananalytical model to determine the motion’s outcome and en-compass all the possible variations is practically impossible.Nevertheless, we can train a network to timely predict anundesirable event to allow for corrective actions through aplanning module or through different controller parameters.On one hand, these system reactions are helpful, but shouldbe triggered only when necessary since otherwise they inhibitnormal task execution. On the other hand, if not triggeredin time, they will not prevent failure. In both cases, it isimperative to understand why they are either needlessly or a r X i v : . [ c s . R O ] F e b a) (b) (c) Fig. 2:

Visualization of a failed cutting trial. The kinodynamic images correspond to the sequence (block) of measurements at the speciﬁc time instance: (a)

While the robot is approaching the object, there is free motion on axis Z ( ∆ p z ) and no contact forces. (b) When the robot is able to cut through theobject, there is sawing motion ( ∆ p y ), slicing motion ( ∆ p z ), as well as forces on both axes, F y , F z . (c) When the controller gains are not appropriate forthe contact’s stiffness and friction, the knife is trapped inside the object, forces F y , F z are high and a mechanical shutdown might occur. The colormapsrepresent the magnitude for every feature on a normalized scale spanning from blue for the largest negative to red for the largest positive value. untimely triggered and address the failure modes.The ability to explain how the system’s inputs triggerspeciﬁc answers enhances the utility of data-driven models.Employing interpretability modules can reveal that a modelis sensitive to a sensor’s noise and will be biased by it duringthe task execution. Interpretability can also highlight that twomodels that seem identical are utilizing different features tomake decisions or why a model cannot generalize to newsituations. Concretely, it offers a deeper understanding ofhow the data-driven models operate and allows the user tothoroughly evaluate and improve them according to the task.To explain how an answer was produced, it is necessaryto inspect the network’s activations with respect to theirpredictions. This inspection however is intractable for mostmodels if the activations and the inputs are not in an imageformat. To address this, we transcribe the problem into animage domain to allow its visualization through the methodsummarized in Fig. 1. We propose a way of constructing theimages so that they are visually intuitive, compact, easilymodiﬁable to include more features and allow us to tracethe activations back to the input with no special operationsthat could limit their applicability. The generated images ofkinematic and dynamic data, called kinodynamic images,depict the state of the system during a sequence of timestepsand permit individual visual inspection of the features asshown in Fig. 2. Furthermore, we propose to use the imageswith a CNN architecture that is not constrained by theplacement of the features in the image. To examine whichparts of the input contribute to the network’s answers, weemploy Grad-CAM [8], a technique for producing visualexplanations in the form of saliency maps.In summary, we present a method for interpreting learnedmodels in manipulation. Our method can be applied to any manipulation task that is formulated as a classiﬁcationproblem. Moreover, it can utilize a variety of features stem-ming from proprioception or kinesthetic sources, as longas they are synchronous. We demonstrate how to constructkinodynamic images and consequently train and evaluatemodels for classiﬁcation in two examples of contact-richmanipulation; pushing and cutting. For both tasks, the classto predict denotes the continuation of motion in a futureinstance. In our experiments, we show examples of differentbehaviors from seemingly identical models and how to detectthem through visual interpretation of isolated sequences orcomparisons of the overall feature utilisation.The main contributions of this work are listed below:(a) It is the ﬁrst work that treats the interpretability of neuralnetworks in contact-rich manipulation tasks.(b) It produces qualitative, visual interpretations of isolatedsequences in a task’s evolution, such as transitionsbetween states, for thorough inspection.(c) It can also be used to produce high-level, statisticalresults for a model’s behavior, such as overall featureutilization, on the entirety of collected data.II. R ELATED W ORK

Vision techniques in contact-rich manipulation:

Image-like structures have been explored in contact-rich tasks toleverage the modelling capabilities of CNNs, usually inthe form of tactile data, allowing vision techniques to beapplied to touch. In [9], the spatial signals from a BioTacsensor are used to estimate contact forces while in [10],the authors employ an optical tactile sensor to train a CNNfor edge perception and contour following. A more generalapproach is presented in [11] where the observations from aGelSight sensor are used to plan explicitly in the raw tactilebservation space and manipulate an object through touch.A multimodal representation for contact-rich tasks has beenproposed in [12] to encode heterogeneous sensory input thatincludes forces, end-effector positions and velocities, as wellas visual data, to learn policies for a peg insertion task.While all of these works address contact-rich tasks, theylack explanations about the underlying decision-making ofthe networks and only utilize them to obtain generalizablesolutions for manipulation. Lastly, the work in [13] directlyaddresses the transformation of non-image data to an imageto leverage a CNN architecture in several non-robotic tasks.This work is not focusing on interpretability either and forthe image construction, a dimensionality reduction techniqueis ﬁrst used to encode the data into a 2-dimensional format.As opposed to reducing the data dimensions, we construct thekinodynamic images to preserve the original dimensionality,which is important for visual interpretation in the Cartesiantask space. When encoding data to a latent space that isproduced by linear combinations of all the input features,the representation of the spatial and color dimensions ofthe corresponding pixels lacks physical meaning w.r.t. theinteraction controller quantities. Even if the operation is fullyreversible and we can return to the original raw data format,it is not straightforward to track which units have contributedto the network’s decision, thus losing the ability to readilyextract an explanation in the world frame.

Interpretability methods:

Interpretability of deep neuralnetworks has recently become a focus point in the deeplearning community. The furthest progress has been made forclassiﬁcation networks operating on single images [14]–[17].Interpretability methods can be divided into two categories:network-centric and data-centric. Network-centric methodsfocus on speciﬁc units in the model, for example ActivityMaximization [18]. This method aims to ﬁnd what typesof inputs would maximize the activation of a speciﬁc unitby using gradient ascent w.r.t. the general input space, notnecessarily the current input. In this sense, it is more appro-priate for examining network architectures and discerning theoptimal parts in the entirety of the input space, as opposedto inspecting the actual inputs for a set architecture, whichis our goal. Data-centric methods on the other hand, focuson examining or manipulating input data to determine whichpatterns the model has learned for the task. Some examplesof these methods are Layer-wise Relevance Propagation [16],Excitation Backprop [19], and Grad-CAM [8]. The ﬁrst twoutilize activations and weights which are normalized andbackpropagated, while Grad-CAM calculates saliency mapsusing activations and gradients. A side-effect of how the ﬁrsttwo methods operate is that they produce more ﬁne-grainedanswers, in contrast to Grad-CAM which is more appropriatefor detecting regions of interest in the input space. Becausewe are interested in considering the whole sequence of thefeature (a row in the image), this is beneﬁcial for the imageswe construct. In addition, in a work evaluating the scope andquality of explanation methods [20], Grad-CAM was foundto be one of the few methods that take into account boththe input data and the model parameters and do not operate similarly to an edge detector.

Interpretability in robotic tasks:

Interpretability inrobotics has mainly been explored in human-robot interactionas a tool to align human intuition with what the robot learns.A method utilising user-deﬁned symbols during learningfrom demonstration to produce a semantically aligned latentspace has been proposed in [21], while [22], a way ofdesigning the environment of the robot to allow for long-term interpretable plans is introduced. In another exampleof human-robot collaboration [23], the authors considerinterpretability in the context of human motion recognition.They employ skeletal and object position data and introducea human activity classiﬁer that allows queries over anomaliesin the input signal and can provide feedback in a human-interpretable manner. The concept of interpretability has alsobeen examined in [24] as part of a hierarchical reinforcementlearning structure with two agents. The high-level agentis deconstructing the current problem into sub-goals forthe low-level agent to reach. By using this hierarchy, theyare able to visualize the parts of the environment that areconsidered prime candidates for the sub-goal and evaluatetheir intuitiveness. In a block stacking setup, [25] presentedan inference method for interpretable plans from natural lan-guage, proprioceptive and image data. In [26], they authorsexplained an object motion in human-object interaction bytraining networks to predict contact forces and contact points.Lastly, [27] proposed a network design methodology for end-to-end robotic control that produces compact networks withinterpretable structure, thanks to their dedicated structures. Incontrast, we focus on low-level interpretations to understandwhy and how a trained model works within a manipulationtask. III. M

ETHOD

We assume a robotic manipulator equipped with externalforce sensing and that the task evolution depends on thestate x and the control action u . We further assume thatthe learning objective can be formulated as a classiﬁcationproblem of the form: given the current state and controlinput, what is the class of the future state? . To model thestate x of the interaction task we use the displacement of theend-effector and the sensed forces as they can encapsulatethe variations in the friction and stiffness of the contact.The control input for our set of experiments was measuredfrom an admittance controller presented in [28]. To produceexplanations for a prediction, sequences of raw data are ﬁrsttransformed into images which are scaled for training andused as inputs for a CNN. A saliency map of the featureimportance is then produced by Grad-CAM. To facilitate theinterpretation, the saliency maps are presented together withimages that have been scaled to be visually intuitive. Detaileddescriptions of the individual components are presented inthe following subsections. Creating the Kinodynamic Image:

To create the net-work’s input, we form blocks b of the tuple ( x , u ) thatconsist of non-overlapping sequences of measurements oflength M ∈ N . We denote the sequence of end-effector a) Raw data, training scaler, visualization scaler on the original signalfor a pushing task. (b)

A block of measurements taken from the above signal scaled withthe training scaler (left) and visualization scaler (right).

Fig. 3:

Scaler differences in signal form (top) and image form (bottom). positions during a block b as p b , the sequence of sensedforces as F b and the control input as u b . We then deﬁnethe state as a combination of the relative change in positions ∆ p b ∈ R × M between blocks and the sensor’s forces F b ∈ R × M such that x = [∆ p b , F b ] T ∈ R M × , while the controlinput u = u Tb ∈ R M × .Every input to the network thus consists of M × L measurements with L = 9 . To form the kinodynamic images,we transform these inputs into an image with width M forthe temporal evolution, and height L for the total amountof features. Consequently, we create an L × M grid ofpixels whose color is given by encoding the magnitude ofthe feature during that timestep into a 3-channel RGB valuevia a colormap. This procedure does not depend on thespeciﬁc features or controller used and can be applied toa multitude of tasks. In addition, it can be used for inputs ofhigher dimensionality, such as the joint states of a dual-armmanipulator, simply by adjusting the height L of the image.In the data preprocessing step, we utilize two differentscalers; one for training and one for visualization. The ﬁrstone is a standard scaler that scales every feature separatelyacross the training set to ensure equal consideration by thenetwork. The standard scaling operation has the disadvantageof producing images that are not visually intuitive as everycomponent of the features will now be in the same valuerange. This implies that the noise in irrelevant to the taskaxes, will now visually appear signiﬁcant. An example ofthis can be seen in Fig. 3. The actual force signals resultingfrom motion on axis Y are shown in the leftmost plot ofFig. 3a. In the middle plot, the signals have been scaled fortraining and whatever noise resided on axes X and Z isnow independently scaled to match the feature range, whichproduces visually uninformative images such as the left onein Fig. 3b.The second scaler is strictly used for visualization pur-poses and scales the features of every group according to the absolute maximum value observed for the group acrossthe training data. Concretely, during training the featuresare scaled row-wise (with respect to the image), while forvisualization, we scale the groups of features that correspondto displacements, forces and inputs based on the largest valuemeasured within the same group.In both cases the data are scaled to the the interval (0 , , since we employ heterogeneous features with differentranges. The last ﬁgure in Fig. 3a depicts the same force sig-nals but scaled to preserve their relative magnitude, resultingin the more informative and intuitive kinodynamic image onthe right in Fig. 3b. Classiﬁcation Network:

We used two variants of CNNnetworks, a purely Convolutional model, and a ConvolutionalLong-Short Term Memory (C-LSTM) model. The convolu-tional architecture consists of three 1D convolutional layersfollowed by a fully-connected layer and a Leaky ReLU afterthe second layer. The convolutional layers have 32, 64 and128 output channels, with ﬁlter sizes (1,5), (1,3), (1,1) ofsingle stride and no padding. The C-LSTM architecture has3 C-LSTM layers, each with 32 channels and ﬁlter size(1,5) and (0,2) padding, operating on a sequence of the 3most recent images. The ﬁlter sizes and their strides werechosen to reﬂect the shape of the individual features inthe image, produce crisp salient maps and avoid constraintson the feature placement. Consecutive features can stillcontribute simultaneously to the prediction, but they produceindividual salient regions. For training, we used a 75-25train-test split and the Cross-Entropy loss. During everypass, the convolutional layers treat time sub-sequences ofindividual features and their output is then used to producesaliency maps centered at those features. This is a resultof Grad-CAM using the gradient of the target class logitin the fully-connected layer w.r.t. the activations in the ﬁnalconvolutional layer to generate a heat map that highlights theimportant regions in the image for a prediction. In essence,these saliency maps refer to the future block and show theparts of the current one that contributed the most to theprediction. IV. E

VALUATION

Dataset Collection:

To collect data, we employ thesame controller as in [28]. Speciﬁcally, if we denote thedesired positions and velocities of a trajectory as p d , ˙ p d ∈ R , the reference and measured forces as f s , f r and the goal isto track the trajectory compliantly, we can deﬁne the velocitycontrol input as u = K a ( f s − f r ) . We can then deﬁnethe desired compliant behavior as f r = K − a ( K p e − ˙ p d ) ,where K p , K a ∈ R × are the stiffness and compliance gainmatrices and e = p − p d the position error. The controlleroperates at Hz, so a block of length M corresponds to aduration of M/ s.For the pushing task, we keep the compliance at K a =0 . and the desired trajectory consists of three pushingphases that start and end with zero velocity for a total of s.The pushed box weighs g and to create variation in thedynamics, we add weights to the box ( , , g) and a) b − (b) b (c) b + 4 Fig. 4:

A sequence of input frames at blocks b ± i during a transition for the pushing task. The class prediction shifts from True Positive, through FalsePositive until it reaches True Negative. Task Network Learning rate Epochs H block Push

EventNetP

EventNetC1

EventNetC2

EvenNetCLSTM

EventNetGen (a)

Training parameters. F -score (Train) F -score (Test)C = 0 C = 1 C = 0 C = 1 EventNetP

EventNetC1

EventNetC2

EventNetCLSTM

EventNetGen (b)

Training and testing scores.

TABLE I:

Training and performance details. use ﬁve different materials on the surface of the motion. Thesurfaces used were cork, sandpaper, felt, gouache paper andsmooth crafting paper for a total of 48 datasets. To ensure theobject will not rotate during the execution or tip upwards, allthe experiments are performed inside a track that constrainsthe motion on the other axes and the pushing point is asclose to the center of the object as possible. Datasets thatinclude accidental rotations or signiﬁcant frictional forceson the other axes were discarded. For the cutting task, thetrajectories consist of periodic sawing motions on axis Y ,while moving downwards on axis Z . The datasets collectedfor this task comprise 43 cutting trials on 3 different objects:zucchinis, potatoes and carrots. To create variation in thedynamics, we vary the stiffness gain of the controller K a inthe range [0 . − . as well the commanded trajectoriesto reﬂect a range of more to less aggressive cutting strategiesby changing the duration of each motion to be either , or s, the sawing range (3 , , cm ) , the number of sawing-slicing repetitions, ( , , ), and the slicing distance in everyrepetition (1 , cm ) .In the remainder of this section, we demonstrate how toutilize the proposed method to visualize and evaluate modelsin two examples of real-world manipulation: pushing andcutting. Pushing is a task that has been often treated withdata-driven methods in the literature [29], [30]. Its dynamicsagree with intuition so we employ it as an instructionalexample. For this task, we introduce the kinodynamic imagesby examining the interpretations of single image framesfor different classiﬁcation results. Additionally, we demon-strate how to detect the features that may lead to miss-classiﬁcations during a label transition.As opposed to pushing, which is easy to interpret but has no interaction between axes of motion, we need to alsoconsider tasks with complicated, coupled dynamics such ascutting. In most cases, cutting requires the knife to ﬁrst breakfriction by sawing and then to slice downwards until theobject has been cut. The coupling of the motions createsa larger degree of variation in the interaction dynamics asboth are affected by the variety in object, controller andtrajectory parameters. Within this task we perform threeexperiments. Initially, we qualitatively show that withoutvisualizing the decisions of trained models, the effect ofsmall model differences on the closed-loop system canbe nearly impossible to identify solely based on perfor-mance metrics. For this experiment, we compare the models EventNetC1 and

EventNetC2 during a transition. Thenext two experiments are examples of quantitative, high-levelevaluations of the networks’ behavior. First, we comparethe overall feature utilisation the models

EventNetC1 and

EventNetC2 exhibit for the entirety of the testing set. Forthe last experiment, we explore the case of generalization toan unseen object and identify which features carry inﬂuencebetween the train and the test objects. It should be notedthat since the focus of this work is visualizing networkdecisions to aid in their evaluation and not to propose thebest models for these tasks, we did not carry out extensivehyperparameter optimization for any of the networks. In allof the following experiments, we assume that the blocksconsist of M = 10 timesteps. Training details as well astraining scores can be found in Tables Ia and Ib. A. Qualitative model evaluation through frame inspectionPushing:

We ﬁrst consider pushing a box-like objecton a plane. Due to the geometry of the task, the kinematicand dynamic axes of interest are reﬂected by the Y axesof the interpretable images in Fig. 4. To avoid making thetask trivial, the network EventNetP is trained to detectwhether the object will continue moving H block = 5 blocksof measurement sequences in the future, countering stick-slipphenomena. The label for each input frame is determinedby the relative displacement on the axis of motion duringthe last timestep of the block H block ahead. If by the endof the prediction horizon, the displacement is not larger than . mm, we determined that there is no motion and assignedthe negative label C = 1 . In any other case, we consideredthe label positive and assigned it C = 0 . A transition fromTrue Positive to True Negative can be seen in Fig. 4. Cutting:

The training goal is to predict if the cuttingprogress will be inhibited H block frames in the future and the a) b − (b) b − (c) b (d) b + 1 (e) b − (f) b − (g) b (h) b + 1 (i) b − (j) b − (k) b (l) b + 1 Fig. 5:

Sequence of input frames at blocks b ± i during a transition for the cutting task. The actual class transitions happens during block b . Top rowcorresponds to model EventNetC1 , the middle row to

EventNetC2 , and the bottom row to

EventNetCLSTM . As

EventNetCLSTM has a sequenceof 3 images as input, the GradCAM images shown for it are the average of the 3 images leading up to the displayed frame. Despite their similar trainingperformance, interpreting the models’ classiﬁcation decisions reveals that they all react differently to the transition. labels are assigned similarly to the pushing task. The positiveclass ( C = 0 ) corresponds to normal downwards motion in H block blocks ahead, while the negative class ( C = 1 ) occurswhen during the last timestep of the block of interest, thedisplacement ∆ p z does not surpass . mm. In this experi-ment, we analyze the models EventNetC1 , EventNetC2 and

EventNetCLSTM during the label transition in Fig.5. These models have very similar scores, and for the twoconvolutional models the only differences are the learningrate (Table Ia) and weight initialization seed, which wasrandom.

B. Quantitative model evaluation through feature utilisation

Another important aspect of this method is that it offers ahigh-level overview of a model’s performance. This can beeasily done by calculating how many times an input featureled to a speciﬁc prediction or even by setting a thresholdfor the Grad-CAM activations and examining which featuressurpass them and contribute more. In Fig. 6 we use theformer method and compare the overall behavior of the mod-els

EventNetC1 , EventNetC2 , and

EventNetCLSTM on the test set. The bar charts depict the percentage ofoccurrences that each feature is the main reason behind aclassiﬁcation result over the total cases for that category.Lastly, we explore the case of generalization to an unseenobject by quantitatively examining the feature importanceduring testing on the new object.V. D

ISCUSSION

A. Qualitative model evaluation through frame inspection

In the transition example of the pushing task in Fig. 4,initially, the decision is based on the forces F y but eventually,the attention is shifted to the lack of motion ∆ p y . In thissequence, the transition between classes happens at block b and the correct detection at b + 4 but not neatly as the irrelevant axis X also gained importance through ∆ p x , F x .Visual inspections such as this are useful to detect modelsthat are affected by inconsequential features and may notbe robust in the presence of disturbance or noise duringdeployment.Motion inhibition in cutting can be a result of inappro-priate controller and trajectory parameters or the object’sstructure. The former case is the most common and usuallycaused by parameters that are not enough to break friction oneither axis. This includes high compliance to external forcesor inadequate velocity to allow progress. The latter caseincludes objects that structurally stop the motion becauseparts of them are impossible to cut through. Nonetheless,depending on the hardware limitations, high exerted torquescan also occur because of objects with dense cores, such asa carrot, or high overall density, such as a potato.For this set of experiments, we trained the networks EventNetC1, EventNetC2, EventNetCLSTM todetect motion inhibition in a future block. The models havevery small differences in performance as shown in Table Ib.However, visualizing and explaining their decisions showsthat they respond to different patterns in the input image,reinforcing that interpretability is important to understandwhy two seemingly identical models can behave differentlywhen deployed. Initially, in Fig. 5a, 5e, 5i, all networksare basing their decision to some degree on the downwardsmotion observed in ∆ p z , with EventNetCLSTM placingmore importance to ∆ p y . One block before the transition( b − ) (Fig. 5b, 5f, 5j), only EventNetC2 shiftsits attention to the lack of motion on ∆ p y and thecorresponding force F y which indicates the blade is notable to saw through the object, while EventNetCLSTM continues to only focus on the displacement. At thetime of the actual transition (Fig. 5c, 5g, 5k), no modeldetects it immediately.

EventNetCLSTM does not change a) EventNetC1 (b)

EventNetC2 (c)

EventNetCLSTM

Fig. 6:

A comparison over features prioritized per classiﬁcation result. Eachbar represents the percentage of feature instances over the total cases ofevery category. Total occurrences TP:1166, TN:1235, FN:214, FP:211. its behaviour and

EventNetC2 is considering the lessimportant lateral forces F x equally as much as F y while EventNetC1 is focusing on axis Z through ∆ p z and u z . Finally, at the immediate next block (Fig. 5d, 5h, 5l), EventNetC1 falsely disregards ∆ p z and miss-classiﬁesthe input while EventNetC2 focuses on the lack ofdisplacement and correctly classiﬁes it.

EventNetCLSTM also miss-classiﬁes the input, basing its decision mostly onthe lack of displacement. It is not until block b + 3 thatit considers the forces and correctly classify the sample,potentially because the network is operating recurrentlyon a sequence of the 3 most recent frames. This doesnot offer the model any more information closer to theprediction horizon than its CNN counterparts, but doesencourage it to focus on trends across longer sequences.In the integrated system, this difference in reasoning couldbe reﬂected in delayed response time or sensitivity to noncrucial features which would in turn negatively affect anymodule determining the corrective action and the overalltask success. B. Quantitative model evaluation through feature utilisation

Fig. 6 depicts the overall feature utilizations for theentirety of the testing set and concisely presents the differentpatterns favored by the models.For

EventNetC1 it is clear from Fig. 6a that the mostused feature is the downwards displacement ∆ p z both for Fig. 7:

Features prioritized per classiﬁcation result for generalization tounseen object. Each bar represents the percentage of feature instances overthe total cases of every category. Total occurrences TP:73, TN:725, FN:32,FP:86.

True Negative (TN) and True Positive (TP) predictions.The control input on the same axis is the second mostcommon indication for TP while the sawing axis is onlyslightly used, indicating a one-sided view of the task. Onthe other hand,

EventNetC2 decidedly utilizes ∆ p z for TP.The obstructed motion (TN) is mostly detected through u z along with inﬂuence from forces F y , F z and displacement ∆ p y which better reﬂects the nature of the task. As alsoexhibited in 5, EventNetCLSTM bases its predictions ofunhindered movement (TP) mostly on ∆ p y , u y , and basesTN/FN samples almost solely on the z axis control input.Lastly, we evaluated which features are important forgeneralization to an object with distinct dynamics. For thisexperiment, we excluded one object category from the train-ing set (zucchinis), trained EventNetGen on the remainder(potatoes, carrots) and then tested on the unseen object.Given the differences in texture and consistency, it is notsurprising that the model has difﬁculty detecting when themotion ceases as demonstrated by the training results inTable Ib. It is interesting to notice however, that the differentmotion proﬁles are the main reasons behind FN but thecontrol input u z effectively dominates the FP results.This is due to the fact that the testing object is signiﬁ-cantly easier to cut. Since it would not cause a mechanicalshutdown, during data collection we were commanding moreaggressive trajectories, which in turn led to higher controlinputs. In the case of the training set objects, these inputswould probably mean that the knife encountered too muchresistance and was not able to move, which justiﬁes themissclassiﬁcation and lack of generalization performance.VI. C ONCLUSION

In this work, we presented a framework that instillsinterpretability in manipulation tasks with kinematic anddynamic data. We achieved that by transcribing sequencesof data in visually intuitive images that encode the temporalevolution of the features. Through our approach, it is possibleto explain the decisions of learned models used in any robotictask that is formulated as a classiﬁcation problem and utilizessynchronous features. Without loss of generality, we treatedtranslational motion and forces and binary classiﬁcation.However, tasks with different dimensions or types of featurescan be explained simply by modifying what comprises theheight of the image, L . In addition, multi-label classiﬁcationcan be treated exactly in the same manner as Grad-CAMs not constrained by the number of classes. In our experi-ments we showed how to detect features that lead to miss-classiﬁcations by inspecting isolated sequences during labeltransitions. Furthermore, we showed how interpretability canhelp distinguish models that are seemingly identical basedon their learning scores but will behave differently duringdeployment. Lastly, we illustrated how the same pipelinecan be used to derive quantitative results about featureutilization and how that can be applied to examine themodel’s generalization abilities. However, this method isversatile and can be used for more tasks. For example, foranalyzing if architectures with different complexities are inessence deciding based on the same features or to monitorif the changes made in training, consistently produce modelswith similar performance and feature utilization. Note thatthe colormap choice can affect the training results. For atask like pushing, all the quantities have the same signand sequential colormaps are appropriate. However, in othercases the features take both positive and negative valuesthat lead to different dynamics and diverging colormaps arepreferable. During training, we observed that the neutral zonecolor could also bias the network and lead to different results.Currently, this method is only applicable to classiﬁcationtasks. In the future, we aim to explore more interpretabilitymodules and extend it to continuous outputs which willencompass regression tasks and learned dynamics models.R EFERENCES[1] I. Mitsioni and J. M¨antt¨ari, “Intepretability in contact-rich ma-nipulation via kinodynamic images,” https://github.com/imitsioni/interpretable manipulation .[2] J. Bohg, A. Morales, T. Asfour, and D. Kragic, “Data-driven graspsynthesis—a survey,”

IEEE Transactions on Robotics , vol. 30, no. 2,pp. 289–309, 2014.[3] I. Agriomallos, S. Doltsinis, I. Mitsioni, and Z. Doulgeri, “Slippagedetection generalizing to grasping of unknown objects using machinelearning with novel features,”

IEEE Robotics and Automation Letters ,vol. 3, no. 2, pp. 942–948, 2018.[4] J. Sung, J. K. Salisbury, and A. Saxena, “Learning to represent hapticfeedback for partially-observable tasks,” in , 2017, pp. 2802–2809.[5] Z. Erickson, H. M. Clever, G. Turk, C. K. Liu, and C. C. Kemp, “Deephaptic model predictive control for robot-assisted dressing,” in ,2018, pp. 4437–4444.[6] A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep Dy-namics Models for Learning Dexterous Manipulation,”

Conference onRobot Learning (CoRL) , pp. 1–12, 2019.[7] O. M. Andrychowicz, B. Baker, M. Chociej, R. J´ozefowicz, B. Mc-Grew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray,J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba,“Learning dexterous in-hand manipulation,”

The International Journalof Robotics Research , vol. 39, no. 1, pp. 3–20, 2020.[8] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh,and D. Batra, “Grad-cam: Visual explanations from deep networksvia gradient-based localization,”

International Journal of ComputerVision , vol. 128, no. 2, p. 336–359, Oct 2019.[9] B. Sundaralingam, A. S. Lambert, A. Handa, B. Boots, T. Hermans,S. Birchﬁeld, N. Ratliff, and D. Fox, “Robust learning of tactileforce estimation through robot interaction,” in , 2019, pp. 9035–9042.[10] N. F. Lepora, A. Church, C. de Kerckhove, R. Hadsell, and J. Lloyd,“From pixels to percepts: Highly robust edge perception and contourfollowing using deep learning and an optical biomimetic tactile sensor,”

IEEE Robotics and Automation Letters , vol. 4, no. 2, pp.2101–2107, 2019.[11] S. Tian, F. Ebert, D. Jayaraman, M. Mudigonda, C. Finn, R. Calandra,and S. Levine, “Manipulation by feel: Touch-based control with deeppredictive models,” in , 2019, pp. 818–824.[12] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei,A. Garg, and J. Bohg, “Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-richtasks,” in , 2019, pp. 8943–8950.[13] A. Sharma, E. Vans, D. Shigemizu, K. A. Boroevich, and T. Tsunoda,“Deepinsight: A methodology to transform a non-image data to animage for convolution neural network architecture,”

Nature ScientiﬁcReports , vol. 11399, no. 9, Aug 2019.[14] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolu-tional networks: Visualising image classiﬁcation models and saliencymaps,” in

Workshop at International Conference on Learning Repre-sentations , 2014.[15] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-volutional networks,”

Lecture Notes in Computer Science (includingsubseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes inBioinformatics) , vol. 8689 LNCS, no. PART 1, pp. 818–833, 2013.[16] G. Montavon, W. Samek, and K. R. M¨uller, “Methods for interpretingand understanding deep neural networks,”

Digital Signal Processing:A Review Journal , vol. 73, pp. 1–15, 2018.[17] B. Kim, W. M., J. Gilmer, C. C., W. J., , F. Viegas, and R. Sayres, “Interpretability Beyond Feature Attribution: Quantitative Testing withConcept Activation Vectors (TCAV) ,”

ICML , 2018.[18] D. Erhan, Y. Bengio, A. Courville, and P. Vincent, “Visualizing higher-layer features of a deep network,”

Bernoulli , no. 1341, pp. 1–13, 2009.[19] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff, “Top-downneural attention by excitation backprop,”

CoRR , vol. abs/1608.00507,2016. [Online]. Available: http://arxiv.org/abs/1608.00507[20] J. Adebayo, J. Gilmer, M. Muelly, I. J. Goodfellow, M. Hardt,and B. Kim, “Sanity checks for saliency maps,”

CoRR , vol.abs/1810.03292, 2018. [Online]. Available: http://arxiv.org/abs/1810.03292[21] Y. Hristov, D. Angelov, M. Burke, A. Lascarides, and S. Ramamoorthy,“Disentangled relational representations for explaining and learningfrom demonstration,” in

Conference on Robot Learning (CoRL) , 2019.[22] A. Kulkarni, S. Sreedharan, S. Keren, T. Chakraborti, D. Smith, andS. Kambhampati, “Designing environments conducive to interpretablerobot behavior,” in , 2020.[23] B. Hayes and J. A. Shah, “Interpretable models for fast activity recog-nition and anomaly explanation during collaborative robotics tasks,”in , 2017, pp. 6586–6593.[24] B. Beyret, A. Shafti, and A. A. Faisal, “Dot-to-dot: Explainablehierarchical reinforcement learning for robotic manipulation,” in , 2019, pp. 5014–5019.[25] C. Paxton, Y. Bisk, J. Thomason, A. Byravan, and D. Foxl, “Prospec-tion: Interpretable plans from language by predicting the future,” in ,2019, pp. 6942–6948.[26] K. Ehsani, S. Tulsiani, S. Gupta, A. Farhadi, and A. Gupta, “Use theforce, luke! learning to predict physical forces by simulating effects,”in

Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR) , June 2020.[27] M. Lechner, R. Hasani, M. Zimmer, T. A. Henzinger, and R. Grosu,“Designing worm-inspired neural networks for interpretable roboticcontrol,” in , 2019, pp. 87–94.[28] I. Mitsioni, Y. Karayiannidis, and D. Kragic, “Modelling andlearning dynamics for robotic food-cutting,” vol. abs/2003.09179,2020. [Online]. Available: https://arxiv.org/abs/2003.09179[29] M. Bauza and A. Rodriguez, “A probabilistic data-driven model forplanar pushing,” in , 2017, pp. 3008–3015.[30] J. Li, W. S. Lee, and D. Hsu, “Push-net: Deep planar pushing forobjects with unknown physical properties,” in