Identifying Multiple Interaction Events from Tactile Data during Robot-Human Object Transfer
Mohammad-Javad Davari, Michael Hegedus, Kamal Gupta, Mehran Mehrandezh
IIdentifying Multiple Interaction Events from Tactile Data duringRobot-Human Object Transfer
Mohammad-Javad Davari † , Michael Hegedus , Kamal Gupta ∗ and Mehran Mehrandezh Abstract — During a robot to human object handover task,several intended or unintended events may occur with the object- it may be pulled, pushed, bumped or simply held - by thehuman receiver. We show that it is possible to differentiatebetween these events solely via tactile sensors. Training datafrom tactile sensors were recorded during interaction of humansubjects with the object held by a 3-finger robotic hand. A Bagof Words approach was used to automatically extract effectivefeatures from the tactile data. A Support Vector Machine wasused to distinguish between the four events with over 95 percentaverage accuracy.
I. INTRODUCTIONRobots can aid humans in different ways. A commonexample that occurs in a variety of scenarios, such as inassisted care or more generally, in collaborative robotics,is that of transferring an object from a robot to human,i.e., robot-human object handover task. A key aspect of thehandover task is the human receiver’s interaction with theobject, while the object is held in the robot’s hand. Forinstance, during object handover, the receiver may pull theobject as he/she grasps it, or, the receiver may accidentallybump the object. Such events may cause potential slippageof the object. Identifying and classifying these events areessential to the proper and robust execution of the objecthandover task, since the robot must take appropriate actionin response to such events. We have identified four suchevents, namely holding, bumping, pulling and pushing thatmay occur during object handover, and in this paper, wefocus on identifying these four events using tactile sensorsonly.Indeed humans are good at human-human handover task,and use multiple sensing modalities including haptic, force,visual and audio [1]. Some of these modalities have beenused in robotics to perform handover tasks as well. Forinstance, in [2] authors used vision to adjust timing ofthe handover. In [3], strain-gauge-based force sensors wereused to model handover grip force. In this paper, we showthat it is possible to identify the above mentioned four *This work was supported by an NSERC Discovery Grant to KamalGupta. † In memoriam. Mohammad recently passed away. School of Engineering Science, Simon Fraser University, Canada { mdavarid,kamal,mehegdus } @sfu.ca Faculty of Engineering and Applied science, University of Regina,Canada [email protected] events solely based on data obtained via tactile sensors.We achieve this by using a machine learning based featureidentification algorithm, Bag of Words (BoW) [4], on thetactile data stream, followed by a Support Vector Machine(SVM) classifier [5].The tactile data is essentially a time series, i.e., a sequenceof pressure readings from a set of spatially distributedtactile elements (taxels). It has both temporal and spatialaspects that can be exploited for detecting features. Temporalinformation is intrinsically relevant for detecting the eventssince they will result in change in sensed pressure, hence,we focused on temporal features primarily, because utilizingspatial features on top will result in more parameters andthus needing more training data.To generate features, one can either hand-design thefeatures or automatically generate them. We indeed firsttested hand-designed features to test classification feasibility.This resulted in very poor results (more detail in sectionIV), due to high variability in temporo-spatial tactile dataencountered in the data collection process, even for the samesubject. This motivated our decision to use automatic featuregeneration, the methods of choice for which are [6]: NeuralNetworks (NN), Hidden Markov Model (HMM) and BoW.BoW uses the data itself as features but NN and HMM usethe data to train their parameters. Thus they generally requiremore training data to generate effective features comparedto BoW. It is onerous and time consuming to collect a largeamount of real data in our case, since our experiments requirehuman subjects, and therefore, we favor the BoW approach togenerate features. Furthermore, we chose SVM as a classifierbecause it requires data near the class boundaries (not thewhole data set) [7], thus it has a better chance of succeedingwith less data.Initially, simply for feasibility purposes, the BoW+SVMapproach was tested on data obtained from one subjectonly (30 training samples for two classes) with acceptableresults. Following this initial experimental confirmation, weincreased the number of participants to five and addedvarious test objects for handover task such as a ball, can andcuboid, in different grasp orientations to create variabilityvia realistic scenarios. For example, through experiments,we observed that the contact forces and contact locationson tactile pads vary widely between a ball and a cuboidtest object, which leads to a significantly more challengingscenario for the classifier.Our paper presents results from five student volunteersinteracting with a 3-finger mechanical Schunk DexterousHand (SDH) mounted on a six-DOF robotic arm. Each finger a r X i v : . [ c s . R O ] S e p as two sensor pads, a proximal and a distal, giving us a totalof six tactile sensor pads. The sensor pads are commercially-available resistive tactile sensors [8]. Our approach obtainedover 95% accuracy with a 0.5 second window of tactile data,hence it is suitable for real-time detection of these events. Insummary, the key contributions of our work are: i) we showthat it is possible to classify if an object held in a robot’shand is being pushed, pulled, held, or bumped by a humanreceiver during a robot-human object transfer task, using onlytactile data and standard machine learning techniques, ii) wepresent a single classifier for automatic detection of thesemultiple events - push, pull, bump or hold - using a BoWalgorithm for automated feature generation on the temporaltactile data stream, followed by an SVM for classification,and iii) the detection is real time and hence can be used bythe robot to ensure robust object transfer.The organization of the paper is as follows. We describerelated work in Section II, and detail nature of the dataand the algorithms used in our work in Section III. Theexperimental setup is described in Section IV, followed bythe experimental procedure in Section V and the results inSection VI. Finally, conclusion and future work are discussedin Section VII. II. RELATED WORKSMost previous work in detecting events using tactile datais related to only a single event, either slip detection [9]–[12] or grasp stability detection [13]–[15], primarily froma grasp stability perspective and does not directly concernitself with object handover. Our work, on the other hand,provides a single classifier for disparate multiple events thatcan occur during object handover task. The previous work,however, does give us information about the types of featuresused for tactile sensor data, as outlined below.Several spatial features have been used in the context ofslip detection - from relatively simple ones like momentsof inertia [14] to more involved ones such as optical-flow-based techniques [12]. [14] found that using temporalinformation with a Hidden Markov Model (HMM) basedclassifier gives better results than a one shot classificationbased on spatial features such as moments of inertia forgrasp stability assessment. [15] also used BoW for featuredetection for grasp stability assessment, but only on spatialaspects, i.e., using only a single reading of sensors acquiredat the end of grasping process. Optical flow based techniqueassumes that the object or point of contact with the tactilesensor is considerably smaller than the overall size of thetactile sensor, and furthermore they only work well with pre-cise/high resolution tactile sensor. Frequency domain basedtechniques have also been used [11], [16], [17], however theyrequire sensors with a sampling rate much higher than thatavailable on commercially available tactile sensors, such asthe one we use.For completeness, we mention that tactile data has alsobeen used to perform object recognition [18], [19]. Theformer uses BoW features but only on the spatial aspects.The latter does take into account temporal aspects of feature detection and uses a neural network for feature generation.[20] used spatial statistical features such as moments alongwith an on-line learning algorithm that learns from the timeseries for object recognition. Both these works, however, donot deal with multiple event classification, as we do in ourwork. III. DATA TYPE AND ALGORITHMA tactile sensor typically is a grid of individual sensors,called taxels, that provide pressure readings at their respec-tive location at a given frequency. The collected sensor data,therefore, consists of a stream of multiple taxel readings fora certain duration of time, represented as a multivariate timeseries data (MTSD) [21]. One of the difficulties of workingwith MTSD is the issue of intra-class variability. This issuearises from the fact that the same event can occur at differentspeeds. To address these issues, methods like: BoW [4],recurrent neural networks [22], dynamic time warping [23]and HMM [24] have been studied in the pertinent literature.Initially we hand designed features and tested them ona limited number of samples. These features were: maxi-mum, mean, median and standard deviation of taxel values,frequency content via FFT, derivative mean, first moment,and optical flow. The best performance was obtained withthe standard deviation feature at 60% for pull vs gripclassification which is clearly sub par.For automatic feature generation, as mentioned earlier inthe introduction, we chose BoW approach followed by anSVM classifier because it has a better chance of succeedingwith less data. A simple description of BoW is given below.Each experiment consists of multiple sequences (eachsequence is of length T) of pressure readings from multipletactile sensors. To compose a training data, only sequenceswith nonzero values are considered. The training data isgiven to the BoW algorithm as a series of tag-less sequences.Using a sliding window of length W, the BoW dividesthese sequences into sub-sequences of length W. Each sub-sequence can be viewed as a point in the W-dimensionalwindow-space. By performing K-means clustering withinthis space, K clusters (or features) are generated, eachrepresented by its center. Since each training or test sequencehas multiple sub-sequences, each sub-sequence representedby the closest center, will generate a histogram of centersfor each sequence. This histogram is represented as a singlepoint in the K-dimensional feature space then. For N trainingsequences, the output of the BoW algorithm includes Npoints in the feature space.In order to have a consistent data stream (among differentexperiments), we feed the derivative of the sensor data streamto the BoW algorithm as a pre-processing step, as shown inFig 1. It eliminates constant biases in the data points, suchas those due to gravity, due to a person’s strength, or due toany variations among sensors, etc.To classify the sequences using the features provided, weused an SVM [5] classifier. SVM classifies data by providinga hyperplane with the maximum margin from the closest datapoints from the opposing classes (i.e., support vectors). With ig. 1. Overall process of classification the aid of an appropriate kernel, SVM is able to classify datawith non-linear boundaries. The block diagram for the overallprocess of classification is shown in Figure 1.IV. HARDWARE SETUPOur experimental setup consists of a three fingered SchunkDexterous Hand (SDH), called robotic hand from now on. Ithas two resistive type sensor pads - DSA 9205 class tactilesensors [8] - per finger, one on the proximal and the other onthe distal link. Each pad has a 13x6 matrix of taxels for a totalof 78 taxels per pad. Although the raw data at the taxel levelare sampled at 230 HZ, the firmware compresses data by afactor of seven to provide a stable output consisting of datastreams for all 6 pads, at 32 HZ. In our experiments, onlyprecision grasps that involve distal part of the finger wereutilized. Hence only data from the 3 distal tactile sensorswere collected in our studies.The hand is mounted on a robotic arm, so we can controlits orientation. Three different orientations (vertically up,vertically down and horizontal) were used in the experimentsas shown in Fig 2. This orienting allows us to include theeffect of gravity as well. It also mimics the versatile handorientations in object transfer tasks.a) vertically down b) vertically up c) horizontal Fig. 2. Using a robotic arm to orient SDH in different directions.
Three different objects were used in the experiment: atennis ball, a cylindrical cardboard tube and a cuboid plankof wood, as shown in Fig 3. The contact surfaces for theobjects are respectively fiber, paper and unpolished wood.These different object shapes were chosen to span differentmodalities of contact. For instance, the contact points on asensor pad shift as the ball moves or slips away, but they stayrelatively stationary for the plank of wood and the cylinder.The plank of wood usually has a biased contact pressureduring push or grip, which is due to the fact that a humansubject would usually apply a small torque on the plankduring the hand-over task. Using different objects makesevent identification more challenging, yet more realistic,since it introduces variability similar to that found in realapplications. Through preliminary experiments, we realised that, for instance, an optical-flow-based method might berobust enough to detect slippage when using a ball object,but it would not serve as a robust method for identifyingslippage in other object types.
Fig. 3. Different objects used in experiment.
The tactile sensors were covered with tape (Fig 4) toperform experiments without damaging the hand or wearingit. The tape also makes slipping easier and ensures a smoothmovement of the object on the surface of the finger bydecreasing friction, thus creating consistent data stream. Wehave reasons to believe that our algorithm would performmore accurately and at a higher confidence level if the tapeswere removed.
Fig. 4. Protecting the tactile sensor with tape.
In each experiment, the data was acquired for 3 seconds.A subset of length T (less than 1 second) was used fortraining and testing data for each event. Choosing the T valuedepends on the trade-off between classification performanceand the classification speed. For real time application of ourapproach, the lower the T value, the better it is because lessacquisition time is needed for classification. The start of thesubset is determined when at least one of the taxel valuesfor all pads exceeds a threshold, set manually in advance, atabout 15% of the maximum taxel value. We used a relativelylow threshold to avoid discarding samples falsely. Manualthresholding can be avoided by introducing a fifth eventcorresponding to “dormant” in the classification scheme.V. EXPERIMENTAL PROCEDUREIn order to perform the experiment, 5 graduate students,4 males and 1 female, between 20-30 years old, volunteeredas subjects. The purpose of the test was explained to thestudents beforehand. They were also shown example videosof each experiment so that they understand the logistics ofthe experiment clearly. Finally, the subjects were asked toperform each of the following actions while the object isheld in a precision grasp by the SDH:) pull: the subject pulls the object from the robotic hand.Subjects were free to either leave the object in therobot’s hand or take it out.2) push: the subject pushes the object into the robotichand. Subjects were asked to stop pushing the objectwhen the object loses contact with the robots distalpads.3) hold (also called grip): the subject grips and holds theobject. They were instructed that they could grip in away that felt natural for them.4) bump: the subject taps any side of the object once withthe distal part of their fingers (to avoid inadvertentdamage to the SDH).A specific GUI, as shown in Fig 5, was designed fordata acquisition. The software automates hand opening andclosing and the movement of the arm; informs the subjectwhen to start interacting with the object and when to replacethe object using voice commands. In order to have versatiledata for training, the participants were free to choose theirown hand configuration to perform the action required.Finally, the software saves the tactile data stream for eachperson in a separate file.The sequence of experiments is as follows. At the start,the investigator initializes the software and the arm moves toa preset pose and SDH opens up to receive the ball object. Apre-recorded video pops up on the monitor showing a sampleof a person pushing the object. The investigator places theball in the SDH and it closes, thereby holding the ball in afirm grasp. The software signals for the subject to push theobject. After 3 seconds, the software signals the subject tostop pushing. After about a second, the software signals thesubject to re-position the ball in the SDH and then signals thesubject to push the ball again into the hand. This is repeated5 times. The same sequence is then applied to pull, hold andbump actions. Next, the entire sequence is repeated for thecylinder and plank objects. Next the robot arm moves to twoother poses, and the entire procedure repeats for each pose.Thus, overall we collected 5 persons × × × × Fig. 5. The GUI developed for conducting experiments and data acquisi-tion. This figure shows a sample bump test for demonstration to the subjects.
VI. RESULTSBoW algorithm has three key parameters, K, the dimen-sion of the feature space; W, the window length; and T, thestream length. Later, we show the effect of theses parameterson the accuracy of classification, but first, we present resultswith the values for which we achieved the most accuracy.These were K = 10, W = 7 and T = 15. Both W and Tare measured in number of time steps. Each time-step is.03 seconds which corresponds to the output frequency (32Hz) of the sensor hardware. In seconds, therefore, T ≈ Fig. 6. 10 centers determined by K-means clustering, each represented asa time series of length W = 7.
We arranged the training and test data sets for ourBoW+SVM algorithm in two different configurations:1) Data of all 5 subjects is pooled together and thetraining and test sets are chosen randomly with 80%of the whole data used for training and 20% for testing(without substitution). The low percentage of test datais due to low number of overall trials.2) The training data is chosen from 4 participants andtested on the fifth one, resulting in 5 such combina-tions.ig 7 shows the confusion matrix for the first configura-tion. The results for the second configuration are shown inFig 8. This shows that the classifier has learned a rich set offeatures and can classify events for interaction with a subjectwho is not in the training set.
Fig. 7. Confusion matrix for Configuration 1 Test where 80% of data israndomly chosen for training and the rest 20% of data is used for testing.Fig. 8. Confusion matrix for Configuration 2 Test where training data ischosen from 4 participants and tested on the fifth one.
As mentioned earlier, the three key parameters in our BoWalgorithm are: K, W and T. We now report the effects ofvarying these parameters.K is the number of words or clusters outputted by the k-means sub-algorithm embedded within the main algorithm.Using too many clusters results in over fitting and usingtoo few clusters results in reduced discrimination capability.Fig 9 shows the effect of changing K on accuracy ofclassification, while keeping W and T constant at 7 and 15,respectively. K = 10 yields the best results and this explainsour use of K = 10 in our BoW algorithm.For real time event classification in real object transferscenarios, data stream length T should not be too long.However, too short a sequence may not give rich enoughfeatures. Fig 10 shows the effect of T (measured in numberof time-steps) on accuracy of the classification while keeping
Fig. 9. Effect of number of centers, K on accuracy
K and W constant (10 and 7, respectively). This explainsour use of T = 15 time steps in our experiments which, asmentioned before, corresponds to a duration of about 0.5seconds for the data stream.
Fig. 10. Effect of sequence length, T on accuracy
The sliding window size, W is also important for creatingeffective features. Fig 11 shows the effect of changing thewindow size on accuracy of classification, while keeping Kand T constant at 10 and 15, respectively. This explains ouruse of W = 7 in our BoW algorithm.
Fig. 11. Effect of window size, W on accuracy
Finally, to visualize the separability of features, we showthe t-SNE [25] representation of the features generated byhe BoW algorithm in Fig 12. The t-SNE is essentially anunsupervised algorithm that projects high dimensional datato lower dimensions (two in our case here) in such a way thatsimilar data remain clustered together and dissimilar data isseparated. It is commonly used for visualization purposesfor high dimensional data. Each class is assigned a uniquecolour label and the data points in the figure are coloredcorresponding to their respective classes. It is evident that theBoW features are making it likely for the SVM to separatemost of the data. There are 4 clearly separated clusters,one corresponding to each class, and one cluster formedfrom different classes. Please note that this mixed group islarger in this visualization because the t-SNE algorithm isunsupervised and has no information about the class tags. Asupervised algorithm such as the SVM, will generally havea significantly lower misclassification.
Fig. 12. t-SNE representation of the features generated from differentexperiments. It shows separability of 4 groups of data: slip (red), grip(black), bump (blue) and push (green).
VII. CONCLUSIONS
AND F UTURE W ORK
In this paper we showed that using only tactile data, itis possible to classify if an object held in a robot’s hand isbeing pushed, pulled, held, or bumped by a human receiverduring a robot-human object transfer task. These events areclosely related and can commonly occur during an objecttransfer task. Our core algorithm uses standard machinelearning techniques - a BoW algorithm for automated featuregeneration on the temporal tactile data stream, followed byan SVM for classification. We also empirically determinedthe best values for the key parameters in the BoW algorithmthat result in about 95 ±
2% average classification accuracy.The experiments we reported in this paper were from apreliminary study. Our next step is to carry out a biggerand more formal study with a larger number of participantsand more objects with different geometries and texture toconfirm our preliminary findings. Subsequently we intend toapply this method to a robot-human object transfer task, bycombining it with an autonomous fetch and delivery system with a mobile manipulator [26] being developed in RAMPLab at SFU. R
EFERENCES[1] K. Strabala et al. , “Toward seamless human-robot handovers,”
Journalof Human-Robot Interaction , vol. 2, no. 1, pp. 112–132, 2013.[2] A. Moon et al. , “Meet me where i’m gazing: how shared attention gazeaffects human-robot handover timing,” in
ACM/IEEE internationalconference on Human-robot interaction , 2014, pp. 334–341.[3] W. P. Chan et al. , “Grip forces and load forces in handovers: impli-cations for designing human-robot handover controllers,” in
SeventhAnnual ACM/IEEE International Conference on Human-Robot Inter-action , 2012, pp. 9–16.[4] Z.-W. Gui and Y.-R. Yeh, “Time series classification with temporalbag-of-words model,” in
Technologies and Applications of ArtificialIntelligence . Springer, 2014, pp. 145–153.[5] V. Vapnik, “The support vector method of function estimation,” in
Nonlinear Modeling . Springer, 1998, pp. 55–85.[6] M. L¨angkvist, L. Karlsson, and A. Loutfi, “A review of unsupervisedfeature learning and deep learning for time-series modeling,”
PatternRecognition Letters , vol. 42, pp. 11–24, 2014.[7] C. Bishop,
Pattern Recognition and Machine Learning , ser. Infor-mation Science and Statistics. Springer New York, 2016. [Online].Available: https://books.google.ca/books?id=kOXDtAEACAAJ[8] Weiss-robotics.com, DSA 9205 tactile sensor specification PDF.[9] M. T. Francomano, D. Accoto, and E. Guglielmelli, “Artificial sense ofslip - a review,”
IEEE Sensors Journal , vol. 13, no. 7, pp. 2489–2498,2013.[10] R. Fernandez et al. , “Slip detection in a novel tactile force sensor,” in
Robotics Research . Springer, 2016, pp. 237–252.[11] E. Holweg, H. Hoeve, W. Jongkind, L. Marconi, C. Melchiorri,and C. Bonivento, “Slip detection by tactile sensors: algorithms andexperimental results,” in
IEEE International Conference on Roboticsand Automation (ICRA) , vol. 4, 1996, pp. 3234–3239.[12] J. A. Alcazar and L. G. Barajas, “Estimating object grasp sliding viapressure array sensing,” in
IEEE International Conference on Roboticsand Automation (ICRA) , 2012, pp. 1740–1746.[13] Y. Bekiroglu, D. Kragic, and V. Kyrki, “Learning grasp stability basedon tactile data and hmms,” in
IEEE International Conference on Robotand Human Interactive Communication (RO-MAN) , 2010, pp. 132–137.[14] Y. Bekiroglu et al. , “Assessing grasp stability based on learning andhaptic data,”
IEEE Transactions on Robotics , vol. 27, no. 3, pp. 616–629, 2011.[15] H. Dang and P. K. Allen, “Stable grasping under pose uncertaintyusing tactile feedback,”
Autonomous Robots , vol. 36, no. 4, pp. 309–330, 2014.[16] J. A. Fishel, V. J. Santos, and G. E. Loeb, “A robust micro-vibrationsensor for biomimetic fingertips,” in , 2008, pp. 659–663.[17] B. Heyneman and M. R. Cutkosky, “Slip classification for dynamictactile array sensors,”
The International Journal of Robotics Research ,vol. 35, no. 4, pp. 404–421, 2016.[18] A. Schneider et al. , “Object identification with tactile sensors usingbag-of-features,” in
Proceedings of the IEEE/RSJ International Con-ference on Robots and Systems (IROS) , 2009, pp. 243–248.[19] M. Madry, L. Bo, D. Kragic, and D. Fox, “St-hmp: Unsupervisedspatio-temporal feature learning for tactile data,” in
IEEE InternationalConference on Robotics and Automation (ICRA) , 2014, pp. 2262–2269.[20] H. Soh, Y. Su, and Y. Demiris, “Online spatio-temporal gaussian pro-cess experts with application to tactile classification,” in
Proceedings ofthe IEEE/RSJ International Conference on Robots and Systems (IROS) ,2012, pp. 4489–4496.[21] B. Chakraborty, “Feature selection and classification techniques formultivariate time series,” in
IEEE ICICIC , 2007, p. 42.[22] M. H¨usken and P. Stagge, “Recurrent neural networks for time seriesclassification,”
Neurocomputing , vol. 50, pp. 223–235, 2003.[23] T. Rakthanmanon et al. , “Searching and mining trillions of time seriessubsequences under dynamic time warping,” in
Proceedings of the 18thACM SIGKDD international conference on Knowledge discovery anddata mining , 2012, pp. 262–270.24] H. Hermansky, D. P. Ellis, and S. Sharma, “Tandem connectionistfeature extraction for conventional hmm systems,” in
IEEE ICASSP ,2000, pp. 1635–1638.[25] L. V. D. Maaten and G. Hinton, “Visualizing data using t-sne,”
Journalof machine learning research , vol. 9, no. Nov, pp. 2579–2605, 2008.[26] M. Hegedus, K. Gupta, and M. Mehrandezh, “Towards an integratedautonomous data-driven grasping system with a mobile manipulator,”in