Multimodal Contact Detection using Auditory and Force Features for Reliable Object Placing in Household Environments
MMultimodal Contact Detection using Auditoryand Force Features for Reliable Object Placingin Household Environments
Jaime L. Maldonado C. [0000 − − − , Asil Kaan Bozcuo˘glu, andChristoph Zetzsche Cognitive Neuroinformatics and Institute for Artificial Intelligence, University ofBremen, Bremen, Germany [email protected], [email protected],[email protected]
Abstract.
Typical contact detection is based on the monitoring of athreshold value in the force and torque signals. The selection of a thresh-old is challenging for robots operating in unstructured or highly dy-namic environments, such in a household setting, due to the variabilityof the characteristics of the objects that might be encountered. We pro-pose a multimodal contact detection approach using time and frequencydomain features which model the distinctive characteristics of contactevents in the auditory and haptic modalities. In our approach the mon-itoring of force and torque thresholds is not necessary as detection isbased on the characteristics of force and torque signals in the frequencydomain together with the impact sound generated by the manipulationtask. We evaluated our approach with a typical glass placing task in ahousehold setting. Our experimental results show that robust contact de-tection (99.94% mean cross-validation accuracy) is possible independentof force/torque threshold values and suitable of being implemented foroperation in highly dynamic scenarios.
One of the major problems in autonomous robotics is to operate in unstruc-tured, highly dynamic or unknown environments including the inherent chal-lenge to react to unexpected situations during task execution [6]. The ability toappropriately react to such unexpected events requires to continuously monitorthe task progression. There is a variety of situations household robots could beconfronted with, such as placing objects or using tools, and one typical singleelement that exists in most of these action sequences is the contact event .Contact events that happen during the execution of everyday activities pro-duce characteristic sensory information in the visual, haptic and auditory modal-ities. Haptic information includes the changes in the load force and torques ex-perienced at hand-held tools or manipulated objects. Regarding the acousticinformation, discrete and continuous interactions, such as tapping or rubbing asurface with a tool, produce distinctive sounds which can be identified as theacoustic signature of an action-object interaction. a r X i v : . [ c s . R O ] D ec J. Maldonado et al.
In order to make use of the sensory information of the individual modalitiesthere is a number of potential uncertainties that must be considered during thedesign of contact detection systems. These uncertainties are related not only tothe robot’s perception capabilities but also to the characteristics of the manip-ulated objects and the strategy followed by the robot to execute the task. Theprocessing of multimodal sensory information in robotic applications aims tocompensate the uncertainties or shortcomings of the individual sensor modali-ties. Monitoring multimodal signals during task execution enables the implemen-tation of different functionalities such as task monitoring (e.g. for success-failuredetection) [11], behavior switching [11], modeling the effects of different actionsrelevant for the task [3] and action adaptation [3].Detection of contacts and collisions (i.e. unintended contacts) is typicallyfocused on the haptic modality and is implemented by monitoring thresholdsvalues in the force and torque signals [6][7] or by monitoring filtered force/torquesignals [2] [8]. In this paper we present a multimodal contact detection approachwhich, instead of monitoring signals in the time domain, uses time and frequencydomain features which model the distinctive characteristics of contact events inthe auditory and haptic modalities. In particular, our approach addresses theuncertainties inherent to the force and torque sensor readings that depend on thecharacteristics of the manipulated object (such as weight and dimensions) andthe manipulation task. In the context of a household setting these uncertaintiespose an enormous difficulty to achieve a robust contact detection during themanipulation of diverse objects, including unknown objects and objects withdynamic characteristics (e.g. liquid containers where the amount of liquid isunknown). The advantages of our approach can be summarized as follows: (i) we do not rely on the monitoring of force/torque threshold values which aredifficult to determine a priori, (ii) we do not rely on position information (iii) we do not rely on velocity or acceleration information which can be difficult todetermine in real time.We evaluated our contact detection approach with a prototypical glass plac-ing task in a household setting. The experimental setup is shown in Figure 1.Our experimental results show that robust contact detection is possible inde-pendent of the monitoring of force/torque threshold values and indicate thatthe selected features provide a good representation of the sensory signature ofcontact events, and thus are well suited for contact detection applications.
In this section we focus on recent examples of multimodal task execution andmonitoring applications together with the properties of sound and force signalsrelevant for the detection of contacts. ultimodal Contact Detection for Reliable Object Placing 3
Fig. 1.
Experimental setup: robot placing the glass on the table. The robot is equippedwith a built-in head microphone and a force/torque sensor in the wrist.
Park et al . [11] used the relation between force and sound for execution monitor-ing and anomaly detection. They detected anomalies using haptic and auditorysignals during the execution of a pushing task in which the robot closed a mi-crowave oven. In their approach, anomaly detection is based on the comparisonof the current signals with signal features associated with successful task execu-tions, which are characterized by a stereotypical force and sound energy profile(loud sound and force reduction when the door is secured).Chu et al . [3] investigated action adaptation to manipulate novel objects (i.e.not previously encountered) and adaptation to changes in previously learnedobjects. Their approach is based on the adaptation of an affordance model ofdifferent object-action pairs in the haptic, audio, and visual modalities. Theycollected interactions of a robot opening a drawer and turning on a lamp. Withthese data they modeled action-object pairs during for different segments of taskexecution (e.g. grasp and release segments). The adaptation to changes in pre-viously learned objects was evaluated by letting a robot to open a drawer indifferent states, ranging from fully closed to fully open. Manipulation of novelobjects was evaluated by letting the robot to turn on different lamps, which dif-fered in shape, size and length of pull chain, in which the robot had to determinethe point to stop pulling the chain.
Impact sounds occur during the manipulation of objects and tools in everydaylife. The characteristics of the sound depend on the objects interacting (e.g. when
J. Maldonado et al. placing a glass on a wood table) and the movement that generated the sound (e.g.a soft or harsh move). Thus impact sounds encode perceptual information relatedto the physical attributes of the objects interacting (material, shape, size) andthe movement (impact force) [4][12]. Impact sounds are characterized by a shortduration, an abrupt onset and a rapid decay [1]. These particular characteristicsare taken into account in the field of audio analysis and computer audition. Inparticular, features based on the energy level of the audio signal can be used forthe detection and classification of sounds characterized by significant and abruptenergy changes such as gunshots, explosions or other environmental sounds [5].sounds, etc.”
The relevant phases involved in the detection and handling of collisions in roboticsystems have been reviewed by Haddadin et al. [6]. Common approaches for col-lision detection include the monitoring of the measured currents in the electricaldrives and the monitoring of the instantaneous torque [6]. A collision detectionsystem should be fast with a minimum occurrence of false detections [6]. Themajor design challenge is the selection of a threshold on the monitoring signalsto avoid false positives and to achieve high sensitivity. The selection of a mon-itoring threshold in any of the common approaches is difficult because of thehighly varying dynamic characteristics of the control torques [6]. Furthermore, arobust detection system should take the following factors into account which de-pend on the robot state, load torque, temperature, and time [6]: torque/currentmeasurement noise, position and velocity sensor noise and modeling errors inthe estimated robot dynamics. Haddadin et al. [6] pointed out the similarities ofcollisions and impacts during robot manipulation tasks, thus the relevant aspectsof collision detection also apply to object and tool manipulation scenarios.
In the field human-robot interaction the characteristics of the force and torquesignals in the frequency domain have been considered for the detection of con-tacts and collisions. Cho et al. [2] implemented a collision detection algorithmbased on the observation that the force during an unintended collision has afaster rate of change compared to the rate measured during intended contacts.The torque data is monitored by means of a high-pass filter which enables thedistinction between intended contacts and unexpected collisions. Following a sim-ilar approach, Li et al. [8] implemented a low-pass and band-pass filter observerfor robot contact and collision detection. Since the frequency components of thecollision force signals are located in higher bands than those of the intentionalcontact force signals, the two filters enable the distinction between contacts andcollisions. ultimodal Contact Detection for Reliable Object Placing 5
The task we are dealing with is placing objects on a surface in the contexthousehold everyday activities. The properties of the manipulated objects differdue to various shapes, sizes and weights available. In this context, the robotmight be requested, for example, to set dishes and glasses on the table for dinner.The robot estimates the distance between the in-hand object and the supportsurface and then is commanded beyond a worst-case position. A force thresholdis usually used to detect the contact of the object against the surface. The objectcan be released once contact has been detected.Figure 2 illustrates the time course of the position, force, torque and audiosignals registered during the manipulation task. The initial force magnitude cor-responds to the weight of the grasped object. The initial amplitude of the audiosignal corresponds to the background noise of the environment and the robot’sego noise. At the beginning of the movement a small perturbation of the base-line measured force can be observed. Correspondingly, the sound of the robot’sactuators can be observed. At the moment of contact the characteristic abruptonset of the audio signal and its rapid decay can be observed together with thechange in the load force and the torque registered at the robot’s sensor. Afterthe glass makes contact with the surface the robot arm keeps on moving towardthe commanded position until the torque threshold is detected, indicating thatthe goal of the task has been accomplished.Detection of contacts is typically realized by monitoring hard-coded thresh-olds in the force signal [6][7]. While this strategy can work in well structured en-vironments where the characteristics of the manipulated objects are well known,the establishment of suitable thresholds in unstructured environments is chal-lenging due to the sources of uncertainty of the task. Apart from the uncertaintiesrelated to the robot and its sensors listed in section 2.3, the weight, shape anddimensions of the manipulated object influence the force/torque measured val-ues. In particular, object weights can be difficult to determine a priori , as inthe case of liquid containers. Furthermore, certain objects might be subject tocontact force constraints which can be encountered during the manipulation offragile objects. In this case, the unnecessary exertion of force after contact mightrisk the object’s integrity.
Figure 3 illustrates the force and sound signals during the manipulation task.We use these signals to obtain time and frequency domain features for contactdetection. As described in section 2.4, force and torque signals show a specificbehavior in the frequency domain during contact. This behavior is illustratedin the spectrograms shown in Figure 3. Before the movement onset, the signalshows a strong low frequency component. It can also be observed the amplitudeof the frequency components between 0 Hz and 25 Hz increase at movementonset and at the moment of contact. This behavior can be easily observed in
J. Maldonado et al.
Fig. 2.
Multimodal signals as the robot places the glass on the table. (a)
Position andvelocity. (b) (c) (d)
Audio signal. Apart from the impactsounds, the audio signal also displays the ego noise of the robot and the backgroundnoise of the laboratory. The dashed semi-transparent vertical line indicates the momentin which the bottom of the glass makes contact with the table.ultimodal Contact Detection for Reliable Object Placing 7 the torque signal. As opposed to the filter-based detection approaches used in[2] and [8], we propose to measure the changes in the frequency domain of theforce and torque signals by means of spectral features, which will be describedin section 4.3.The sound signal produced at the moment of contact is illustrated in Figure 3.As described in section 2.2, contact sounds can be modeled by means of featuresbased on the energy level of the audio signal. The time-domain feature used toidentify contact sounds is described in section 4.2.The relation between the force, torque, audio signals and their correspondingfeatures can be observed in Figure 3. In order to avoid the shortcomings andthe challenges of using detection procedures based on position, velocity, andacceleration data or force threshold values, we propose to detect contacts basedon frequency domain force features and time domain audio features. In orderto detect contacts from the multimodal features we trained a
Random Forest classifier. The characteristics of the classifier and its parameters are describedin section 4.4.
For the evaluation of our multimodal contact detection approach we used a Toy-ota Human Support Robot (HSR) (Fig. 1). The robot is designed to supporthuman household activities and to provide assistance for handicapped persons.The HSR has a single arm with 5 DoF attached to a gripper equipped with tactilesensors and suction capability. The wrist is equipped with a 3-axes force/torquesensor sampled at 100 Hz. The HSR’s base has omni-directional movement ca-pabilities and can be lifted by a prismatic joint (movable range 0-1,350mm).For sound sensing, the HSR employs a Playstation 3 Eye Microphone mountedat the top of its head. The raw audio signal from the HSR’s microphone wasrecorded at 44.1 kHz.We recorded 60 task executions of the robot placing a glass on a table asshown in Figure 1. After grasping the glass, the robot was commanded to lift itsarm to a start position of 36 cm. In order to put the glass on the table, the robotwas commanded to move downwards and to monitor the torque signal in the yaxis. The torque threshold was set at -3 N. After the contact of the glass with thetable the robot kept moving downwards until the magnitude of torque in the yaxis exceeded the threshold. At this point the robot was commanded to move itsarm upwards. The trial ended once the robot returned to the start position. Thecourse of the trial enabled the recording of acoustic and force/torque signals atrest (at the beginning and end of each trial), during movement onset and offset,at the moment of the initial contact and during the sustained contact betweenthe glass and the table.In order to imitate a real-life scenario in which impact sounds from othersources might be registered in the auditory modality, we added two types of ex-ogenous impact sounds during the movement execution. In 20 trials the experi-menter was hitting randomly the robot’s hand and in 20 trials the experimenterwas hitting randomly the table.
J. Maldonado et al.
Fig. 3.
Multimodal signals and features during the manipulation task. (a)
Position.The robot is commanded toward the position of the table. After the torque threshold isdetected the robots moves its arm toward the start position. (b)
Magnitude of the forcesignal and its spectrogram. (c)
Frequency domain features computed from the forcesignal. (d)
Magnitude of the torque signal and its spectrogram. (e)
Frequency domainfeatures computed from the torque signal. (f )
Audio signal and its corresponding audiopower feature. The dashed semi-transparent vertical line indicates the moment in whichthe bottom of the glass makes contact with the table. Subfigures (b) and (d) illustratethe change of the in the frequency content of the force and torque signal at the initialcontact and during the sustained contact of the glass with the table. Subfigure (f)illustrates two exogenous contact sound that ocurred during task execution (visible aspeaks in the curve of the audio feature).ultimodal Contact Detection for Reliable Object Placing 9
Features based on the energy level of the audio signal can be used for the detec-tion and classification of sounds characterized by significant and abrupt energychanges [5], such as those produced during the initial contact of the glass andthe table.In order to monitor the abrupt changes in the audio signal, the normalizedpower P ( i ) of the signal was computed: P ( i ) = 1 W L W L (cid:88) n =1 | x i ( n ) | (1)in which x i ( n ) is the sequence of audio samples of the ith frame with n =1 , . . . , W L , where W L is the length of the frame. The power is normalized bydividing it with W L in order to remove the dependency on the frame length [5].The frame length was 512 samples and the hop window 160 samples. The audiosignal processing was implemented in HARK, an open-source robot auditionsoftware [10], and P ( i ) was computed with a custom python script executed ina HARK processing node. As shown in section 2.4, force and torque signals show a specific behavior in thefrequency domain during contacts. In order to measure the changes of the forceand torque signals during the task execution we computed the spectral centroid ,the spectral spread and the spectral flux features. These features are computedon the Discrete Fourier Transform (DFT) coefficients obtained for each signalframe. In the following equations X i ( k ) is the magnitude of the DFT coefficientsof the ith signal frame with k = 1 , . . . , W fL , where W fL is the number of DFTcoefficients. The DFT was computed over 160 coefficients with a hop window of128 samples.The spectral centroid is the ’center of gravity’ of the spectrum. The spectralcentroid C i of the ith signal frame is defined as: C i = (cid:80) W fL k =1 kX i ( k ) (cid:80) W fL k =1 X i ( k ) (2)The spectral spread S i , which provides a measure of how the spectrum isdistributed around C i , is defined as: S i = (cid:118)(cid:117)(cid:117)(cid:116) (cid:80) W fL k =1 ( k − C i ) X i ( k ) (cid:80) W fL k =1 X i ( k ) (3) C i corresponds to the frequency content of the signal (e.g. high C i valuescorrespond to a signal with dominant components in the high frequency bands)and S i is commonly associated with the bandwidth of the signal (i.e. low values correspond to a spectrum tightly concentrated around the spectral centroid).Both C i and S i were normalized to the range [0, 1] by dividing their values by F s / F s is the sampling rate of the force and torque signals).The spectral flux F l is a measure of the change of the spectrum between twosuccessive signal frames.
F l of the ith frame is defined as:
F l ( i,i − = W fL (cid:88) k =1 ( EN i ( k ) − EN i − ( k )) (4)where EN i ( k ) = X i ( k ) (cid:80) WfLl =1 X i ( l ) is the kth normalized DFT coefficient at the ith frame. Random forest is a supervised machine learning algorithm which uses decisiontrees as its main building block [9]. This algorithm does not require scaling of thedata. The parameters of the algorithm are n estimators (i.e. the number of trees)and max features , which controls the number of features that are selected in eachnode of the decision tree [9]. The max features parameter was set to the recom-mended value for classification tasks max features = √ n features [9]. Regardingthe selection of the number of trees to build the model, a larger number of treesprovides smoother and more robust decision boundaries. However, a large num-ber of trees increases the computational cost with marginal improvements in theclassification accuracy. The parameter n estimators was determined by comparingthe classification performance with different settings (see section 5.1). We usedthe implementation of the random forest classifier available in the python sklearn library version 0.20.3.In order to detect the contact of the glass on the table we trained a randomforests model with time and frequency domain features. The length of the featuresequences obtained from the audio and force/torque signals were different due tothe differing sampling rates. Being the power of the audio signal the feature withthe highest resolution, the force/torque features were interpolated to match thelength of the audio feature sequence. The interpolated feature samples were la-beled as contact or not contact and were divided into training (56% of the data),validation (19% of the data) and test sets (25% of the data). The performanceof the trained model with selected n estimators is presented in section 5.3. Figure 4 shows the audio and force features computed for the 60 executions ofthe task. The peaks in the audio power subplot show the occurrence of contactand not-contact related impact sounds. By visual inspection it can be observedthat force and torque features provide a good representation of the behavior ofthe signal during the movement onset and at the moment of contact of the glass ultimodal Contact Detection for Reliable Object Placing 11 and the table. However, it is important to notice that values of the force andtorque/features at the moment of contact are similar to those measured whenthe robot moves its arm upwards returning to the start position. In this case,the audio power feature carries the information necessary to distinguish contactsfrom non contacts whenever the force/torque features show ambiguous values.
Fig. 4.
Position, audio and force (F) and torque (T) features for the 60 executions ofthe task.2 J. Maldonado et al.
The number of trees ( n estimators ) was determined by comparing the Area Underthe Curve (AUC) score of the
Receiver Operating Characteristic (ROC) obtainedwith the validation set for models trained with different n estimators ranging from1 to 10. The results are displayed in Table 1. The accuracy of all the n estimators settings is above 98%. The AUC values, which express the average precision witha value between 0 (worst) and 1 (best) [9], are above 0.99 for all the settings.As recommended in [9] for classification problems with imbalanced classes (inour dataset not contact samples occur more often), the AUC is used as criterionfor the selection of the n estimators parameter. Thus the number of trees is set to n estimators = 8. Table 1.
Comparison of different AUC and accuracy scores with different n estimators n estimators accuracy AUC In order to evaluate the extent to which the model trained with n estimators = 8generalizes over the whole data set we performed an stratified 10-fold cross-validation. The mean cross-validation accuracy indicates that the trained modelis 99.94% ( SD = 0 . We conducted an experiment to evaluate the extent to which a random forestclassifier can detect contact and non-contact events from auditory and forcefeatures. The classification results over the test data set are shown in Figure 5and Table 2. ultimodal Contact Detection for Reliable Object Placing 13
Fig. 5.
Confusion matrix.
Table 2.
Classification resultsprecision recall f1-score supportnot contact 1.00 1.00 1.00 81368contact 0.95 0.97 0.96 146weighted average 1.00 1.00 1.00 81514
In addition to the confusion matrix and the classification results shown in Ta-ble 2, the random forest algorithm provides the feature importance , a summarystatistic which quantifies how informative each feature is for the classificationtask. The feature importance is expressed as a number between 0 and 1 whichdescribes the importance of each feature for the classification decision. The im-portance of all features sums to 1. The feature importance results shown inFigure 6 indicate that the audio power is the most informative feature in orderto detect contact events, followed by the torque centroid and the torque spread . Fig. 6.
Importance of audio, force (F) and torque (T) features.4 J. Maldonado et al.
We have introduced an approach to detect contacts during manipulation tasksusing auditory and force signals. We selected time and frequency domain featuresthat represent the signature characteristics of contacts in the auditory and hapticmodalities that have been reported in the literature. We used these features totrain a classifier model. The classification results indicate that the features areappropriate for the multimodal detection of contacts. Our results indicate thatcontact detection is possible independent of the use of force/torque thresholdvalues and we provide an alternative method to the use of filters to process theforce/torque signals.Regarding the auditory modality, our results show that the sound informationis important for the detection of contacts (as quantified by the feature importance shown in Figure 6) and that, combined with force/torque features, is robustagainst exogenous contact sounds that would otherwise generate false detections.The low feature importance of the force and torque features might indicate thatthey encode redundant information [9]. Therefore, classification with a reducedset of these features should be investigated in order to assess whether similardetection results can be achieved with less features.We used frequency domain features of the force/torque signals to detectcontacts. The use of these features can be extended in order to identify segmentsof task execution. A visual inspection of Figure 4 suggests that movement towardand away from the table, as well as the sustained contact of the glass over thetable can be identified based on the spectral features. This would enable taskmonitoring and the identification sub-task completions independent of the forceor torque threshold values.
Acknowledgements
References
1. Aramaki, M., Kronland-Martinet, R.: Analysis-synthesis of impact sounds by real-time dynamic filtering. IEEE Transactions on Audio, Speech and Language Pro-cessing (2), 695–705 (mar 2006). https://doi.org/10.1109/tsa.2005.8558312. Cho, C.N., Kim, J.H., Kim, Y.L., Song, J.B., Kyung, J.H.: Collisiondetection algorithm to distinguish between intended contact and un-expected collision. Advanced Robotics (16), 1825–1840 (may 2012).https://doi.org/10.1080/01691864.2012.685259ultimodal Contact Detection for Reliable Object Placing 153. Chu, V., Gutierrez, R.A., Chernova, S., Thomaz, A.L.: Real-time multisen-sory affordance-based control for adaptive object manipulation. In: 2019 Inter-national Conference on Robotics and Automation (ICRA). IEEE (may 2019).https://doi.org/10.1109/icra.2019.87938604. Cook, P.R.: Physically informed sonic modeling (PhISM): Synthesis of percussivesounds. Computer Music Journal (3), 38–49 (1997)5. Giannakopoulos, T., Pikrakis, A.: Audio features. In: Introduction to AudioAnalysis, pp. 59–103. Elsevier, Oxford (2014). https://doi.org/10.1016/b978-0-08-099388-1.00004-26. Haddadin, S., De Luca, A., Albu-Schaffer, A.: Robot collisions: A survey on detec-tion, isolation, and identification. IEEE Transactions on Robotics (6), 1292–1312(dec 2017). https://doi.org/10.1109/tro.2017.27239037. Haninger, K., Surdilovic, D.: Multimodal environment dynamics for interactiverobots: Towards fault detection and task monitoring. In: 2018 IEEE/RSJ Inter-national Conference on Intelligent Robots and Systems (IROS). IEEE (oct 2018).https://doi.org/10.1109/iros.2018.85936508. Li, Z., Ye, J., Wu, H.: A virtual sensor for collision detection and distinc-tion with conventional industrial robots. Sensors (10), 2368 (may 2019).https://doi.org/10.3390/s191023689. M¨uller, A., Guido, S.: Introduction to machine learning with Python : a guide fordata scientists. O’Reilly Media, Sebastopol, CA, first edition edn. (2017)10. Nakadai, K., Takahashi, T., Okuno, H.G., Nakajima, H., Hasegawa, Y., Tsujino, H.:Design and implementation of robot audition system HARK open source softwarefor listening to three simultaneous speakers. Advanced Robotics24