Non-invasive Cognitive-level Human Interfacing for the Robotic Restoration of Reaching & Grasping
AAccepted manuscript IEEE/EMBS Neural Engineering (NER) 2021
Non-invasive Cognitive-level Human Interfacing for the RoboticRestoration of Reaching & Grasping
Ali Shafti , and A. Aldo Faisal , , , Abstract — Assistive and Wearable Robotics have the po-tential to support humans with different types of motor im-pairments to become independent and fulfil their activities ofdaily living successfully. The success of these robot systems,however, relies on the ability to meaningfully decode humanaction intentions and carry them out appropriately. Neuralinterfaces have been explored for use in such system withseveral successes, however, they tend to be invasive and requiretraining periods in the order of months. We present a roboticsystem for human augmentation, capable of actuating the user’sarm and fingers for them, effectively restoring the capabilityof reaching, grasping and manipulating objects; controlledsolely through the user’s eye movements. We combine wearableeye tracking, the visual context of the environment and thestructural grammar of human actions to create a cognitive-level assistive robotic setup that enables the users in fulfillingactivities of daily living, while conserving interpretability, andthe agency of the user. The interface is worn, calibrated andready to use within 5 minutes. Users learn to control and makesuccessful use of the system with an additional 5 minutes ofinteraction. The system is tested with 5 healthy participants,showing an average success rate of . on first attempt across6 tasks. I. I
NTRODUCTION
Assistive robotic systems represent an immense opportu-nity to overcome disabilities arising from motor impairmentsand allow those who have suffered from such losses toregain their physical interaction capabilities. From exoskele-ton stroke rehabilitation devices [1] to prosthetic systems[2] assistive robotics present the possibility to enhance,augment, rehabilitate and/or replace human capabilities andfunctionalities, such as upper limb movement.The effective use of robotic systems as motor assistivedevices relies heavily on safe and efficient human-robotinterfaces that allow the reliable decoding of human intentionof action. Neural interfaces have been thoroughly exploredwithin this context. The use of surface EMG is one non-invasive approach to this, and has been widely applied incontrol of robotic prostheses [2], though in some casesinvolving invasive surgical nerve transfer procedures to createadditional recording sites for myoelectric control signals [3].Other approaches involve Brain-Machine Interfaces. Musal-lam et. al. have shown the possibility of identifying high-level cognitive signals, through implanted electrode arrays in Dept.of Computing and Dept. of Bioengineering, Behaviour Analyt-ics Lab, Data Science Institute, MRC London Institute of Med-ical Sciences, Imperial College London, SW7 2AZ, London, UK { a.shafti,a.faisal } @imperial.ac.uk Fig. 1:
Our gaze-driven assistive robotic reach and support system.Top: The user wears eye-trackers that have been augmented witha depth camera for scene understanding and 3D gaze estimation.Head pose is tracked optically. The Universal Robots UR10 isused to actuate the user’s arm to restore reach, and the BioServoCarbonhand robotic glove restores grasp. Bottom: (A)
1. User looksat the orange with an intention of action, this is detected by thesystem, task identified, motion planning performed, resulting in therobot arm moving the user’s arm to the orange and the robot glovemaking the grip (2), once the orange is grasped, and their armmoved away to open the field of view, the user looks at the bowlwith an intention of action, which results in the system assistinghim in reaching the bowl and dropping the orange in it (4), (B)
Here, the user has similarly grasped a cup, which, once the userlooks at the bowl with an intention of action, changes the rest ofthe sequence to one of pouring, in (2) we see reach to bowl, in (3)and (4) changing of pose to enable pouring. (C)
Examples of theegocentric view of the user, with gaze, object labelling and intentiondecoding running. a r X i v : . [ c s . R O ] F e b onkeys, that define movement goals of reach in visual coor-dinates [4]. Hochberg et. al. show the use of cortical neuronalensemble signals obtained through a surgically implantedmicroelectrode array for the control a robotic arm in reachand grasp tasks by two tetraplegic users [5]. Collinger et.al. implant two 96-channel intracortical microelectrodes onan individual with tetraplegia and show that after 13 weeksof training, they can control 7 degrees of freedom with anathropomorphic robotic arm to fulfil tasks [6]. Ajiboye et. al.use a similar implant approach with a tetraplegic participant,allowing them to reach and grasp with their own limbsthrough the combination of robotic support and functionalelectrical stimulation [7]. While these results show the bigpotential and impact of robotic systems as assistive devices,they all rely on invasive, surgical procedures followed bylong training times.Here, we present the use of a non-invasive and intuitiverobot interface: our eyes. Eye-movement is often preservedin the presence of other severe motor impairments [8], andcan serve as a window to our intentions: research has shownthat our gaze patterns change depending on the task athand [9]. Finally, eye movements as the natural interfacebetween humans and their surrounding environment, makefor an intuitive control interface. We previously showed thefeasibility of using the eyes as a means to control roboticsystems, e.g. to augment human capability [10], allowingto draw and write while eating and drinking [11], as wellas assisting them in reaching and grasping using their ownarm and hand [12], [13]. We have also demonstrated othermulti-modal approaches to head-gaze estimation [14] andneural interfaces [15] for BMI use. Others have shown thefeasibility of using gaze as an interface for robot control,be it for augmentation in cases such as surgical robotics[16], or assistive robots that fulfil tasks for the user [17].Here, we present an expansion of our previous work [13],enabling users to reach and grasp using their own arm andhand, assisted by a robotic arm and glove, with gaze as theinterface. II. M ATERIALS AND M ETHODS
The overview of our system’s architecture can be see inFigure 2. We use the SMI ETG 2W eye trackers (SensoMo-toric Instruments Gesellschaft f¨ur innovative Sensorik mbH,Teltow, Germany), augmented with the Intel RealSense D435RGB-D camera (Intel Corporation, Santa Clara, California,USA) to provide the 3D context of gaze and surrounding en-vironment. We use Optitrack Flex 13 cameras (NaturalPoint,Inc. DBA OptiTrack, Corvallis, Oregon, USA) to track theuser’s head pose through optical markers (see Figure 1). Weuse our own method for absolute 3D gaze point estimationwhich we presented in detail in [13].To understand the context of gaze, we create a real-time object detection system using a Convolutional NeuralNetwork (CNN) implemented in TensorFlow, which runs onthe RGB stream and labels objects in the user’s egocentricfield of view, details previously presented in [18]. Throughthis we can infer the intention of the user. Intention of
Fig. 2:
Block diagram of our system architecture. action is identified based on the user’s gaze relative toobject bounding boxes: we assign the right-most / of thebounding box to indicate intention through fixation. This isapplicable to all objects, and gives our users the agency toinspect and indictate intent when required.Once an action intention is detected, our system iden-tifies the sequence of action for the user’s intended taskby considering the current human-robot state and parsingit through our Action Grammars approach. Our behaviour,similar to our language, can be represented as having rules(grammar) on how to combine actions and in which orderto use them to create a meaningful sequence. By extractingthe grammars of action by observing human behaviour, weare able to reduce the dimensionality of the robot actionselection problem, as the system will be facing a smaller setof action choices based on the current human-robot state. Forthe purposes of this work, we consider a simulated diningtable scenario, including small and large containers, and non-container objects, leading to possibilities of pick and place,as well as pick, pour (from small container to large container)and place. Grammars for these sequences are hand-derived,and are parsed through a finite state machine implementation.Further details on this implementation can be seen in [13].We use the Robot Operating System (ROS) [19] forintegration, running on a Linux workstation (ROS Kinetic,Ubuntu 16.04). The Motive software used for optical trackingas well as the eye trackers integrated with the RGB-D camerarun on Windows 10 workstations networked with the ROSmaster. Our robotic setup consists of the Universal RobotsUR10 (Universal Robots A/S, Odense, Denmark) and thesoft-robotic BioServo Carbonhand (Bioservo TechnologiesAB, Kista, Sweden). Both robot systems are linked to thehuman and controlled through ROS – the UR10 for reaching,and the Carbonhand for grasping purposes. Users of thesystem will wear the Carbonhand, and attach their arm to theUR10’s end-effector at their wrist. We use a fixed-joint armattachment to have full control over the user’s hand pose formproved manipulation. This, however, means that we needto motion plan for human arm kinematics. We calculate theoptimal reaching orientation for the grasp of each object,based on the object’s location, with respect to the user’selbow location. The default start point is selected to keepthe user’s lower-arm parallel to the ground, and the upper-arm perpendicular to it. Thus, by knowing the initial locationof the user’s elbow, and the target point, we are able toidentify the necessary waypoints for the robotic arm to reach,with an optimal orientation of the arm. These are selected byinitially keeping the elbow location constant, and positioningthe forearm so that it is aligned with the line connecting theelbow and the target location in space. From there, the armcan be safely moved over the elbow-target line to approachthe target location. Once reaching is complete, the roboticglove can be activated to grasp or drop. We combine theglove’s sensor and motor data to infer grasp success, whichinforms our action grammars module for further planning .For added safety, our fixed attachment uses an electromag-net connected through a Robotiq FT-150 force/torque sensor(Robotiq Inc., Quebec, Canada) to the UR10’s end-effector.This allows us to have closed-loop control on the release ofthe user’s arm for safety, i.e. in cases where unsafe forcesor torques are detected at the attachment (see Figure 1).III. E XPERIMENTAL E VALUATION
Our experiment involves simulated dining table scenariotasks. The objects in use are: apple, orange, cup, bowl anddining table. The tasks can then be defined as follows: • Grasp { Orange, Apple } , drop in/on { Bowl, Table } . • Grasp { Cup } , pour into { Bowl } , drop on { Table } . • Grasp { Cup } , drop on { Table } .We tested all above combinations, leading to a total of6 options, with 5 participants. Participants are allowed torepeat each option up to 5 times in case of failure, butif succeeded, move immediately to the next option. Eachparticipant starts by first receiving a description of the entiresetup and what to expect from the experiment. They thenwear the robotic glove and have their arm attached to therobotic arm. The eye-trackers are worn and participants gothrough the initial calibration. Once calibration is complete,the participants are given 5 minutes to test out the setup andget used to the environment. Trials are randomised in termsof task and object placement. Performance in completingtasks, and subjective evaluation of the system’s usabilitythrough the System Usability Scale (SUS) questionnaire [20]are reported. IV. R ESULTS & D
ISCUSSION
We tested the system with 5 healthy participants who weredirected to fully relax their right arm and hand and not tohelp the system with actuation, to simulate paralysis. A video of the system can be seen here: https://youtu.be/VHcipV4UAUs
A. Task Performance
Across all participants, and for all tasks, for the reachactions, for all tasks, of participants were successful ontheir first attempt. A successful reach would mean computingthe 3D gaze point of interest for the user, detecting anintention of action, motion planning and moving their armto the target point, with the correct pose to enable thegrasping of the intended object. For the grasping action,i.e. grasping the intended object after a successful reach, of participants made successful grasps on their firstattempt. For the subsequent dropping or pouring of objects,except for a single case where participant 2 had a failedpour and only succeeded on their second attempt, all otherattempts across all participants and all tasks were limitedto 1. Therefore, . of drop/pour actions were successfulon the first attempt. For drop actions alone this was .Therefore, at task level, except for the one pouring failure,all participants succeeded at all 6 tasks on their 1st attempt.These results show the feasibility of using the combinationof gaze and action grammars as an interface for assistiverobotic systems. All 5 participants, who were previouslynaive to our non-invasive system, were capable to use itsuccessfully after an initial 5 minute calibration, followedby 5 minutes of free use for training. B. Subjective questionnaire
All participants filled a System Usability Scale [20] formafter the experiment. Results are shown in Figure 3, separatedby positive and negative statements respectively, i.e. Agree-ing is the good result in Figure 3.a whereas Disagreeing isthe good result in Figure 3.b.For all of the positive statements shown in Figure 3.A wesee no participant disagreeing. All 5 participants agree thatthey found the system easy to use, with strongly agree-ing. All participants found that the various functionalities ofthe system were well integrated. On whether they imaginethe system would be easy to use for most people, agree(half of them ‘strongly’), whereas , i.e. 1 participant,is neutral. On whether they felt confident when using thesystem, agree and 1 participant is neutral.Looking at the negative statements summarised in Figure3.B, we see that of participants disagree ( stronglydisagree) with the system being characterised as unneces-sarily complex. On whether the system’s behaviour wasunpredictable diagree, are neutral and agree(i.e. 1 participant). On whether the system is cumbersometo use, of participants disagree, strongly. Onwhether participants needed to learn a lot to use the system,we see disagreeing, and 1 participant neutral.Overall, we see a clear majority agreeing on positivestatements, and disagreeing on negative ones, indicating thatour participants are happy with the system. Note that dueto the short calibration and learning requirements of oursetup, the entire experiment from entering the room untildeparture, took a maximum of 30mins for all participants.This is perhaps not enough time to get fully used to thesystem. While majority agree that they felt confident with ig. 3:
Results of the System Usability Scale (SUS) questionnairefilled by our 5 participants after their experience with our system (A) positive statements, Agreeing is good, (B) negative statementsDisagreeing is good. the system, we believe this would have improved withmore time using the system. On the system’s behaviour, wehave one participant characterising the system’s behaviour asunpredictable. This is perhaps the main message: need forfurther feedback mechanisms, so that participants are betteraware of the course of action by the robotic system. OurAction Grammar approach allows us to explain the behaviourof the robot on a human-interpretable level, i.e. task/actionlevel; feedback mechanisms will be pursued as future work.V. C
ONCLUSIONS
We present a cognitive-level brain interface through gazefor assistive robotic restoration of reaching and grasping.While many successful efforts within Brain-Machine Inter-faces research for assistive robot control is based on invasive,surgically implanted and training-heavy methods [4]–[7], ourapproach to interfacing the human mind and decoding theirintentions is non-invasive, requires 5 minutes to be wornand calibrated, and shows a very high success rate after 5minutes of training in healthy controls. Musallam et. al. arguefor the use of cognitive level signals representative of goalsin visual coordinates [4]. Here we take this concept furtherby removing the need for any invasive procedures, andinterfacing the human entirely through their eye-movementsand on the cognitive level.Our system detaches users from the multiple degrees offreedom involved in the robot control and motion planningproblem, allowing them to simply look at an object that theywant to manipulate, with an intention to do so, at whichpoint the system will handle all the lower-level controllersnecessary to perform their intended task successfully, safelyand efficiently. Our very positive subjective questionnaireresults support this, showing that participants find the systemeasy to use/learn, well-integrated and felt confident using it,among other positive feedback. Based on the feedback wehave received, we are improving our system with additionalfeatures, such as 3D mapping of the task environment, and adedicated kinematic solver for the human arm, enabling thesystem to to plan for and avoid obstacles, but also to move the human arm in natural human motion paths to improve theuser experience while conserving safety and interpretabiltiy.R
EFERENCES[1] M. Bortole, A. Venkatakrishnan, F. Zhu, J. C. Moreno, G. E. Francisco,J. L. Pons, and J. L. Contreras-Vidal, “The h2 robotic exoskeleton forgait rehabilitation after stroke: early findings from a clinical study,”
J.Neuroeng. Rehabil. , vol. 12, no. 1, p. 54, 2015.[2] D. Farina, N. Jiang, H. Rehbaum, A. Holobar, B. Graimann, H. Dietl,and O. C. Aszmann, “The extraction of neural information fromthe surface emg for the control of upper-limb prostheses: emergingavenues and challenges,”
IEEE Trans. Neural Syst. RehabilitationEng. , vol. 22, no. 4, pp. 797–809, 2014.[3] L. J. Hargrove, L. A. Miller, K. Turner, and T. A. Kuiken, “Myoelec-tric pattern recognition outperforms direct control for transhumeralamputees with targeted muscle reinnervation: a randomized clinicaltrial,”
Sci. Rep. , vol. 7, no. 1, pp. 1–9, 2017.[4] S. Musallam, B. Corneil, B. Greger, H. Scherberger, and R. A.Andersen, “Cognitive control signals for neural prosthetics,”
Science ,vol. 305, no. 5681, pp. 258–262, 2004.[5] L. R. Hochberg, D. Bacher, B. Jarosiewicz, N. Y. Masse, J. D.Simeral, J. Vogel, S. Haddadin, J. Liu, S. S. Cash, P. Van Der Smagt, et al. , “Reach and grasp by people with tetraplegia using a neurallycontrolled robotic arm,”
Nature , vol. 485, no. 7398, pp. 372–375, 2012.[6] J. L. Collinger, B. Wodlinger, J. E. Downey, W. Wang, E. C. Tyler-Kabara, D. J. Weber, A. J. McMorland, M. Velliste, M. L. Boninger,and A. B. Schwartz, “High-performance neuroprosthetic control by anindividual with tetraplegia,”
The Lancet , vol. 381, no. 9866, pp. 557–564, 2013.[7] A. B. Ajiboye, F. R. Willett, D. R. Young, W. D. Memberg, B. A. Mur-phy, J. P. Miller, B. L. Walter, J. A. Sweet, H. A. Hoyen, M. W. Keith, et al. , “Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: a proof-of-concept demonstration,”
The Lancet , vol. 389, no. 10081, pp. 1821–1830, 2017.[8] W. W. Abbott and A. A. Faisal, “Ultra-low-cost 3d gaze estimation:an intuitive high information throughput compliment to direct brain–machine interfaces,”
J Neural Eng. , vol. 9, no. 4, p. 046016, 2012.[9] A. Borji and L. Itti, “Defending yarbus: Eye movements revealobservers’ task,”
J. Vis. , vol. 14, no. 3, pp. 29–29, 2014.[10] B. Noronha, S. Dziemian, G. A. Zito, C. Konnaris, and A. A. Faisal,““wink to grasp”—comparing eye, voice & emg gesture control ofgrasp with soft-robotic gloves,” in , pp. 1043–1048, IEEE,2017.[11] S. Dziemian, W. W. Abbott, and A. A. Faisal, “Gaze-based telepros-thetic enables intuitive continuous control of complex robot arm use:Writing & drawing,” in
IEEE BioRob , pp. 1277–1282, IEEE, 2016.[12] R. O. Maimon-Mor, J. Fernandez-Quesada, G. A. Zito, C. Konnaris,S. Dziemian, and A. A. Faisal, “Towards free 3d end-point controlfor robotic-assisted human reaching using binocular eye tracking,” in
IEEE ICORR , pp. 1049–1054, IEEE, 2017.[13] A. Shafti, P. Orlov, and A. A. Faisal, “Gaze-based, context-awarerobotic system for assisted reaching and grasping,” in
IEEE ICRA ,pp. 863–869, IEEE, 2019.[14] N. Sim, C. Gavriel, W. W. Abbott, and A. A. Faisal, “The headmouse—head gaze estimation” in-the-wild” with low-cost inertialsensors for bmi use,” in , pp. 735–738, IEEE,2013.[15] S. Fara, C. S. Vikram, C. Gavriel, and A. A. Faisal, “Robust, ultralow-cost mmg system with brain-machine-interface applications,” in , pp. 723–726, IEEE, 2013.[16] G. P. Mylonas, A. Darzi, and G. Zhong Yang, “Gaze-contingent controlfor minimally invasive robotic surgery,”
Comput. Aided Surg. , vol. 11,no. 5, pp. 256–266, 2006.[17] R. M. Aronson and H. Admoni, “Eye gaze for assistive manipulation,”in
ACM/IEEE HRI , pp. 552–554, 2020.[18] C. Auepanwiriyakul, A. Harston, P. Orlov, A. Shafti, and A. A. Faisal,“Semantic fovea: real-time annotation of ego-centric videos with gazecontext,” in
ACM ETRA , pp. 1–3, 2018.[19] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs,R. Wheeler, and A. Y. Ng, “ROS: an open-source robot operatingsystem,” in
ICRA WS on Open Source Software , 2009.[20] J. Brooke, “SUS: a “quick and dirty” usability,”