HAIR: Head-mounted AR Intention Recognition
HHAIR: Head-mounted AR Intention Recognition
David Puljiz
Intelligent Process Automation and Robotics Lab (IPR),Institute for Anthropomatics and Robotics, KarlsruheInstitute of TechnologyKarlsruhe, [email protected]
Bowen Zhou
Intelligent Process Automation and Robotics Lab (IPR),Institute for Anthropomatics and Robotics, KarlsruheInstitute of TechnologyKarlsruhe, Germany
Ke Ma
Intelligent Process Automation and Robotics Lab (IPR),Institute for Anthropomatics and Robotics, KarlsruheInstitute of TechnologyKarlsruhe, Germany
Björn Hein
Intelligent Process Automation and Robotics Lab (IPR),Institute for Anthropomatics and Robotics, KarlsruheInstitute of TechnologyKarlsruhe, Germany
Figure 1: The view from the HoloLens on the left, showing the user looking at the goal "Sphere" and approaching it with thehand. On the right the current and previous probabilities. The user rotated from the first goal, the sphere, through the othertwo goals and back towards the sphere. To note is how the output proclaims the user irrational when they are facing awayfrom all the defined goals
ABSTRACT
Human teams exhibit both implicit and explicit intention sharing.To further development of human-robot collaboration, intentionrecognition is crucial on both sides. Present approaches rely ona vast sensor suite on and around the robot to achieve intentionrecognition. This relegates intuitive human-robot collaborationpurely to such bulky systems, which are inadequate for large-scale,real-world scenarios due to their complexity and cost. In this paperwe propose an intention recognition system that is based purelyon a portable head-mounted display. In addition robot intentionvisualisation is also supported. We present experiments to showthe quality of our human goal estimation component and some
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
VAM-HRI ’21, Virtual, Augmented and Mixed reality human-robot interaction Workshop © 2021 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn basic interactions with an industrial robot. HAIR should raise thequality of interaction between robots and humans, instead of suchinteractions raising the hair on the necks of the human coworkers.
KEYWORDS
Human Intention Estimation, Augmented Reality, Human-robotCollaboration, Head Mounted Displays
ACM Reference Format:
David Puljiz, Bowen Zhou, Ke Ma, and Björn Hein. 2021. HAIR: Head-mounted AR Intention Recognition. In ,. ACM, New York, NY, USA, 6 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn
Communicating intentions between members of a team is para-mount for successful cooperation and task completion. Previouswork in the field of Augmented reality (AR) human-robot interac-tion (HRI) focused on either improving robot programming [13]or visualising robot motions [14]. Although quite important forcollaboration, such systems still lack the estimation of the humanintention from the robot’s side. Several such systems have beenproposed, such as [1] where the human is tracked and their goal a r X i v : . [ c s . R O ] F e b AM-HRI ’21, Virtual, Augmented and Mixed reality human-robot interaction Workshop Puljiz et al. (a) (b)
Figure 2: YOLOv4 classification of HoloLens camera data. Bounding box classification approaches together with the knownHoloLens egomotion and depth data can be used to define the interaction objects/goals as well as to outsource part of theenvironmental sensing from the robot to the human. estimated inside a robot cell. Such systems however require a bigoverhead in complexity and cost of the robot cells. With the ad-vent of Head-mounted Displays (HMDs) the possibility arises of afully portable, completely worn system possessing both the robotintention visualisation and human intention estimation.A similar system based on a HMD and intended for human-robot collaborative task planning was presented by Chakraborti etal. [3]. In the presented system, however, the human coworker hadto explicitly select and reserve objects it wished to interact with,slowing down task execution and increasing physical and mentalworkloads on the human worker.Here we propose a system that implicitly evaluates the inten-tions of the human, thus minimising the increase in workload. Theproposed system is based on the Microsoft’s HoloLens HMD andis aimed at a collaborative scenario between a single human andan industrial manipulator. The system is robot agnostic and com-pletely portable requiring a very short set-up at the beginning ofinteraction. This guarantees that a single human worker can inter-face with multiple robots one after another, without the need forspecialised robot cells or sensors around any of those robots.The system takes as input the pose of the HMD in the worldcoordinate system, the position of the hand joints on the worldcoordinate system as well as a set of possible spatial goals, whichcan be added and removed during the interaction itself. The outputis a set of probabilities of the goal the human wants to approach aswell as the action they wish to perform.This paper will present our current work and tests aimed mostlyat having a robust goal estimation. To the best of the authors knowl-edge such an intention estimation algorithm using a completelyworn system has not yet been developed.
First and foremost a common coordinate system must be establishbetween the HMD and the robotic manipulator. Referencing can bedone in a variety of ways, perhaps the most popular is the use of QR codes or other preset visual markers [6]. Although these offercontinuous instead of one-shot referencing, as well as very goodprecision, they require a setup step that we would like to avoid.Manually selecting the robot base such as presented in [13] is moreflexible yet also more imprecise. We have proposed several refer-encing methods in [12], with the semi-automatic one, consisting ofa rough user guess followed by a refinement step, offering the bestbalance between accuracy and computational time. The refinementstep consists of filtering a point cloud captured with the HMD andusing a registration algorithm to fit the model of the manipulatorinto the filtered point cloud, using the user guess as the start pointof the registration algorithm. The user guess prevents the commonproblem of registration algorithms being stuck in the local minima,and we found that even a basic ICP algorithm performs a goodjob of refining the guess of the user. Another approach is a fullyautomatic one without a user guess. Such a referencing algorithm,similarly to our automatic method proposed in [12], was proposedby Ostantin et al. [8]. It clusters the point cloud captured by theHMD using the DBSCAN clustering algorithm and then performsmodel matching between the clusters and a model of the robot.
Secondly the possible spatial goals of the human and the robot needto be defined. In case the robot does not possess the full map of itssurroundings, the HMD can also provide that as we demonstratedin [11]. This can also include possible goals and regions of interestsuch as the table or the conveyor belt. If goals are specific objects,here too the HMD can provide types and positions of those objects.One such possibility is through the use of bounding-box classifierssuch as YOLOv4 [2]. In Fig. 2, one can see the result of runningYOLOv4 on the HoloLens camera data. Having the egomotion dataof the hololens, as well as data from the HMD’s depth sensor, allowsa full spatial definition of the objects and therefore possible goals.The user should also be able to add and remove goals manuallyduring the interaction step. Therefore the goal estimation algorithmwas selected to allow such a modality.
AIR: Head-mounted AR Intention Recognition VAM-HRI ’21, Virtual, Augmented and Mixed reality human-robot interaction Workshop G G i G g G ? G x α α
1- g β - γα αα β β β δ δ γ ... ... Figure 3: The HMM states of the human goal intention es-timation system as presented in [10]. It consists of g goalstates, a state of unknown intention G ? and a state of thehuman acting irrationally G 𝑥 . Finally, having a common coordinate system, mapped workingenvironment and possible goals, one can infer the goals using ahuman intention recognition algorithm. We base our HIR algorithmon previous work by Petkovic et al. [10] where a hidden Markovmodel framework was used to estimate the goal of the human inan automated warehouse. The approach is quite general and withminimal modification can be adapted to be used in our use case. Inthis section we will present a brief overview of the calculation, formore details please refer to the original paper.Instead of the position of the human coworker as in the origi-nal paper, we consider the position of the hand in relation to goalobjects. To simplify the calculations, we assume there is an almoststraight line between the hand position and each goal object. Bydoing that we can forego the complex path planning step to deter-mine the modulated distance and instead use the euclidean distanceto calculate the vector d that represents the distance of the hand toeach goal. As in [10], we define additional 32 points 𝑝 𝑖 on a circlearound the previous hand position 𝑙 ′ and a radius 𝑟 equal to thedistance between the current 𝑙 and the previous 𝑙 ′ hand positions.We calculate the vector d for each point 𝑝 𝑖 and append them to themodulated distance matrix D .Additionally we consider the gaze validation s of the HMD. Themotivation being that the user is more likely to look approximatelytowards the goal of the hand motion than towards other goals. Thegaze validation is calculated as: s 𝑖 = g · o 𝑖 − h || o 𝑖 − h || , g · o 𝑖 − h || o 𝑖 − h || ≥ , otherwise. (1) Where g is the HMD orientation in the world coordinate system, o 𝑖 is the position of object i and h is the position of the HMD. Weexpand the motion validation vector v as follows: v = max ≤ 𝑖 ≤ 𝑛 D 𝑖 𝑗 − d max ≤ 𝑖 ≤ 𝑛 D 𝑖 𝑗 − min ≤ 𝑖 ≤ 𝑛 D 𝑖 𝑗 · s (2)The rest follows exactly the algorithm described in [10]. We usethe same transition matrix with g goals T 𝑔 + × 𝑔 + defined as: T = − 𝛼 . . . 𝛼
00 1 − 𝛼 . . . 𝛼 ... . . . ...𝛽 𝛽 . . . − 𝑔𝛽 − 𝛾 𝛾 . . . 𝛿 − 𝛿 , (3)This transition matrix corresponds to the hidden Markov model(HMM) architecture visible in Fig. 3. The parameter 𝛼 captures theworker tendency to change their mind, while the parameter couple 𝛽 and 𝛾 set the threshold for estimating intention for each goal loca-tion. Increasing 𝛽 leads to quicker inference of worker’s intentionsand increasing 𝛾 speeds up the decision making process. Parameter 𝛿 captures model’s reluctance to return to estimating the other goalprobabilities once it estimated that the worker is irrational. Weperformed several tests to determine the optimal values of theseparameters which will be described in the "Experiments" section.The worker intention is estimated using the Viterbi algorithm [5],which takes as inputs the hidden states set 𝑆 = { 𝐺 , ...𝐺 𝑔 , 𝐺 ? , 𝐺 𝑥 } ,hidden state transition matrix T , initial state Π , sequence of obser-vations O , and the emission matrix B .The emission matrix B is calculated using the motion validationvector v . Since the observation is the validation vector v with con-tinuous element values, the input to the Viterbi algorithm was mod-ified by introducing an expandable emission matrix B 𝑘 × 𝑔 , where 𝑘 is the recorded number of observations, are functions of the ob-servation value. Once a new validation vector 𝑣 is calculated, theemission matrix is expanded with the row B ′ , where the element B ′ 𝑖 stores the probability of observing v from hidden state 𝐺 𝑖 . Theaverage of the last 𝑚 vectors v is also calculated and the maximumaverage value 𝜙 is selected. It is used as an indicator if the workeris behaving irrationally, i.e., is not moving towards any predefinedgoal. The value of the hyperparameter 𝑚 indicates how much ev-idence is to be collect before the worker is declared irrational. Ifthe worker has been moving towards at least one goal in the last 𝑚 iterations ( 𝜙 > . B ′ is calculated as: 𝐵 ′ = 𝜁 · (cid:2) tanh ( v ) tanh ( − Δ ) (cid:3) , (4)and otherwise as: 𝐵 ′ = 𝜁 · (cid:2) × 𝑔 tanh ( . ) tanh ( − 𝜙 ) (cid:3) , (5)where 𝜁 is a normalising constant and Δ is calculated as the differ-ence of the largest and second largest element of v .Finally, the initial probabilities of worker’s intentions are set as: Π = (cid:2) . . . (cid:3) , (6)indicating that the initial state is 𝐺 ? and the model does not knowwhich goal the worker desires the most. The Viterbi algorithm out-puts the most probable hidden state sequence and the probabilities AM-HRI ’21, Virtual, Augmented and Mixed reality human-robot interaction Workshop Puljiz et al.
Figure 4: The result of hand detection presented in [7] onHoloLens RGB camera data. One can see robust perfor-mance even during object handling. 𝑃 ( 𝐺 𝑖 ) of each hidden state in each step. These probabilities are theworker’s intention estimates.Goals can be added and removed during runtime as well makingsuch a intention estimation framework quite flexible. Though estimating the goal of the human motion is extremelyimportant for replanning robot motions to keep the interactionboth safe and efficient, estimating actions the human wishes toperform could also bring additional information and flexibility tointention estimation systems.Although the first generation of the HoloLens possesses inbuilthand-tracking capabilities, these are quite limited and only fourgestures can be tracked and classified. For a more robust handfollowing and classification we expanded the hand tracking byusing the work presented in [7] on the HoloLens’ RGB camera data.The algorithm tracks 21 hand joints and works with occlusions,surface contacts and object handling.The detected hand joints are to classify actions - intention tograsp an object, grasped an object, pointing and stop. More actionscan be classified in the future. The stop and pointing gestures areused as simple cues to control the robot. In addition, common ges-tures of fear or distress shall be classified as stop gestures, allowingthe system indirect reaction to stress.We presently only detect and use the right hand, however witha slight overhead the algorithm can detect both provided there isno significant overlap between them.
The benefits of HMDs extend also to visualising robot intention.Instead of adapting robot motions to make them more legible [4],one can use holograms to signal the desired goal. In [15] it wasshown that holographic information is adequate to show the goalof the robot, and even solve ambiguities if intention is expressedvia synthesised voice. General motion intent can also be effectively visualised using holographic cues [14]. In our work we chose toindicate the goal via a hologram containing a 3D sound source(spatial sound), as well as virtual execution - having a hologramof the robot execute the motion before the real robot performs it,such as shown in Fig. 5.
The experiments were aimed at testing the performance of thegoal intention estimation. We used three goals in a circular pattern,from left to right - a green cylinder, a red cube and a blue sphere,as shown in Fig. 7.The first set of experiments was aimed to find the optimal setof parameters 𝛼 , 𝛽 , 𝛾 and 𝛿 for our use case. Here we looked atthe goal states and the transitions between them. The path was asimple left to right one, first going towards the green cylinder, thenthe red cube and finally the blue sphere. Figure Fig. 6 shows thebehaviour of the parameter 𝛼 . A low value makes the algorithmtoo certain and almost does not spend time in the unknown state,while a high value makes the estimated goals jump too much. Theoptimal value of the parameter 𝛼 was therefore set to 𝛼 = . 𝛼 set, we tested the behaviour of changingthe parameter 𝛽 . A low 𝛽 maintains the unknown intention statetoo long, while a high 𝛽 lowers the general certainty but eliminatesthe insecurities between state transitions which is an unwantedbehaviour. The value of 𝛽 was set to 𝛽 = .
05. In Fig. 8 one can seethe behaviour of changing the parameter 𝛽 .The parameters 𝛾 and 𝛿 did not significantly influence the out-puts and were kept at the same value as in [10], namely 𝛾 = . 𝛿 = . Figure 5: An example of visualisation of the intentions ofthe robot to the human coworker - virtual execution with aholographic robot and plan visualisation.
AIR: Head-mounted AR Intention Recognition VAM-HRI ’21, Virtual, Augmented and Mixed reality human-robot interaction Workshop
Figure 6: The effect of increasing the parameter 𝛼 . The parameter captures the worker tendency to change their mind. A low 𝛼 will make the algorithm "too sure" about the intention, while a too high 𝛼 produces a chaotic and unusable output. the sphere. One can see that the transition from cylinder to cubelasts slightly longer than from cube to cylinder. This is due to thefact that the algorithm is reluctant to estimate an already visited orskipped goal. One can also see the long transition between the cubeand the sphere, as the algorithm prefers the goal that has alreadybeen visited two times. This shows that the estimation follows ourintuition.Finally, the third experiment shows what happens when the userdoes a complete rotation and faces away from all of the three goals.Again the algorithm performs quite intuitively and proclaims theuser "irrational" as all the possible goals were completely on theother side.Additionally, we tested simple interactions between an industrialmanipulator and the human user. In the first one the robot wasselecting goals and randomly. Should the goal intention estima-tion detect that the human is moving to the same goal the robotwould stop and select a new goal. Additionally we used the sameframework to navigate the manipulator to the estimated goal ofthe human, proving that the framework can also be beneficial inteleportation scenarios. Figure 7: The interaction setup with an industrial robot, thethree spatial goals are represented by colour and shape. Inthis experiment the user took direct control of the robot andthe human intention estimation is used to detect to whichobject does the user wish the robot to move, illustrating an-other use case of intention estimation.
In this work we presented a completely portable, robot agnosticsystem for intention estimation and visualisation for human-robotcollaboration scenarios. Our system does not require any specialset-up or sensors on or around the robot and is capable of both esti-mating the human coworkers goals and actions as well as visualisethe goals and intentions of the robot coworker.Having an intention estimation system, in addition to explicitintention declarations, can greatly reduce the mental and physicalworkload on the user, while providing constant, information richdata to the robot, thereby improving the safety and efficiency ofrobot motions.We have shown that the goal prediction part of the HAIR systemworks as intended and indeed the goal intentions estimated followa reasoning that humans might find intuitive and agree with.Predicting the goal and motion of the human coworker canincrease both safety for the human and the efficiency of robotmotions. The goal estimation system [10] was integrated into amobile robot fleet management system of a simulated automatedwarehouse. In [9] it was shown that the proposed system markedlyimproved warehouse efficiency compared to no goal estimation oreven a simplistic one. It is to be expected that such an efficiencyincrease would also be observed in interactions with a roboticmanipulator. Further testing is going to be needed however, tosupport that claim.Likewise the action estimation component as well as the entiresystem needs to be evaluated in user studies. More specifically thechange in mental and physical workloads between various intentionsharing modalities is of great interest and quite important in provingthe claims that the intention estimation algorithms presented heresignificantly decrease the workload compared to explicitly statingthe goals.As HMDs become ever more common, and the amount of ro-bot coworkers per human coworker continues to increase, havingintuitive HRI using systems that are cheap, simple and portablebecomes essential. Lowering the complexity and price of each ro-bot by exploiting wearables will lead to a wider use of robots andincreased human-robot collaboration. We hope that the researchpresented here provides the first stepping stones towards such asystem. HAIR should raise the quality of interaction between robotsand humans, instead of such interactions raising the hair on thenecks of the human coworkers.
AM-HRI ’21, Virtual, Augmented and Mixed reality human-robot interaction Workshop Puljiz et al.
Figure 8: The effect of increasing the parameter 𝛽 . The parameter couple 𝛽 and 𝛾 set the threshold for estimating intentions foreach goal location. A low 𝛽 will make the algorithm estimate the unknown intention too much rendering it unusable , whilea too high 𝛽 lowers the general certainty but eliminates the insecurities between transitions which is an unwanted behaviour.Figure 9: Three tests with different goal order. On the left, the user was selecting goals left to right - cylinder, cube then sphere.In the middle the user starts with the cube, moves to the cylinder, back to the cube and finally goes to the sphere. In the teston the right the starting goal is sphere, then cube, then cylinder, after which the user turns around completely and ends backon the sphere. REFERENCES [1] L. Bascetta, G. Ferretti, P. Rocco, H. Ardö, H. Bruyninckx, E. Demeester, andE. Di Lello. 2011. Towards safe human-robot interaction in robotic cells: Anapproach based on visual tracking and intention estimation. In . 2971–2978. https://doi.org/10.1109/IROS.2011.6094642[2] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. YOLOv4:Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:2004.10934 (2020).[3] T. Chakraborti, S. Sreedharan, A. Kulkarni, and S. Kambhampati. 2018. Projection-Aware Task Planning and Execution for Human-in-the-Loop Operation of Robotsin a Mixed-Reality Workspace. In . 4476–4482. https://doi.org/10.1109/IROS.2018.8593830[4] A. D. Dragan, S. Bauman, J. Forlizzi, and S. S. Srinivasa. 2015. Effects of RobotMotion on Human-Robot Collaboration. In . 51–58.[5] G. D. Forney. 1973. The viterbi algorithm.
Proc. IEEE
61, 3 (1973), 268–278.https://doi.org/10.1109/PROC.1973.9030[6] D. Krupke, F. Steinicke, P. Lubos, Y. Jonetzko, M. Görner, and J. Zhang. 2018.Comparison of Multimodal Heading and Pointing Gestures for Co-Located MixedReality Human-Robot Interaction. In . 1–9. https://doi.org/10.1109/IROS.2018.8594043[7] Franziska Mueller, Florian Bernard, Oleksandr Sotnychenko, Dushyant Mehta,Srinath Sridhar, Dan Casas, and Christian Theobalt. 2018. GANerated Hands forReal-Time 3D Hand Tracking from Monocular RGB. In
Proceedings of ComputerVision and Pattern Recognition (CVPR) . 11 pages. https://handtracker.mpi-inf.mpg.de/projects/GANeratedHands/[8] M. Ostanin, S. Mikhel, A. Evlampiev, V. Skvortsova, and A. Klimchik. 2020.Human-robot interaction for robotic manipulator programming in Mixed Re-ality. In .2805–2811. https://doi.org/10.1109/ICRA40945.2020.9196965 [9] Tomislav Petković, Jakub Hvězda, Tomáš Rybecký, Ivan Marković, MiroslavKulich, Libor Přeučil, and Ivan Petrović. 2020. Human Intention Recog-nition for Human Aware Planning in Integrated Warehouse Systems.arXiv:2005.11202 [cs.RO][10] Tomislav Petković, David Puljiz, Ivan Marković, and Björn Hein. 2019. Humanintention estimation based on hidden Markov model motion validation for safeflexible robotized warehouses.
Robotics and Computer-Integrated Manufacturing
57 (2019), 182 – 196. https://doi.org/10.1016/j.rcim.2018.11.004[11] David Puljiz, Franziska Krebs, Fabian Bösing, and Björn Hein. 2020. What theHoloLens Maps Is Your Workspace: Fast Mapping and Set-up of Robot Cells viaHead Mounted Displays and Augmented Reality. arXiv preprint arXiv:2005.12651 (2020).[12] David Puljiz, Katharina S Riesterer, Björn Hein, and Torsten Kröger. 2019. Refer-encing between a Head-Mounted Device and Robotic Manipulators. In
Proceed-ings of the 2nd Workshop on Virtual, Mixed and Augmented Reality Human.RobotInteraction, HRI 2019 . http://arxiv.org/abs/1904.02480[13] C. P. Quintero, S. Li, M. K. Pan, W. P. Chan, H. F. Machiel Van der Loos, and E.Croft. 2018. Robot Programming Through Augmented Trajectories in AugmentedReality. In . 1838–1844. https://doi.org/10.1109/IROS.2018.8593700[14] Michael Walker, Hooman Hedayati, Jennifer Lee, and Daniel Szafir. 2018. Com-municating Robot Motion Intent with Augmented Reality. In
Proceedings of the2018 ACM/IEEE International Conference on Human-Robot Interaction (Chicago,IL, USA) (HRI ’18) . Association for Computing Machinery, New York, NY, USA,316–324. https://doi.org/10.1145/3171221.3171253[15] T. Williams, M. Bussing, S. Cabrol, E. Boyle, and N. Tran. 2019. Mixed RealityDeictic Gesture for Multi-Modal Robot Communication. In2019 14th ACM/IEEEInternational Conference on Human-Robot Interaction (HRI)