Towards Markerless Grasp Capture
TTowards Markerless Grasp Capture
Samarth Brahmbhatt , Charles C. Kemp , and James Hays , Institute for Robotics and Intelligent Machines, Georgia Tech Argo AI [email protected] , [email protected] , [email protected] Abstract
Humans excel at grasping objects and manipulatingthem. Capturing human grasps is important for understand-ing grasping behavior and reconstructing it realistically inVirtual Reality (VR). However, grasp capture – capturingthe pose of a hand grasping an object, and orienting it w.r.t.the object – is difficult because of the complexity and diver-sity of the human hand, and occlusion. Reflective markersand magnetic trackers traditionally used to mitigate this dif-ficulty introduce undesirable artifacts in images and can in-terfere with natural grasping behavior. We present prelimi-nary work on a completely marker-less algorithm for graspcapture from a video depicting a grasp. We show how re-cent advances in 2D hand pose estimation can be used withwell-established optimization techniques. Uniquely, our al-gorithm can also capture hand-object contact in detail andintegrate it in the grasp capture process. This is work inprogress, find more details at https://contactdb.cc.gatech.edu/grasp_capture.html .
1. Introduction
Humans, with their complex hands made of soft tissueenveloping a rigid skeletal structure, excel at grasping andmanipulating objects. Capturing human grasps of house-hold objects can enable a better understanding of grasp-ing behavior, which can improve a wide variety of VR andhuman-computer interaction applications. While hand- andobject-pose capture and estimation have been studied ex-tensively in isolation, there is a lack of large-scale graspcapture datasets and algorithms. We define grasp capture ascapture of both the hand and object pose in a scene depict-ing grasping. Partial occlusion of the object by the hand andvice versa make grasp capture and prediction difficult. Asmentioned in Section 2, the only large scale dataset employswired magnetic trackers taped to the hand and object [5].However, this method has the drawback of introducing un-wanted artifacts in the RGB images and potentially interfer-ing with natural grasping behavior.In this paper we focus on capturing the hand pose Figure 1: Grasp capture for a scene depicting a cellphonebeing grasped. The multi-colored hand model shows theestimated hand pose and the dark blue model shows the es-timated object pose.through the 6-DOF palm pose and 20 joint angles, and the 6-DOF object pose. Since a single image is often not enoughto estimate both the hand and object pose, we record a videoof a human participant grasping a household object. Theparticipant rotates and translates their hand in 3D space topresent the grasp to an RGB-D camera from various per-spectives (see Figure 1 for an example frame).Hand-object contact is either ignored, or enforced with-out any ground-truth contact observation in traditional graspcapture pipelines. Observing ground truth contact has so farbeen very difficult, but recently Brahmbhatt et al [1] createda large scale dataset of detailed hand-object contactmapsthrough thermal imaging. Different from other grasp cap-ture pipelines, ours can also capture such contactmaps andutilize them improve capture accuracy.
2. Related Work
Hand pose estimation is a highly researched topic, andmany datasets are available publicly to train models forhand pose estimation. Hand pose is captured through datagloves [9, 7], manually annotated joint locations [15], mag-netic trackers [20, 22], or fitting a hand model to depth im-ages after manual initialization [17]. These methods capture1 a r X i v : . [ c s . C V ] J u l nly free hands rather than hands grasping objects.However, as mentioned in Section 1, grasp capture alsoinvolves capturing the object pose. Relatively few workshave addressed this problem. The First Person Hand Ac-tion Benchmark [5] is the only large scale real-world datasetcapturing both hand and object pose. 3D joint locationsand object pose are captured through taped magnetic track-ers. In addition to limited working volume (Hasson et al [6]mention in Section 5.2 that the object poses are impreciseand result in penetration of the hand inside the object by 1.1cm on average), taping these long wired sensors to the handintroduces artifacts in the RGB images and can potentiallyinterfere with natural grasping behavior. A large number of works estimate the pose of non-grasping hands in a model-based [18, 8] or model-free [21,14, 11] manner. Garcia-Hernando et al [5] note that handpose estimation in images depicting grasping benefits fromincluding such grasping images in the training dataset.Tekin et al [16] predict both the hand and object pose bypredicting 3D hand joint and object bounding box locations.Hasson et al [6] predict hand parameters and approximatethe object with a predicted genus-0 geometry. Relying onpredicted geometry reduces applicability to grasp capturefor creating datasets, where object geometry is known indetail. In addition, it is not clear how accurate such algo-rithms will be on images from a data collection location,which can differ significantly from their training datasets.
3. Grasp Capture
As mentioned in Section 1, our aim is to capture both thehand pose (6-DOF palm pose and joint angles) and 6-DOFobject pose from a video of a human participant graspinga household object. The objects in our experiments are 3Dprinted at real-life scale using detailed mesh models down-loaded from online repositories. Our data collection proto-col builds on the protocol from ContactDB [1], in whichparticipants hold the object for 5 s and then place it ona turntable, where it is scanned with a calibrated RGB-D-Thermal camera rig. We propose to utilize the object hold-ing time for grasp capture.
Stage 1 : The object is first placed on the turntable, whereits 6-DOF pose w T o is estimated using the depth camerapoint-cloud and the known object 3D model. Stage 2 : The grasp video recording starts when the par-ticipant reaches for the object. The participants are in-structed to hold their joints steady after a transient phase(termed grasp adjustment ) in which they pick the object upand settle into a comfortable grasp. Frames of the video af- Figure 2: OpenPose [14] is used to detect 2D joint locationsin the grasp video frames.Figure 3: Structure from Motion (SfM) is used to recover3D joint locations (green dots connected by while lines),and virtual camera poses (3D axes with blue pointing alongthe camera axis). Green lines connect consecutive cameraposes.ter this instant are used to detect 2D hand joints using theOpenPose library [14] (Figure 2). These 2D detections x ( i ) are treated as observations from a mobile virtual camera thatis observing a stationary hand (the problem is inverted; inreality we have a stationary camera and a moving hand). AStructure from Motion (SfM) problem is setup using these2D detections, and optimized using the GTSAM library [3]to recover the 3D joint locations X as well as virtual cameraposes (cid:110) w T ( i ) c (cid:111) Ni =1 , with w T (1) c anchored to the origin (Fig-igure 4: Fitting a hand model to the 3D joint locations.Left: Hand transformed by the palm pose. Right: Handafter inverse kinematics fitting. Green dots: target 3D jointlocations (from SfM), Red dots: corresponding locations onthe hand model.Figure 5: The contactmap for the grasp depicted in Figure 2,captured using ContactDB [1].ure 3). SfM minimizes the following re-projection error: L (cid:18) X , (cid:110) w T ( i ) c (cid:111) Ni =1 (cid:19) = N (cid:88) i =1 || x ( i ) − π (cid:16) X ; w T ( i ) c , K (cid:17) || (1)where π ( · ) is the camera projection function. Stage 3 : The hand pose is estimated by fitting a handmodel (we use HumanHand20DOF from GraspIt! [10]) to3D joint locations X , in two stages (Figure 4): 1) palmpose w T p is recovered from the locations of the 6 rigid handpoints (wrist base + base of 5 fingers) through the Umeyamatransform [19], which estimates a 3D similarity matrix, and2) joint angles are recovered through inverse kinematics af-ter the hand is transformed by w T p . Stage 4 : The participant places the object back on theturntable, which starts rotating and the object is scanned bythe RGB-D-Thermal camera to construct the contact mapaccording to the ContactDB [1] protocol (Figure 5).
Stage 5 : The contact map from Stage 4 can be used tofurther refine the grasp capture by enforcing the observedcontact relation i.e. attracting the closest hand segmentto contacted points, repelling it away from non-contactedpoints, and penalizing intersection of the hand and object(Figure 1). We follow the grasp optimization stage of theContactGrasp algorithm [2] to perform this refinement.The virtual camera poses estimated in Stage 2 can beused to propagate the object and palm pose to all frames of the grasp video: c T ( i ) o = (cid:104) w T ( i ) c (cid:105) − T adjw T o , c T ( i ) p = (cid:104) w T ( i ) c (cid:105) − w T p . Here, T adj is the change in the object poseduring grasp adjustment mentioned in Stage 2. To summa-rize, the proposed algorithm captures the hand- and object-pose for all frames in a video depicting a grasp from variousperspectives, without requiring gloves, reflective markers ormagnetic trackers. Grasp adjustment (Stage 2), which involves in-hand ma-nipulation [4], introduces an unknown change T adj in theobject pose. We plan to estimate T adj through ICP initial-ized at w T o . Another caveat is that OpenPose requires a vis-ible head and shoulder to initialize the hand detector. Sinceit is not desirable to record the participants’ body and facefor privacy reasons, we plan to develop a hand detector byskin color segmentation.
4. Future Work
We plan to improve the grasp capture algorithm de-scribed in Section 3 in two aspects: • Utilizing a more expressive hand model (e.g.MANO [13]) will allow a better fit to individual handcharacteristics. Currently, the only identity-dependentparameter in our algorithm the scale estimated duringpalm fitting (Stage 3). Research in hand modeling hasshown [13] that many more parameters are needed tocapture the diversity of human hands. • Integrating hand-fitting into the SfM problem (Eq. 1)will reduce the number of stages in the pipeline andmake it less brittle. Denoting hand parameters by Φ ,we plan to recover hand pose and virtual camera posesjointly by minimizing the following cost function: L (cid:18) Φ , (cid:110) w T ( i ) c (cid:111) Ni =1 (cid:19) = N (cid:88) i =1 || x ( i ) − π (cid:16) J (Φ) ; w T ( i ) c , K (cid:17) || (2)where J (Φ) gives the 3D joint locations from Φ .
5. Conclusion
In summary, this paper presents preliminary work on acompletely markerless grasp capture algorithm that utilizeswell-established geometric optimization techniques and re-cent advances in 2D hand keypoint detection. In additionto the hand- and object-pose, our algorithm also capturesdetailed hand-object contact, which is an important com-ponent of grasping. We also discuss ways to improve theproposed algorithm. Markerless grasp capture can enable abetter understanding of human grasping behavior, and canenerate datasets for training models to predict various as-pects of grasping like physically plausible hand pose, oc-currence of contact at object and hand locations, and poten-tially even locations and directions of forces being appliedto the object [12]. These models have applications in VRand human-computer interaction.
References [1] Samarth Brahmbhatt, Cusuh Ham, Charles C. Kemp, andJames Hays. ContactDB: Analyzing and predicting graspcontact via thermal imaging. , Jun 2019.1, 2, 3[2] Samarth Brahmbhatt, Ankur Handa, James Hays, and DieterFox. ContactGrasp: Functional multi-finger grasp synthesisfrom contact. arXiv preprint arXiv:1904.03754 , 2019. 3[3] Frank Dellaert. Factor graphs and gtsam: A hands-on intro-duction. Technical report, Georgia Institute of Technology,2012. 2[4] Charlotte E Exner. In-hand manipulation skills.
Develop-ment of hand skills in the child , pages 35–45, 1992. 3[5] Guillermo Garcia-Hernando, Shanxin Yuan, SeungryulBaek, and Tae-Kyun Kim. First-person hand action bench-mark with rgb-d videos and 3d hand pose annotations. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 409–419, 2018. 1, 2[6] Yana Hasson, G¨ul Varol, Dimitrios Tzionas, Igor Kale-vatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid.Learning joint reconstruction of hands and manipulated ob-jects. In
CVPR , 2019. 2[7] Guido Heumer, Heni Ben Amor, Matthias Weber, and Bern-hard Jung. Grasp recognition with uncalibrated data gloves-acomparison of classification methods. In , pages 19–26. IEEE, 2007. 1[8] Nikolaos Kyriazis and Antonis Argyros. Physically plausible3d scene tracking: The single actor hypothesis. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 9–16, 2013. 2[9] Yun Lin and Yu Sun. Grasp planning based on strategy ex-tracted from demonstration. In , pages 4458–4463. IEEE, 2014. 1[10] Andrew T Miller and Peter K Allen. Graspit! a versatilesimulator for robotic grasping.
IEEE Robotics & AutomationMagazine , 11(4):110–122, 2004. 3[11] Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee.V2v-posenet: Voxel-to-voxel prediction network for accu-rate 3d hand and human pose estimation from a single depthmap. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 5079–5088, 2018. 2[12] Gr´egory Rogez, James S Supancic, and Deva Ramanan. Un-derstanding everyday hands in action from rgb-d images. In
Proceedings of the IEEE international conference on com-puter vision , pages 3889–3897, 2015. 4[13] Javier Romero, Dimitrios Tzionas, and Michael J. Black.Embodied hands: Modeling and capturing hands and bod- ies together.
ACM Transactions on Graphics, (Proc. SIG-GRAPH Asia) , 36(6), Nov. 2017. 3[14] Tomas Simon, Hanbyul Joo, Iain Matthews, and YaserSheikh. Hand keypoint detection in single images using mul-tiview bootstrapping. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 1145–1153, 2017. 2[15] Srinath Sridhar, Antti Oulasvirta, and Christian Theobalt. In-teractive markerless articulated hand motion tracking usingrgb and depth data. In
Proceedings of the IEEE internationalconference on computer vision , pages 2456–2463, 2013. 1[16] Bugra Tekin, Federica Bogo, and Marc Pollefeys. H+ o: Uni-fied egocentric recognition of 3d hand-object poses and in-teractions. arXiv preprint arXiv:1904.05349 , 2019. 2[17] Jonathan Tompson, Murphy Stein, Yann Lecun, and KenPerlin. Real-time continuous pose recovery of human handsusing convolutional networks.
ACM Transactions on Graph-ics , 33, August 2014. 1[18] Dimitrios Tzionas, Luca Ballan, Abhilash Srikantha, PabloAponte, Marc Pollefeys, and Juergen Gall. Capturing handsin action using discriminative salient points and physicssimulation.
International Journal of Computer Vision ,118(2):172–193, 2016. 2[19] Shinji Umeyama. Least-squares estimation of transformationparameters between two point patterns.
IEEE Transactionson Pattern Analysis & Machine Intelligence , (4):376–380,1991. 3[20] Aaron Wetzler, Ron Slossberg, and Ron Kimmel. Rule ofthumb: Deep derotation for improved fingertip detection. arXiv preprint arXiv:1507.05726 , 2015. 1[21] Qi Ye, Shanxin Yuan, and Tae-Kyun Kim. Spatial atten-tion deep net with partial pso for hierarchical hybrid handpose estimation. In
European conference on computer vi-sion , pages 346–361. Springer, 2016. 2[22] Shanxin Yuan, Qi Ye, Bjorn Stenger, Siddhant Jain, and Tae-Kyun Kim. Bighand2.2m benchmark: Hand pose dataset andstate of the art analysis.2017 IEEE Conference on ComputerVision and Pattern Recognition (CVPR)