DeepHPS: End-to-end Estimation of 3D Hand Pose and Shape by Learning from Synthetic Depth
Jameel Malik, Ahmed Elhayek, Fabrizio Nunnari, Kiran Varanasi, Kiarash Tamaddon, Alexis Heloir, Didier Stricker
DDeepHPS: End-to-end Estimation of 3D Hand Pose and Shape by Learning fromSynthetic Depth
Jameel Malik , Ahmed Elhayek , Fabrizio Nunnari , Kiran Varanasi ,Kiarash Tamaddon , Alexis Heloir , and Didier Stricker AV group, DFKI Kaiserslautern, Germany DFKI-MMCI, SLSI group, Saarbruecken, Germany NUST-SEECS, Pakistan
Abstract
Articulated hand pose and shape estimation is an im-portant problem for vision-based applications such as aug-mented reality and animation. In contrast to the existingmethods which optimize only for joint positions, we proposea fully supervised deep network which learns to jointly es-timate a full 3D hand mesh representation and pose froma single depth image. To this end, a CNN architecture isemployed to estimate parametric representations i.e. handpose, bone scales and complex shape parameters. Then, anovel hand pose and shape layer, embedded inside our deepframework, produces 3D joint positions and hand mesh.Lack of sufficient training data with varying hand shapeslimits the generalized performance of learning based meth-ods. Also, manually annotating real data is suboptimal.Therefore, we present SynHand5M: a million-scale syn-thetic dataset with accurate joint annotations, segmenta-tion masks and mesh files of depth maps. Among modelbased learning (hybrid) methods, we show improved resultson our dataset and two of the public benchmarks i.e. NYUand ICVL. Also, by employing a joint training strategy withreal and synthetic data, we recover 3D hand mesh and posefrom real images in 3.7ms.
1. Introduction
3D hand pose estimation is essential for many computervision applications such as activity recognition, human-computer interaction and modeling user intent. However,the advent of virtual and augmented reality technologiesmakes it necessary to reconstruct the 3D hand surface to-gether with the pose. Recent years have seen a greatprogress in the pose estimation task primarily due to sig-nificant developments in deep learning and the availabilityof low cost commodity depth sensors. However, the statedproblem is still far from being solved due to many chal-lenging factors that include large variations in hand shapes,view point changes, many degrees of freedom (DoFs), con-strained parameter space, self similarity and occlusions.Large amounts of training data, enriched with all pos-sible variations in each of the challenging aspects stated Figure 1:
Real hand pose and shape recovery : We de-scribe a deep network for recovering the 3D hand pose andshape of NYU[43] depth images by learning from syntheticdepth. Note that we infer 3D pose and shape even in casesof missing depth and occluded fingers.above, are a key requirement for deep learning based meth-ods to generalize well and achieve significant gains in ac-curacy. The recent real dataset [53] gathers a sufficientnumber of annotated images. However, it is very limitedin hand shape variation (i.e. only 10 subjects). Progressin essential tasks such as estimation of hand surface andhand-part segmentation is hampered, as manual supervisionfor such problems at large scale is extremely expensive. Inthis paper, we generate a synthetic dataset that addressesthese problems. It not only allows us to create virtuallyinfinite training data, with large variations in shapes andview-points, but it also produces annotations that are highlyaccurate even in the case of occlusions. One weakness ofsynthetic datasets is their limited realism. A solution to thisproblem has been proposed by [32, 18], where a generativeadversarial training network is employed to improve the re-alism of synthetic images. However, producing realistic im-ages is not the same problem as improving the recognitionrates of a convolutional neural network (CNN) model. Inthis paper, we address this latter problem, and specificallyfocus on a wide variation of hand shapes, including extremeshapes that are not very common (in contrast to [30]). Wepresent SynHand5M: a new million scale synthetic datasetwith accurate ground truth joints positions, angles, meshfiles, and segmentation masks of depth frames; see Figure a r X i v : . [ c s . C V ] A ug . Our SynHand5M dataset opens up new possibilities foradvanced hand analysis.Currently, CNN-based discriminative methods are thestate-of-the-art which estimate 3D joint positions directlyfrom depth images [8, 21, 4, 27]. However, major weak-ness of these methods is that the predictions are coarse withno explicit consideration to kinematics and geometric con-straints. Sinha et al. [35] propose to estimate 3D shape sur-face from depth image or hand joint angles, using a CNN.However, their approach neither estimates hand pose norconsiders kinematics and physical constraints. Also, thesemethods generalize poorly to unseen hand shapes [52].On the other hand, building a personalized hand modelrequires a different generative approach, that optimizes acomplex energy function to generate the hand pose [29, 24,26, 40, 42]. However, person specific hand model cali-bration clearly restricts the generalization of these methodsfor varying hand shapes. Hybrid methods combine the ad-vantages of both discriminative and generative approaches[6, 34, 23, 41, 36, 50, 54]. To the best of our knowledge,none of the existing works explicitly addresses the problemof jointly estimating full hand shape surface, bone-lengthsand pose in a single deep framework.In this paper, we address the problem of generalizing3D hand pose and surface geometry estimation over varyinghand shapes. We propose to embed a novel hand pose andshape layer (HPSL) inside deep learning network to jointlyoptimize for 3D hand pose and shape surface. The proposedCNN architecture simultaneously estimates the hand poseparameters, bones scales and shape parameters. All theseparameters are fed to the HPSL which implements not onlya new forward kinematics function, but also the fitting of amorphable hand model and linear blend skinning to produceboth 3D joint positions and 3D hand surface; see Figure 3.The whole pipeline is trained in an end-to-end manner. Insum, our contributions are:1. A novel deep network layer which performs:(a) Forward kinematics using a new combination ofhand pose and bone scales parameters.(b) Reconstruction of a morphable hand model fromhand shape parameters and the morph targets.(c) Linear blend Skinning algorithm to animate the3D hand surface; see Section 4.2.2. A novel end-to-end framework for simultaneous handpose and shape estimation; see Section 3.3. A new million scale synthetic hand pose dataset thatoffers accurate ground truth joint angles, 3D joint posi-tions, 3D mesh vertices, segmentation masks; see Sec-tion 5. The synthetic dataset will be publicly available. (a) Dataset components (b) Shape variationsFigure 2: The SynHand5M dataset contains million im-ages. (a) The dataset ground truth components: hand poses(joints angles and 3D positions), depth maps, mesh files,and hand parts segmentation. (b) Samples illustrating thebig variation in shape.
2. Related Work
Depth-based hand pose estimation has been extensivelystudied in the computer vision community. We refer thereader to the survey [39] for a detailed overview of the field.Recently, a comprehensive analysis and investigation of thestate-of-the-art along-with future challenges have been pre-sented by [52]. The approaches can be roughly divided intogenerative, discriminative and hybrid methods. In this sec-tion, we briefly review the existing hand pose benchmarks.Then, we focus our discussion on CNN-based discrimina-tive and hybrid methods.
Existing Benchmarks.
Common shortcomings in ex-isting real hand datasets are low variation in hand shape,inaccurate ground truth annotations, insufficient amount oftraining data, low complexity (e.g. occlusion) of handposes, and limited view point coverage. Most commonlyused benchmarks are NYU [43], ICVL [41] and MSRA15[38]. NYU hand pose dataset uses a model-based di-rect search method for annotating ground truth which isquite accurate. It covers a good range of complex handposes. However, their training set has single hand shape.ICVL dataset uses a guided Latent Tree Model (LTM)based search method and mostly contains highly inaccurateground truth annotations. Moreover, it uses only one handmodel [53]. MSRA15 employs an iterative optimizationmethod [26] for annotating followed by manual refinement.It uses hand poses, however, it has large view-point cov-erage. The major limitation of this dataset is its limited sizeand low annotation accuracy. Recently, Yuan et al. [53]propose a million scale real hand pose dataset, but it haslow variation in hand shape(i.e. only 10 subjects). Someother very small real hand pose datasets such as Dexter-1[37], ASTAR [49] , MSRA14 [26] are not suited for large-scale training. Several works focused on creating synthetichand pose datasets. MSRC [31] is a synthetic benchmarkhowever, it has only one hand model and limited pose spacecoverage. In [35, 19], medium-scale synthetic hand datasetsare used to train CNN models, but they are not publiclyvailable. Given the hard problem of collecting and an-notating a large-scale real hand pose dataset, we proposethe first million scale synthetic benchmark which consistsof more than 5 million depth images together with groundtruth joints positions, angles, mesh files, and segmentationmasks. CNN-based Discriminative Methods.
Recent workssuch as [17, 2, 47, 9, 7, 27] exceed in accuracy over ran-dom decision forest (RDF) based discriminative methods[31, 38, 46, 48, 13]. A few manuscripts have used ei-ther RGB or RGB-D data to predict 3D joint positions[56, 25, 33, 20]. In [7], Ge et al. directly regress 3Djoint coordinates using a 3D-CNN. Recently, [17] intro-duced voxel-to-voxel regression framework which exploitsa one-to-one relationship between voxelised input depth andoutput 3D heatmaps. [9, 47] introduce a powerful regionensemble strategy which integrates the outputs from mul-tiple regressors on different regions of depth input. Chenet al. [2] extended [47] by an iterative pose guided regionensemble strategy. In [35], a discriminative hand shape es-timation is proposed. Although the accuracy of these meth-ods is the state-of-the-art, they impose no explicit geometricand physical constraints on the estimated pose. Also, thesemethods still fail to generalize on unseen hand shapes [52].
CNN-based Hybrid Methods.
Tompson et al. [43] em-ployed CNN for estimating 2D heatmaps. Thereafter, theyapply inverse kinematics for hand pose recovery. In exten-sion to this work, [6] utilize 3D-CNN for 2D heatmaps es-timation and afterwards regress 3D joint positions. Ober-weger et al. [23] utilize three CNNs combined in a feed-back loop to regress 3D joint positions. The network com-prises of an initial pose estimator, a synthesizer and finally apose update network. Ye et al. [51] present a hybrid frame-work using hierarchical spatial attention mechanism and hi-erarchical PSO. Wan et al. [44] implicitly model the de-pendencies in the hand skeleton by learning a shared latentspace. In [55], a forward kinematics layer, with physicalconstraints and a fixed hand model, is implemented in anend-to-end training framework. Malik et al. [14] furtherextend this work by introducing a flexible hand geometryin the training pipeline. The algorithm simultaneously esti-mates bone-lengths and hand pose. In [45], a multi-task cas-cade network is employed to predict 2D/3D joint heatmapsalong-with 3D joint offsets. Dibra et al. [5] introduce anend-to-end training pipeline to refine the hand pose using anunlabeled dataset. All of the above described methods castthe problem of hand pose estimation to 3D joints regressiononly. Our argument is that given the inherent 3D surfacegeometry information in depth inputs, a differentiable handpose and shape layer can be embedded in the deep learn-ing framework to regress not only the 3D joint positions butalso, the full 3D mesh of hand.
3. Method Overview
We aim to jointly estimate the locations of J = 22 ϑ = 1193 vertices of hand mesh from asingle depth image D I . Our hand skeleton in rest pose isshown in Figure 3(b). It has J hand joints defined on DoFs. The hand root has DoF; for global orientationand for global translation. All other DoFs are defined forjoints articulations. The dimensional pose vector is ini-tialized for the rest pose, called θ init . Any other pose Θ can be constructed by adding change δθ to the rest posei.e. Θ = θ init + δθ . The bone-lengths B , are initialized byaveraging over all bone-lengths of different hand shapes inour synthetic dataset. In order to add flexibility to the handskeleton, different hand bones scales, α , are associatedto bone-lengths. Our hand mesh has ϑ vertices and faces. The neutral hand surface is shown in Figure 3(b). Weuse hand shape parameters β which allow to formulate thesurface geometry of a desired hand shape in reference pose;see Section 5.Our pipeline is shown in Figure 3(a). Firstly, a new CNNarchitecture estimates δθ , α and β given a depth input D I .This architecture consists of PoseCNN which estimates δθ and ShapeCNN which estimates α and β . Thereafter, a newnon-linear hand pose and shape layer (HPSL) performs for-ward kinematics, hand shape surface reconstruction and lin-ear blend skinning. The outputs of the layer are 3D joint po-sitions and hand surface vertices. These outputs are used tocompute the standard euclidean loss for joint positions andvertices; see Equation 2. The complete pipeline is trainedend-to-end in a fully supervised manner.
4. Joint Hand Pose and Shape Estimation
In this section, we discuss the components of ourpipeline which are shown in Figure 3(a). We explain thenovel Hand Pose and Shape Layer (HPSL) in detail becauseit is the main component which allows to jointly estimatehand pose and shape surface.
Our CNN architecture comprises of three parallel CNNsto learn δθ , α and β , given D I . The PoseCNN leveragesone of the state-of-the-art CNN [9] to estimate joint angles δθ . However, the CNN was originally used to regress 3Dhand joint positions; see Section 2. We refer the readerto [9] for network details of Region Ensemble (REN). Inour implementation, the final regressor in REN outputs dimensional δθ . The ShapeCNN consists of two simplerCNNs similar to [22]; called α -CNN and β -CNN. Each ofthem has convolutional layers using kernels sizes , , respectively. First two convolution layers are followed bymax pool layers. The pooling layers use strides of and .The convolutional layers generate feature maps of size a) Algorithm pipeline (b) Our hand modelFigure 3: (a) An overview of our method for simultaneous 3D hand pose and surface estimation. A depth image D I is passedthrough three CNNs to estimate pose parameters δθ , bones scales α and shape parameters β . These parameters are sent toHPSL which generate the hand joints positions P and hand surface vertices V . (b) Our hand model with DoFs overlaidwith the neutral hand shape b . The bone colors illustrate bone-length scales α .x . Lastly, the two fully connected (FC) layers have neurons each with dropout ratio of . . After the second FClayer, the final FC layers in α -CNN and β -CNN output di-mensional α and dimensional β parameters respectively.All layers use the ReLu as activation function. HPSL is a non-linear differentiable layer, embedded in-side the deep network as shown in Figure 3(a). The task ofthe layer is to produce 3D joint positions P ∈ R x J andvertices of hand mesh V ∈ R x ϑ given the pose parameters Θ , hand bones scales α and shape parameters β . The layerfunction can be written as: (P,V) = HPSL (Θ , β, α ) (1)We compute the respective gradients in the layer forback-propagation. The Euclidean 3D joint location and 3Dvertex location losses are given as: L J = 12 (cid:107) P − P GT (cid:107) , L V = 12 (cid:107) V − V GT (cid:107) (2)Where L J and L V are the 3D joint and vertex losses respec-tively. P GT and V GT are vectors of 3D ground truth jointpositions and mesh vertices, respectively. Various functionsinside the layer are detailed as follows: Hand Skeleton Bone-lengths Adaptation : In order toadapt bone-lengths of hand skeleton during training overvarying hand shapes in the dataset, [14] propose variousbone-length scaling strategies. Following the similar ap-proach, we assign a separate scale parameter for bone-lengths in palm s p and different scales for bones as shownin Figure 3(b). The HPSL acquires the scaling parameters α = [ s p , s , s , s , s , s ] from the ShapeCNN during thetraining process. Morphable Hand Model Formulation : Given the shapeparameters β learned by our ShapeCNN, we reconstruct the hand shape surface by implementing a morphable handmodel inside our HPSL. A morphable hand model Ψ ∈R x ϑ is a set of 3D vertices representing a particular handshape. Any morphable hand model can be expressed asa linear combination of principle hand shape components,called morphable targets b t [11]. Our principle handshape components are defined for Length , Mass , Size , PalmLength , Fingers Inter-distance , Fingers Length and
FingersTip-Size . They represent offsets from a neutral hand shape b similar to one shown in Figure 3(b). Each learned shapeparameter β t defines the amount of contribution of a princi-ple shape components b t towards formulation of final handmorphable model. Hence, a hand morphable model Ψ canbe formulated using the following Equation: Ψ( β ) = b + (cid:88) t = β t ( b t − b ) (3) Forward Kinematics and Geometric Skinning : To es-timate the 3D hand joints positions and surface vertices,we implement forward kinematics and geometric skinningfunctions inside our HPSL. As this layer is part of our deepnetwork, it is essential to compute and back-propagate thegradients of these functions. The rest of this section ad-dresses the definition of these functions and their gradients.The deformation of the hand skeleton from the referencepose θ init to the current pose Θ can be obtained by trans-forming each joint j i along the kinematic chain by simplerigid transformations matrices. In our algorithm, these ma-trices are updated based on bones scales α and the changesin pose parameters δθ which are learned by our ShapeCNNand PoseCNN, respectively. The kinematics equation ofjoint j i can be written as: j i = F ji (Θ , α ) = M ji [0 , , , T = (cid:0) (cid:89) k ∈ S ji [ R φk ( θ k )] × [ T φk ( α B )] (cid:1) [0 , , , T (4)here M ji represents the transformation matrix from thezero pose (i.e. joint at position [0 , , , ) to the currentpose. S ji is the set of joints along kinematic chain from j i to the root joint and φ k is one of the rotation axes of joint k .For animating the 3D hand mesh, we use linear blendskinning [12] to deform the set of vertices ϑ according tounderlying hand skeleton kinematic transformations. Theskinning weights ω i , define the skeleton-to-skin bindings.Their values represent the influence of joints on their asso-ciated vertices. Normally, the weights of each vertex areassumed to be convex (i.e. (cid:80) ni =1 ω i = 1 ) and ω i > . Thetransformation of a vertex v κ ∈ Ψ can be defined as:v κ = Υ v κ (Θ , β, α ) = (cid:88) i ∈ P v κ ω i C ji v κ ( β )= (cid:88) i ∈ P v κ ω i C ji ( b v κ + (cid:88) t = β t ( b v κ t − b v κ )) (5)where P v κ is the set of joints influencing the vertex v κ and C ji is the transformation matrix of each joint j i from itsreference pose θ init to its actual position in the current ani-mated posture. C ji can be represented as: C ji = M ji M j ∗ i − (6)where M j ∗ i − defines the inverse of reference pose transfor-mation matrix. Gradients computation : For backward-pass in the
HPSL ,we compute gradients of the following equation with re-spect to the layer inputs:
HPSL (Θ , β, α ) = ( F (Θ , α ) , Υ(Θ , β, α ) ) . (7)Each vertex v κ = HPSL v κ (Θ , β, α ) in the reconstructedhand morphable model Ψ is deformed using Equation 5.Hence, its gradients with respect to a shape parameter β t can be computed as: ∂ ( HPSL v κ ) ∂β t = (cid:88) i ω i C ji ( b v κ t − b v κ ) for t = 1 , , . . . , According to Equation 7, bones scales influence the jointspositions and vertices positions. Hence, the resultant gradi-ent with respect to a hand scale parameter α s , can be calcu-lated as: ∂ ( HPSL ) ∂α s = ∂ F ∂α s + ∂ Υ ∂α s for s = 1 , , . . . , To compute the partial derivative of F with respect to α s ,we need to derivate each joint with respect to its associatedscale parameter. The gradient of a joint with respect to α s ,can be computed by replacing the scaled translational ma-trix containing α s by its derivative and keep all other matri-ces same; see Equation in supplementary document. In a similar way, the gradient of a vertex v κ with respect to α s can be computed by: ∂ Υ v κ ∂α s = (cid:88) i ω i ∂ C ji ∂α s v κ = (cid:88) i ω i [ M ji ( M j ∗ i − ) (cid:48) + ( M ji ) (cid:48) M j ∗ i − ] v κ Likewise, for the pose parameters Θ , we compute the fol-lowing equation: ∂ ( HPSL ) ∂θ p = ∂ F ∂θ p + ∂ Υ ∂θ p for p = 1 , , . . . , Accordingly, the derivative of a joint with respect to a poseparameter θ p , is simply to replace the rotation matrix of θ p by its derivation; see Equation in supplementary docu-ment. And, the derivative of a vertex v κ with respect to θ p is computed by: ∂ Υ v κ ∂θ p = (cid:88) i ω i ∂ C ji ∂θ p v κ = (cid:88) i ω i [( M ji ) (cid:48) M j ∗ i − ] v κ for p = 1 , , . . . , More details about the gradients computation can be foundin the supplementary document.
5. Synthetic Dataset
There are two main objectives of creating our syntheticdataset. First is to jointly recover full hand shape surfaceand pose provided that there is no ground truth hand sur-face information available in public benchmarks; see Sec-tion 6.2. Second objective is to provide a training data withsufficient variation in hand shapes and poses such that aCNN model can be pre-trained to improve the recognitionrates on real benchmarks; see Section 6.3. This problem isdifferent from generating very realistic hand-shape, where areal-world statistical hand model [30] can be applied. How-ever, the variation in shape is more challenging for real-world databases e.g. BigHand2.2M [53] database was cap-tured from only users, and the MANO [30] database wasbuilt from the contribution of users. Instead, we generatea bigger hand shape variation which may not be present ina given cohort of human users.Our SynHand5M dataset offers . M train and K test images; see Figure 2(a) for SynHand5M components.SynHand5M uses the hand model generated by Manuel-BastionLAB [15] which is a procedural full-body genera-tor distributed as add-on of the Blender [1] 3D authoringsoftware. Our virtual camera simulates a Creative Senz3DInteractive Gesture Camera [3]. It renders images of reso-lution 320x240 using diagonal field of view of 74 degrees.n the default position, the hand palm faces the camera or-thogonally and the fingers point up. We procedurally mod-ulate many parameters controlling the hand and generateimages by rendering the view from the virtual camera. Theparameters characterizing the hand model belong to threecategories: hand shape, pose and view point.Without constraints the hand generator can easily leadto impossible hand shapes. So, in order to define realis-tic range limits for modulating hand shapes, we relied onthe DINED [16] anthropometric database. DINED is arepository collecting the results of several anthropometricdatabases, including the CAESAR surface anthropometrysurvey [28]. We manually tuned the ranges of the handshape parameters (see Section 4.2) in order to cover 99% ofthe measured population in this dataset; see supplementarydocument for more details.To modulate the hand pose, we manipulate the DoFsof our hand model; see Figure 3(b). For each finger, ro-tations are applied to flexion of all phalanges plus the ab-duction of the proximal phalanx. Additionally, in order toincrease the realism of the closed fist configuration, the rollof middle, ring, and pinky fingers is derived from the ab-duction angle of the same phalanx. The rotation limits areset to bring the hand from a closed fist to an over-extendedaperture, respecting anatomical constraints and avoiding thefingers to enter the palm.The hand can rotate about three DoFs to generate dif-ferent view points: roll around its longitudinal axis (i.e.along the fingers), rotate around the palm orthogonal axis(i.e. rolling in front of the camera), and rotate around itstransversal axis (i.e. flexion/extension of the wrist).
6. Experiments and Results
In this section, we provide the implementation details,quantitative and qualitative evaluations of the proposed al-gorithm and the proposed dataset. We use three evaluationmetrics; mean 3D joint location error (JLE), 3D vertex lo-cation error (VLE) and percentage of images within certainthresholds in mm .Recent CNN-based discriminative methods such as [7,47, 17, 27] outperform CNN-based hybrid methods; seeSection 2. However, due to direct joints regression, dis-criminative methods neither explicitly account for the handshapes nor consider kinematics constraints [55, 14]. More-over, in contrast to hybrid methods, discriminative methodsgeneralize poorly to unseen hand shapes; see [52]. Our pro-posed hybrid method does not exceed in accuracy over re-cent discriminative works but, it does not suffer from suchlimitations. Therefore, it is not fair to compare with thesemethods. However, we compare with the state-of-the-arthybrid methods and show improved performance. Notably,we propose the first algorithm that jointly regresses handpose, bone-lengths and shape surface in a single network. (a) ICVL (b) NYUFigure 4: Quantitative evaluation . (a) show the results ofour algorithm (DeepHPS) on ICVL test set, when trainedon ICVL and fine-tuned on ICVL. (b) is the same but withNYU. To fine-tune, we pretrain DeepHPS on our Syn-Hand5M. Our results on ICVL show improved accuracyover the state-of-the-art hybrid methods (e.g. LRF[41]and DeepModel[55]). On NYU, the results are better thanthe state-of-the-art hybrid methods (e.g. DeepPrior[22],DeepPrior-Refine[22], Feedback[23], DeepModel[55] andLie-X[48]). The curves show the number of frames in errorwithin certain thresholds.
For training, we pre-process the raw depth data for stan-dardization and depth invariance. We start by computingthe centroid of the hand region in the depth image. The ob-tained 3D hand center location (i.e. palm center) is usedto crop the depth frame. The camera intrinsics (i.e. focallength) and a bounding box of size , are used during thecrop. The pre-processed depth image is of size x andin depth range of [ − , . The annotations in camera coor-dinates are simply normalized by the bounding box size andclipped in range [ − , .We use Caffe [10] which is an open-source trainingframework for deep networks. The complete pipeline istrained end-to-end until convergence. The learning rate wasset to . with 0.9 SGD momentum. A batch size of was used during the training. The framework is ex-ecuted on a desktop equipped with Nvidia Geforce GTX Ti GPU with 16GB RAM. One forward pass takes . ms to generate 3D hand joint positions and shape sur-face. For simplicity, we name our method as DeepHPS. In this subsection, we evaluate our complete pipeline us-ing the SynHand5M. Moreover, we devise a joint trainingstrategy for both real and synthetic datasets to show quali-tative hand surface reconstruction of real images.
Evaluation on the synthetic dataset : The completepipeline is trained end-to-end using SynHand5M for poseand shape recovery. For fair comparison, we train the state-of-the-art model based learning methods [55, 14] on Syn-Hand5M. [14] works for varying hand shapes in contrast toigure 5:
Real hand pose and shape recovery : More re-sults on hand pose and surface reconstruction of NYU[43]images. Despite of unavailability of ground truth hand meshvertices, our algorithm produces plausible hand shape.the closely related method [55]. The quantitative results areshown in Table 1. Our method clearly exceeds in accuracyover the compared method and additionally reconstructs fullhand surface. The qualitative results are shown in Figure 6.The estimated joint positions are overlaid on the depthimages while the reconstructed hand surface is shown us-ing two different views named as
3D View1 and
3D View2 .For better visualization, view2 is similar to ground truthview. The results demonstrate that our DeepHPS model in-fers correct hand shape surface even in cases of occlusionof several fingers and large variation in view points.
Evaluation on the NYU real dataset : In order to jointlytrain our whole pipeline on both real and synthetic data, wefound closely matching common joint positions in Syn-Hand5M and the NYU dataset. These common joints aredifferent from the joints used for the public comparisons[43]. The loss equation is; L = L J + L V (8)where is an indicator function which specifies whetherthe ground truth for mesh vertices is available or not. In oursetup, it is 1 for synthetic images and 0 for real images. Forreal images, backpropagation from surface reconstructionpart is disabled.The qualitative pose and surface shape results on sampleNYU real images are shown in Figure 1 and 5. Despite ofthe missing ground truth surface information and presenceof high camera noise in NYU images, the resulting handsurface is plausible and the algorithm performs well in caseof missing depth information and occluded hand parts. The public benchmarks do not provide ground truth handmesh files. Therefore, we provide quantitative results forpose inference on two of the real hand pose datasets (i.e.NYU and ICVL). For comparisons, NYU dataset use joint positions [43] whereas ICVL dataset [41] use jointpositions. Method \ Error(mm) 3D Joint Loc. 3D Vertex Loc.DeepModel [55] 11.36 –HandScales [14] 9.67 –DeepHPS [
Ours ] Table 1:
Quantitative Evaluation on SynHand5M : Weshow the 3D joint and vertex locations errors(mm). Ourmethod additionally outputs mesh vertices and outperformsmodel based learning methods [55, 14].Methods 3D Joint Location ErrorDeepPrior [22] 20.75mmDeepPrior-Refine [22] 19.72mmCrossing Nets [44] 15.5mmFeedback [23] 15.9mmDeepModel [55] 17.0mmLie-X [48] 14.5mmDeepHPS:NYU [Ours] 15.8mmDeepHPS:fine-tuned [
Ours ] Table 2:
Quantitative comparison on NYU [43]: Our fine-tuned DeepHPS model on the NYU dataset shows the state-of-the-art performance among hybrid methods.Methods 3D Joint Location ErrorLRF [41] 12.57mmDeepModel [55] 11.56mmCrossing Nets [44] 10.2mmDeepHPS:ICVL [Ours] 10.5mmDeepHPS:fine-tuned [
Ours ] Table 3:
Quantitative comparison on ICVL [41]: TheDeepHPS model fine-tuned on the ICVL dataset outper-forms the state-of-the-art hybrid methods.Our DeepHPS algorithm is trained on NYU and ICVLindividually, called DeepHPS:NYU and DeepHPS:ICVLmodels. Then, we fine-tune the pre-trained DeepHPS(on SynHand5M) with the NYU and ICVL, we callDeepHPS:fine-tuned models. The 3D joint location errorsof the trained models are calculated on
NYU and
ICVL test images respectively. The quantitative results areshown in Figure 4 and Tables 2 and 3. DeepHPS:fine-tunedmodels achieve an error improvement of . and . over DeepHPS:ICVL and DeepHPS:NYU models respec-tively.On the ICVL and NYU datasets, we achieve improve-ment in the joint location accuracy over the state-of-the-arthybrid methods. Failure case : Our framework works well in case of missingdepth information and occlusions. However, under severeocclusions and a lot of missing depth information, it mayfail to detect the correct pose and shape; see Figure 7.igure 6:
Synthetic hand pose and shape recovery : We show example estimated hand poses overlaid with the preprocesseddepth images from our SynHand5M. We show the reconstructed surface from two different views (yellow) and the groundtruth surface (gray).
3D View2 is similar to the ground truth view. Our algorithm infers correct 3D pose and shape even invery challenging condition, like occlusion of several fingers and large variation in view points.(a) (b)Figure 7:
Failure case :(a) incorrect pose due to highly oc-cluded hand parts. (b) incorrect pose and shape due to sig-nificant missing depth information.
7. Conclusion and Future Work
In this work, we demonstrate the simultaneous recoveryof hand pose and shape surface from a single depth image.For training, we synthetically generate a large scale datasetwith accurate joint positions, segmentation masks and handmeshes of depth images. Our dataset will be a valuable ad-dition for training and testing CNN-based models for 3Dhand pose and shape analysis. Furthermore, it improves therecognition rate of CNN models on hand pose datasets. Inour algorithm, intermediate parametric representations areestimated from a CNN architecture. Then, a novel hand pose and shape layer is embedded inside the deep networkto produce 3D hand joint positions and shape surface. Ex-periments show improved accuracy over the state-of-the-arthybrid methods. Furthermore, we demonstrate plausible re-sults for the recovery of hand shape surface on real images.Improving the performance of CNN-based hybrid methodsis a potential research direction. These methods bear a lotof potential due to their inherent stability and scalability.In future, we wish to extend our dataset with wider viewpoints coverage, object interactions and RGB images. An-other aspect for future work is predicting fine-scale 3D sur-face detail on the hand, where real-world statistical handmodels [30] possibly give better priors.
Acknowledgements
This work was partially funded by NUST, Pakistan, theFederal Ministry of Education and Research of the FederalRepublic of Germany as part of the research projects DY-NAMICS (Grant number 01IW15003) and VIDETE (Grantnumber 01IW18002). eferences arXiv preprint arXiv:1708.03416 , 2017. 3[3] Creative. Senz3d interactive gesture camera.https://us.creative.com/p/web-cameras/creative-senz3d,March 2018. 5[4] X. Deng, S. Yang, Y. Zhang, P. Tan, L. Chang, and H. Wang.Hand3d: Hand pose estimation using 3d neural network. arXiv preprint arXiv:1704.02224 , 2017. 2[5] E. Dibra, T. Wolf, C. Oztireli, and M. Gross. How to refine3d hand pose estimation from unlabelled depth data?
In3DV , 2017. 3[6] L. Ge, H. Liang, J. Yuan, and D. Thalmann. Robust 3d handpose estimation in single depth images: from single-viewcnn to multi-view cnns. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages3593–3601, 2016. 2, 3[7] L. Ge, H. Liang, J. Yuan, and D. Thalmann. 3d convolutionalneural networks for efficient and robust hand pose estimationfrom single depth images. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , 2017.3, 6[8] H. Guo, G. Wang, and X. Chen. Two-stream convolutionalneural network for accurate rgb-d fingertip detection usingdepth and edge information. In
Image Processing (ICIP),2016 IEEE International Conference on , pages 2608–2612.IEEE, 2016. 2[9] H. Guo, G. Wang, X. Chen, C. Zhang, F. Qiao, and H. Yang.Region ensemble network: Improving convolutional net-work for hand pose estimation.
In ICIP , 2017. 3[10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. In
Proceed-ings of the 22nd ACM international conference on Multime-dia , pages 675–678. ACM, 2014. 6[11] J. P. Lewis, K. Anjyo, T. Rhee, M. Zhang, F. H. Pighin, andZ. Deng. Practice and theory of blendshape facial models. 4[12] J. P. Lewis, M. Cordner, and N. Fong. Pose space de-formation: a unified approach to shape interpolation andskeleton-driven deformation. In
Proceedings of the 27th an-nual conference on Computer graphics and interactive tech-niques , pages 165–172. ACM Press/Addison-Wesley Pub-lishing Co., 2000. 5[13] P. Li, H. Ling, X. Li, and C. Liao. 3d hand pose estimationusing randomized decision forest with segmentation indexpoints. In
Proceedings of the IEEE International Conferenceon Computer Vision , pages 819–827, 2015. 3[14] J. Malik, A. Elhayek, and D. Stricker. Simultaneous handpose and skeleton bone-lengths estimation from a singledepth image.
In 3DV https://dined.io.tudelft.nl/ , 2004. 6 [17] G. Moon, J. Y. Chang, and K. M. Lee. V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and humanpose estimation from a single depth map. arXiv preprintarXiv:1711.07399 , 2017. 3, 6[18] F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Srid-har, D. Casas, and C. Theobalt. Ganerated hands for real-time 3d hand tracking from monocular rgb. arXiv preprintarXiv:1712.01057 , 2017. 1[19] F. Mueller, D. Mehta, O. Sotnychenko, S. Sridhar, D. Casas,and C. Theobalt. Real-time hand tracking under occlusionfrom an egocentric rgb-d sensor. , pages 1163–1172,2017. 2[20] F. Mueller, D. Mehta, O. Sotnychenko, S. Sridhar, D. Casas,and C. Theobalt. Real-time hand tracking under occlusionfrom an egocentric rgb-d sensor. In
Proceedings of Interna-tional Conference on Computer Vision (ICCV) , volume 10,2017. 3[21] M. Oberweger and V. Lepetit. Deepprior++: Improving fastand accurate 3d hand pose estimation. In
ICCV workshop ,volume 840, page 2, 2017. 2[22] M. Oberweger, P. Wohlhart, and V. Lepetit. Hands deep indeep learning for hand pose estimation.
In CVWW , 2015. 3,6, 7[23] M. Oberweger, P. Wohlhart, and V. Lepetit. Training a feed-back loop for hand pose estimation. In
Proceedings of theIEEE International Conference on Computer Vision , pages3316–3324, 2015. 2, 3, 6, 7[24] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Efficientmodel-based 3d tracking of hand articulations using kinect.In
BmVC , volume 1, page 3, 2011. 2[25] P. Panteleris, I. Oikonomidis, and A. Argyros. Using a singlergb frame for real time 3d hand pose estimation in the wild. arXiv preprint arXiv:1712.03866 , 2017. 3[26] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun. Realtimeand robust hand tracking from depth. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 1106–1113, 2014. 2[27] M. Rad, M. Oberweger, and V. Lepetit. Feature mapping forlearning fast and accurate 3d pose inference from syntheticimages. arXiv preprint arXiv:1712.03904 , 2017. 2, 3, 6[28] K. Robinette, H. Daanen, and E. Paquet. The CAESARproject: a 3-D surface anthropometry survey. pages 380–386. IEEE Comput. Soc, 1999. 6[29] K. Roditakis, A. Makris, and A. Antonis. Generative 3dhand tracking with spatially constrained pose sampling. In
In BMVC . IEEE, 2017. 2[30] J. Romero, D. Tzionas, and M. J. Black. Embodiedhands: Modeling and capturing hands and bodies together.
ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) ,36(6):245:1–245:17, Nov. 2017. 1, 5, 8[31] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton,D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei,et al. Accurate, robust, and flexible real-time hand track-ing. In
Proceedings of the 33rd Annual ACM Conferenceon Human Factors in Computing Systems , pages 3633–3642.ACM, 2015. 2, 332] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang,and R. Webb. Learning from simulated and unsupervisedimages through adversarial training. In
The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,volume 3, page 6, 2017. 1[33] T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand key-point detection in single images using multiview bootstrap-ping. In
The IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR) , volume 2, 2017. 3[34] A. Sinha, C. Choi, and K. Ramani. Deephand: Robust handpose estimation by completing a matrix imputed with deepfeatures. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 4150–4158,2016. 2[35] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani. Surfnet:Generating 3d shape surfaces using deep residual networks.In
Proc. CVPR , 2017. 2, 3[36] S. Sridhar, F. Mueller, M. Zollh¨ofer, D. Casas, A. Oulasvirta,and C. Theobalt. Real-time joint tracking of a hand manip-ulating an object from rgb-d input. In
European Conferenceon Computer Vision , pages 294–310. Springer, 2016. 2[37] S. Sridhar, A. Oulasvirta, and C. Theobalt. Interactive mark-erless articulated hand motion tracking using rgb and depthdata. In
Proceedings of the IEEE International Conferenceon Computer Vision , pages 2456–2463, 2013. 2[38] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun. Cascaded handpose regression. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 824–832,2015. 2, 3[39] J. S. Supancic, G. Rogez, Y. Yang, J. Shotton, and D. Ra-manan. Depth-based hand pose estimation: data, methods,and challenges. In
IEEE international conference on com-puter vision , pages 1868–1876, 2015. 2[40] A. Tagliasacchi, M. Schr¨oder, A. Tkach, S. Bouaziz,M. Botsch, and M. Pauly. Robust articulated-icp for real-time hand tracking. In
Computer Graphics Forum , vol-ume 34, pages 101–114. Wiley Online Library, 2015. 2[41] D. Tang, H. Jin Chang, A. Tejani, and T.-K. Kim. La-tent regression forest: Structured estimation of 3d articu-lated hand posture. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3786–3793, 2014. 2, 6, 7[42] D. Tang, J. Taylor, P. Kohli, C. Keskin, T.-K. Kim, andJ. Shotton. Opening the black box: Hierarchical samplingoptimization for estimating human hand pose. In
Proceed-ings of the IEEE International Conference on Computer Vi-sion , pages 3325–3333, 2015. 2[43] J. Tompson, M. Stein, Y. Lecun, and K. Perlin. Real-timecontinuous pose recovery of human hands using convolu-tional networks.
ACM Transactions on Graphics (ToG) ,33(5):169, 2014. 1, 2, 3, 7[44] C. Wan, T. Probst, L. Van Gool, and A. Yao. Crossing nets:Combining gans and vaes with a shared latent space for handpose estimation. In . IEEE, 2017. 3, 7[45] C. Wan, T. Probst, L. Van Gool, and A. Yao. Dense3d regression for hand pose estimation. arXiv preprintarXiv:1711.08996 , 2017. 3 [46] C. Wan, A. Yao, and L. Van Gool. Hand pose estimationfrom local surface normals. In
European Conference onComputer Vision , pages 554–569. Springer, 2016. 3[47] G. Wang, X. Chen, H. Guo, and C. Zhang. Region ensemblenetwork: Towards good practices for deep 3d hand pose es-timation.
Journal of Visual Communication and Image Rep-resentation , 2018. 3, 6[48] C. Xu, L. N. Govindarajan, Y. Zhang, and L. Cheng. Lie-x:Depth image based articulated object pose estimation, track-ing, and action recognition on lie groups.
International Jour-nal of Computer Vision , pages 1–25, 2017. 3, 6, 7[49] C. Xu, A. Nanjappa, X. Zhang, and L. Cheng. Estimatehand poses efficiently from single depth images.
Interna-tional Journal of Computer Vision , 116(1):21–45, 2016. 2[50] Q. Ye and T.-K. Kim. Occlusion-aware hand pose estimationusing hierarchical mixture density network. arXiv preprintarXiv:1711.10872 , 2017. 2[51] Q. Ye, S. Yuan, and T.-K. Kim. Spatial attention deep netwith partial pso for hierarchical hybrid hand pose estimation.In
European Conference on Computer Vision , pages 346–361. Springer, 2016. 3[52] S. Yuan, G. Garcia-Hernando, B. Stenger, G. Moon, J. Y.Chang, K. M. Lee, P. Molchanov, J. Kautz, S. Honari, L. Ge,et al. Depth-based 3d hand pose estimation: From currentachievements to future goals. In
IEEE CVPR , 2018. 2, 3, 6[53] S. Yuan, Q. Ye, B. Stenger, S. Jain, and T.-K. Kim. Big-hand2. 2m benchmark: Hand pose dataset and state of theart analysis. In
Computer Vision and Pattern Recogni-tion (CVPR), 2017 IEEE Conference on , pages 2605–2613.IEEE, 2017. 1, 2, 5[54] Y. Zhang, C. Xu, and L. Cheng. Learning to search on man-ifolds for 3d pose estimation of articulated objects. arXivpreprint arXiv:1612.00596 , 2016. 2[55] X. Zhou, Q. Wan, W. Zhang, X. Xue, and Y. Wei. Model-based deep hand pose estimation.
In IJCAI , 2016. 3, 6, 7[56] C. Zimmermann and T. Brox. Learning to estimate 3d handpose from single rgb images. In