[PDF] MVHM: A Large-Scale Multi-View Hand Mesh Benchmark for Accurate 3D Hand Pose Estimation

Abstract

Estimating 3D hand poses from a single RGB image is challenging because depth ambiguity leads the problem ill-posed. Training hand pose estimators with 3D hand mesh annotations and multi-view images often results in significant performance gains. However, existing multi-view datasets are relatively small with hand joints annotated by off-the-shelf trackers or automated through model predictions, both of which may be inaccurate and can introduce biases. Collecting a large-scale multi-view 3D hand pose images with accurate mesh and joint annotations is valuable but strenuous. In this paper, we design a spin match algorithm that enables a rigid mesh model matching with any target mesh ground truth. Based on the match algorithm, we propose an efficient pipeline to generate a large-scale multi-view hand mesh (MVHM) dataset with accurate 3D hand mesh and joint labels. We further present a multi-view hand pose estimation approach to verify that training a hand pose estimator with our generated dataset greatly enhances the performance. Experimental results show that our approach achieves the performance of 0.990 in AUC 20-50 on the MHP dataset compared to the previous state-of-the-art of 0.939 on this dataset. Our datasset is public available. \footnote{\url{this https URL}} Our datasset is available at~\href{this https URL}{\color{blue}{this https URL}}.

Full PDF

MMVHM: A Large-Scale Multi-View Hand Mesh Benchmark forAccurate 3D Hand Pose Estimation

Liangjian Chen , Shih-Yao Lin , Yusheng Xie *3 , Yen-Yu Lin , and Xiaohui Xie University of California, Irvine , Tencent America , Amazon , National Chiao Tung University , { liangjc2,xhx } @ics.uci.edu , [email protected] , [email protected] , [email protected] Abstract

Estimating 3D hand poses from a single RGB image ischallenging because depth ambiguity leads the problem ill-posed. Training hand pose estimators with 3D hand meshannotations and multi-view images often results in signif-icant performance gains. However, existing multi-viewdatasets are relatively small with hand joints annotated byoff-the-shelf trackers or automated through model predic-tions, both of which may be inaccurate and can introducebiases. Collecting a large-scale multi-view 3D hand poseimages with accurate mesh and joint annotations is valu-able but strenuous. In this paper, we design a spin matchalgorithm that enables a rigid mesh model matching withany target mesh ground truth. Based on the match algo-rithm, we propose an efﬁcient pipeline to generate a large-scale multi-view hand mesh (MVHM) dataset with accu-rate 3D hand mesh and joint labels. We further presenta multi-view hand pose estimation approach to verify thattraining a hand pose estimator with our generated datasetgreatly enhances the performance. Experimental resultsshow that our approach achieves the performance of 0.990in AUC on the MHP dataset compared to the previousstate-of-the-art of 0.939 on this dataset. Our datasset isavailable at https://github.com/Kuzphi/MVHM.

1. Introduction

Estimating 3D hand poses from images has attracted in-creasing attention because it is essential to a wide range ofapplications such as human-computer interaction (HCI) [1,27, 26], virtual reality (VR) [12, 18], augmented reality(AR) [7], medical diagnosis [20], and sign language un-derstanding [43]. Although extensive research efforts havebeen made on this research topic for decades, there are stillseveral unsolved challenges. One of the most crucial chal-lenges is to handle the issue of depth ambiguity present insingle view 3D hand pose estimation.Conventional studies mainly focus on inferring 3D hand * Work done outside of Amazon.

Figure 1. Our core idea. We build a synthetic dataset from a multi-view perspective, e.g ., rendering hand images from different a an-gles. With the aid of this dataset, a single-view method takes theimage from each view and generates a possible hand pose can-didate. We proposed a multi-view method takes different single-view predictions as input and predicts the ﬁnal result. poses from either depth or RGB images directly. To ad-dress the problems caused by depth ambiguity, Some pre-vious studies [5, 19, 8, 9] want to address this problem byleveraging depthmap information. These works come upwith several ways to introduce depthmap into the trainingprocedure, such as making depthmap as intermediate su-pervision [19] or using depth regularizer [5]. On the otherhand, recent studies [22, 50] point out that imposing 3Dhand shape (mesh) supervision can boost both the perfor-mance of 3D hand pose and shape estimators. It is clearthat 3D hand shape brings richer hand structure informationthan hand keypoints. Furthermore, a preset mesh serves asa strong prior to reduce the freedom of the hand, thereforemitigating depth ambiguity. Along this line, several meth-ods such as [3, 2, 45, 25, 40, 39, 48, 47, 46] are proposed.Despite the potential, the aforementioned methods highlyrely on a preset hand model learned with a large number ofaccurate 3D mesh annotations. Hence, a large-scale datasetwith accurate annotations of mesh vertices is in great de-mand.Accurate mesh ground truth is hard to be manually an- a r X i v : . [ c s . C V ] D ec otated in general. The hand mesh annotations in most ex-isting datasets are often annotated by hand shape estimatorwhich can be inaccurate because hand mesh estimation it-self is an even more challenging task. Most existing meth-ods leveraging mesh information for 3D hand pose estima-tion are derived based on a single view. However, meremesh information is insufﬁcient to address depth ambigu-ity. Thus, 3D pose estimation still remains ill-posed in thesemethods.The issue of depth ambiguity can be tackled by multi-view vision according to epipolar geometry. Multi-viewsensing systems can capture hand images from cameras indifferent angles and therefore depth information of handscan be accurately inferred as long as camera parameters areknown. Inspired by the above observations, we aim to builda large-scale multi-view hand mesh dataset that provideshand meshes and multi-view hand images simultaneouslyfor training pose estimators.In this work, we present an effective mechanism to syn-thesize 3D hand joint and mesh annotations, and establisha large-scale multi-view hand mesh (MVHM) dataset. Weacquired a hand mesh model with a rigging system, and 3Dhand ground truth from existing datasets, and a rigged handmodel to match the given ground truth to perform variousgestures. We render the hand model from different anglesto collect multi-view images, as well as the 3D keypointsand mesh annotations to built MVHM. Then, we determineif the generated MVHM dataset can be used to improve3D hand pose estimators. To this end, a multi-view basedapproach is developed for inferring 3D hand poses. Theexperimental results show that the resultant pose estimatorcan be greatly boosted by leveraging the generated MVHMdataset, and performs favorably against existing methods.This work makes three major contributions, which are sum-marized as follows:1. We propose an effective mechanism for compiling alarge-scale multi-view hand mesh (MVHM) dataset for3D hand pose estimator training. To the best of ourknowledge, this is the ﬁrst large-scale hand datasetwith multi-view hand images, accurate mesh annota-tions, hand joint keypoints labels.2. We present a multi-view hand pose estimation ap-proach based on an end-to-end trainable graph convo-lutional neural network where information from multi-view images is utilized to predict 3D hand poses.3. Our proposed approach achieves the state-of-the-artperformance on the benchmark, the MHP dataset, inboth single-view and multi-view settings.

2. Related Work

RGB cameras are much more widely used than depthsensors. Estimating 3D hand poses merely from monocu- lar RGB images are more practical and active in the litera-ture [5, 10, 19, 29, 37, 41, 49, 24, 23, 11]. The pioneeringwork by Zimmermann and Brox [49] utilizes convolutionalneural networks (CNN) to extract image feature, and feedcamera parameters with these features to a 3D lift networkwhere depth information is then estimated. Based on [49],Iqbal et al . [19] leverage depth maps as intermediate super-vision. Meanwhile, Cai et al . [5] propose a weakly super-vised approach that reconstructs the depth map and uses itas a regularizer during model training.

3D hand pose estimation provides sparse joint locations.However, many computer vision applications would ben-eﬁt more from hand shape information than sparse joints.Therefore, 3D hand mesh estimation, an effective shaperepresentation, has emerged as an increasingly populartopic [16, 3, 2, 21, 45]. Most methods [3, 2, 45, 25, 40]are developed around a pre-deﬁned deformable hand meshmodel called MANO [32]. Because of the high degree offreedom and complexity of the hand gesture, searching forthe right hand mesh in such a high dimensional space isquite challenging. Using this MANO model often relieson strong prior to constrain the model to only regress low-dimensional model parameters, and may ignore the high-dimensional information. Ge et al . [16] argue that mesh isa graph-structure data, and propose to directly regress 3Dmesh vertices through graph convolutional neural network(GCN) with a pre-deﬁned mesh graph.

Unlike single-view pose estimation, few research effortsfocus on 3D hand pose estimation from multi-view data.Ge et al . [15] ﬁrst introduce multi-view CNN to formu-late it as an estimation problem. Their method assumes thathand joint locations independently follow 3D Gaussian dis-tributions, and uses CNN to estimate the mean and varianceof the location distribution of each joint. The main draw-backs of their method include 1) its inability to train in anend-to-end manner and 2) its impractical assumption aboutthe independence among different joint locations. Simon etal . [33] propose a multi-view system which is trained to pro-gressively improve hand keypoints detection. Their methodwould work well on ﬁne-tuning a well pre-trained estimatorbut could not train a 3D hand pose estimator from scratch.

There exist extensive research efforts such as [36, 38, 35,42, 44, 49, 34, 30, 28, 25, 50] on building hand datasets for3D hand pose estimation. We summarize the publicly avail-able hand datasets and our dataset in Table 1. Most exist-ing datasets do not contain mesh information, since labeling able 1. Comparison between our dataset with publicly available datasets.

Auto in ﬁeld

Annotation represents that the annotation ismade by some algorithms and therefore may not be accurate.

Mano means the emsh annotaion is ﬁtted by Mano Model

Dataset RGB Depth Image Type Resolution Annotation Dataset Size Multi-View MeshICVL [36] (cid:55) (cid:51) real 320 ×

320 tracking 18K (cid:55) (cid:55)

NYU [38] (cid:55) (cid:51) real 648 ×

480 tracking 243K (cid:55) (cid:55)

MSRA [35] (cid:55) (cid:51) real 1920 × (cid:55) (cid:55) BigHand2.2M [42] (cid:55) (cid:51) real 640 ×

480 marker 2.2M (cid:55) (cid:55)

STB [44] (cid:51) (cid:51) real 640 ×

480 manual 36K (cid:55) (cid:55)

RHP [49] (cid:51) (cid:51) synthetic 640 ×

480 synthetic 44K (cid:55) (cid:55)

Dexter+Object [34] (cid:51) (cid:51) real 640 ×

480 manual 3K (cid:55) (cid:55)

EgoDexter [30] (cid:51) (cid:51) real 640 ×

480 manual 3K (cid:55) (cid:55)

MHP [17] (cid:51) (cid:55) real 480 ×

480 auto 80K (cid:51) (cid:55)

FreiHand [50] (cid:51) (cid:55) real 224 ×

224 auto 134K (cid:55)

ManoInterHand [28] (cid:51) (cid:55) real 512 ×

334 auto 2.2M (cid:51)

ManoYoutube Hand [25] (cid:51) (cid:55) real 256 ×

256 auto 47K (cid:55)

Mano

Ours (cid:51) (cid:51) synthetic 256 ×

256 synthetic 320K (cid:51) (cid:51) hand meshes manually is almost infeasible for human anno-tators.To address the issue of labor-intensive annotations, re-cent studies [50, 25, 9] propose semi-automatic ways to la-bel RGB images. FreiHand (Zimmermann et al . [50]) usean iterative process where the trained models ﬁrst make pre-dictions on the images and then the annotators are askedto make necessary adjustments. YoutubeHand (Kulon etal . [25]) run OpenPose [6] to get 2D annotations, uponwhich the parameters of the MANO model are regressed.Thresholding according to conﬁdence scores is applied toremove those with low conﬁdence, and hence ensures an-notation quality. Despite the progress on efﬁciency and ef-ﬁcacy of labeling RGB images, the accuracy of annotationrelies heavily on the pre-trained models used in the pro-cess. In addition, these methods rely on the MANO modelas the ground-truth mesh generator, which could lose high-dimensional information of hands, as mentioned in Sec-tion 2.2. Compared to existing datasets, our dataset con-sists of large-scale RGB images and includes a variety ofsequences. In addition, synthetic nature provides 100% ac-curate annotation for both hand joints and mesh. We makethe ﬁrst attempt to collect the dataset that provides large-scale, multi-view training images, thereby enhancing poseestimator training with a multi-view perspective.

3. Generating Muti-View Dataset

Currently, there exists no dataset providing both large-scale mesh and multi-view annotations of 3D hands, al-though many potential applications can beneﬁt from sucha dataset. Therefore, we create a new dataset called Multi-View Hand Mesh (MVHM) dataset for training multi-viewhand pose estimators. To get accurate mesh and joint anno-tation, we use a well-made hand model from TurboSquid (a) (b) (c) Figure 2. An example of hand joint and bone labels and their or-ders adopted in this work. (a) Joint labels. (b) Bone labels, whichare used by Algorithm 1 during spin matching. (c) A failure casewhen directly rigging the hand mesh based on the bone coordi-nates without using Algorithm 1. which provides around 2000 mesh vertices as well as an ar-mature system to form various hand gestures. We render theimages in the open-source software Blender .In order to generate images and meshes from differentgestures, we deploy the NYU dataset [38] which providesvarious hand pose and accurate keypoints annotation andrigged hand bone in our model to match with the givengroundtruth. The hand bone in the rigged system consists of degrees of freedom, for its bone head, for its bone tail,and for its spin. The ﬁrst degrees of freedom determinethe location of the bone and the last one represents its orien-tation. In order to rig the hand joints and the mesh surfacecorrectly, we need to consider both location and orientation.Figure 2(c) shows an example of a distorted mesh obtainedwhen we do not perform orientation match but simply movethe bone location.For this purpose, we deﬁne a bone vector as the differ-ence between the bone tail and head. Assume u as the bonevector of the current bone we are working on. We take its igure 3. Some examples of the MVHM dataset. Column 1 to Column 8 shows the image with 2D annotation from view 1 to view 8,Column 9 shows the mesh. Column 10 shows the 3D annotation adjacent bone’s vector as the spin reference ref . We deﬁnespin sign vector as u × ref × u , and make sure this vectordoes not change after matching bone with groundturth. Thedetailed algorithm is summarized in Algorithm 1.For each ground-truth gesture, we set 8 different camerapositions that are evenly located on a circle within the planeperpendicular to the palm. All 8 cameras point to the centerof the palm to ensure that the hand locates at the center ofeach rendered image. Figure 4 shows the scene when werender hand images.To increase the diversity of the collected MVHM, werandomly change the intensity of the light and global illu-mination in the blender. In addition, we select some back-ground scenes from online sources, and randomly use themas background during our rendering. We render , images of resolution × for MVHM construction.We emphasize that each sample in MVHM comes withfull annotations of 21 hand joint and 2651 mesh vertices.Following the setting in [49], each ﬁnger is fully repre-sented by 4 keypoints (Metacarpophalangeal, Proximal in-terphalangeal, Distal interphalangeal and ﬁngertip), addi-tionally. Carpometacarpal joints are also labeled in MVHM.Figure 2(a) shows a sample of the hand joint labels. Inaddition, we also release the hand segmentation mask, thecamera intrinsic matrix, and the optical ﬂow for each sam-ple. Nevertheless, we in this paper only use the multi-viewand mesh information from the collected MVHM datasetfor hand pose estimator training.

4. Methodology

Given an RGB image of a hand I ∈ R W × H × , our goalis to estimate the D joint locations of the hand P j ∈ R k × ,where W and H denote the image height and width re-spectively, and K is the number of the hand joints. Recentstudies [16, 3] have demonstrated that using the mesh dis- Figure 4. Synthetic scene when rendering hands in MVHM. tance loss as an intermediate supervision during training canboost the performance of the learned hand pose estimator.Inspired by the approach [16], we deﬁne a hand mesh as abidirectional graph G ( V , Λ ) , where V is the vertex set and Λ is the adjacency matrix. We also assume that V contains N different elements ( i.e ., points on the mesh) and our meshestimator would predict the 3D locations P m ∈ R N × forall vertices in V .In our single-view approach, we use the stacked hour-glass [31] as the CNN backbone to extract hand featuresfrom an image. The graph convolution network (GCN) isapplied to estimate the 3D pose and mesh. Figure 5 showsthe architecture of our single-view network, which consistsof three major components: the 2D evidence network, meshevidence network, and 3D pose estimator. These compo-nents are elaborated in the following subsections. The 2D evidence network offers two main functionali-ties. First, it estimates hand keypoint heatmaps to obtain the2D hand joint locations. Second, it extracts image featuresthat then serve as the input to the mesh evidence network.We denote the estimated heatmaps as H ∈ R K × H × W . Asshown in Figure 5, the hourglass backbone gives two out-puts, including the estimated hand joint heatmaps and the lgorithm 1: Spin matching Algorithm for rigginghand mesh base on 3D hand pose ground truth

Input: C is the array of 3D keypoints ground truth that wewant our mesh to match with. B is the array of hand bones in the rig system. Eachbone has two attributes, head and tail, whichrepresent the beginning and end location of thebone. B and C is stored in an array whose orders areshown in Figure 2(a) and 2(b) begin Move B[0].tail to location C[0] ; // spin the bone inside the palm for i ∈ { , , , , } do // bone vector u ← B[i].tail - B[i].head; v ← C[i] - C[0] ;adj ← i + 4 ; if i = then adj ← i − ; end // reference vector ref ori ← B[adj].tail - B[adj].head;ref aft ← C[adj] - C[0] ;Move B[i] to match groundturth; // sign vector e = u × ref ori × u ; e = v × ref aft × v ;Spin B[i] with the angle between e & e ; endfor i ← to doif i mod (cid:54) = 1 then Move B[i] to match groundturth;B[i] performs the same spin as B[i-1] ; endendend extracted features. The ground-truth heatmaps ¯ H s are ob-tained by smoothing the keypoint location k th with Gaus-sian blur. To train the 2D evidence network, we apply theheatmap loss L h to each hourglass block as supervision.The heatmap loss is deﬁned by L h = 1 S ∗ K S (cid:88) s =1 K (cid:88) k =1 || H sk − ¯ H k || F , (1)where S and K denote the number of the hourglass blocksand keypoints, respectively. Figure 5. Overview of our single-view method. Given a single-view RGB image, the 2D evidence network predicts its heatmapand outputs the encoded image features. The mesh evidence net-work takes image features as input and outputs the hand mesh.Based on the estimated mesh, the 3D pose estimator gives ﬁnalhand pose prediction.

Our mesh evidence network is built on the basis of spectralGCN [4]. Given the image features extracted by the 2Devidence network, our mesh evidence network estimates the3D hand mesh. A 3D hand mesh is represented by a set ofvertex coordinates P m ∈ R N × where N is the number ofthe vertices in the hand mesh. We represent a hand meshas a graph G ( V , Λ ) , where V is the vertices set, and Λ isthe adjacency matrix. Λ i,j is if there is an edge betweenvertex i and vertex j , otherwise it is .Speciﬁcally, we ﬁrst normalize the adjacency matrix Λ via graph Laplacian operation and obtain a normalized ad-jacency matrix L = I − D − Λ D − , where D is thedegree matrix of graph G and I is an identity matrix.Graph spectral decomposition is then used to decomposethe normalized adjacency matrix L as U AU T , where A = diag ( λ , λ , ..., λ N ) consists of the eigenvalues of the L ,where λ max is the largest eigenvalue of L .Following [13], we deﬁne the convolution kernel ˆ A inGCN as ˆ A = diag ( S (cid:88) i =0 α i λ i , ..., S (cid:88) i =0 α i λ iC ) , (2)where α is the kernel parameter and S is a pre-set hyper-parameter used to control the receptive ﬁeld.Thus, the GCN convolutional operation is deﬁned by F (cid:48) = U ˆΛ U T F θ i = S (cid:88) i =0 α i L i F θ i , (3)where F ∈ R N × F in and F (cid:48) ∈ R N × F out indicate the inputand output features respectively, and θ i ∈ R F in × F out is train- igure 6. Overview of the multi-view method. The single-view method ﬁrst predicts the hand pose for each view independently. A graphU-Net takes the concatenation of these single-view predictions as input, and estimates the ﬁnal pose estimation. N ( · ) and C ( · ) representthe number of nodes in the graph and feature size of each node, respectively. able parameter used to reﬁne the input feature and controlthe output channel size.Since our hand mesh surface is composed of ver-tices, it takes a huge computational cost to apply the aboveoperation to each vertex because the time complexity ofmatrix multiplication for L i is O ( N ) . Therefore, we uti-lize the Chebyshev polynomial approximation to reduce thecomplexity. The convolutional operation is then deﬁned by F (cid:48) = S (cid:88) i =0 α i T i ( ˆ L ) θ i , (4)where T k ( x ) is the k th Chebyshev polynomial and ˆ L =2 L /λ max − I is used to normalize the input features.To enable our model to learn both local and global fea-tures, we adopt a scheme that is used in [13, 16] for gen-erating hand meshes from coarse to ﬁne. We leverage theheavy-edge matching algorithm to coarsen the graph bythree different coarsening levels, and record the mappingbetween graph nodes in every two consecutive levels. Inthe forward pass, our model ﬁrst constructs the most coarsehand mesh and then up-samples more nodes from the coarsegraph to the ﬁne graph based on the stored mappings.At the last layer of the GCN, we set F out to to repre-sent the 3D coordinate vertices. Also, we apply the l lossbetween the ground-truth mesh and prediction map as themesh loss function: L m = 1 N || P m − ¯ P m || F . (5) The proposed 3D evidence network infers the depth of3D hand keypoints P d from the predicted hand mesh P m by the mesh evidence network. Taking P m as the input, weadopt a two layers GCN with a similar structure of the meshevidence network to predict the pose features. These posefeatures are then fed to two fully connected layers to regress the depth of 3D hand keypoint locations. The correspondingloss is deﬁned by L d = 1 K || D − ¯ D || F , (6)where D ∈ R K and ¯ D ∈ R K represent the predicted andthe ground-truth joint depths, respectively.To infer the 3D hand keypoints, we use non-maximumsuppression to get the 2D coordinates from the estimatedheatmaps. With the estimated 2D coordinates and the depthmap calculated by the 3D depth evidence network, we thenobtain 3D coordinates in the camera coordinate system.Since the camera parameters are known, we are then ableto infer hand keypoints in the world system. Based on our single-view method, we propose a simpleyet effective multi-view approach to hand pose estimation.Figure 6 illustrates the core idea of our approach.Our single-view method predicts the 3D hand pose foreach view independently. We concatenate these view-speciﬁc predictions on their coordinate channel. The con-catenated prediction serves as the input features to a graphU-Nets[14] and predicts the ﬁnal 3D hand keypoints. Weutilize the L distance as the loss function in our multi-viewnetwork, i.e ., L s = 1 K || P j − ¯ P j || F , (7)where P j and ¯ P j represent the predicted and the ground-truth joint depth, respectively.

5. Experiment Setting

We evaluate our single-view approach on two benchmarkhand pose datasets, including the Stereo Tracking Bench-mark (STB) Dataset [44] and the Multi-view 3D Hand Pose igure 7. Examples of the two hand pose datasets used for eval-uation. The ﬁrst row shows images with the annotated hand posesfrom the STB dataset [44] while the second row shows those fromthe MHP dataset [17]. (MHP) dataset [17]. The proposed multi-view approach isevaluated on the MHP dataset. Both the MHP and STBdatasets provide real hand video sequences performed bydifferent people in various backgrounds. The hand joint an-notations of the STB dataset are manually labeled while theannotations of the MHP dataset are obtained by using theLeap Motion sensor. The MVHM dataset we build is usedin all of our experiments. We aim at determining if trainingthe hand pose estimators with the MVHM dataset can beeffectively improved in different experimental settings.For the STB dataset, we use its SK subset, which con-tains 6 different hand videos, to evaluate our approach. Fol-lowing the train-validation split setting in [16], we take theﬁrst video as the validation set while the rest videos serveas the training set.The MHP dataset includes 21 different hand motionvideos. Each hand motion video provides hand RGB im-ages and multiple types of annotations in each sample, in-cluding bounding boxes and 2D/3D hand joint locations.Figure 7 displays some examples of the STB and MHPdatasets. We follow [5, 49] and apply the standard data pre-processing for both of the STB and MHP datasets. Duringdata pre-processing, we ﬁrstly crop the images to removethe irrelevant background and make sure the hands are lo-cated at the center of the images. All the cropped images arethen resized to resolution × . Secondly, we followthe mechanism used in [5] to change the hand center fromthe palm center to the joint of the wrist for data in both ofthe STB and MHP datasets. We follow the settings from previous researches [49, 16]and adopt the average end-point-error (EPE m ), and the area under the curve (AUC) on the percentage of correctkeypoints (PCK) within a threshold range as the metrics toevaluate model effectiveness. We report the performancein both AUC on PCK between 0mm and 50mm as well asbetween 20mm and 50mm. Table 2. Ablation studies of D hand pose estimation on the STBand MHP datasets. ↑ : higher is better; ↓ : lower is better; The mea-suring unit of EPE is millimeter(mm). SV stands for the single-view method and MV represents the multi-view method. AUC ↑ AUC ↑ EPE m ↓ MHP DatasetSV w/o MVHM 0.604 0.802 22.13SV w/ MVHM 0.660 0.857 18.09MV w/o MVHM 0.832 0.985 8.43

MV w/ MVHM 0.895 0.990 5.20

STB DatasetSV w/o MVHM 0.820 0.987 8.95

SV w/ MVHM 0.832 0.991 8.38

Table 3. Results on the MHP dataset. ↑ : higher is better. AUC ↑ Zimmermann et al . [50] 0.717Cai et al . [5] 0.928Chen et al . [8] 0.939

Our multi-view method 0.991

We implement our single-view and multi-view ap-proaches in Python with PyTorch. In the training phase,we set the batch size as , and use the Adam solver withan initial learning rate 0.01. Both models are trained on aserver with four GeForce GTX 1080-Ti GPUs.When training the single-view network, we use a multi-stage training strategy. In the ﬁrst stage, we train our 2Devidence network with the heatmap loss L h . In the sec-ond stage, we ﬁx the weights of the 2D evidence networkand train the mesh network with mesh loss L m . In the thirdstage, we ﬁx the weights of both of the 2D evidence networkand mesh network, and focus on training the joint depth net-work with loss L d . In the ﬁnal stage, the whole network isoptimized end-to-end.For training the multi-view network, we apply the samemulti-stage training strategy. In the ﬁrst stage, we use thepre-trained weights from the single-view network for ini-tializing the 3D hand single-view network, and keep thispart ﬁxed for training the 3D hand fusion network. In thesecond stage, we activate both networks and ﬁne-tune thewhole network architecture in an end-to-end manner.

6. Experimental Results

To evaluate the effectiveness of the proposed multi-view method, we compare our single-view method with ourmulti-view method on the MHP dataset under the setting ofwith or without using data from the MVHM dataset. Table 2and Figure 8(a) show that utilizing the multi-view informa-tion from the MHP dataset itself boosts the testing perfor-a) (b) (c) (d)

Figure 8. Ablation studies and comparison of the state-of-the-art methods for single-view pose estimation. (a) PCK results of differentsettings on the STB dataset. (b) Comparison results in PCK for the state-of-the-art methods on the STB dataset. (c) PCK results underdifferent settings on the MHP dataset. (d) Comparison results in PCK for the state-of-the-art methods on the MHP dataset. mance in AUC , AUC , and EPE m by large margins, i.e ., 0.218, 0.183, and 13.80mm respectively. When addi-tional data from the MVHM dataset are used, substantialperformance gains are achieved, which reveals the effec-tiveness of using the collected MVHM dataset for training.Three current state-of-the-art methods are chosen forcomparing with our method on the MHP dataset, includ-ing Zimmermann et al . [50] (0.717 in AUC ), Cai etal . [5] (0.928 in AUC ) , and Chen et al . [8] (0.939 inAUC ). Zimmermann et al . [50] just report the numer-ical result so we include their result in Table 3 and doesnot show it in Figure 8(b). Our multi-view method achievesthe performance of 0.990 in AUC , outperforming thesecompeting methods by a large margin. This experimentshows that both the proposed multi-view method and theestablished MVHM dataset are beneﬁcial and can work to-gether to get the new state-of-the-art performance on theMHP dataset. To further validate the effectiveness of the generatedmesh dataset MVHM in addition to multi-view methods, wealso conduct the following experiments for comparison onsingle-view methods. We compare the results when modelsare trained solely on the MHP/STB datasets and trained onthe MHP/STB datasets together with the MVHM dataset.Table 2, Figure 8(a) and Figure 8(c) show, on both MHPand STB datasets, adding the mesh data greatly enhancesthe performance by granting a model the ability to capturethe mesh-level features, therefore leading to better results.We select seven powerful and recently published meth- Cai et al . [5] do not report the results in their paper. Here we reportthe re-implementaition results by Chen et al . [8]. ods for comparison with the proposed method, includingPSO [3], ICPPSO [10], CHPR [44], Iqbal et al . [19], Cai etal . [5], Zimmermann and Brox [49], and Ge et al . [16]. TheAUC curves are plotted in Figure 8(d). Ge et al . [16] alsoutilize an additional dataset to train their model and got theSTOA result, which demonstrates the effectiveness of theirmesh dataset. Besides, they introduce more complicatedmesh metrics like the surface norm loss. Iqbal et al . [19]and Cai et al . [5] leverage additional depth-map informa-tion to derive their models, and achieve good results. Asa multi-view approach without complicated components,our method is on par with methods by Ge et al . [16] andIqbal et al . [19] while outperforms most of them on single-view tasks.

7. Conclusions

Estimating 3D hand poses from monocular images is anill-posed problem due to its depth ambiguity. Nevertheless,multi-view images could make up the deﬁciency. To thisend, we build a multi-view mesh hand dataset, MVHM,to enable training 3D pose estimators with mesh supervi-sion. We present a multi-view method that effectively fusessingle-view predictions. When testing on the real-worldmulti-view dataset MHP, our multi-view method with theaid of the MVHM dataset achieves the state-of-the-art per-formance.

Acknowledgement.

The authors thank Kun Han for help-ing generate the dataset images used in this work. This workwas supported in part by the Ministry of Science and Tech-nology (MOST) under grants MOST 107-2628-E-009-007-MY3, MOST 109-2634-F-007-013, and MOST 109-2221-E-009-113-MY3, and by Qualcomm through a Taiwan Uni-versity Research Collaboration Project. eferences [1] Shamama Anwar, Subham Kumar Sinha, Snehanshu Vivek,and Vishal Ashank. Hand gesture recognition: a survey.In

Nanoelectronics, Circuits and Communication Systems .2019.[2] Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. Push-ing the envelope for rgb-based dense 3d hand pose estimationvia neural rendering. In

CVPR , 2019.[3] Adnane Boukhayma, Rodrigo de Bem, and Philip HS Torr.3d hand shape and pose from images in the wild. In

CVPR ,2019.[4] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le-Cun. Spectral networks and locally connected networks ongraphs. arXiv preprint arXiv:1312.6203 , 2013.[5] Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan.Weakly-supervised 3d hand pose estimation from monocu-lar rgb images. In

ECCV , 2018.[6] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, andYaser Sheikh. OpenPose: realtime multi-person 2D pose es-timation using Part Afﬁnity Fields. In

TPAMI , 2018.[7] Tejo Chalasani and Aljosa Smolic. Simultaneous segmen-tation and recognition: Towards more accurate ego gesturerecognition. In

ICCVW , 2019.[8] Liangjian Chen, Shih-Yao Lin, Yusheng Xie, Yen-Yu Lin,Wei Fan, and Xiaohui Xie. Dggan: Depth-image guided gen-erative adversarial networks fordisentangling rgb and depthimages in 3d hand pose estimation. In

WACV , 2020.[9] Liangjian Chen, Shih-Yao Lin, Yusheng Xie, Yen-Yu Lin,and Xiaohui Xie. Temporal-aware self-supervised learningfor 3d hand pose and mesh estimation in videos. In

WACV ,2021.[10] Liangjian Chen, Shih-Yao Lin, Yusheng Xie, Hui Tang, Yu-fan Xue, Yen-Yu Lin, Xiaohui Xie, and Wei Fan. Tagan:Tonality-alignment generative adversarial networks for real-istic hand pose synthesis. In

BMVC , 2019.[11] Liangjian Chen, Shih-Yao Lin, Yusheng Xie, Hui Tang, Yu-fan Xue, Xiaohui Xie, Yen-Yu Lin, and Wei Fan. Generatingrealistic training images based on tonality-alignment gener-ative adversarial networks for hand pose estimation. arXivpreprint arXiv:1811.09916 , 2018.[12] Ulysse Cˆot´e-Allard, Cheikh Latyr Fall, Alexandre Drouin,Alexandre Campeau-Lecours, Cl´ement Gosselin, KyrreGlette, Franc¸ois Laviolette, and Benoit Gosselin. Deep learn-ing for electromyographic hand gesture signal classiﬁcationusing transfer learning.

IEEE Trans Neural Systems and Re-habilitation Engineering , 27(4):760–771, 2019.[13] Micha¨el Defferrard, Xavier Bresson, and Pierre Van-dergheynst. Convolutional neural networks on graphs withfast localized spectral ﬁltering. In

NeurIPS , 2016.[14] Hongyang Gao and Shuiwang Ji. Graph u-nets. arXivpreprint arXiv:1905.05178 , 2019.[15] Liuhao Ge, Hui Liang, Junsong Yuan, and Daniel Thalmann.Robust 3d hand pose estimation in single depth images: fromsingle-view cnn to multi-view cnns. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 3593–3601, 2016. [16] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, YingyingWang, Jianfei Cai, and Junsong Yuan. 3d hand shape andpose estimation from a single rgb image. In

CVPR , 2019.[17] Francisco Gomez-Donoso, Sergio Orts-Escolano, andMiguel Cazorla. Large-scale multiview 3d hand pose dataset. arXiv preprint arXiv:1707.03742 , 2017.[18] Yi-Ping Hung and Shih-Yao Lin. Re-anchorable virtualpanel in three-dimensional space, Dec. 27 2016. US Patent9,529,446.[19] Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall,and Jan Kautz. Hand pose estimation via latent 2.5 d heatmapregression. In

ECCV , 2018.[20] MR Mohamad Ismail, CK Lam, K Sundaraj, and MHF Rahi-man. Hand motion pattern recognition analysis of forearmmuscle using mmg signals.

Bulletin of Electrical Engineer-ing and Informatics , 8(2):533–540, 2019.[21] Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total cap-ture: A 3d deformation model for tracking faces, hands, andbodies. In

CVPR , 2018.[22] Mia Kokic, Danica Kragic, and Jeannette Bohg. Learningto estimate pose and shape of hand-held objects from rgbimages. arXiv preprint arXiv:1903.03340 , 2019.[23] Deying Kong, Yifei Chen, Haoyu Ma, Xiangyi Yan, and Xi-aohui Xie. Adaptive graphical model network for 2d hand-pose estimation. arXiv preprint arXiv:1909.08205 , 2019.[24] Deying Kong, Haoyu Ma, and Xiaohui Xie. Sia-gcn: Aspatial information aware graph neural network with 2dconvolutions for hand pose estimation. arXiv preprintarXiv:2009.12473 , 2020.[25] Dominik Kulon, Riza Alp Guler, Iasonas Kokkinos,Michael M Bronstein, and Stefanos Zafeiriou. Weakly-supervised mesh-convolutional hand reconstruction in thewild. In

CVPR , pages 4990–5000, 2020.[26] Shih-Yao Lin, Yun-Chien Lai, Li-Wei Chan, and Yi-PingHung. Real-time 3d model-based gesture tracking for multi-media control. In

ICPR , 2010.[27] Shih-Yao Lin, Chuen-Kai Shie, Shen-Chi Chen, and Yi-PingHung. Airtouch panel: a re-anchorable virtual touch panel.In MM , 2013.[28] Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori,and Kyoung Mu Lee. Interhand2. 6m: A dataset and base-line for 3d interacting hand pose estimation from a single rgbimage.[29] Franziska Mueller, Florian Bernard, Oleksandr Sotny-chenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, andChristian Theobalt. Ganerated hands for real-time 3d handtracking from monocular rgb. In CVPR , 2018.[30] Franziska Mueller, Dushyant Mehta, Oleksandr Sotny-chenko, Srinath Sridhar, Dan Casas, and Christian Theobalt.Real-time hand tracking under occlusion from an egocentricrgb-d sensor. In

ICCVW , 2017.[31] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-glass networks for human pose estimation. In

ECCV , pages483–499, 2016.[32] Javier Romero, Dimitrios Tzionas, and Michael J Black. Em-bodied hands: Modeling and capturing hands and bodies to-gether.

ACM Transactions on Graphics , 36(6):245, 2017.33] Tomas Simon, Hanbyul Joo, Iain Matthews, and YaserSheikh. Hand keypoint detection in single images using mul-tiview bootstrapping. In

CVPR , 2017.[34] Srinath Sridhar, Franziska Mueller, Michael Zollh¨ofer, DanCasas, Antti Oulasvirta, and Christian Theobalt. Real-timejoint tracking of a hand manipulating an object from rgb-dinput. In

ECCV , 2016.[35] Xiao Sun, Yichen Wei, Shuang Liang, Xiaoou Tang, and JianSun. Cascaded hand pose regression. In

CVPR , 2015.[36] Danhang Tang, Hyung Jin Chang, Alykhan Tejani, and Tae-Kyun Kim. Latent regression forest: Structured estimationof 3d articulated hand posture. In

CVPR , 2014.[37] Bugra Tekin, Federica Bogo, and Marc Pollefeys. H+ o: Uni-ﬁed egocentric recognition of 3d hand-object poses and in-teractions.

CVPR , 2019.[38] Jonathan Tompson, Murphy Stein, Yann Lecun, and KenPerlin. Real-time continuous pose recovery of human handsusing convolutional networks.

ACM Transactions on Graph-ics , 33, 2014.[39] Zhenyu Wu, Duc Hoang, Shih-Yao Lin, Yusheng Xie,Liangjian Chen, Yen-Yu Lin, Zhangyang Wang, and WeiFan. Mm-hand: 3d-aware multi-modal guided hand genera-tive network for 3d hand pose synthesis. In MM , 2020.[40] John Yang, Hyung Jin Chang, Seungeui Lee, and NojunKwak. Seqhand: Rgb-sequence-based 3d hand pose andshape estimation. arXiv preprint arXiv:2007.05168 , 2020.[41] Linlin Yang and Angela Yao. Disentangling latent hands forimage synthesis and pose estimation. CVPR , 2019.[42] Shanxin Yuan, Qi Ye, Bjorn Stenger, Siddhant Jain, and Tae-Kyun Kim. Bighand2. 2m benchmark: Hand pose datasetand state of the art analysis. In

CVPR , 2017.[43] Zahoor Zafrulla, Helene Brashear, Thad Starner, HarleyHamilton, and Peter Presti. American sign language recog-nition with the kinect. In

ICMI , 2011.[44] Jiawei Zhang, Jianbo Jiao, Mingliang Chen, Liangqiong Qu,Xiaobin Xu, and Qingxiong Yang. 3d hand pose track-ing and estimation using stereo matching. arXiv preprintarXiv:1610.07214 , 2016.[45] Xiong Zhang, Qiang Li, Wenbo Zhang, and Wen Zheng.End-to-end hand mesh recovery from a monocular rgb im-age. arXiv preprint arXiv:1902.09305 , 2019.[46] Zhengli Zhao, Sameer Singh, Honglak Lee, Zizhao Zhang,Augustus Odena, and Han Zhang. Improved consistency reg-ularization for gans. arXiv preprint arXiv:2002.04724 , 2020.[47] Zhengli Zhao, Samarth Sinha, Anirudh Goyal, Colin Raffel,and Augustus Odena. Top-k training of gans: Improving ganperformance by throwing away bad samples. arXiv preprintarXiv:2002.06224 , 2020.[48] Zhengli Zhao, Zizhao Zhang, Ting Chen, Sameer Singh, andHan Zhang. Image augmentations for gan training. arXivpreprint arXiv:2006.02595 , 2020.[49] Christian Zimmermann and Thomas Brox. Learning to esti-mate 3d hand pose from single rgb images. In

CVPR , 2017.[50] Christian Zimmermann, Duygu Ceylan, Jimei Yang, BryanRussell, Max Argus, and Thomas Brox. Freihand: A datasetfor markerless capture of hand pose and shape from singlergb images. In